Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-58462

AWS VM Host/CPU kernel issue - [10B, KV 2% RR, GSI 5% RR]: KV nodes failing over during GSI node swap rebalance due to compute up scale operation.

    XMLWordPrintable

Details

    • Bug
    • Resolution: Fixed
    • Critical
    • 7.6.0
    • 7.2.1
    • build, qe
    • AWS
      Enterprise Edition 7.2.1 build 5921

    Description

      Cluster Config

      [
          {
              "compute": "c5.12xlarge",
              "services":
              [
                  "data"
              ],
              "size": 3,
              "storage":
              {
                  "IOPS": 16000,
                  "size": 10250,
                  "type": "GP3"
              }
          },
          {
              "compute": "m5.16xlarge",
              "services":
              [
                  "index"
              ],
              "size": 3,
              "storage":
              {
                  "IOPS": 16000,
                  "size": 10250,
                  "type": "GP3"
              }
          },
          {
              "compute": "c5.9xlarge",
              "services":
              [
                  "query"
              ],
              "size": 2,
              "storage":
              {
                  "IOPS": 3000,
                  "size": 50,
                  "type": "GP3"
              }
          }
      ]
      

      Steps:

      1. Load 10B items in 2 collections. 5B in each.
      2. Build GSI indexes. 2 on 1 collection and 1 on another.
      3. Start a n1ql load asynchronously.
      4. Trigger compute scale UP for KV, GSI
      5. KV scaling is completed for all the 3 nodes but during GSI nodes swap rebalances KV nodes are failing over.
      6. Failed over ['ns_1@svc-d-node-009.jfbm369m3mnjrs-s.sandbox.nonprod-project-avengers.com']: ok
      7. CP tried to bring back the cluster to a healthy state by adding all the nodes back into the cluster including the failover over node which seems to be reachable after a while back.
      8. During this rebalance KV nodes again tried to failed over but Could not automatically fail over nodes (['ns_1@svc-d-node-010.jfbm369m3mnjrs-s.sandbox.nonprod-project-avengers.com']). Rebalance is running. Otherwise, another KV node is ready to be failed over.

      QE Test

      sudo guides/gradlew --refresh-dependencies testrunner -P jython=/opt/jython/bin/jython -P 'args=-i /tmp/couchbase_capella_volume_2_new.ini -p bucket_storage=magma,bucket_eviction_policy=fullEviction,rerun=False -t aGoodDoctor.hostedHospital.Murphy.test_rebalance,num_items=5000000000,num_buckets=1,bucket_names=GleamBook,bucket_type=membase,iterations=1,batch_size=1000,sdk_timeout=60,log_level=debug,infra_log_level=debug,rerun=False,skip_cleanup=True,key_size=18,randomize_doc_size=False,randomize_value=True,maxttl=10,pc=20,gsi_nodes=3,cbas_nodes=3,fts_nodes=3,kv_nodes=3,n1ql_nodes=2,kv_disk=10240,n1ql_disk=50,gsi_disk=10240,fts_disk=50,cbas_disk=50,kv_compute=c5.9xlarge,gsi_compute=m5.12xlarge,n1ql_compute=c5.4xlarge,fts_compute=c5.4xlarge,cbas_compute=c5.4xlarge,mutation_perc=100,key_type=CircularKey,capella_run=true,services=data-index-query,rebl_services=data-index-query,max_rebl_nodes=27,provider=AWS,region=us-east-1,type=GP3,size=50,collections=2,ops_rate=120000,skip_teardown_cleanup=true,wait_timeout=14400,index_timeout=86400,runtype=dedicated,skip_init=true,rebl_ops_rate=10000,nimbus=true,expiry=false,v_scaling=true,h_scaling=false,horizontal_scale=1,clients_per_db=1 -m rest'
      

      Attachments

        Issue Links

          No reviews matched the request. Check your Options in the drop-down menu of this sections header.

          Activity

            People

              ritesh.agarwal Ritesh Agarwal
              ritesh.agarwal Ritesh Agarwal
              Votes:
              0 Vote for this issue
              Watchers:
              11 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Gerrit Reviews

                  There are no open Gerrit changes

                  PagerDuty