Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-57354

[Provisioned/GCP]: Analytics Swap rebalance triggered by CP due to disk size reduction is failing forever in a loop. Rebalance timing out waiting for all nodes to join every 5 mins.

    XMLWordPrintable

Details

    • Bug
    • Resolution: Duplicate
    • Critical
    • None
    • 7.1.4
    • analytics
    • Enterprise Edition 7.1.4 build 3638

    Description

      Test steps

      1. Deploy an GCP cluster having 3 KV, GSI, Query and FTS nodes each separately.
      2. Create a Magma bucket having single replica and 1 scope + 2 collections in addition to _default._default keyspace.
      3. Load 5M docs in each of the 2 collections.
      4. Create GSI Indexes, wait for the Indexes to come online and run queries against them.
      5. Start KV workload 10k/s.
      6. Increase disk size by 5G for all service groups.
      7. Online scaling operation goes through fine without any issues.
      8. Decrease the disk size by 5G for all the service groups. This triggers a swap rebalance for all the nodes one at a time. Swap rebalance for cbas failing:

        Rebalance Activity

        Starting rebalance, KeepNodes = ['ns_1@svc-a-node-011.np7cmh-hxy7-vgln.sandbox.nonprod-project-avengers.com',
        'ns_1@svc-a-node-012.np7cmh-hxy7-vgln.sandbox.nonprod-project-avengers.com',
        'ns_1@svc-a-node-018.np7cmh-hxy7-vgln.sandbox.nonprod-project-avengers.com',
        'ns_1@svc-d-node-013.np7cmh-hxy7-vgln.sandbox.nonprod-project-avengers.com',
        'ns_1@svc-d-node-014.np7cmh-hxy7-vgln.sandbox.nonprod-project-avengers.com',
        'ns_1@svc-d-node-015.np7cmh-hxy7-vgln.sandbox.nonprod-project-avengers.com',
        'ns_1@svc-qi-node-004.np7cmh-hxy7-vgln.sandbox.nonprod-project-avengers.com',
        'ns_1@svc-qi-node-005.np7cmh-hxy7-vgln.sandbox.nonprod-project-avengers.com',
        'ns_1@svc-qi-node-006.np7cmh-hxy7-vgln.sandbox.nonprod-project-avengers.com',
        'ns_1@svc-s-node-009.np7cmh-hxy7-vgln.sandbox.nonprod-project-avengers.com',
        'ns_1@svc-s-node-016.np7cmh-hxy7-vgln.sandbox.nonprod-project-avengers.com',
        'ns_1@svc-s-node-017.np7cmh-hxy7-vgln.sandbox.nonprod-project-avengers.com'], EjectNodes = ['ns_1@svc-a-node-010.np7cmh-hxy7-vgln.sandbox.nonprod-project-avengers.com'], Failed over and being ejected nodes = []; no delta recovery nodes; Operation Id = 327a570d2e5aa3e0647921f14d434456
        

      First failure

      Rebalance exited with reason {service_rebalance_failed,cbas,
      {worker_died,
      {'EXIT',<0.23349.29>,
      {rebalance_failed,
      {service_error,
      <<"Rebalance 85b674cc8e804b9c806b6ae210376a3b failed: see analytics_info.log for details">>}}}}}.
      Rebalance Operation Id = fdf6a79b02b963dd3be3c30a111c506a
      

      Analytics Service unable to successfully rebalance e933bc4742d4866e9f78de073471adc8 due to 'java.lang.IllegalStateException: timed out waiting for all nodes to join & cluster active (missing nodes: [130fd128ae000f61de18d44341223852], state: UNUSABLE)'; see analytics_info.log for details
      

      QE Test

      sudo guides/gradlew --refresh-dependencies testrunner -P jython=/opt/jython/bin/jython -P 'args=-i /tmp/capella.ini -p bucket_storage=magma,bucket_eviction_policy=fullEviction,rerun=False -t aGoodDoctor.hostedHospital.Murphy.test_rebalance,graceful=True,skip_cleanup=True,num_buckets=1,bucket_names=GleamBook,bucket_type=membase,eviction_policy=fullEviction,iterations=10,batch_size=1000,sdk_timeout=60,log_level=debug,infra_log_level=debug,rerun=False,skip_cleanup=True,key_size=24,randomize_doc_size=False,randomize_value=True,maxttl=10,pc=20,gsi_nodes=3,cbas_nodes=3,fts_nodes=3,kv_nodes=3,n1ql_nodes=3,mutation_perc=100,key_type=RandomKey,capella_run=true,services=data-query:index-search-analytics,max_rebl_nodes=27,kv_compute=n2-standard-16,gsi_compute=n2-standard-16,n1ql_compute=n2-standard-16,fts_compute=n2-standard-16,cbas_compute=n2-standard-16,kv_disk=500,n1ql_disk=50,gsi_disk=500,cbas_disk=500,provider=GCP,region=us-central1,type=PD-SSD,skip_teardown_cleanup=true,wait_timeout=14400,index_timeout=28800,runtype=dedicated,track_failures=True,skip_init=False,key_type=CircularKey,rebalance_type=disk,clients_per_db=1 -m rest'
      

      Attachments

        No reviews matched the request. Check your Options in the drop-down menu of this sections header.

        Activity

          People

            ritesh.agarwal Ritesh Agarwal
            ritesh.agarwal Ritesh Agarwal
            Votes:
            0 Vote for this issue
            Watchers:
            6 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Gerrit Reviews

                There are no open Gerrit changes

                PagerDuty