Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-57744

[AWS/Collocated]: Nodes getting failover while building cbas datasets and indexes. CP does add back,recovery followed by rebalance in leads to rebalance failures.

    XMLWordPrintable

Details

    Description

      1. Create a 3 nodes colocated services cluster on AWS.
      2. Create a bucket, 2 collections and load 75M items in each collection.
      3. Create CBAS datasets and indexes. Wait for them to build/ingest data.
      4. While this is happening it is seen that node 001 failed over. CP tried to add back the node and rebalance it IN.

        Analytics Service unable to successfully rebalance 943d20d4e3c0fe8bcb36e3b25842a9a0 due to 'java.lang.Exception: replica com.couchbase.analytics.control.rebalance.TopologyCoordinator$TimedReplicaStatus@582ced3e inactivity timeout; 300 seconds passed with no progress'; see analytics_info.log for details
         
        Failed over ['ns_1@svc-dqisa-node-001.b8ea3-02qejf6ihx.sandbox.nonprod-project-avengers.com']. Failover couldn't complete on some nodes:
        ['ns_1@svc-dqisa-node-001.b8ea3-02qejf6ihx.sandbox.nonprod-project-avengers.com']
        

      5. Rebalance failed:

        Starting rebalance, KeepNodes = ['ns_1@svc-dqisa-node-001.b8ea3-02qejf6ihx.sandbox.nonprod-project-avengers.com',
        'ns_1@svc-dqisa-node-002.b8ea3-02qejf6ihx.sandbox.nonprod-project-avengers.com',
        'ns_1@svc-dqisa-node-003.b8ea3-02qejf6ihx.sandbox.nonprod-project-avengers.com'], EjectNodes = [], Failed over and being ejected nodes = []; no delta recovery nodes; Operation Id = 99b90a3489aee7df57aa0cf81626aeaa
         
        Analytics Service unable to successfully rebalance 9c47373917a92825e779ee891df88715 due to 'java.lang.Exception: replica com.couchbase.analytics.control.rebalance.TopologyCoordinator$TimedReplicaStatus@747ed24d inactivity timeout; 300 seconds passed with no progress'; see analytics_info.log for details
        

      6. Next Rebalance attempt:

        Starting rebalance, KeepNodes = ['ns_1@svc-dqisa-node-001.b8ea3-02qejf6ihx.sandbox.nonprod-project-avengers.com',
        'ns_1@svc-dqisa-node-002.b8ea3-02qejf6ihx.sandbox.nonprod-project-avengers.com',
        'ns_1@svc-dqisa-node-003.b8ea3-02qejf6ihx.sandbox.nonprod-project-avengers.com'], EjectNodes = [], Failed over and being ejected nodes = []; no delta recovery nodes; Operation Id = d6acbc11307f34ea9381b627b3bd3850
         
        Analytics Service unable to successfully rebalance 4b3462081972c9fbf923e89a12a2259b due to 'java.lang.IllegalStateException: timed out waiting for all nodes to join & cluster active (missing nodes: [f42c94909e21cfa5316f5317c7f34e78], state: ACTIVE)'; see analytics_info.log for details
        

      7. Links seems to be broken and data ingestion is stuck.

      Ali Alsuliman, This run was on AWS and t could be related to MB-57636. Please have a look.

      QE Test

      guides/gradlew --refresh-dependencies testrunner -P jython=/opt/jython/bin/jython -P 'args=-i /tmp/couchbase_capella_volume_2_new.ini bucket_storage=magma,bucket_eviction_policy=fullEviction,rerun=False -t aGoodDoctor.hostedHospital.Murphy.test_rebalance,num_items=75000000,num_buckets=1,bucket_names=GleamBook,bucket_type=membase,iterations=5,batch_size=1000,sdk_timeout=60,log_level=debug,infra_log_level=debug,rerun=False,skip_cleanup=True,key_size=18,randomize_doc_size=False,randomize_value=True,maxttl=10,pc=20,gsi_nodes=3,cbas_nodes=3,fts_nodes=3,kv_nodes=3,n1ql_nodes=3,kv_disk=500,n1ql_disk=50,gsi_disk=500,fts_disk=500,cbas_disk=500,kv_compute=m5.4xlarge,gsi_compute=m5.4xlarge,n1ql_compute=m5.4xlarge,fts_compute=m5.4xlarge,cbas_compute=m5.4xlarge,mutation_perc=20,key_type=CircularKey,capella_run=true,services=data:query:index:analytics:search,max_rebl_nodes=27,provider=AWS,region=us-east-1,type=GP3,size=500,skip_teardown_cleanup=false,wait_timeout=14400,index_timeout=28800,runtype=dedicated,sanity=True'
      

      Attachments

        1. testLogs_5855.txt
          502 kB
          Ritesh Agarwal
        2. testLogs_5819.txt
          12.25 MB
          Ritesh Agarwal
        3. Screenshot 2023-07-10 at 3.46.27 PM.png
          476 kB
          Murtadha Hubail
        4. screenshot-1.png
          316 kB
          Ritesh Agarwal
        5. java_memory.png
          625 kB
          Murtadha Hubail
        6. java_cpu.png
          954 kB
          Murtadha Hubail
        7. image-2023-07-21-09-28-41-703.png
          448 kB
          Neelima Premsankar
        8. image-2023-07-21-09-25-46-954.png
          547 kB
          Neelima Premsankar
        9. image-2023-07-20-11-17-31-558.png
          65 kB
          Neelima Premsankar
        10. image-2023-07-20-11-14-21-752.png
          57 kB
          Neelima Premsankar

        Issue Links

          No reviews matched the request. Check your Options in the drop-down menu of this sections header.

          Activity

            People

              ritesh.agarwal Ritesh Agarwal
              ritesh.agarwal Ritesh Agarwal
              Votes:
              0 Vote for this issue
              Watchers:
              7 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Gerrit Reviews

                  There are no open Gerrit changes

                  PagerDuty