Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-57744

[AWS/Collocated]: Nodes getting failover while building cbas datasets and indexes. CP does add back,recovery followed by rebalance in leads to rebalance failures.

    XMLWordPrintable

Details

    Description

      1. Create a 3 nodes colocated services cluster on AWS.
      2. Create a bucket, 2 collections and load 75M items in each collection.
      3. Create CBAS datasets and indexes. Wait for them to build/ingest data.
      4. While this is happening it is seen that node 001 failed over. CP tried to add back the node and rebalance it IN.

        Analytics Service unable to successfully rebalance 943d20d4e3c0fe8bcb36e3b25842a9a0 due to 'java.lang.Exception: replica com.couchbase.analytics.control.rebalance.TopologyCoordinator$TimedReplicaStatus@582ced3e inactivity timeout; 300 seconds passed with no progress'; see analytics_info.log for details
         
        Failed over ['ns_1@svc-dqisa-node-001.b8ea3-02qejf6ihx.sandbox.nonprod-project-avengers.com']. Failover couldn't complete on some nodes:
        ['ns_1@svc-dqisa-node-001.b8ea3-02qejf6ihx.sandbox.nonprod-project-avengers.com']
        

      5. Rebalance failed:

        Starting rebalance, KeepNodes = ['ns_1@svc-dqisa-node-001.b8ea3-02qejf6ihx.sandbox.nonprod-project-avengers.com',
        'ns_1@svc-dqisa-node-002.b8ea3-02qejf6ihx.sandbox.nonprod-project-avengers.com',
        'ns_1@svc-dqisa-node-003.b8ea3-02qejf6ihx.sandbox.nonprod-project-avengers.com'], EjectNodes = [], Failed over and being ejected nodes = []; no delta recovery nodes; Operation Id = 99b90a3489aee7df57aa0cf81626aeaa
         
        Analytics Service unable to successfully rebalance 9c47373917a92825e779ee891df88715 due to 'java.lang.Exception: replica com.couchbase.analytics.control.rebalance.TopologyCoordinator$TimedReplicaStatus@747ed24d inactivity timeout; 300 seconds passed with no progress'; see analytics_info.log for details
        

      6. Next Rebalance attempt:

        Starting rebalance, KeepNodes = ['ns_1@svc-dqisa-node-001.b8ea3-02qejf6ihx.sandbox.nonprod-project-avengers.com',
        'ns_1@svc-dqisa-node-002.b8ea3-02qejf6ihx.sandbox.nonprod-project-avengers.com',
        'ns_1@svc-dqisa-node-003.b8ea3-02qejf6ihx.sandbox.nonprod-project-avengers.com'], EjectNodes = [], Failed over and being ejected nodes = []; no delta recovery nodes; Operation Id = d6acbc11307f34ea9381b627b3bd3850
         
        Analytics Service unable to successfully rebalance 4b3462081972c9fbf923e89a12a2259b due to 'java.lang.IllegalStateException: timed out waiting for all nodes to join & cluster active (missing nodes: [f42c94909e21cfa5316f5317c7f34e78], state: ACTIVE)'; see analytics_info.log for details
        

      7. Links seems to be broken and data ingestion is stuck.

      Ali Alsuliman, This run was on AWS and t could be related to MB-57636. Please have a look.

      QE Test

      guides/gradlew --refresh-dependencies testrunner -P jython=/opt/jython/bin/jython -P 'args=-i /tmp/couchbase_capella_volume_2_new.ini bucket_storage=magma,bucket_eviction_policy=fullEviction,rerun=False -t aGoodDoctor.hostedHospital.Murphy.test_rebalance,num_items=75000000,num_buckets=1,bucket_names=GleamBook,bucket_type=membase,iterations=5,batch_size=1000,sdk_timeout=60,log_level=debug,infra_log_level=debug,rerun=False,skip_cleanup=True,key_size=18,randomize_doc_size=False,randomize_value=True,maxttl=10,pc=20,gsi_nodes=3,cbas_nodes=3,fts_nodes=3,kv_nodes=3,n1ql_nodes=3,kv_disk=500,n1ql_disk=50,gsi_disk=500,fts_disk=500,cbas_disk=500,kv_compute=m5.4xlarge,gsi_compute=m5.4xlarge,n1ql_compute=m5.4xlarge,fts_compute=m5.4xlarge,cbas_compute=m5.4xlarge,mutation_perc=20,key_type=CircularKey,capella_run=true,services=data:query:index:analytics:search,max_rebl_nodes=27,provider=AWS,region=us-east-1,type=GP3,size=500,skip_teardown_cleanup=false,wait_timeout=14400,index_timeout=28800,runtype=dedicated,sanity=True'
      

      Attachments

        1. screenshot-1.png
          screenshot-1.png
          316 kB
        2. Screenshot 2023-07-10 at 3.46.27 PM.png
          Screenshot 2023-07-10 at 3.46.27 PM.png
          476 kB
        3. testLogs_5819.txt
          12.25 MB
        4. testLogs_5855.txt
          502 kB
        5. java_memory.png
          java_memory.png
          625 kB
        6. java_cpu.png
          java_cpu.png
          954 kB
        7. image-2023-07-20-11-14-21-752.png
          image-2023-07-20-11-14-21-752.png
          57 kB
        8. image-2023-07-20-11-17-31-558.png
          image-2023-07-20-11-17-31-558.png
          65 kB
        9. image-2023-07-21-09-25-46-954.png
          image-2023-07-21-09-25-46-954.png
          547 kB
        10. image-2023-07-21-09-28-41-703.png
          image-2023-07-21-09-28-41-703.png
          448 kB

        Issue Links

          For Gerrit Dashboard: MB-57744
          # Subject Branch Project Status CR V

          Activity

            People

              ritesh.agarwal Ritesh Agarwal
              ritesh.agarwal Ritesh Agarwal
              Votes:
              0 Vote for this issue
              Watchers:
              7 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Gerrit Reviews

                  There are no open Gerrit changes

                  PagerDuty