Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-48468

[System test upgrade] : Post upgrade analytics rebalance fails with "Rebalance failed: timed out waiting for all nodes to join & cluster active (missing nodes:"

    XMLWordPrintable

Details

    • Untriaged
    • Centos 64-bit
    • 1
    • No
    • CX Sprint 263

    Description

      Steps to Repro
      1. Run the following longevity script on 6.6.3 for 5 days.

      ./sequoia -client 172.23.104.254:2375 -provider file:centos_second_cluster.yml -test tests/integration/test_allFeatures_madhatter_durability.yml -scope tests/integration/scope_Xattrs_Madhatter.yml -scale 3 -repeat 0 -log_level 0 -version 6.6.3-9808 -skip_setup=true -skip_test=false -skip_teardown=true -skip_cleanup=false -continue=false -collect_on_error=false -stop_on_error=false -duration=604800 -show_topology=true
      

      At this point it should have a 27 node cluster ( 9 Kv, 6 Index, 3 analytics, 3 fts, 3 eventing and 3 n1ql)
      2. Create 10k metakv tombstones. This has been part of our testing since MB-44838 was fixed. We used to have a total of around 25k for CC, have reduced it here to around 12k.

       #!/bin/sh
      for i in {0..10000}
          do
              `curl -X PUT -u Administrator:password http://localhost:8091/_metakv/key{$i} -d 'value=foo1'`
              `curl -X DELETE -v -u Administrator:password http://localhost:8091/_metakv/key{$i}`
          done       
      

      3. Swap rebalance 6 nodes , 1 of each service with that of 7.0.2 nodes. Rebalance goes through successfully.
      4. Failover 6 nodes(6.6.3 nodes)1 of each service(kv is graceful failover), Upgrade these nodes to 7.0.2, do a recovery of all the 6 node(kv is delta recovery) and rebalance.
      5. Repeat step no 4 until all the nodes in cluster are upgraded to 7.0.2.
      6. Now run the following commands to enable IPV4 only and set encryption level to strict

       /opt/couchbase/bin/couchbase-cli ip-family -c http://localhost:8091 -u Administrator -p password --set --ipv4only
       /opt/couchbase/bin/couchbase-cli node-to-node-encryption -c http://localhost:8091 -u Administrator -p password --enable
       /opt/couchbase/bin/couchbase-cli setting-security -c http://localhost:8091 -u Administrator -p password --set --cluster-encryption-level strict
      

      7. Add new 7.0.2 nodes and remove few 7.0.2 nodes and start rebalance(Operation id: 015dc7f6b30f1864adf4611a37435014). Had to stop/start this rebalance due to unrelated issue(See MB-48449). Retried rebalance(Operation id : 2535978d0ed7e241b4a93065d1fcf79e) failed as shown below.

      ns_1@172.23.106.136 2:11:41 AM   15 Sep, 2021

      Starting rebalance, KeepNodes = ['ns_1@172.23.106.134','ns_1@172.23.106.136', 'ns_1@172.23.106.137','ns_1@172.23.106.138', 'ns_1@172.23.120.58','ns_1@172.23.120.73', 'ns_1@172.23.120.74','ns_1@172.23.120.75', 'ns_1@172.23.120.77','ns_1@172.23.120.81', 'ns_1@172.23.120.86','ns_1@172.23.121.118', 'ns_1@172.23.121.77','ns_1@172.23.123.24', 'ns_1@172.23.123.25','ns_1@172.23.123.26', 'ns_1@172.23.123.31','ns_1@172.23.123.32', 'ns_1@172.23.123.33','ns_1@172.23.96.122', 'ns_1@172.23.96.14','ns_1@172.23.96.243', 'ns_1@172.23.96.254','ns_1@172.23.96.48', 'ns_1@172.23.97.105','ns_1@172.23.97.110', 'ns_1@172.23.97.112','ns_1@172.23.97.148', 'ns_1@172.23.97.149','ns_1@172.23.97.150', 'ns_1@172.23.97.151','ns_1@172.23.97.241', 'ns_1@172.23.97.74'], EjectNodes = [], Failed over and being ejected nodes = []; no delta recovery nodes; Operation Id = 2535978d0ed7e241b4a93065d1fcf79e
      

      ns_1@172.23.97.241 2:18:22 AM 15 Sep, 2021

      Analytics Service unable to successfully rebalance d41b688310a12c6cf599bee64c6afde6 due to 'java.lang.IllegalStateException: timed out waiting for all nodes to join & cluster active (missing nodes: [79b50a33da8ff241d7aae2df002048d6], state: ACTIVE)'; see analytics_info.log for details
      

      ns_1@172.23.106.136 2:18:22 AM   15 Sep, 2021

      Rebalance exited with reason {service_rebalance_failed,cbas, {worker_died, {'EXIT',<0.14871.1636>, {rebalance_failed, {service_error, <<"Rebalance d41b688310a12c6cf599bee64c6afde6 failed: timed out waiting for all nodes to join & cluster active (missing nodes: [172.23.123.32:8091 (79b50a33da8ff241d7aae2df002048d6)], state: ACTIVE)">>}}}}}. Rebalance Operation Id = 2535978d0ed7e241b4a93065d1fcf79e
      

      cbcollect_info attached. This the first time we are running this system test upgrade on 7.0.2, hence there is no baseline as such and no last working build.

      Attachments

        No reviews matched the request. Check your Options in the drop-down menu of this sections header.

        Activity

          People

            umang.agrawal Umang
            Balakumaran.Gopal Balakumaran Gopal
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Gerrit Reviews

                There are no open Gerrit changes

                PagerDuty