[System test upgrade] : Post upgrade analytics rebalance fails with "Rebalance failed: timed out waiting for all nodes to join & cluster active (missing nodes:"



      Steps to Repro
      1. Run the following longevity script on 6.6.3 for 5 days.

      ./sequoia -client -provider file:centos_second_cluster.yml -test tests/integration/test_allFeatures_madhatter_durability.yml -scope tests/integration/scope_Xattrs_Madhatter.yml -scale 3 -repeat 0 -log_level 0 -version 6.6.3-9808 -skip_setup=true -skip_test=false -skip_teardown=true -skip_cleanup=false -continue=false -collect_on_error=false -stop_on_error=false -duration=604800 -show_topology=true

      At this point it should have a 27 node cluster ( 9 Kv, 6 Index, 3 analytics, 3 fts, 3 eventing and 3 n1ql)
      2. Create 10k metakv tombstones. This has been part of our testing since MB-44838 was fixed. We used to have a total of around 25k for CC, have reduced it here to around 12k.

      for i in {0..10000}
              `curl -X PUT -u Administrator:password http://localhost:8091/_metakv/key{$i} -d 'value=foo1'`
              `curl -X DELETE -v -u Administrator:password http://localhost:8091/_metakv/key{$i}`

      3. Swap rebalance 6 nodes , 1 of each service with that of 7.0.2 nodes. Rebalance goes through successfully.
      4. Failover 6 nodes(6.6.3 nodes)1 of each service(kv is graceful failover), Upgrade these nodes to 7.0.2, do a recovery of all the 6 node(kv is delta recovery) and rebalance.
      5. Repeat step no 4 until all the nodes in cluster are upgraded to 7.0.2.
      6. Now run the following commands to enable IPV4 only and set encryption level to strict

       /opt/couchbase/bin/couchbase-cli ip-family -c http://localhost:8091 -u Administrator -p password --set --ipv4only
       /opt/couchbase/bin/couchbase-cli node-to-node-encryption -c http://localhost:8091 -u Administrator -p password --enable
       /opt/couchbase/bin/couchbase-cli setting-security -c http://localhost:8091 -u Administrator -p password --set --cluster-encryption-level strict

      7. Add new 7.0.2 nodes and remove few 7.0.2 nodes and start rebalance(Operation id: 015dc7f6b30f1864adf4611a37435014). Had to stop/start this rebalance due to unrelated issue(See MB-48449). Retried rebalance(Operation id : 2535978d0ed7e241b4a93065d1fcf79e) failed as shown below.

      ns_1@ 2:11:41 AM   15 Sep, 2021

      Starting rebalance, KeepNodes = ['ns_1@','ns_1@', 'ns_1@','ns_1@', 'ns_1@','ns_1@', 'ns_1@','ns_1@', 'ns_1@','ns_1@', 'ns_1@','ns_1@', 'ns_1@','ns_1@', 'ns_1@','ns_1@', 'ns_1@','ns_1@', 'ns_1@','ns_1@', 'ns_1@','ns_1@', 'ns_1@','ns_1@', 'ns_1@','ns_1@', 'ns_1@','ns_1@', 'ns_1@','ns_1@', 'ns_1@','ns_1@', 'ns_1@'], EjectNodes = [], Failed over and being ejected nodes = []; no delta recovery nodes; Operation Id = 2535978d0ed7e241b4a93065d1fcf79e

      ns_1@ 2:18:22 AM 15 Sep, 2021

      Analytics Service unable to successfully rebalance d41b688310a12c6cf599bee64c6afde6 due to 'java.lang.IllegalStateException: timed out waiting for all nodes to join & cluster active (missing nodes: [79b50a33da8ff241d7aae2df002048d6], state: ACTIVE)'; see analytics_info.log for details

      ns_1@ 2:18:22 AM   15 Sep, 2021

      Rebalance exited with reason {service_rebalance_failed,cbas, {worker_died, {'EXIT',<0.14871.1636>, {rebalance_failed, {service_error, <<"Rebalance d41b688310a12c6cf599bee64c6afde6 failed: timed out waiting for all nodes to join & cluster active (missing nodes: [ (79b50a33da8ff241d7aae2df002048d6)], state: ACTIVE)">>}}}}}. Rebalance Operation Id = 2535978d0ed7e241b4a93065d1fcf79e

      cbcollect_info attached. This the first time we are running this system test upgrade on 7.0.2, hence there is no baseline as such and no last working build.


