Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-48468

[System test upgrade] : Post upgrade analytics rebalance fails with "Rebalance failed: timed out waiting for all nodes to join & cluster active (missing nodes:"

    XMLWordPrintable

Details

    • Untriaged
    • Centos 64-bit
    • 1
    • No
    • CX Sprint 263

    Description

      Steps to Repro
      1. Run the following longevity script on 6.6.3 for 5 days.

      ./sequoia -client 172.23.104.254:2375 -provider file:centos_second_cluster.yml -test tests/integration/test_allFeatures_madhatter_durability.yml -scope tests/integration/scope_Xattrs_Madhatter.yml -scale 3 -repeat 0 -log_level 0 -version 6.6.3-9808 -skip_setup=true -skip_test=false -skip_teardown=true -skip_cleanup=false -continue=false -collect_on_error=false -stop_on_error=false -duration=604800 -show_topology=true
      

      At this point it should have a 27 node cluster ( 9 Kv, 6 Index, 3 analytics, 3 fts, 3 eventing and 3 n1ql)
      2. Create 10k metakv tombstones. This has been part of our testing since MB-44838 was fixed. We used to have a total of around 25k for CC, have reduced it here to around 12k.

       #!/bin/sh
      for i in {0..10000}
          do
              `curl -X PUT -u Administrator:password http://localhost:8091/_metakv/key{$i} -d 'value=foo1'`
              `curl -X DELETE -v -u Administrator:password http://localhost:8091/_metakv/key{$i}`
          done       
      

      3. Swap rebalance 6 nodes , 1 of each service with that of 7.0.2 nodes. Rebalance goes through successfully.
      4. Failover 6 nodes(6.6.3 nodes)1 of each service(kv is graceful failover), Upgrade these nodes to 7.0.2, do a recovery of all the 6 node(kv is delta recovery) and rebalance.
      5. Repeat step no 4 until all the nodes in cluster are upgraded to 7.0.2.
      6. Now run the following commands to enable IPV4 only and set encryption level to strict

       /opt/couchbase/bin/couchbase-cli ip-family -c http://localhost:8091 -u Administrator -p password --set --ipv4only
       /opt/couchbase/bin/couchbase-cli node-to-node-encryption -c http://localhost:8091 -u Administrator -p password --enable
       /opt/couchbase/bin/couchbase-cli setting-security -c http://localhost:8091 -u Administrator -p password --set --cluster-encryption-level strict
      

      7. Add new 7.0.2 nodes and remove few 7.0.2 nodes and start rebalance(Operation id: 015dc7f6b30f1864adf4611a37435014). Had to stop/start this rebalance due to unrelated issue(See MB-48449). Retried rebalance(Operation id : 2535978d0ed7e241b4a93065d1fcf79e) failed as shown below.

      ns_1@172.23.106.136 2:11:41 AM   15 Sep, 2021

      Starting rebalance, KeepNodes = ['ns_1@172.23.106.134','ns_1@172.23.106.136', 'ns_1@172.23.106.137','ns_1@172.23.106.138', 'ns_1@172.23.120.58','ns_1@172.23.120.73', 'ns_1@172.23.120.74','ns_1@172.23.120.75', 'ns_1@172.23.120.77','ns_1@172.23.120.81', 'ns_1@172.23.120.86','ns_1@172.23.121.118', 'ns_1@172.23.121.77','ns_1@172.23.123.24', 'ns_1@172.23.123.25','ns_1@172.23.123.26', 'ns_1@172.23.123.31','ns_1@172.23.123.32', 'ns_1@172.23.123.33','ns_1@172.23.96.122', 'ns_1@172.23.96.14','ns_1@172.23.96.243', 'ns_1@172.23.96.254','ns_1@172.23.96.48', 'ns_1@172.23.97.105','ns_1@172.23.97.110', 'ns_1@172.23.97.112','ns_1@172.23.97.148', 'ns_1@172.23.97.149','ns_1@172.23.97.150', 'ns_1@172.23.97.151','ns_1@172.23.97.241', 'ns_1@172.23.97.74'], EjectNodes = [], Failed over and being ejected nodes = []; no delta recovery nodes; Operation Id = 2535978d0ed7e241b4a93065d1fcf79e
      

      ns_1@172.23.97.241 2:18:22 AM 15 Sep, 2021

      Analytics Service unable to successfully rebalance d41b688310a12c6cf599bee64c6afde6 due to 'java.lang.IllegalStateException: timed out waiting for all nodes to join & cluster active (missing nodes: [79b50a33da8ff241d7aae2df002048d6], state: ACTIVE)'; see analytics_info.log for details
      

      ns_1@172.23.106.136 2:18:22 AM   15 Sep, 2021

      Rebalance exited with reason {service_rebalance_failed,cbas, {worker_died, {'EXIT',<0.14871.1636>, {rebalance_failed, {service_error, <<"Rebalance d41b688310a12c6cf599bee64c6afde6 failed: timed out waiting for all nodes to join & cluster active (missing nodes: [172.23.123.32:8091 (79b50a33da8ff241d7aae2df002048d6)], state: ACTIVE)">>}}}}}. Rebalance Operation Id = 2535978d0ed7e241b4a93065d1fcf79e
      

      cbcollect_info attached. This the first time we are running this system test upgrade on 7.0.2, hence there is no baseline as such and no last working build.

      Attachments

        No reviews matched the request. Check your Options in the drop-down menu of this sections header.

        Activity

          Marking this Blocker as I am unable to rebalance out the nodes. Any work around is highly appreciated.
          Cluster is available for debugging and the links + Auth details available in the above comment.

          Balakumaran.Gopal Balakumaran Gopal added a comment - Marking this Blocker as I am unable to rebalance out the nodes. Any work around is highly appreciated. Cluster is available for debugging and the links + Auth details available in the above comment.

          Hi Balakumaran Gopal,

          The rebalance (Operation id : 2535978d0ed7e241b4a93065d1fcf79e) is failing because the node at 172.23.123.32 is encountering an issue when re-joining after the rebalance that rebalanced out the node was canceled (Operation id: 015dc7f6b30f1864adf4611a37435014). The workaround is to kill the CBAS process on node 172.23.123.32, and the rebalance should proceed.

          Wayne Siu, could we get this issue approved for 7.0.2? we have the fix ready.

          ali.alsuliman Ali Alsuliman added a comment - Hi Balakumaran Gopal , The rebalance (Operation id : 2535978d0ed7e241b4a93065d1fcf79e) is failing because the node at 172.23.123.32 is encountering an issue when re-joining after the rebalance that rebalanced out the node was canceled (Operation id: 015dc7f6b30f1864adf4611a37435014). The workaround is to kill the CBAS process on node 172.23.123.32, and the rebalance should proceed. Wayne Siu , could we get this issue approved for 7.0.2? we have the fix ready.

          Above work around did unblock me. Reducing the priority of the bug.

          Balakumaran.Gopal Balakumaran Gopal added a comment - Above work around did unblock me. Reducing the priority of the bug.

          Build couchbase-server-7.0.2-6683 contains cbas commit 3cf8b71 with commit message:
          MB-48468: Keep http server running

          build-team Couchbase Build Team added a comment - Build couchbase-server-7.0.2-6683 contains cbas commit 3cf8b71 with commit message: MB-48468 : Keep http server running

          Build couchbase-server-7.1.0-1307 contains cbas commit 3cf8b71 with commit message:
          MB-48468: Keep http server running

          build-team Couchbase Build Team added a comment - Build couchbase-server-7.1.0-1307 contains cbas commit 3cf8b71 with commit message: MB-48468 : Keep http server running
          umang.agrawal Umang added a comment -

          Verified with 7.0.2 build 6678

           

          umang.agrawal Umang added a comment - Verified with 7.0.2 build 6678  

          People

            umang.agrawal Umang
            Balakumaran.Gopal Balakumaran Gopal
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Gerrit Reviews

                There are no open Gerrit changes

                PagerDuty