Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-46285

[system test upgrade] : Analytics rebalance fails with "java.lang.IllegalStateException: timed out waiting for all nodes to join & cluster active" during upgrade from 6.6.2 -> 7.0.0

    XMLWordPrintable

Details

    • Bug
    • Status: Closed
    • Critical
    • Resolution: Fixed
    • Cheshire-Cat
    • 7.0.0
    • analytics
    • 6.6.2-9588 -> 7.0.0-5141
    • Untriaged
    • Centos 64-bit
    • 1
    • Yes
    • CX Sprint 247

    Description

      Scripts to Repro
      1. Run the 6.6.2 longevity test for 3 days.

      ./sequoia -client 172.23.96.162:2375 -provider file:centos_third_cluster.yml -test tests/integration/test_allFeatures_madhatter_durability.yml -scope tests/integration/scope_Xattrs_Madhatter.yml -scale 3 -repeat 0 -log_level 0 -version 6.6.2-9588 -skip_setup=false -skip_test=false -skip_teardown=true -skip_cleanup=false -continue=false -collect_on_error=false -stop_on_error=false -duration=604800 -show_topology=true
      

      2. It had 27 nodes at the end of the test.
      3. Added 6 7.0.0(172.23.105.102,172.23.105.62,172.23.106.232,172.23.106.239,172.23.106.37, 172.23.106.246) nodes and rebalanced in and removed 6 node from 6.6.2(172.23.110.75,172.23.110.76,172.23.105.61,172.23.106.191,172.23.106.209,172.23.106.70)
      and rebalanced out.
      4. Failed over 6 nodes and graceful failover + recovery + rebalance.
      5. Now swap rebalance 6 nodes. 2 data + 2 index + 1 eventing + 1 analytics as shown below.

      ns_1@172.23.105.10211:42:57 PM   11 May, 2021

      Starting rebalance, KeepNodes = ['ns_1@172.23.104.15','ns_1@172.23.104.214',
      'ns_1@172.23.104.232','ns_1@172.23.104.244',
      'ns_1@172.23.104.245','ns_1@172.23.105.102',
      'ns_1@172.23.105.109','ns_1@172.23.105.112',
      'ns_1@172.23.105.118','ns_1@172.23.105.164',
      'ns_1@172.23.105.61','ns_1@172.23.105.62',
      'ns_1@172.23.105.90','ns_1@172.23.105.93',
      'ns_1@172.23.106.117','ns_1@172.23.106.191',
      'ns_1@172.23.106.207','ns_1@172.23.106.209',
      'ns_1@172.23.106.232','ns_1@172.23.106.239',
      'ns_1@172.23.106.246','ns_1@172.23.106.32',
      'ns_1@172.23.106.37','ns_1@172.23.106.70',
      'ns_1@172.23.110.75','ns_1@172.23.110.76'], EjectNodes = ['ns_1@172.23.106.54',
      'ns_1@172.23.105.210',
      'ns_1@172.23.105.25',
      'ns_1@172.23.105.86',
      'ns_1@172.23.105.206',
      'ns_1@172.23.106.225'], Failed over and being ejected nodes = []; no delta recovery nodes; Operation Id = 7e7071f79333e252943a2259497d743d
      

      The above rebalance failed as shown below. This is related to MB-46246.
      ns_1@172.23.105.10212:09:57 AM   12 May, 2021

      Rebalance exited with reason {service_rebalance_failed,eventing,
      {agent_died,<31276.23862.7>,
      {lost_connection,
      {'ns_1@172.23.106.70',shutdown}}}}.
      Rebalance Operation Id = 7e7071f79333e252943a2259497d743d
      

      Now I retried the failed rebalance again .
      ns_1@172.23.105.10212:25:53 AM   12 May, 2021

      Starting rebalance, KeepNodes = ['ns_1@172.23.104.15','ns_1@172.23.104.214',
      'ns_1@172.23.104.232','ns_1@172.23.104.244',
      'ns_1@172.23.104.245','ns_1@172.23.105.102',
      'ns_1@172.23.105.109','ns_1@172.23.105.112',
      'ns_1@172.23.105.118','ns_1@172.23.105.164',
      'ns_1@172.23.105.61','ns_1@172.23.105.62',
      'ns_1@172.23.105.90','ns_1@172.23.105.93',
      'ns_1@172.23.106.117','ns_1@172.23.106.191',
      'ns_1@172.23.106.207','ns_1@172.23.106.209',
      'ns_1@172.23.106.232','ns_1@172.23.106.239',
      'ns_1@172.23.106.246','ns_1@172.23.106.32',
      'ns_1@172.23.106.37','ns_1@172.23.106.70',
      'ns_1@172.23.110.75','ns_1@172.23.110.76'], EjectNodes = ['ns_1@172.23.106.54',
      'ns_1@172.23.105.210',
      'ns_1@172.23.105.25',
      'ns_1@172.23.105.86',
      'ns_1@172.23.105.206',
      'ns_1@172.23.106.225'], Failed over and being ejected nodes = []; no delta recovery nodes; Operation Id = e5d19839baa473b6d0c1155448d81eeb 
      

      This rebalance hung at indexing service for well over 6+ hours. It got stuck at 53.69318181818181 %.See MB-46274 for more details.

      To proceed with the upgrade of the entire cluster I stopped the above rebalance and retried the rebalance again. This retried rebalance failed as shown below.

      ns_1@172.23.105.102 7:58:38 PM 12 May, 2021

      Rebalance exited with reason {service_rebalance_failed,cbas,
      {worker_died,
      {'EXIT',<0.25904.1315>,
      {rebalance_failed,
      {service_error,
      <<"Rebalance f4cda0e6b5a1cc69f95ea635aaaf4942 failed: timed out waiting for all nodes to join & cluster active (missing nodes: [e6f0383d4902ece226bc1f2329d23993], state: ACTIVE)">>}}}}}.
      Rebalance Operation Id = a5f54b861372ce2c5a86d6a1f34d8daa
      

       

      At the exact same time I noticed analytics services failing as shown below which I believe caused the above rebalance to fail.
      ns_1@172.23.106.209 7:58:38 PM 12 May, 2021

      Analytics Service unable to successfully rebalance f4cda0e6b5a1cc69f95ea635aaaf4942 due to 'java.lang.IllegalStateException: timed out waiting for all nodes to join & cluster active (missing nodes: [e6f0383d4902ece226bc1f2329d23993], state: ACTIVE)'; see analytics_info.log for details
      

      cbcollect_info attached. This was not seen in upgrade during 6.6.2->9588 to 7.0.0-5033.

      Attachments

        Issue Links

          No reviews matched the request. Check your Options in the drop-down menu of this sections header.

          Activity

            michael.blow Michael Blow added a comment -

            This is a duplicate of MB-45869. The problematic node (172.23.105.210) is still 6.6.2, so it does not have the fix.

            A workaround would be to restart couchbase server or killall -9 cbas on 172.23.105.210, to bring the analytics service on that node back to life. Alternatively, you could failover the node and offline upgrade it to 7.0.0.

            michael.blow Michael Blow added a comment - This is a duplicate of MB-45869 . The problematic node (172.23.105.210) is still 6.6.2, so it does not have the fix. A workaround would be to restart couchbase server or killall -9 cbas on 172.23.105.210, to bring the analytics service on that node back to life. Alternatively, you could failover the node and offline upgrade it to 7.0.0.

            Closing all non-fixed issues

            mihir.kamdar Mihir Kamdar (Inactive) added a comment - Closing all non-fixed issues
            michael.blow Michael Blow added a comment -

            I had an epiphany that I believe we can handle this in the upgrade case as well, reopening.

            michael.blow Michael Blow added a comment - I had an epiphany that I believe we can handle this in the upgrade case as well, reopening.
            michael.blow Michael Blow added a comment -

            This should no longer occur once there is a build w/ the associated patch in, even in upgrade scenarios (6.0, 6.5, 6.6).

            michael.blow Michael Blow added a comment - This should no longer occur once there is a build w/ the associated patch in, even in upgrade scenarios (6.0, 6.5, 6.6).

            Build couchbase-server-7.0.0-5170 contains cbas commit 1c9bd24 with commit message:
            MB-46285: prevent unupgraded nodes from experiencing MB-45869

            build-team Couchbase Build Team added a comment - Build couchbase-server-7.0.0-5170 contains cbas commit 1c9bd24 with commit message: MB-46285 : prevent unupgraded nodes from experiencing MB-45869

            Marking this closed post the upgrade from 6.6.2-9588 -> 7.0.0-5226 went successfully.

            Balakumaran.Gopal Balakumaran Gopal added a comment - Marking this closed post the upgrade from 6.6.2-9588 -> 7.0.0-5226 went successfully.

            People

              Balakumaran.Gopal Balakumaran Gopal
              Balakumaran.Gopal Balakumaran Gopal
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Gerrit Reviews

                  There are no open Gerrit changes

                  PagerDuty