Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-50545

[System test upgrade] - Multi node failover + recovery fails post upgrade to 7.1 with "java.lang.IllegalStateException: timed out waiting for all nodes to join & cluster active"

    XMLWordPrintable

Details

    • Untriaged
    • Centos 64-bit
    • 1
    • No
    • CX Sprint 280

    Description

      Steps to Repro
      1. Run 6.6.5 longevity test for 5-6 days.

      ./sequoia -client 172.23.104.254:2375 -provider file:centos_second_cluster.yml -test tests/integration/test_allFeatures_madhatter_durability.yml -scope tests/integration/scope_Xattrs_Madhatter.yml -scale 3 -repeat 0 -log_level 0 -version 6.6.5-10076 -skip_setup=false -skip_test=false -skip_teardown=true -skip_cleanup=false -continue=false -collect_on_error=false -stop_on_error=false -duration=604800 -show_topology=true
      

      2. Online upgrade to 7.1 using swap rebalance and graceful failover/recovery strategies.
      3. Failed over 2 analytics node( 172.23.120.77 and 172.23.120.73)

      Starting failover of nodes ['ns_1@172.23.120.77']. Operation Id = 937473f9db74d764c3af79f332cf9fc5
      Starting failover of nodes ['ns_1@172.23.120.73']. Operation Id = 5d6af9c1f2c1162b5014b499de159f3b
      

      4. Did a full recovery of those 2 nodes and did a rebalance.
      172.23.120.197 9:28:24 PM 23 Jan, 2022

      Starting rebalance, KeepNodes = ['ns_1@172.23.106.134','ns_1@172.23.106.136',
      'ns_1@172.23.106.137','ns_1@172.23.106.138',
      'ns_1@172.23.120.197','ns_1@172.23.120.58',
      'ns_1@172.23.120.73','ns_1@172.23.120.74',
      'ns_1@172.23.120.75','ns_1@172.23.120.77',
      'ns_1@172.23.120.81','ns_1@172.23.120.86',
      'ns_1@172.23.121.118','ns_1@172.23.121.77',
      'ns_1@172.23.123.26','ns_1@172.23.123.31',
      'ns_1@172.23.123.32','ns_1@172.23.123.33',
      'ns_1@172.23.96.14','ns_1@172.23.96.243',
      'ns_1@172.23.96.254','ns_1@172.23.97.105',
      'ns_1@172.23.97.110','ns_1@172.23.97.112',
      'ns_1@172.23.97.148','ns_1@172.23.97.149',
      'ns_1@172.23.97.151'], EjectNodes = [], Failed over and being ejected nodes = []; no delta recovery nodes; Operation Id = 1601627790a0b959e180c49e12e57399
      

      Rebalance failed as shown below.
      172.23.120.73 9:30:51 PM 23 Jan, 2022

      Analytics Service unable to successfully rebalance 1a44f742b97e16491ad11c1f826ffd7c due to 'java.lang.IllegalStateException: timed out waiting for all nodes to join & cluster active (missing nodes: [e4922efbaa722c9ef303d72aee04090c], state: UNUSABLE)'; see analytics_info.log for details
      

      172.23.120.73 analytics_info.log

      2022-01-23T23:33:08.670-08:00 ERRO CBAS.rebalance.Rebalance [Executor-16:ClusterController] Rebalance 1cd8195171ae0b6c9d9c941492bdb344 failed
      java.lang.IllegalStateException: timed out waiting for all nodes to join & cluster active (missing nodes: [e4922efbaa722c9ef303d72aee04090c], state: UNUSABLE)
              at com.couchbase.analytics.control.rebalance.Rebalance.ensureNodesClusterActive(Rebalance.java:526) ~[cbas-server-7.1.0-2117.jar:7.1.0-2117]
              at com.couchbase.analytics.control.rebalance.Rebalance.adjustClusterBeforeRebalance(Rebalance.java:683) ~[cbas-server-7.1.0-2117.jar:7.1.0-2117]
              at com.couchbase.analytics.control.rebalance.Rebalance.doRebalance(Rebalance.java:205) ~[cbas-server-7.1.0-2117.jar:7.1.0-2117]
              at com.couchbase.analytics.control.rebalance.Rebalance.doCall(Rebalance.java:166) [cbas-server-7.1.0-2117.jar:7.1.0-2117]
              at com.couchbase.analytics.control.rebalance.Rebalance.doCall(Rebalance.java:84) [cbas-server-7.1.0-2117.jar:7.1.0-2117]
              at com.couchbase.analytics.runtime.WriteLockCallable.call(WriteLockCallable.java:27) [cbas-connector-7.1.0-2117.jar:7.1.0-2117]
              at java.util.concurrent.FutureTask.run(FutureTask.java:264) [?:?]
              at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) [?:?]
              at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) [?:?]
              at java.lang.Thread.run(Thread.java:829) [?:?]
      2022-01-23T23:33:08.671-08:00 WARN CBAS.rebalance.Rebalance [Executor-16:ClusterController] exit Rebalance 1cd8195171ae0b6c9d9c941492bdb344
      2022-01-23T23:33:08.671-08:00 INFO CBAS.rebalance.RebalanceProgress [Executor-18:ClusterController] dataset size fetcher interrupted
      2022-01-23T23:33:08.904-08:00 ERRO CBAS.servlet.RebalanceServlet [HttpExecutor(port:9111)-15] Rebalance 1cd8195171ae0b6c9d9c941492bdb344 failed
      java.lang.IllegalStateException: timed out waiting for all nodes to join & cluster active (missing nodes: [e4922efbaa722c9ef303d72aee04090c], state: UNUSABLE)
              at com.couchbase.analytics.control.rebalance.Rebalance.ensureNodesClusterActive(Rebalance.java:526) ~[cbas-server-7.1.0-2117.jar:7.1.0-2117]
              at com.couchbase.analytics.control.rebalance.Rebalance.adjustClusterBeforeRebalance(Rebalance.java:683) ~[cbas-server-7.1.0-2117.jar:7.1.0-2117]
              at com.couchbase.analytics.control.rebalance.Rebalance.doRebalance(Rebalance.java:205) ~[cbas-server-7.1.0-2117.jar:7.1.0-2117]
              at com.couchbase.analytics.control.rebalance.Rebalance.doCall(Rebalance.java:166) ~[cbas-server-7.1.0-2117.jar:7.1.0-2117]
              at com.couchbase.analytics.control.rebalance.Rebalance.doCall(Rebalance.java:84) ~[cbas-server-7.1.0-2117.jar:7.1.0-2117]
              at com.couchbase.analytics.runtime.WriteLockCallable.call(WriteLockCallable.java:27) ~[cbas-connector-7.1.0-2117.jar:7.1.0-2117]
              at java.util.concurrent.FutureTask.run(FutureTask.java:264) [?:?]
              at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) [?:?]
              at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) [?:?]
              at java.lang.Thread.run(Thread.java:829) [?:?]
      2022-01-23T23:33:08.921-08:00 INFO CBAS.cbas requesting isBalanced for 1cd8195171ae0b6c9d9c941492bdb344 from driver
      2022-01-23T23:33:08.923-08:00 INFO CBAS.servlet.RebalanceServlet [HttpExecutor(port:9111)-1] +post request: {"nodes":[{"nodeId":"4914cd856897180e302068cb33eb6642","priority":1970329131941888,"opaque":{"cbas-version":"7.1.0-2117","cc-http-port":"9111","controller-id":"43","host":"172.23.106.138","ns-server-port":"8091","num-iodevices":"8","starting-partition-id":"344","svc-http-port":"8095"}},{"nodeId":"5ffd65353607e8c3ef50cd4240c725ed","priority":1970329131942144,"opaque":{"cbas-version":"6.6.5-10076","cc-http-port":"9111","controller-id":"41","host":"172.23.120.73","ns-server-port":"8091","num-iodevices":"8","starting-partition-id":"328","svc-http-port":"8095"}},{"nodeId":"e4922efbaa722c9ef303d72aee04090c","priority":1970329131941888,"opaque":{"cbas-version":"6.6.5-10076","cc-http-port":"9111","controller-id":"42","host":"172.23.120.77","ns-server-port":"8091","num-iodevices":"8","starting-partition-id":"336","svc-http-port":"8095"}}],"id":"1cd8195171ae0b6c9d9c941492bdb344","type":"topology-change-rebalance","ccNodeId":"5ffd65353607e8c3ef50cd4240c725ed","metadataNodeId":"5ffd65353607e8c3ef50cd4240c725ed","metadataPartition":-1,"rev":507,"configVersion":1,"balanceState":"unknown","keepNodesUpdated":false,"keepNodes":["4914cd856897180e302068cb33eb6642","5ffd65353607e8c3ef50cd4240c725ed","e4922efbaa722c9ef303d72aee04090c"],"inPlaceNumReplicas":0}
       
      2022-01-23T23:33:08.923-08:00 INFO CBAS.rebalance.TopologyManager [HttpExecutor(port:9111)-1] found missing masters in partitions topology
      2022-01-23T23:33:08.923-08:00 INFO CBAS.rebalance.Rebalance [HttpExecutor(port:9111)-1] latest partitions topology is not balanced
      2022-01-23T23:33:08.924-08:00 INFO CBAS.cbas setting balanced state to unbalanced for 1cd8195171ae0b6c9d9c941492bdb344
      2022-01-23T23:33:08.926-08:00 INFO CBAS.cbas updating balance state unbalanced for 1cd8195171ae0b6c9d9c941492bdb344
      

      Retried failed rebalance multiple times. Still rebalance did not succeed and Rebalance button is still enabled.

      cbcollect_info attached.

      Attachments

        Issue Links

          No reviews matched the request. Check your Options in the drop-down menu of this sections header.

          Activity

            People

              Balakumaran.Gopal Balakumaran Gopal
              Balakumaran.Gopal Balakumaran Gopal
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Gerrit Reviews

                  There are no open Gerrit changes

                  PagerDuty