Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-56955

Analytics Service unable to successfully rebalance 4c95651b68cfdfc22d0c1d306ff4d7c1 due to 'java.lang.IllegalStateException: timed out waiting for all nodes to join & cluster active (missing nodes: [453dcde5201e809268a0df89fe474ebe], state: UNUSABLE)

    XMLWordPrintable

Details

    • Untriaged
    • Centos 64-bit
    • 0
    • Unknown
    • Analytics Sprint 20

    Description

      Steps to Repro
      1. Run a longevity test on 7.1.4 for 2 days.

      ./sequoia -client 172.23.104.27:2375 -provider file:centos_pine.yml -test tests/integration/neo/test_neo.yml -scope tests/integration/neo/scope_neo_magma.yml -scale 3 -repeat 0 -log_level 0 -version 7.1.4-3601 -skip_setup=false -skip_test=false -skip_teardown=true -skip_cleanup=false -continue=false -collect_on_error=false -stop_on_error=false -duration=604800 -show_topology=true
      

      2. Upgraded to 7.2.0-5324 using online upgrade with failover/recovery strategy.
      3. Enabled CDC on all buckets and on some collections post upgrade.
      4. Hard failed over nodes(one of each service type), did full recovery and rebalanced. Rebalance succeeds with failures in analytics side of the rebalance. Tried it couple of times. Same state.

      172.23.120.75 9:14:07 PM 15 May, 2023

      Starting rebalance, KeepNodes = ['ns_1@172.23.120.58','ns_1@172.23.120.73',
      'ns_1@172.23.120.74','ns_1@172.23.120.75',
      'ns_1@172.23.120.77','ns_1@172.23.120.81',
      'ns_1@172.23.120.86','ns_1@172.23.121.77',
      'ns_1@172.23.123.25','ns_1@172.23.123.26',
      'ns_1@172.23.123.31','ns_1@172.23.123.32',
      'ns_1@172.23.123.33','ns_1@172.23.96.122',
      'ns_1@172.23.96.243','ns_1@172.23.96.254',
      'ns_1@172.23.96.48','ns_1@172.23.97.105',
      'ns_1@172.23.97.110','ns_1@172.23.97.112',
      'ns_1@172.23.97.148','ns_1@172.23.97.241',
      'ns_1@172.23.97.74'], EjectNodes = [], Failed over and being ejected nodes = []; no delta recovery nodes; Operation Id = ff9177e6beb670e1b5e19414fccf4d3
      

      172.23.120.86 10:57:16 PM 15 May, 2023

      Analytics Service unable to successfully rebalance 9d3bc1e5cf6f0a1cca0b06e36ea29f36 due to 'java.lang.IllegalStateException: timed out waiting for all nodes to join & cluster active (missing nodes: [453dcde5201e809268a0df89fe474ebe], state: UNUSABLE)'; see analytics_info.log for details
      

      172.23.120.75 10:41:01 AM 16 May, 2023

      Starting rebalance, KeepNodes = ['ns_1@172.23.120.58','ns_1@172.23.120.73',
      'ns_1@172.23.120.74','ns_1@172.23.120.75',
      'ns_1@172.23.120.77','ns_1@172.23.120.81',
      'ns_1@172.23.120.86','ns_1@172.23.121.77',
      'ns_1@172.23.123.25','ns_1@172.23.123.26',
      'ns_1@172.23.123.31','ns_1@172.23.123.32',
      'ns_1@172.23.123.33','ns_1@172.23.96.122',
      'ns_1@172.23.96.243','ns_1@172.23.96.254',
      'ns_1@172.23.96.48','ns_1@172.23.97.105',
      'ns_1@172.23.97.110','ns_1@172.23.97.112',
      'ns_1@172.23.97.148','ns_1@172.23.97.241',
      'ns_1@172.23.97.74'], EjectNodes = [], Failed over and being ejected nodes = []; no delta recovery nodes; Operation Id = 4cc40c996dae66055c2073dafe54ce53
      

      172.23.120.86 11:22:14 AM 16 May, 2023

      Analytics Service unable to successfully rebalance 4c95651b68cfdfc22d0c1d306ff4d7c1 due to 'java.lang.IllegalStateException: timed out waiting for all nodes to join & cluster active (missing nodes: [453dcde5201e809268a0df89fe474ebe], state: UNUSABLE)'; see analytics_info.log for details
      

      analytics_info.log(172.23.120.86)

      2023-05-15T22:57:06.246-07:00 INFO CBAS.work.NotifyShutdownWork [Worker:ClusterController] Received unsolicted shutdown notification from node 453dcde5201e809268a0df89fe474ebe
      2023-05-15T22:57:16.021-07:00 ERRO CBAS.rebalance.Rebalance [Executor-187:ClusterController] Rebalance 9d3bc1e5cf6f0a1cca0b06e36ea29f36 failed
      java.lang.IllegalStateException: timed out waiting for all nodes to join & cluster active (missing nodes: [453dcde5201e809268a0df89fe474ebe], state: UNUSABLE)
              at com.couchbase.analytics.control.rebalance.Rebalance.ensureNodesClusterActive(Rebalance.java:535) ~[cbas-server-7.2.0-5324.jar:7.2.0-5324]
              at com.couchbase.analytics.control.rebalance.Rebalance.adjustClusterBeforeRebalance(Rebalance.java:692) ~[cbas-server-7.2.0-5324.jar:7.2.0-5324]
              at com.couchbase.analytics.control.rebalance.Rebalance.doRebalance(Rebalance.java:205) ~[cbas-server-7.2.0-5324.jar:7.2.0-5324]
              at com.couchbase.analytics.control.rebalance.Rebalance.doCall(Rebalance.java:166) ~[cbas-server-7.2.0-5324.jar:7.2.0-5324]
              at com.couchbase.analytics.control.rebalance.Rebalance.doCall(Rebalance.java:84) ~[cbas-server-7.2.0-5324.jar:7.2.0-5324]
              at com.couchbase.analytics.runtime.WriteLockCallable.call(WriteLockCallable.java:27) ~[cbas-connector-7.2.0-5324.jar:7.2.0-5324]
              at java.util.concurrent.FutureTask.run(FutureTask.java:264) ~[?:?]
              at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) ~[?:?]
              at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) ~[?:?]
              at java.lang.Thread.run(Thread.java:829) ~[?:?]
      2023-05-15T22:57:16.021-07:00 WARN CBAS.rebalance.Rebalance [Executor-187:ClusterController] exit Rebalance 9d3bc1e5cf6f0a1cca0b06e36ea29f36
      2023-05-15T22:57:16.021-07:00 INFO CBAS.rebalance.RebalanceProgress [Executor-188:ClusterController] dataset size fetcher interrupted
      2023-05-15T22:57:16.349-07:00 ERRO CBAS.servlet.RebalanceServlet [HttpExecutor(port:9111)-5] Rebalance 9d3bc1e5cf6f0a1cca0b06e36ea29f36 failed
      java.lang.IllegalStateException: timed out waiting for all nodes to join & cluster active (missing nodes: [453dcde5201e809268a0df89fe474ebe], state: UNUSABLE)
              at com.couchbase.analytics.control.rebalance.Rebalance.ensureNodesClusterActive(Rebalance.java:535) ~[cbas-server-7.2.0-5324.jar:7.2.0-5324]
              at com.couchbase.analytics.control.rebalance.Rebalance.adjustClusterBeforeRebalance(Rebalance.java:692) ~[cbas-server-7.2.0-5324.jar:7.2.0-5324]
              at com.couchbase.analytics.control.rebalance.Rebalance.doRebalance(Rebalance.java:205) ~[cbas-server-7.2.0-5324.jar:7.2.0-5324]
              at com.couchbase.analytics.control.rebalance.Rebalance.doCall(Rebalance.java:166) ~[cbas-server-7.2.0-5324.jar:7.2.0-5324]
              at com.couchbase.analytics.control.rebalance.Rebalance.doCall(Rebalance.java:84) ~[cbas-server-7.2.0-5324.jar:7.2.0-5324]
              at com.couchbase.analytics.runtime.WriteLockCallable.call(WriteLockCallable.java:27) ~[cbas-connector-7.2.0-5324.jar:7.2.0-5324]
              at java.util.concurrent.FutureTask.run(FutureTask.java:264) ~[?:?]
              at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) ~[?:?]
              at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) ~[?:?]
              at java.lang.Thread.run(Thread.java:829) ~[?:?]
      2023-05-15T22:57:16.433-07:00 INFO CBAS.cbas requesting isBalanced for 9d3bc1e5cf6f0a1cca0b06e36ea29f36 from driver
      ....
      ....
      ....
      2023-05-16T11:22:14.865-07:00 ERRO CBAS.rebalance.Rebalance [Executor-201:ClusterController] Rebalance 4c95651b68cfdfc22d0c1d306ff4d7c1 failed
      java.lang.IllegalStateException: timed out waiting for all nodes to join & cluster active (missing nodes: [453dcde5201e809268a0df89fe474ebe], state: UNUSABLE)
              at com.couchbase.analytics.control.rebalance.Rebalance.ensureNodesClusterActive(Rebalance.java:535) ~[cbas-server-7.2.0-5324.jar:7.2.0-5324]
              at com.couchbase.analytics.control.rebalance.Rebalance.adjustClusterBeforeRebalance(Rebalance.java:692) ~[cbas-server-7.2.0-5324.jar:7.2.0-5324]
              at com.couchbase.analytics.control.rebalance.Rebalance.doRebalance(Rebalance.java:205) ~[cbas-server-7.2.0-5324.jar:7.2.0-5324]
              at com.couchbase.analytics.control.rebalance.Rebalance.doCall(Rebalance.java:166) ~[cbas-server-7.2.0-5324.jar:7.2.0-5324]
              at com.couchbase.analytics.control.rebalance.Rebalance.doCall(Rebalance.java:84) ~[cbas-server-7.2.0-5324.jar:7.2.0-5324]
              at com.couchbase.analytics.runtime.WriteLockCallable.call(WriteLockCallable.java:27) ~[cbas-connector-7.2.0-5324.jar:7.2.0-5324]
              at java.util.concurrent.FutureTask.run(FutureTask.java:264) ~[?:?]
              at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) ~[?:?]
              at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) ~[?:?]
              at java.lang.Thread.run(Thread.java:829) ~[?:?]
      2023-05-16T11:22:14.865-07:00 WARN CBAS.rebalance.Rebalance [Executor-201:ClusterController] exit Rebalance 4c95651b68cfdfc22d0c1d306ff4d7c1
      2023-05-16T11:22:14.865-07:00 INFO CBAS.rebalance.RebalanceProgress [Executor-202:ClusterController] dataset size fetcher interrupted
      2023-05-16T11:22:15.244-07:00 ERRO CBAS.servlet.RebalanceServlet [HttpExecutor(port:9111)-3] Rebalance 4c95651b68cfdfc22d0c1d306ff4d7c1 failed
      java.lang.IllegalStateException: timed out waiting for all nodes to join & cluster active (missing nodes: [453dcde5201e809268a0df89fe474ebe], state: UNUSABLE)
              at com.couchbase.analytics.control.rebalance.Rebalance.ensureNodesClusterActive(Rebalance.java:535) ~[cbas-server-7.2.0-5324.jar:7.2.0-5324]
              at com.couchbase.analytics.control.rebalance.Rebalance.adjustClusterBeforeRebalance(Rebalance.java:692) ~[cbas-server-7.2.0-5324.jar:7.2.0-5324]
              at com.couchbase.analytics.control.rebalance.Rebalance.doRebalance(Rebalance.java:205) ~[cbas-server-7.2.0-5324.jar:7.2.0-5324]
              at com.couchbase.analytics.control.rebalance.Rebalance.doCall(Rebalance.java:166) ~[cbas-server-7.2.0-5324.jar:7.2.0-5324]
              at com.couchbase.analytics.control.rebalance.Rebalance.doCall(Rebalance.java:84) ~[cbas-server-7.2.0-5324.jar:7.2.0-5324]
              at com.couchbase.analytics.runtime.WriteLockCallable.call(WriteLockCallable.java:27) ~[cbas-connector-7.2.0-5324.jar:7.2.0-5324]
              at java.util.concurrent.FutureTask.run(FutureTask.java:264) ~[?:?]
              at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) ~[?:?]
              at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) ~[?:?]
      

      cbcollect_info attached.

      Attachments

        Issue Links

          No reviews matched the request. Check your Options in the drop-down menu of this sections header.

          Activity

            People

              Balakumaran.Gopal Balakumaran Gopal
              Balakumaran.Gopal Balakumaran Gopal
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Gerrit Reviews

                  There are no open Gerrit changes

                  PagerDuty