Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-45869

[System Test][Analytics] Rebalance failed with error - java.lang.IllegalStateException: timed out waiting for all nodes to join & cluster active (missing nodes: [xxxx], state: ACTIVE)

    XMLWordPrintable

Details

    • Bug
    • Status: Closed
    • Critical
    • Resolution: Fixed
    • Cheshire-Cat
    • 7.0.0
    • analytics
    • Untriaged
    • 1
    • Unknown
    • CX Sprint 246

    Description

      7.0.0-4955

      Test:
      -test tests/integration/cheshirecat/test_cheshirecat_kv_gsi_coll_xdcr_backup_sgw_fts_itemct_txns_eventing_cbas_scale3.yml -scope tests/integration/cheshirecat/scope_cheshirecat_with_backup.yml
      Scale 3
      Iteration 3

      .108.103 diag.log:

      2021-04-21T15:23:09.425-07:00, ns_vbucket_mover:0:info:message(ns_1@172.23.104.137) - Bucket "default" rebalance appears to be swap rebalance
      2021-04-21T15:23:11.139-07:00, ns_node_disco:4:info:node up(ns_1@172.23.104.67) - Node 'ns_1@172.23.104.67' saw that node 'ns_1@172.23.105.111' came up. Tags: [] (repeated 1 times, last seen 61.342812 secs ago)
      2021-04-21T15:23:17.661-07:00, ns_node_disco:4:info:node up(ns_1@172.23.105.107) - Node 'ns_1@172.23.105.107' saw that node 'ns_1@172.23.105.111' came up. Tags: [] (repeated 1 times, last seen 68.368038 secs ago)
      2021-04-21T15:23:18.407-07:00, ns_node_disco:4:info:node up(ns_1@172.23.104.157) - Node 'ns_1@172.23.104.157' saw that node 'ns_1@172.23.105.111' came up. Tags: [] (repeated 1 times, last seen 69.015704 secs ago)
      2021-04-22T05:58:24.082-07:00, analytics:0:warning:message(ns_1@172.23.104.157) - Analytics Service unable to successfully rebalance 60133c5ac910f395614e1c0d59647e48 due to 'java.lang.IllegalStateException: timed out waiting for all nodes to join & cluster active (missing nodes: [1151f6cfc86035b5768e4a5ab36bbb8f], state: ACTIVE)'; see analytics_info.log for details
      2021-04-22T05:58:24.588-07:00, ns_orchestrator:0:critical:message(ns_1@172.23.104.137) - Rebalance exited with reason {service_rebalance_failed,cbas,
                                    {worker_died,
                                     {'EXIT',<0.28306.2352>,
                                      {rebalance_failed,
                                       {service_error,
                                        <<"Rebalance 60133c5ac910f395614e1c0d59647e48 failed: timed out waiting for all nodes to join & cluster active (missing nodes: [1151f6cfc86035b5768e4a5ab36bbb8f], state: ACTIVE)">>}}}}}.
      Rebalance Operation Id = 759b7fb5e9eb86629ddf589999b711a4
      

      Console:

      [2021-04-21T15:21:49-07:00, sequoiatools/couchbase-cli:7.0:1d0175] server-add -c 172.23.108.103:8091 --server-add https://172.23.105.111 -u Administrator -p password --server-add-username Administrator --server-add-password password --services index
      [2021-04-21T15:22:23-07:00, sequoiatools/couchbase-cli:7.0:8b292a] rebalance -c 172.23.108.103:8091 -u Administrator -p password
       
      Error occurred on container - sequoiatools/couchbase-cli:7.0:[rebalance -c 172.23.108.103:8091 -u Administrator -p password]
       
      docker logs 8b292a
      docker start 8b292a
       
      *Unable to display progress bar on this os
      JERROR: Rebalance failed. See logs for detailed reason. You can try again.
      [2021-04-22T05:58:31-07:00, sequoiatools/cmd:a0f7ff] 60
      

      .104.157 analytics info.log:

      2021-04-22T05:58:22.850-07:00 INFO CBAS.server.QueryServiceServlet [HttpExecutor(port:8095)-10] handleRequest: <ud>{"host":"172.23.104.157:8095","path":"/query/service","statement":"select avg(price) as AvgPrice, min(price) as MinPrice, max(price) as MaxPrice from dv_8.ds_29 where free_breakfast=True and free_parking=True and price is not null and array_count(public_likes)>5 and `type`='Hotel' group by country limit 100;","pretty":true,"mode":"immediate","clientContextID":"query_thread_129617","format":"CLEAN_JSON","timeout":3600000,"maxResultReads":1,"planFormat":"JSON","expressionTree":false,"rewrittenExpressionTree":false,"logicalPlan":false,"optimizedLogicalPlan":false,"job":false,"profile":"counts","signature":true,"multiStatement":true,"parseOnly":false,"readOnly":false,"maxWarnings":0,"scanConsistency":null,"scanWait":null}</ud>
      2021-04-22T05:58:22.851-07:00 INFO CBAS.messaging.CCMessageBroker [Executor-758:ClusterController] Received message: ExecuteStatementRequestMessage(id=38079, from=47c864f2a9e597168db197820602939e): <ud>select avg(price) as AvgPrice, min(price) as MinPrice, max(price) as MaxPrice from dv_8.ds_29 where free_breakfast=True and free_parking=True and price is not null and array_count(public_likes)>5 and `type`='Hotel' group by country limit 100;</ud>
      2021-04-22T05:58:22.908-07:00 INFO CBAS.work.JobCleanupWork [Worker:ClusterController] Cleanup for job: JID:5.80502
      2021-04-22T05:58:22.910-07:00 INFO CBAS.messaging.NCMessageBroker [Worker:47c864f2a9e597168db197820602939e] Received message: ExecuteStatementResponseMessage(id=38079): 71 characters
      2021-04-22T05:58:24.085-07:00 ERRO CBAS.rebalance.Rebalance [Executor-755:ClusterController] Rebalance 60133c5ac910f395614e1c0d59647e48 failed
      java.lang.IllegalStateException: timed out waiting for all nodes to join & cluster active (missing nodes: [1151f6cfc86035b5768e4a5ab36bbb8f], state: ACTIVE)
      	at com.couchbase.analytics.control.rebalance.Rebalance.ensureNodesClusterActive(Rebalance.java:484) ~[cbas-server.jar:7.0.0-4955]
      	at com.couchbase.analytics.control.rebalance.Rebalance.adjustClusterBeforeRebalance(Rebalance.java:613) ~[cbas-server.jar:7.0.0-4955]
      	at com.couchbase.analytics.control.rebalance.Rebalance.doRebalance(Rebalance.java:190) ~[cbas-server.jar:7.0.0-4955]
      	at com.couchbase.analytics.control.rebalance.Rebalance.doCall(Rebalance.java:152) [cbas-server.jar:7.0.0-4955]
      	at com.couchbase.analytics.control.rebalance.Rebalance.doCall(Rebalance.java:80) [cbas-server.jar:7.0.0-4955]
      	at com.couchbase.analytics.runtime.WriteLockCallable.call(WriteLockCallable.java:27) [cbas-connector.jar:7.0.0-4955]
      	at java.util.concurrent.FutureTask.run(Unknown Source) [?:?]
      	at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source) [?:?]
      	at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) [?:?]
      	at java.lang.Thread.run(Unknown Source) [?:?]
      2021-04-22T05:58:24.086-07:00 WARN CBAS.rebalance.Rebalance [Executor-755:ClusterController] exit Rebalance 60133c5ac910f395614e1c0d59647e48
      2021-04-22T05:58:24.086-07:00 INFO CBAS.rebalance.RebalanceProgress [Executor-754:ClusterController] dataset size fetcher interrupted
      2021-04-22T05:58:24.576-07:00 ERRO CBAS.servlet.RebalanceServlet [HttpExecutor(port:9111)-10] Rebalance 60133c5ac910f395614e1c0d59647e48 failed
      java.lang.IllegalStateException: timed out waiting for all nodes to join & cluster active (missing nodes: [1151f6cfc86035b5768e4a5ab36bbb8f], state: ACTIVE)
      	at com.couchbase.analytics.control.rebalance.Rebalance.ensureNodesClusterActive(Rebalance.java:484) ~[cbas-server.jar:7.0.0-4955]
      	at com.couchbase.analytics.control.rebalance.Rebalance.adjustClusterBeforeRebalance(Rebalance.java:613) ~[cbas-server.jar:7.0.0-4955]
      	at com.couchbase.analytics.control.rebalance.Rebalance.doRebalance(Rebalance.java:190) ~[cbas-server.jar:7.0.0-4955]
      	at com.couchbase.analytics.control.rebalance.Rebalance.doCall(Rebalance.java:152) ~[cbas-server.jar:7.0.0-4955]
      	at com.couchbase.analytics.control.rebalance.Rebalance.doCall(Rebalance.java:80) ~[cbas-server.jar:7.0.0-4955]
      	at com.couchbase.analytics.runtime.WriteLockCallable.call(WriteLockCallable.java:27) ~[cbas-connector.jar:7.0.0-4955]
      	at java.util.concurrent.FutureTask.run(Unknown Source) [?:?]
      	at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source) [?:?]
      	at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) [?:?]
      	at java.lang.Thread.run(Unknown Source) [?:?]
      2021-04-22T05:58:24.589-07:00 INFO CBAS.cbas requesting isBalanced for 60133c5ac910f395614e1c0d59647e48 from driver
      

      Cluster config:
      backup : 1 ===== > [172.23.123.28:8091] ###########
      index : 6 ===== > [172.23.104.137:8091 172.23.105.107:8091 172.23.121.117:8091 172.23.96.252:8091 172.23.96.253:8091 172.23.99.11:8091]
      fts : 2 ===== > [172.23.104.155:8091 172.23.96.148:8091] ###########
      cbas : 4 ===== > [172.23.104.157:8091 172.23.104.5:8091 172.23.106.188:8091 172.23.97.242:8091] ###########
      eventing : 4 ===== > [172.23.104.67:8091 172.23.123.27:8091 172.23.97.239:8091 172.23.98.135:8091] ###########
      kv : 11 ===== > [172.23.104.69:8091 172.23.104.70:8091 172.23.106.100:8091 172.23.108.103:8091 172.23.121.3:8091 172.23.97.119:8091 172.23.97.121:8091 172.23.97.122:8091 172.23.99.20:8091 172.23.99.21:8091 172.23.99.25:8091] ###########
      n1ql : 2 ===== > [172.23.120.245:8091 172.23.96.251:8091] ###########

      Logs:
      url : https://cb-jira.s3.us-east-2.amazonaws.com/logs/systestmon-1619103127/collectinfo-2021-04-22T145210-ns_1%40172.23.104.137.zip
      url : https://cb-jira.s3.us-east-2.amazonaws.com/logs/systestmon-1619103127/collectinfo-2021-04-22T145210-ns_1%40172.23.104.155.zip
      url : https://cb-jira.s3.us-east-2.amazonaws.com/logs/systestmon-1619103127/collectinfo-2021-04-22T145210-ns_1%40172.23.104.157.zip
      url : https://cb-jira.s3.us-east-2.amazonaws.com/logs/systestmon-1619103127/collectinfo-2021-04-22T145210-ns_1%40172.23.104.5.zip
      url : https://cb-jira.s3.us-east-2.amazonaws.com/logs/systestmon-1619103127/collectinfo-2021-04-22T145210-ns_1%40172.23.104.67.zip
      url : https://cb-jira.s3.us-east-2.amazonaws.com/logs/systestmon-1619103127/collectinfo-2021-04-22T145210-ns_1%40172.23.104.69.zip
      url : https://cb-jira.s3.us-east-2.amazonaws.com/logs/systestmon-1619103127/collectinfo-2021-04-22T145210-ns_1%40172.23.104.70.zip
      url : https://cb-jira.s3.us-east-2.amazonaws.com/logs/systestmon-1619103127/collectinfo-2021-04-22T145210-ns_1%40172.23.105.107.zip
      url : https://cb-jira.s3.us-east-2.amazonaws.com/logs/systestmon-1619103127/collectinfo-2021-04-22T145210-ns_1%40172.23.105.111.zip
      url : https://cb-jira.s3.us-east-2.amazonaws.com/logs/systestmon-1619103127/collectinfo-2021-04-22T145210-ns_1%40172.23.106.100.zip
      url : https://cb-jira.s3.us-east-2.amazonaws.com/logs/systestmon-1619103127/collectinfo-2021-04-22T145210-ns_1%40172.23.106.188.zip
      url : https://cb-jira.s3.us-east-2.amazonaws.com/logs/systestmon-1619103127/collectinfo-2021-04-22T145210-ns_1%40172.23.108.103.zip
      url : https://cb-jira.s3.us-east-2.amazonaws.com/logs/systestmon-1619103127/collectinfo-2021-04-22T145210-ns_1%40172.23.120.245.zip
      url : https://cb-jira.s3.us-east-2.amazonaws.com/logs/systestmon-1619103127/collectinfo-2021-04-22T145210-ns_1%40172.23.121.117.zip
      url : https://cb-jira.s3.us-east-2.amazonaws.com/logs/systestmon-1619103127/collectinfo-2021-04-22T145210-ns_1%40172.23.121.3.zip
      url : https://cb-jira.s3.us-east-2.amazonaws.com/logs/systestmon-1619103127/collectinfo-2021-04-22T145210-ns_1%40172.23.123.27.zip
      url : https://cb-jira.s3.us-east-2.amazonaws.com/logs/systestmon-1619103127/collectinfo-2021-04-22T145210-ns_1%40172.23.123.28.zip
      url : https://cb-jira.s3.us-east-2.amazonaws.com/logs/systestmon-1619103127/collectinfo-2021-04-22T145210-ns_1%40172.23.96.148.zip
      url : https://cb-jira.s3.us-east-2.amazonaws.com/logs/systestmon-1619103127/collectinfo-2021-04-22T145210-ns_1%40172.23.96.251.zip
      url : https://cb-jira.s3.us-east-2.amazonaws.com/logs/systestmon-1619103127/collectinfo-2021-04-22T145210-ns_1%40172.23.96.252.zip
      url : https://cb-jira.s3.us-east-2.amazonaws.com/logs/systestmon-1619103127/collectinfo-2021-04-22T145210-ns_1%40172.23.96.253.zip
      url : https://cb-jira.s3.us-east-2.amazonaws.com/logs/systestmon-1619103127/collectinfo-2021-04-22T145210-ns_1%40172.23.97.119.zip
      url : https://cb-jira.s3.us-east-2.amazonaws.com/logs/systestmon-1619103127/collectinfo-2021-04-22T145210-ns_1%40172.23.97.121.zip
      url : https://cb-jira.s3.us-east-2.amazonaws.com/logs/systestmon-1619103127/collectinfo-2021-04-22T145210-ns_1%40172.23.97.122.zip
      url : https://cb-jira.s3.us-east-2.amazonaws.com/logs/systestmon-1619103127/collectinfo-2021-04-22T145210-ns_1%40172.23.97.239.zip
      url : https://cb-jira.s3.us-east-2.amazonaws.com/logs/systestmon-1619103127/collectinfo-2021-04-22T145210-ns_1%40172.23.97.242.zip
      url : https://cb-jira.s3.us-east-2.amazonaws.com/logs/systestmon-1619103127/collectinfo-2021-04-22T145210-ns_1%40172.23.98.135.zip
      url : https://cb-jira.s3.us-east-2.amazonaws.com/logs/systestmon-1619103127/collectinfo-2021-04-22T145210-ns_1%40172.23.99.11.zip
      url : https://cb-jira.s3.us-east-2.amazonaws.com/logs/systestmon-1619103127/collectinfo-2021-04-22T145210-ns_1%40172.23.99.20.zip
      url : https://cb-jira.s3.us-east-2.amazonaws.com/logs/systestmon-1619103127/collectinfo-2021-04-22T145210-ns_1%40172.23.99.21.zip
      url : https://cb-jira.s3.us-east-2.amazonaws.com/logs/systestmon-1619103127/collectinfo-2021-04-22T145210-ns_1%40172.23.99.25.zip

      Attachments

        Issue Links

          No reviews matched the request. Check your Options in the drop-down menu of this sections header.

          Activity

            michael.blow Michael Blow added a comment -

            There seems to be something confused with the topology in this cluster- the node 1151f6cfc86035b5768e4a5ab36bbb8f (172.23.106.188) was rebalanced out as part of rebalance 669ad0d085b724d1f62eceac6b0dc7f1:

            2021-04-21T08:35:21.368-07:00 INFO CBAS.servlet.RebalanceServlet [HttpExecutor(port:9111)-8] +post request: {"topologyChange":{"id":"669ad0d085b724d1f62eceac6b0dc7f1","currentTopologyRev":null,"type":"topology-change-rebalance","keepNodes":[{"nodeInfo":{"nodeId":"47c864f2a9e597168db197820602939e","priority":0,"opaque":{"cbas-version":"7.0.0-4955","cc-http-port":"9111","controller-id":"5","host":"172.23.104.157","num-iodevices":"8","starting-partition-id":"44","svc-http-port":"8095"}},"recoveryType":"recovery-full"},{"nodeInfo":{"nodeId":"8f2b8af91b250f31e02143ecf103617b","priority":0,"opaque":{"cbas-version":"7.0.0-4955","cc-http-port":"9111","controller-id":"8","host":"172.23.104.5","num-iodevices":"8","starting-partition-id":"68","svc-http-port":"8095"}},"recoveryType":"recovery-full"},{"nodeInfo":{"nodeId":"69e7ccf80fd909080b2bd6fd4c0fbd0b","priority":0,"opaque":{"cbas-version":"7.0.0-4955","cc-http-port":"9111","controller-id":"7","host":"172.23.97.242","num-iodevices":"8","starting-partition-id":"60","svc-http-port":"8095"}},"recoveryType":"recovery-full"}],"ejectNodes":[{"nodeId":"1151f6cfc86035b5768e4a5ab36bbb8f","priority":0,"opaque":{"cbas-version":"7.0.0-4955","cc-http-port":"9111","controller-id":"0","host":"172.23.106.188","num-iodevices":"8","starting-partition-id":"0","svc-http-port":"8095"}}]}
            

            which completed at approximately 2021-04-21T08:40:30.351-07:00:

            2021-04-21T08:40:30.351-07:00 WARN CBAS.rebalance.Rebalance [Executor-1836:ClusterController] exit Rebalance 669ad0d085b724d1f62eceac6b0dc7f1
            

            somehow, on the next rebalance 60133c5ac910f395614e1c0d59647e48 the ejected node is re-presented in keepNodes to .157 some 21+ hours later:

            2021-04-22T05:56:23.952-07:00 WARN CBAS.rebalance.Rebalance [Executor-755:ClusterController] enter Rebalance 60133c5ac910f395614e1c0d59647e48: {"id":"60133c5ac910f395614e1c0d59647e48","currentTopologyRev":null,"type":"topology-change-rebalance","keepNodes":[{"nodeInfo":{"nodeId":"47c864f2a9e597168db197820602939e","priority":0,"opaque":{"cbas-version":"7.0.0-4955","cc-http-port":"9111","controller-id":"5","host":"172.23.104.157","num-iodevices":"8","starting-partition-id":"44","svc-http-port":"8095"}},"recoveryType":"recovery-full"},{"nodeInfo":{"nodeId":"8f2b8af91b250f31e02143ecf103617b","priority":0,"opaque":{"cbas-version":"7.0.0-4955","cc-http-port":"9111","controller-id":"8","host":"172.23.104.5","num-iodevices":"8","starting-partition-id":"68","svc-http-port":"8095"}},"recoveryType":"recovery-full"},{"nodeInfo":{"nodeId":"1151f6cfc86035b5768e4a5ab36bbb8f","priority":0,"opaque":{"cbas-version":"7.0.0-4955","cc-http-port":"9111","controller-id":"9","host":"172.23.106.188","num-iodevices":"8","starting-partition-id":"76","svc-http-port":"8095"}},"recoveryType":"recovery-full"},{"nodeInfo":{"nodeId":"69e7ccf80fd909080b2bd6fd4c0fbd0b","priority":0,"opaque":{"cbas-version":"7.0.0-4955","cc-http-port":"9111","controller-id":"7","host":"172.23.97.242","num-iodevices":"8","starting-partition-id":"60","svc-http-port":"8095"}},"recoveryType":"recovery-full"}],"ejectNodes":[]}
            

            Note, ns_server made no attempt to restart cbas on .188, there is no activity in the analytics logs on .188 after the rebalance out as part of rebalance 669ad0d085b724d1f62eceac6b0dc7f1.

            Reassigning to ns_server for further investigation, from what I can tell, analytics is working as intended.

            michael.blow Michael Blow added a comment - There seems to be something confused with the topology in this cluster- the node 1151f6cfc86035b5768e4a5ab36bbb8f (172.23.106.188) was rebalanced out as part of rebalance 669ad0d085b724d1f62eceac6b0dc7f1 : 2021-04-21T08:35:21.368-07:00 INFO CBAS.servlet.RebalanceServlet [HttpExecutor(port:9111)-8] +post request: {"topologyChange":{"id":"669ad0d085b724d1f62eceac6b0dc7f1","currentTopologyRev":null,"type":"topology-change-rebalance","keepNodes":[{"nodeInfo":{"nodeId":"47c864f2a9e597168db197820602939e","priority":0,"opaque":{"cbas-version":"7.0.0-4955","cc-http-port":"9111","controller-id":"5","host":"172.23.104.157","num-iodevices":"8","starting-partition-id":"44","svc-http-port":"8095"}},"recoveryType":"recovery-full"},{"nodeInfo":{"nodeId":"8f2b8af91b250f31e02143ecf103617b","priority":0,"opaque":{"cbas-version":"7.0.0-4955","cc-http-port":"9111","controller-id":"8","host":"172.23.104.5","num-iodevices":"8","starting-partition-id":"68","svc-http-port":"8095"}},"recoveryType":"recovery-full"},{"nodeInfo":{"nodeId":"69e7ccf80fd909080b2bd6fd4c0fbd0b","priority":0,"opaque":{"cbas-version":"7.0.0-4955","cc-http-port":"9111","controller-id":"7","host":"172.23.97.242","num-iodevices":"8","starting-partition-id":"60","svc-http-port":"8095"}},"recoveryType":"recovery-full"}],"ejectNodes":[{"nodeId":"1151f6cfc86035b5768e4a5ab36bbb8f","priority":0,"opaque":{"cbas-version":"7.0.0-4955","cc-http-port":"9111","controller-id":"0","host":"172.23.106.188","num-iodevices":"8","starting-partition-id":"0","svc-http-port":"8095"}}]} which completed at approximately 2021-04-21T08:40:30.351-07:00: 2021-04-21T08:40:30.351-07:00 WARN CBAS.rebalance.Rebalance [Executor-1836:ClusterController] exit Rebalance 669ad0d085b724d1f62eceac6b0dc7f1 somehow, on the next rebalance 60133c5ac910f395614e1c0d59647e48 the ejected node is re-presented in keepNodes to .157 some 21+ hours later: 2021-04-22T05:56:23.952-07:00 WARN CBAS.rebalance.Rebalance [Executor-755:ClusterController] enter Rebalance 60133c5ac910f395614e1c0d59647e48: {"id":"60133c5ac910f395614e1c0d59647e48","currentTopologyRev":null,"type":"topology-change-rebalance","keepNodes":[{"nodeInfo":{"nodeId":"47c864f2a9e597168db197820602939e","priority":0,"opaque":{"cbas-version":"7.0.0-4955","cc-http-port":"9111","controller-id":"5","host":"172.23.104.157","num-iodevices":"8","starting-partition-id":"44","svc-http-port":"8095"}},"recoveryType":"recovery-full"},{"nodeInfo":{"nodeId":"8f2b8af91b250f31e02143ecf103617b","priority":0,"opaque":{"cbas-version":"7.0.0-4955","cc-http-port":"9111","controller-id":"8","host":"172.23.104.5","num-iodevices":"8","starting-partition-id":"68","svc-http-port":"8095"}},"recoveryType":"recovery-full"},{"nodeInfo":{"nodeId":"1151f6cfc86035b5768e4a5ab36bbb8f","priority":0,"opaque":{"cbas-version":"7.0.0-4955","cc-http-port":"9111","controller-id":"9","host":"172.23.106.188","num-iodevices":"8","starting-partition-id":"76","svc-http-port":"8095"}},"recoveryType":"recovery-full"},{"nodeInfo":{"nodeId":"69e7ccf80fd909080b2bd6fd4c0fbd0b","priority":0,"opaque":{"cbas-version":"7.0.0-4955","cc-http-port":"9111","controller-id":"7","host":"172.23.97.242","num-iodevices":"8","starting-partition-id":"60","svc-http-port":"8095"}},"recoveryType":"recovery-full"}],"ejectNodes":[]} Note, ns_server made no attempt to restart cbas on .188, there is no activity in the analytics logs on .188 after the rebalance out as part of rebalance 669ad0d085b724d1f62eceac6b0dc7f1 . Reassigning to ns_server for further investigation, from what I can tell, analytics is working as intended.
            timofey.barmin Timofey Barmin added a comment - - edited

            I see that rebalance that was supposed to eject node 172.23.106.188 failed:

            2500889 [user:info,2021-04-21T04:04:51.677-07:00,ns_1@172.23.108.103:<0.23324.0>:ns_orchestrator:idle:773]Starting rebalance, KeepNodes = ['ns_1@172.23.104.137','ns_1@172.23.104.155',
            2500890                                  'ns_1@172.23.104.157','ns_1@172.23.104.5',
            2500891                                  'ns_1@172.23.104.67','ns_1@172.23.104.69',
            2500892                                  'ns_1@172.23.104.70','ns_1@172.23.105.107',
            2500893                                  'ns_1@172.23.106.100','ns_1@172.23.108.103',
            2500894                                  'ns_1@172.23.120.245','ns_1@172.23.121.117',
            2500895                                  'ns_1@172.23.121.3','ns_1@172.23.123.27',
            2500896                                  'ns_1@172.23.123.28','ns_1@172.23.96.148',
            2500897                                  'ns_1@172.23.96.251','ns_1@172.23.96.252',
            2500898                                  'ns_1@172.23.96.253','ns_1@172.23.97.119',
            2500899                                  'ns_1@172.23.97.121','ns_1@172.23.97.122',
            2500900                                  'ns_1@172.23.97.239','ns_1@172.23.97.242',
            2500901                                  'ns_1@172.23.98.135','ns_1@172.23.99.11',
            2500902                                  'ns_1@172.23.99.20','ns_1@172.23.99.21',
            2500903                                  'ns_1@172.23.99.25'], EjectNodes = ['ns_1@172.23.106.188'],
             
            ....
             
            2861252 [user:error,2021-04-21T08:41:40.426-07:00,ns_1@172.23.108.103:<0.23324.0>:ns_orchestrator:log_rebalance_completion:1404]Rebalance exited with reason {service_rebalance_failed,eventing,
            2861253                               {agent_died,<30247.24974.0>,
            2861254                                {linked_process_died,<30247.30085.1993>,
            2861255                                 {timeout,
            2861256                                  {gen_server,call,
            2861257                                   [<30247.25022.0>,
            2861258                                    {call,"ServiceAPI.PrepareTopologyChange",
            2861259                                     #Fun<json_rpc_connection.0.77329884>},
            2861260                                    60000]}}}}}.
            

            But that rebalance for cbas completed successfully:

            2854650 [rebalance:info,2021-04-21T08:35:12.708-07:00,ns_1@172.23.108.103:service_rebalancer-cbas-worker<0.17039.3492>:service_rebalancer:rebalance_worker:147]Rebalancing service cbas with id <<"669
            2854651 KeepNodes: ['ns_1@172.23.104.157','ns_1@172.23.104.5','ns_1@172.23.97.242']
            2854652 EjectNodes: ['ns_1@172.23.106.188']
            2854653 DeltaNodes: []
            ....
            2860342 [ns_server:debug,2021-04-21T08:40:40.002-07:00,ns_1@172.23.108.103:service_rebalancer-cbas<0.11401.3492>:service_rebalancer:run_rebalance_worker:116]Worker terminated normally
            

            So there are two problems here:
            1) Rebalance at 2021-04-21T04:04:51.677-07:00 fails because of gen_server:call timeout (at eventing rebalance step)
            2) Rebalance at 2021-04-21T15:22:42.000-07:00 fails because cbas doesn't expect to see already ejected node in list of nodes

            I also noticed that between those two rebalances orchestrator moved from 172.23.108.103 to 172.23.104.137 (which was caused by a restart of 172.23.108.103:

            14702 [ns_server:info,2021-04-21T14:30:54.357-07:00,babysitter_of_ns_1@cb.local:<0.105.0>:ns_babysitter:stop:162]Received shutdown request. Terminating.
            14703 [ns_server:info,2021-04-21T14:30:58.892-07:00,babysitter_of_ns_1@cb.local:<0.105.0>:ns_babysitter:init_logging:131]Brought up babysitter logging
            

            ).

            Probably it makes sense to create a separate bug for problem #1.

            I'm still looking into this issue.

            timofey.barmin Timofey Barmin added a comment - - edited I see that rebalance that was supposed to eject node 172.23.106.188 failed: 2500889 [user:info,2021-04-21T04:04:51.677-07:00,ns_1@172.23.108.103:<0.23324.0>:ns_orchestrator:idle:773]Starting rebalance, KeepNodes = ['ns_1@172.23.104.137','ns_1@172.23.104.155', 2500890 'ns_1@172.23.104.157','ns_1@172.23.104.5', 2500891 'ns_1@172.23.104.67','ns_1@172.23.104.69', 2500892 'ns_1@172.23.104.70','ns_1@172.23.105.107', 2500893 'ns_1@172.23.106.100','ns_1@172.23.108.103', 2500894 'ns_1@172.23.120.245','ns_1@172.23.121.117', 2500895 'ns_1@172.23.121.3','ns_1@172.23.123.27', 2500896 'ns_1@172.23.123.28','ns_1@172.23.96.148', 2500897 'ns_1@172.23.96.251','ns_1@172.23.96.252', 2500898 'ns_1@172.23.96.253','ns_1@172.23.97.119', 2500899 'ns_1@172.23.97.121','ns_1@172.23.97.122', 2500900 'ns_1@172.23.97.239','ns_1@172.23.97.242', 2500901 'ns_1@172.23.98.135','ns_1@172.23.99.11', 2500902 'ns_1@172.23.99.20','ns_1@172.23.99.21', 2500903 'ns_1@172.23.99.25'], EjectNodes = ['ns_1@172.23.106.188'],   ....   2861252 [user:error,2021-04-21T08:41:40.426-07:00,ns_1@172.23.108.103:<0.23324.0>:ns_orchestrator:log_rebalance_completion:1404]Rebalance exited with reason {service_rebalance_failed,eventing, 2861253 {agent_died,<30247.24974.0>, 2861254 {linked_process_died,<30247.30085.1993>, 2861255 {timeout, 2861256 {gen_server,call, 2861257 [<30247.25022.0>, 2861258 {call,"ServiceAPI.PrepareTopologyChange", 2861259 #Fun<json_rpc_connection.0.77329884>}, 2861260 60000]}}}}}. But that rebalance for cbas completed successfully: 2854650 [rebalance:info,2021-04-21T08:35:12.708-07:00,ns_1@172.23.108.103:service_rebalancer-cbas-worker<0.17039.3492>:service_rebalancer:rebalance_worker:147]Rebalancing service cbas with id <<"669 2854651 KeepNodes: ['ns_1@172.23.104.157','ns_1@172.23.104.5','ns_1@172.23.97.242'] 2854652 EjectNodes: ['ns_1@172.23.106.188'] 2854653 DeltaNodes: [] .... 2860342 [ns_server:debug,2021-04-21T08:40:40.002-07:00,ns_1@172.23.108.103:service_rebalancer-cbas<0.11401.3492>:service_rebalancer:run_rebalance_worker:116]Worker terminated normally So there are two problems here: 1) Rebalance at 2021-04-21T04:04:51.677-07:00 fails because of gen_server:call timeout (at eventing rebalance step) 2) Rebalance at 2021-04-21T15:22:42.000-07:00 fails because cbas doesn't expect to see already ejected node in list of nodes I also noticed that between those two rebalances orchestrator moved from 172.23.108.103 to 172.23.104.137 (which was caused by a restart of 172.23.108.103: 14702 [ns_server:info,2021-04-21T14:30:54.357-07:00,babysitter_of_ns_1@cb.local:<0.105.0>:ns_babysitter:stop:162]Received shutdown request. Terminating. 14703 [ns_server:info,2021-04-21T14:30:58.892-07:00,babysitter_of_ns_1@cb.local:<0.105.0>:ns_babysitter:init_logging:131]Brought up babysitter logging ). Probably it makes sense to create a separate bug for problem #1. I'm still looking into this issue.

            I see that the node 188 was removed from cbas service_map after first cbas rebalance:

            [ns_server:debug,2021-04-21T08:40:40.018-07:00,ns_1@172.23.108.103:<0.32754.3402>:ns_cluster_membership:set_service_map:447]Set service map for service cbas to ['ns_1@172.23.104.157',
                                                 'ns_1@172.23.104.5','ns_1@172.23.97.242']
            

            And added back to service_map before the second cbas rebalance:

            [ns_server:debug,2021-04-22T05:56:19.835-07:00,ns_1@172.23.104.137:<0.7339.2075>:ns_cluster_membership:set_service_map:447]Set service map for service cbas to ['ns_1@172.23.104.157',
                                                 'ns_1@172.23.104.5',
                                                 'ns_1@172.23.106.188',
                                                 'ns_1@172.23.97.242']
            

            So it looks like the node 188 is being added back to the cluster by that rebalance, which is correct behavior. That's the reason why the node is present in keepNodes.

            Michael Blow Could you please clarify what is the expected behavior of ns_server in case when node is being added back?

            timofey.barmin Timofey Barmin added a comment - I see that the node 188 was removed from cbas service_map after first cbas rebalance: [ns_server:debug,2021-04-21T08:40:40.018-07:00,ns_1@172.23.108.103:<0.32754.3402>:ns_cluster_membership:set_service_map:447]Set service map for service cbas to ['ns_1@172.23.104.157', 'ns_1@172.23.104.5','ns_1@172.23.97.242'] And added back to service_map before the second cbas rebalance: [ns_server:debug,2021-04-22T05:56:19.835-07:00,ns_1@172.23.104.137:<0.7339.2075>:ns_cluster_membership:set_service_map:447]Set service map for service cbas to ['ns_1@172.23.104.157', 'ns_1@172.23.104.5', 'ns_1@172.23.106.188', 'ns_1@172.23.97.242'] So it looks like the node 188 is being added back to the cluster by that rebalance, which is correct behavior. That's the reason why the node is present in keepNodes. Michael Blow Could you please clarify what is the expected behavior of ns_server in case when node is being added back?

            If the node is rebalanced out and added back, it should not have the same uuid (1151f6cfc86035b5768e4a5ab36bbb8f), right? i would expect to not get the same uuid.

            michael.blow Michael Blow added a comment - If the node is rebalanced out and added back, it should not have the same uuid ( 1151f6cfc86035b5768e4a5ab36bbb8f ), right? i would expect to not get the same uuid.

            Michael Blow
            Since that node has not fully left the cluster it keeps its uuid.

            timofey.barmin Timofey Barmin added a comment - Michael Blow Since that node has not fully left the cluster it keeps its uuid.

            On rebalance out, 'cbas' has always gone into a halted state, waiting for ns_server to kill us- it sounds like you're saying that is now (or always has been) wrong, and we instead should be prepared to be resurrected by some future rebalance, is that right Timofey Barmin?

            Moving back to analytics based on this information and will investigate future.

            michael.blow Michael Blow added a comment - On rebalance out, 'cbas' has always gone into a halted state, waiting for ns_server to kill us- it sounds like you're saying that is now (or always has been) wrong, and we instead should be prepared to be resurrected by some future rebalance, is that right Timofey Barmin ? Moving back to analytics based on this information and will investigate future.

            Since the rebalance-out for that node didn't finish, the node didn't get wiped. I think that's reason why you see that the cbas is not killed. I think it's always been like this. We've just never tested it (that's my guess actually).

            timofey.barmin Timofey Barmin added a comment - Since the rebalance-out for that node didn't finish, the node didn't get wiped. I think that's reason why you see that the cbas is not killed. I think it's always been like this. We've just never tested it (that's my guess actually).
            michael.blow Michael Blow added a comment -

            I haven't tried to reproduce this on 6.6, but it seems odd that i've gone from never seeing this in forever to seeing it three times in a week- I wonder if previously the babysitter restarted services that had successfully completed the rebalance out, on an ultimately failed rebalance.

            Either way, I have updated our service to accommodate the [new] behavior.

            michael.blow Michael Blow added a comment - I haven't tried to reproduce this on 6.6, but it seems odd that i've gone from never seeing this in forever to seeing it three times in a week- I wonder if previously the babysitter restarted services that had successfully completed the rebalance out, on an ultimately failed rebalance. Either way, I have updated our service to accommodate the [new] behavior.

            Build couchbase-server-7.0.0-5064 contains cbas commit 144562f with commit message:
            MB-45869: support rebalance-in on rebalanced-out node

            build-team Couchbase Build Team added a comment - Build couchbase-server-7.0.0-5064 contains cbas commit 144562f with commit message: MB-45869 : support rebalance-in on rebalanced-out node

            Build couchbase-server-7.0.0-5170 contains cbas commit 1c9bd24 with commit message:
            MB-46285: prevent unupgraded nodes from experiencing MB-45869

            build-team Couchbase Build Team added a comment - Build couchbase-server-7.0.0-5170 contains cbas commit 1c9bd24 with commit message: MB-46285 : prevent unupgraded nodes from experiencing MB-45869

            Bulk labeling closed issues as triaged.

            till Till Westmann added a comment - Bulk labeling closed issues as triaged .

            People

              arunkumar Arunkumar Senthilnathan
              arunkumar Arunkumar Senthilnathan
              Votes:
              0 Vote for this issue
              Watchers:
              8 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Gerrit Reviews

                  There are no open Gerrit changes

                  PagerDuty