Details
-
Bug
-
Status: Closed
-
Critical
-
Resolution: Fixed
-
Cheshire-Cat
-
Untriaged
-
1
-
Unknown
-
CX Sprint 246
Description
7.0.0-4955
Test:
-test tests/integration/cheshirecat/test_cheshirecat_kv_gsi_coll_xdcr_backup_sgw_fts_itemct_txns_eventing_cbas_scale3.yml -scope tests/integration/cheshirecat/scope_cheshirecat_with_backup.yml
Scale 3
Iteration 3
.108.103 diag.log:
2021-04-21T15:23:09.425-07:00, ns_vbucket_mover:0:info:message(ns_1@172.23.104.137) - Bucket "default" rebalance appears to be swap rebalance
|
2021-04-21T15:23:11.139-07:00, ns_node_disco:4:info:node up(ns_1@172.23.104.67) - Node 'ns_1@172.23.104.67' saw that node 'ns_1@172.23.105.111' came up. Tags: [] (repeated 1 times, last seen 61.342812 secs ago)
|
2021-04-21T15:23:17.661-07:00, ns_node_disco:4:info:node up(ns_1@172.23.105.107) - Node 'ns_1@172.23.105.107' saw that node 'ns_1@172.23.105.111' came up. Tags: [] (repeated 1 times, last seen 68.368038 secs ago)
|
2021-04-21T15:23:18.407-07:00, ns_node_disco:4:info:node up(ns_1@172.23.104.157) - Node 'ns_1@172.23.104.157' saw that node 'ns_1@172.23.105.111' came up. Tags: [] (repeated 1 times, last seen 69.015704 secs ago)
|
2021-04-22T05:58:24.082-07:00, analytics:0:warning:message(ns_1@172.23.104.157) - Analytics Service unable to successfully rebalance 60133c5ac910f395614e1c0d59647e48 due to 'java.lang.IllegalStateException: timed out waiting for all nodes to join & cluster active (missing nodes: [1151f6cfc86035b5768e4a5ab36bbb8f], state: ACTIVE)'; see analytics_info.log for details
|
2021-04-22T05:58:24.588-07:00, ns_orchestrator:0:critical:message(ns_1@172.23.104.137) - Rebalance exited with reason {service_rebalance_failed,cbas,
|
{worker_died,
|
{'EXIT',<0.28306.2352>,
|
{rebalance_failed,
|
{service_error,
|
<<"Rebalance 60133c5ac910f395614e1c0d59647e48 failed: timed out waiting for all nodes to join & cluster active (missing nodes: [1151f6cfc86035b5768e4a5ab36bbb8f], state: ACTIVE)">>}}}}}.
|
Rebalance Operation Id = 759b7fb5e9eb86629ddf589999b711a4
|
Console:
[2021-04-21T15:21:49-07:00, sequoiatools/couchbase-cli:7.0:1d0175] server-add -c 172.23.108.103:8091 --server-add https://172.23.105.111 -u Administrator -p password --server-add-username Administrator --server-add-password password --services index
|
[2021-04-21T15:22:23-07:00, sequoiatools/couchbase-cli:7.0:8b292a] rebalance -c 172.23.108.103:8091 -u Administrator -p password
|
ā
|
|
Error occurred on container - sequoiatools/couchbase-cli:7.0:[rebalance -c 172.23.108.103:8091 -u Administrator -p password]
|
|
docker logs 8b292a
|
docker start 8b292a
|
|
*Unable to display progress bar on this os
|
JERROR: Rebalance failed. See logs for detailed reason. You can try again.
|
[2021-04-22T05:58:31-07:00, sequoiatools/cmd:a0f7ff] 60
|
.104.157 analytics info.log:
2021-04-22T05:58:22.850-07:00 INFO CBAS.server.QueryServiceServlet [HttpExecutor(port:8095)-10] handleRequest: <ud>{"host":"172.23.104.157:8095","path":"/query/service","statement":"select avg(price) as AvgPrice, min(price) as MinPrice, max(price) as MaxPrice from dv_8.ds_29 where free_breakfast=True and free_parking=True and price is not null and array_count(public_likes)>5 and `type`='Hotel' group by country limit 100;","pretty":true,"mode":"immediate","clientContextID":"query_thread_129617","format":"CLEAN_JSON","timeout":3600000,"maxResultReads":1,"planFormat":"JSON","expressionTree":false,"rewrittenExpressionTree":false,"logicalPlan":false,"optimizedLogicalPlan":false,"job":false,"profile":"counts","signature":true,"multiStatement":true,"parseOnly":false,"readOnly":false,"maxWarnings":0,"scanConsistency":null,"scanWait":null}</ud>
|
2021-04-22T05:58:22.851-07:00 INFO CBAS.messaging.CCMessageBroker [Executor-758:ClusterController] Received message: ExecuteStatementRequestMessage(id=38079, from=47c864f2a9e597168db197820602939e): <ud>select avg(price) as AvgPrice, min(price) as MinPrice, max(price) as MaxPrice from dv_8.ds_29 where free_breakfast=True and free_parking=True and price is not null and array_count(public_likes)>5 and `type`='Hotel' group by country limit 100;</ud>
|
2021-04-22T05:58:22.908-07:00 INFO CBAS.work.JobCleanupWork [Worker:ClusterController] Cleanup for job: JID:5.80502
|
2021-04-22T05:58:22.910-07:00 INFO CBAS.messaging.NCMessageBroker [Worker:47c864f2a9e597168db197820602939e] Received message: ExecuteStatementResponseMessage(id=38079): 71 characters
|
2021-04-22T05:58:24.085-07:00 ERRO CBAS.rebalance.Rebalance [Executor-755:ClusterController] Rebalance 60133c5ac910f395614e1c0d59647e48 failed
|
java.lang.IllegalStateException: timed out waiting for all nodes to join & cluster active (missing nodes: [1151f6cfc86035b5768e4a5ab36bbb8f], state: ACTIVE)
|
at com.couchbase.analytics.control.rebalance.Rebalance.ensureNodesClusterActive(Rebalance.java:484) ~[cbas-server.jar:7.0.0-4955]
|
at com.couchbase.analytics.control.rebalance.Rebalance.adjustClusterBeforeRebalance(Rebalance.java:613) ~[cbas-server.jar:7.0.0-4955]
|
at com.couchbase.analytics.control.rebalance.Rebalance.doRebalance(Rebalance.java:190) ~[cbas-server.jar:7.0.0-4955]
|
at com.couchbase.analytics.control.rebalance.Rebalance.doCall(Rebalance.java:152) [cbas-server.jar:7.0.0-4955]
|
at com.couchbase.analytics.control.rebalance.Rebalance.doCall(Rebalance.java:80) [cbas-server.jar:7.0.0-4955]
|
at com.couchbase.analytics.runtime.WriteLockCallable.call(WriteLockCallable.java:27) [cbas-connector.jar:7.0.0-4955]
|
at java.util.concurrent.FutureTask.run(Unknown Source) [?:?]
|
at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source) [?:?]
|
at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) [?:?]
|
at java.lang.Thread.run(Unknown Source) [?:?]
|
2021-04-22T05:58:24.086-07:00 WARN CBAS.rebalance.Rebalance [Executor-755:ClusterController] exit Rebalance 60133c5ac910f395614e1c0d59647e48
|
2021-04-22T05:58:24.086-07:00 INFO CBAS.rebalance.RebalanceProgress [Executor-754:ClusterController] dataset size fetcher interrupted
|
2021-04-22T05:58:24.576-07:00 ERRO CBAS.servlet.RebalanceServlet [HttpExecutor(port:9111)-10] Rebalance 60133c5ac910f395614e1c0d59647e48 failed
|
java.lang.IllegalStateException: timed out waiting for all nodes to join & cluster active (missing nodes: [1151f6cfc86035b5768e4a5ab36bbb8f], state: ACTIVE)
|
at com.couchbase.analytics.control.rebalance.Rebalance.ensureNodesClusterActive(Rebalance.java:484) ~[cbas-server.jar:7.0.0-4955]
|
at com.couchbase.analytics.control.rebalance.Rebalance.adjustClusterBeforeRebalance(Rebalance.java:613) ~[cbas-server.jar:7.0.0-4955]
|
at com.couchbase.analytics.control.rebalance.Rebalance.doRebalance(Rebalance.java:190) ~[cbas-server.jar:7.0.0-4955]
|
at com.couchbase.analytics.control.rebalance.Rebalance.doCall(Rebalance.java:152) ~[cbas-server.jar:7.0.0-4955]
|
at com.couchbase.analytics.control.rebalance.Rebalance.doCall(Rebalance.java:80) ~[cbas-server.jar:7.0.0-4955]
|
at com.couchbase.analytics.runtime.WriteLockCallable.call(WriteLockCallable.java:27) ~[cbas-connector.jar:7.0.0-4955]
|
at java.util.concurrent.FutureTask.run(Unknown Source) [?:?]
|
at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source) [?:?]
|
at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) [?:?]
|
at java.lang.Thread.run(Unknown Source) [?:?]
|
2021-04-22T05:58:24.589-07:00 INFO CBAS.cbas requesting isBalanced for 60133c5ac910f395614e1c0d59647e48 from driver
|
Cluster config:
backup : 1 ===== > [172.23.123.28:8091] ###########
index : 6 ===== > [172.23.104.137:8091 172.23.105.107:8091 172.23.121.117:8091 172.23.96.252:8091 172.23.96.253:8091 172.23.99.11:8091]
fts : 2 ===== > [172.23.104.155:8091 172.23.96.148:8091] ###########
cbas : 4 ===== > [172.23.104.157:8091 172.23.104.5:8091 172.23.106.188:8091 172.23.97.242:8091] ###########
eventing : 4 ===== > [172.23.104.67:8091 172.23.123.27:8091 172.23.97.239:8091 172.23.98.135:8091] ###########
kv : 11 ===== > [172.23.104.69:8091 172.23.104.70:8091 172.23.106.100:8091 172.23.108.103:8091 172.23.121.3:8091 172.23.97.119:8091 172.23.97.121:8091 172.23.97.122:8091 172.23.99.20:8091 172.23.99.21:8091 172.23.99.25:8091] ###########
n1ql : 2 ===== > [172.23.120.245:8091 172.23.96.251:8091] ###########
Logs:
url : https://cb-jira.s3.us-east-2.amazonaws.com/logs/systestmon-1619103127/collectinfo-2021-04-22T145210-ns_1%40172.23.104.137.zip
url : https://cb-jira.s3.us-east-2.amazonaws.com/logs/systestmon-1619103127/collectinfo-2021-04-22T145210-ns_1%40172.23.104.155.zip
url : https://cb-jira.s3.us-east-2.amazonaws.com/logs/systestmon-1619103127/collectinfo-2021-04-22T145210-ns_1%40172.23.104.157.zip
url : https://cb-jira.s3.us-east-2.amazonaws.com/logs/systestmon-1619103127/collectinfo-2021-04-22T145210-ns_1%40172.23.104.5.zip
url : https://cb-jira.s3.us-east-2.amazonaws.com/logs/systestmon-1619103127/collectinfo-2021-04-22T145210-ns_1%40172.23.104.67.zip
url : https://cb-jira.s3.us-east-2.amazonaws.com/logs/systestmon-1619103127/collectinfo-2021-04-22T145210-ns_1%40172.23.104.69.zip
url : https://cb-jira.s3.us-east-2.amazonaws.com/logs/systestmon-1619103127/collectinfo-2021-04-22T145210-ns_1%40172.23.104.70.zip
url : https://cb-jira.s3.us-east-2.amazonaws.com/logs/systestmon-1619103127/collectinfo-2021-04-22T145210-ns_1%40172.23.105.107.zip
url : https://cb-jira.s3.us-east-2.amazonaws.com/logs/systestmon-1619103127/collectinfo-2021-04-22T145210-ns_1%40172.23.105.111.zip
url : https://cb-jira.s3.us-east-2.amazonaws.com/logs/systestmon-1619103127/collectinfo-2021-04-22T145210-ns_1%40172.23.106.100.zip
url : https://cb-jira.s3.us-east-2.amazonaws.com/logs/systestmon-1619103127/collectinfo-2021-04-22T145210-ns_1%40172.23.106.188.zip
url : https://cb-jira.s3.us-east-2.amazonaws.com/logs/systestmon-1619103127/collectinfo-2021-04-22T145210-ns_1%40172.23.108.103.zip
url : https://cb-jira.s3.us-east-2.amazonaws.com/logs/systestmon-1619103127/collectinfo-2021-04-22T145210-ns_1%40172.23.120.245.zip
url : https://cb-jira.s3.us-east-2.amazonaws.com/logs/systestmon-1619103127/collectinfo-2021-04-22T145210-ns_1%40172.23.121.117.zip
url : https://cb-jira.s3.us-east-2.amazonaws.com/logs/systestmon-1619103127/collectinfo-2021-04-22T145210-ns_1%40172.23.121.3.zip
url : https://cb-jira.s3.us-east-2.amazonaws.com/logs/systestmon-1619103127/collectinfo-2021-04-22T145210-ns_1%40172.23.123.27.zip
url : https://cb-jira.s3.us-east-2.amazonaws.com/logs/systestmon-1619103127/collectinfo-2021-04-22T145210-ns_1%40172.23.123.28.zip
url : https://cb-jira.s3.us-east-2.amazonaws.com/logs/systestmon-1619103127/collectinfo-2021-04-22T145210-ns_1%40172.23.96.148.zip
url : https://cb-jira.s3.us-east-2.amazonaws.com/logs/systestmon-1619103127/collectinfo-2021-04-22T145210-ns_1%40172.23.96.251.zip
url : https://cb-jira.s3.us-east-2.amazonaws.com/logs/systestmon-1619103127/collectinfo-2021-04-22T145210-ns_1%40172.23.96.252.zip
url : https://cb-jira.s3.us-east-2.amazonaws.com/logs/systestmon-1619103127/collectinfo-2021-04-22T145210-ns_1%40172.23.96.253.zip
url : https://cb-jira.s3.us-east-2.amazonaws.com/logs/systestmon-1619103127/collectinfo-2021-04-22T145210-ns_1%40172.23.97.119.zip
url : https://cb-jira.s3.us-east-2.amazonaws.com/logs/systestmon-1619103127/collectinfo-2021-04-22T145210-ns_1%40172.23.97.121.zip
url : https://cb-jira.s3.us-east-2.amazonaws.com/logs/systestmon-1619103127/collectinfo-2021-04-22T145210-ns_1%40172.23.97.122.zip
url : https://cb-jira.s3.us-east-2.amazonaws.com/logs/systestmon-1619103127/collectinfo-2021-04-22T145210-ns_1%40172.23.97.239.zip
url : https://cb-jira.s3.us-east-2.amazonaws.com/logs/systestmon-1619103127/collectinfo-2021-04-22T145210-ns_1%40172.23.97.242.zip
url : https://cb-jira.s3.us-east-2.amazonaws.com/logs/systestmon-1619103127/collectinfo-2021-04-22T145210-ns_1%40172.23.98.135.zip
url : https://cb-jira.s3.us-east-2.amazonaws.com/logs/systestmon-1619103127/collectinfo-2021-04-22T145210-ns_1%40172.23.99.11.zip
url : https://cb-jira.s3.us-east-2.amazonaws.com/logs/systestmon-1619103127/collectinfo-2021-04-22T145210-ns_1%40172.23.99.20.zip
url : https://cb-jira.s3.us-east-2.amazonaws.com/logs/systestmon-1619103127/collectinfo-2021-04-22T145210-ns_1%40172.23.99.21.zip
url : https://cb-jira.s3.us-east-2.amazonaws.com/logs/systestmon-1619103127/collectinfo-2021-04-22T145210-ns_1%40172.23.99.25.zip
Attachments
Issue Links
- backports to
-
MB-46782 [BP 6.6.3][System Test][Analytics] Rebalance failed with error - java.lang.IllegalStateException: timed out waiting for all nodes to join & cluster active (missing nodes: [xxxx], state: ACTIVE)
-
- Closed
-
- is duplicated by
-
MB-45970 [Windows]cbbackupmgr: Backup failed with reason "failed to transfer Analytics metadata"
-
- Closed
-
-
MB-46285 [system test upgrade] : Analytics rebalance fails with "java.lang.IllegalStateException: timed out waiting for all nodes to join & cluster active" during upgrade from 6.6.2 -> 7.0.0
-
- Closed
-
-
MB-46781 [BP 6.6.3][system test upgrade] : Analytics rebalance fails with "java.lang.IllegalStateException: timed out waiting for all nodes to join & cluster active" during upgrade from 6.6.2 -> 7.0.0
-
- Closed
-
- relates to
-
MB-45954 Rebalance fails because of PrepareTopologyChange timeout
-
- Closed
-
-
MB-46865 [BP 6.6.3] Unhelpful error message on: timed out waiting for all nodes to join & cluster active (missing nodes: [e6f0383d4902ece226bc1f2329d23993]
-
- Closed
-
-
MB-46293 [CX] Unhelpful error message on: timed out waiting for all nodes to join & cluster active (missing nodes: [e6f0383d4902ece226bc1f2329d23993]
-
- Closed
-
There seems to be something confused with the topology in this cluster- the node 1151f6cfc86035b5768e4a5ab36bbb8f (172.23.106.188) was rebalanced out as part of rebalance 669ad0d085b724d1f62eceac6b0dc7f1:
2021-04-21T08:35:21.368-07:00 INFO CBAS.servlet.RebalanceServlet [HttpExecutor(port:9111)-8] +post request: {"topologyChange":{"id":"669ad0d085b724d1f62eceac6b0dc7f1","currentTopologyRev":null,"type":"topology-change-rebalance","keepNodes":[{"nodeInfo":{"nodeId":"47c864f2a9e597168db197820602939e","priority":0,"opaque":{"cbas-version":"7.0.0-4955","cc-http-port":"9111","controller-id":"5","host":"172.23.104.157","num-iodevices":"8","starting-partition-id":"44","svc-http-port":"8095"}},"recoveryType":"recovery-full"},{"nodeInfo":{"nodeId":"8f2b8af91b250f31e02143ecf103617b","priority":0,"opaque":{"cbas-version":"7.0.0-4955","cc-http-port":"9111","controller-id":"8","host":"172.23.104.5","num-iodevices":"8","starting-partition-id":"68","svc-http-port":"8095"}},"recoveryType":"recovery-full"},{"nodeInfo":{"nodeId":"69e7ccf80fd909080b2bd6fd4c0fbd0b","priority":0,"opaque":{"cbas-version":"7.0.0-4955","cc-http-port":"9111","controller-id":"7","host":"172.23.97.242","num-iodevices":"8","starting-partition-id":"60","svc-http-port":"8095"}},"recoveryType":"recovery-full"}],"ejectNodes":[{"nodeId":"1151f6cfc86035b5768e4a5ab36bbb8f","priority":0,"opaque":{"cbas-version":"7.0.0-4955","cc-http-port":"9111","controller-id":"0","host":"172.23.106.188","num-iodevices":"8","starting-partition-id":"0","svc-http-port":"8095"}}]}
which completed at approximately 2021-04-21T08:40:30.351-07:00:
2021-04-21T08:40:30.351-07:00 WARN CBAS.rebalance.Rebalance [Executor-1836:ClusterController] exit Rebalance 669ad0d085b724d1f62eceac6b0dc7f1
somehow, on the next rebalance 60133c5ac910f395614e1c0d59647e48 the ejected node is re-presented in keepNodes to .157 some 21+ hours later:
2021-04-22T05:56:23.952-07:00 WARN CBAS.rebalance.Rebalance [Executor-755:ClusterController] enter Rebalance 60133c5ac910f395614e1c0d59647e48: {"id":"60133c5ac910f395614e1c0d59647e48","currentTopologyRev":null,"type":"topology-change-rebalance","keepNodes":[{"nodeInfo":{"nodeId":"47c864f2a9e597168db197820602939e","priority":0,"opaque":{"cbas-version":"7.0.0-4955","cc-http-port":"9111","controller-id":"5","host":"172.23.104.157","num-iodevices":"8","starting-partition-id":"44","svc-http-port":"8095"}},"recoveryType":"recovery-full"},{"nodeInfo":{"nodeId":"8f2b8af91b250f31e02143ecf103617b","priority":0,"opaque":{"cbas-version":"7.0.0-4955","cc-http-port":"9111","controller-id":"8","host":"172.23.104.5","num-iodevices":"8","starting-partition-id":"68","svc-http-port":"8095"}},"recoveryType":"recovery-full"},{"nodeInfo":{"nodeId":"1151f6cfc86035b5768e4a5ab36bbb8f","priority":0,"opaque":{"cbas-version":"7.0.0-4955","cc-http-port":"9111","controller-id":"9","host":"172.23.106.188","num-iodevices":"8","starting-partition-id":"76","svc-http-port":"8095"}},"recoveryType":"recovery-full"},{"nodeInfo":{"nodeId":"69e7ccf80fd909080b2bd6fd4c0fbd0b","priority":0,"opaque":{"cbas-version":"7.0.0-4955","cc-http-port":"9111","controller-id":"7","host":"172.23.97.242","num-iodevices":"8","starting-partition-id":"60","svc-http-port":"8095"}},"recoveryType":"recovery-full"}],"ejectNodes":[]}
Note, ns_server made no attempt to restart cbas on .188, there is no activity in the analytics logs on .188 after the rebalance out as part of rebalance 669ad0d085b724d1f62eceac6b0dc7f1.
Reassigning to ns_server for further investigation, from what I can tell, analytics is working as intended.