Details
-
Bug
-
Resolution: Fixed
-
Critical
-
7.1.0
-
6.6.5-10076 —> 7.1.0-2117
-
Untriaged
-
Centos 64-bit
-
1
-
No
-
CX Sprint 280
Description
Steps to Repro
1. Run 6.6.5 longevity test for 5-6 days.
./sequoia -client 172.23.104.254:2375 -provider file:centos_second_cluster.yml -test tests/integration/test_allFeatures_madhatter_durability.yml -scope tests/integration/scope_Xattrs_Madhatter.yml -scale 3 -repeat 0 -log_level 0 -version 6.6.5-10076 -skip_setup=false -skip_test=false -skip_teardown=true -skip_cleanup=false -continue=false -collect_on_error=false -stop_on_error=false -duration=604800 -show_topology=true
|
2. Online upgrade to 7.1 using swap rebalance and graceful failover/recovery strategies.
3. Failed over 2 analytics node( 172.23.120.77 and 172.23.120.73)
Starting failover of nodes ['ns_1@172.23.120.77']. Operation Id = 937473f9db74d764c3af79f332cf9fc5
|
Starting failover of nodes ['ns_1@172.23.120.73']. Operation Id = 5d6af9c1f2c1162b5014b499de159f3b
|
4. Did a full recovery of those 2 nodes and did a rebalance.
172.23.120.197 9:28:24 PM 23 Jan, 2022
Starting rebalance, KeepNodes = ['ns_1@172.23.106.134','ns_1@172.23.106.136',
|
'ns_1@172.23.106.137','ns_1@172.23.106.138',
|
'ns_1@172.23.120.197','ns_1@172.23.120.58',
|
'ns_1@172.23.120.73','ns_1@172.23.120.74',
|
'ns_1@172.23.120.75','ns_1@172.23.120.77',
|
'ns_1@172.23.120.81','ns_1@172.23.120.86',
|
'ns_1@172.23.121.118','ns_1@172.23.121.77',
|
'ns_1@172.23.123.26','ns_1@172.23.123.31',
|
'ns_1@172.23.123.32','ns_1@172.23.123.33',
|
'ns_1@172.23.96.14','ns_1@172.23.96.243',
|
'ns_1@172.23.96.254','ns_1@172.23.97.105',
|
'ns_1@172.23.97.110','ns_1@172.23.97.112',
|
'ns_1@172.23.97.148','ns_1@172.23.97.149',
|
'ns_1@172.23.97.151'], EjectNodes = [], Failed over and being ejected nodes = []; no delta recovery nodes; Operation Id = 1601627790a0b959e180c49e12e57399
|
Rebalance failed as shown below.
172.23.120.73 9:30:51 PM 23 Jan, 2022
Analytics Service unable to successfully rebalance 1a44f742b97e16491ad11c1f826ffd7c due to 'java.lang.IllegalStateException: timed out waiting for all nodes to join & cluster active (missing nodes: [e4922efbaa722c9ef303d72aee04090c], state: UNUSABLE)'; see analytics_info.log for details
|
172.23.120.73 analytics_info.log
2022-01-23T23:33:08.670-08:00 ERRO CBAS.rebalance.Rebalance [Executor-16:ClusterController] Rebalance 1cd8195171ae0b6c9d9c941492bdb344 failed
|
java.lang.IllegalStateException: timed out waiting for all nodes to join & cluster active (missing nodes: [e4922efbaa722c9ef303d72aee04090c], state: UNUSABLE)
|
at com.couchbase.analytics.control.rebalance.Rebalance.ensureNodesClusterActive(Rebalance.java:526) ~[cbas-server-7.1.0-2117.jar:7.1.0-2117]
|
at com.couchbase.analytics.control.rebalance.Rebalance.adjustClusterBeforeRebalance(Rebalance.java:683) ~[cbas-server-7.1.0-2117.jar:7.1.0-2117]
|
at com.couchbase.analytics.control.rebalance.Rebalance.doRebalance(Rebalance.java:205) ~[cbas-server-7.1.0-2117.jar:7.1.0-2117]
|
at com.couchbase.analytics.control.rebalance.Rebalance.doCall(Rebalance.java:166) [cbas-server-7.1.0-2117.jar:7.1.0-2117]
|
at com.couchbase.analytics.control.rebalance.Rebalance.doCall(Rebalance.java:84) [cbas-server-7.1.0-2117.jar:7.1.0-2117]
|
at com.couchbase.analytics.runtime.WriteLockCallable.call(WriteLockCallable.java:27) [cbas-connector-7.1.0-2117.jar:7.1.0-2117]
|
at java.util.concurrent.FutureTask.run(FutureTask.java:264) [?:?]
|
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) [?:?]
|
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) [?:?]
|
at java.lang.Thread.run(Thread.java:829) [?:?]
|
2022-01-23T23:33:08.671-08:00 WARN CBAS.rebalance.Rebalance [Executor-16:ClusterController] exit Rebalance 1cd8195171ae0b6c9d9c941492bdb344
|
2022-01-23T23:33:08.671-08:00 INFO CBAS.rebalance.RebalanceProgress [Executor-18:ClusterController] dataset size fetcher interrupted
|
2022-01-23T23:33:08.904-08:00 ERRO CBAS.servlet.RebalanceServlet [HttpExecutor(port:9111)-15] Rebalance 1cd8195171ae0b6c9d9c941492bdb344 failed
|
java.lang.IllegalStateException: timed out waiting for all nodes to join & cluster active (missing nodes: [e4922efbaa722c9ef303d72aee04090c], state: UNUSABLE)
|
at com.couchbase.analytics.control.rebalance.Rebalance.ensureNodesClusterActive(Rebalance.java:526) ~[cbas-server-7.1.0-2117.jar:7.1.0-2117]
|
at com.couchbase.analytics.control.rebalance.Rebalance.adjustClusterBeforeRebalance(Rebalance.java:683) ~[cbas-server-7.1.0-2117.jar:7.1.0-2117]
|
at com.couchbase.analytics.control.rebalance.Rebalance.doRebalance(Rebalance.java:205) ~[cbas-server-7.1.0-2117.jar:7.1.0-2117]
|
at com.couchbase.analytics.control.rebalance.Rebalance.doCall(Rebalance.java:166) ~[cbas-server-7.1.0-2117.jar:7.1.0-2117]
|
at com.couchbase.analytics.control.rebalance.Rebalance.doCall(Rebalance.java:84) ~[cbas-server-7.1.0-2117.jar:7.1.0-2117]
|
at com.couchbase.analytics.runtime.WriteLockCallable.call(WriteLockCallable.java:27) ~[cbas-connector-7.1.0-2117.jar:7.1.0-2117]
|
at java.util.concurrent.FutureTask.run(FutureTask.java:264) [?:?]
|
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) [?:?]
|
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) [?:?]
|
at java.lang.Thread.run(Thread.java:829) [?:?]
|
2022-01-23T23:33:08.921-08:00 INFO CBAS.cbas requesting isBalanced for 1cd8195171ae0b6c9d9c941492bdb344 from driver
|
2022-01-23T23:33:08.923-08:00 INFO CBAS.servlet.RebalanceServlet [HttpExecutor(port:9111)-1] +post request: {"nodes":[{"nodeId":"4914cd856897180e302068cb33eb6642","priority":1970329131941888,"opaque":{"cbas-version":"7.1.0-2117","cc-http-port":"9111","controller-id":"43","host":"172.23.106.138","ns-server-port":"8091","num-iodevices":"8","starting-partition-id":"344","svc-http-port":"8095"}},{"nodeId":"5ffd65353607e8c3ef50cd4240c725ed","priority":1970329131942144,"opaque":{"cbas-version":"6.6.5-10076","cc-http-port":"9111","controller-id":"41","host":"172.23.120.73","ns-server-port":"8091","num-iodevices":"8","starting-partition-id":"328","svc-http-port":"8095"}},{"nodeId":"e4922efbaa722c9ef303d72aee04090c","priority":1970329131941888,"opaque":{"cbas-version":"6.6.5-10076","cc-http-port":"9111","controller-id":"42","host":"172.23.120.77","ns-server-port":"8091","num-iodevices":"8","starting-partition-id":"336","svc-http-port":"8095"}}],"id":"1cd8195171ae0b6c9d9c941492bdb344","type":"topology-change-rebalance","ccNodeId":"5ffd65353607e8c3ef50cd4240c725ed","metadataNodeId":"5ffd65353607e8c3ef50cd4240c725ed","metadataPartition":-1,"rev":507,"configVersion":1,"balanceState":"unknown","keepNodesUpdated":false,"keepNodes":["4914cd856897180e302068cb33eb6642","5ffd65353607e8c3ef50cd4240c725ed","e4922efbaa722c9ef303d72aee04090c"],"inPlaceNumReplicas":0}
|
|
2022-01-23T23:33:08.923-08:00 INFO CBAS.rebalance.TopologyManager [HttpExecutor(port:9111)-1] found missing masters in partitions topology
|
2022-01-23T23:33:08.923-08:00 INFO CBAS.rebalance.Rebalance [HttpExecutor(port:9111)-1] latest partitions topology is not balanced
|
2022-01-23T23:33:08.924-08:00 INFO CBAS.cbas setting balanced state to unbalanced for 1cd8195171ae0b6c9d9c941492bdb344
|
2022-01-23T23:33:08.926-08:00 INFO CBAS.cbas updating balance state unbalanced for 1cd8195171ae0b6c9d9c941492bdb344
|
Retried failed rebalance multiple times. Still rebalance did not succeed and Rebalance button is still enabled.
cbcollect_info attached.
Attachments
Issue Links
- links to