Loading...

XML

Word

Printable

Details

Type: Bug
Resolution: Fixed
Priority: Critical
Fix Version/s: 7.1.0
Affects Version/s: 7.1.0
Component/s: analytics
Labels:
Environment:
6.6.5-10076 —> 7.1.0-2117

Triage:
Untriaged
Operating System:
Centos 64-bit
Story Points:
1
Is this a Regression?:
No
Sprint:
CX Sprint 280

Description

Steps to Repro
1. Run 6.6.5 longevity test for 5-6 days.

./sequoia -client 172.23.104.254:2375 -provider file:centos_second_cluster.yml -test tests/integration/test_allFeatures_madhatter_durability.yml -scope tests/integration/scope_Xattrs_Madhatter.yml -scale 3 -repeat 0 -log_level 0 -version 6.6.5-10076 -skip_setup=false -skip_test=false -skip_teardown=true -skip_cleanup=false -continue=false -collect_on_error=false -stop_on_error=false -duration=604800 -show_topology=true

2. Online upgrade to 7.1 using swap rebalance and graceful failover/recovery strategies.
3. Failed over 2 analytics node( 172.23.120.77 and 172.23.120.73)

Starting failover of nodes ['ns_1@172.23.120.77']. Operation Id = 937473f9db74d764c3af79f332cf9fc5

Starting failover of nodes ['ns_1@172.23.120.73']. Operation Id = 5d6af9c1f2c1162b5014b499de159f3b

4. Did a full recovery of those 2 nodes and did a rebalance.
172.23.120.197 9:28:24 PM 23 Jan, 2022

Starting rebalance, KeepNodes = ['ns_1@172.23.106.134','ns_1@172.23.106.136',

'ns_1@172.23.106.137','ns_1@172.23.106.138',

'ns_1@172.23.120.197','ns_1@172.23.120.58',

'ns_1@172.23.120.73','ns_1@172.23.120.74',

'ns_1@172.23.120.75','ns_1@172.23.120.77',

'ns_1@172.23.120.81','ns_1@172.23.120.86',

'ns_1@172.23.121.118','ns_1@172.23.121.77',

'ns_1@172.23.123.26','ns_1@172.23.123.31',

'ns_1@172.23.123.32','ns_1@172.23.123.33',

'ns_1@172.23.96.14','ns_1@172.23.96.243',

'ns_1@172.23.96.254','ns_1@172.23.97.105',

'ns_1@172.23.97.110','ns_1@172.23.97.112',

'ns_1@172.23.97.148','ns_1@172.23.97.149',

'ns_1@172.23.97.151'], EjectNodes = [], Failed over and being ejected nodes = []; no delta recovery nodes; Operation Id = 1601627790a0b959e180c49e12e57399

Rebalance failed as shown below.
172.23.120.73 9:30:51 PM 23 Jan, 2022

Analytics Service unable to successfully rebalance 1a44f742b97e16491ad11c1f826ffd7c due to 'java.lang.IllegalStateException: timed out waiting for all nodes to join & cluster active (missing nodes: [e4922efbaa722c9ef303d72aee04090c], state: UNUSABLE)'; see analytics_info.log for details

172.23.120.73 analytics_info.log

2022-01-23T23:33:08.670-08:00 ERRO CBAS.rebalance.Rebalance [Executor-16:ClusterController] Rebalance 1cd8195171ae0b6c9d9c941492bdb344 failed

java.lang.IllegalStateException: timed out waiting for all nodes to join & cluster active (missing nodes: [e4922efbaa722c9ef303d72aee04090c], state: UNUSABLE)

        at com.couchbase.analytics.control.rebalance.Rebalance.ensureNodesClusterActive(Rebalance.java:526) ~[cbas-server-7.1.0-2117.jar:7.1.0-2117]

        at com.couchbase.analytics.control.rebalance.Rebalance.adjustClusterBeforeRebalance(Rebalance.java:683) ~[cbas-server-7.1.0-2117.jar:7.1.0-2117]

        at com.couchbase.analytics.control.rebalance.Rebalance.doRebalance(Rebalance.java:205) ~[cbas-server-7.1.0-2117.jar:7.1.0-2117]

        at com.couchbase.analytics.control.rebalance.Rebalance.doCall(Rebalance.java:166) [cbas-server-7.1.0-2117.jar:7.1.0-2117]

        at com.couchbase.analytics.control.rebalance.Rebalance.doCall(Rebalance.java:84) [cbas-server-7.1.0-2117.jar:7.1.0-2117]

        at com.couchbase.analytics.runtime.WriteLockCallable.call(WriteLockCallable.java:27) [cbas-connector-7.1.0-2117.jar:7.1.0-2117]

        at java.util.concurrent.FutureTask.run(FutureTask.java:264) [?:?]

        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) [?:?]

        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) [?:?]

        at java.lang.Thread.run(Thread.java:829) [?:?]

2022-01-23T23:33:08.671-08:00 WARN CBAS.rebalance.Rebalance [Executor-16:ClusterController] exit Rebalance 1cd8195171ae0b6c9d9c941492bdb344

2022-01-23T23:33:08.671-08:00 INFO CBAS.rebalance.RebalanceProgress [Executor-18:ClusterController] dataset size fetcher interrupted

2022-01-23T23:33:08.904-08:00 ERRO CBAS.servlet.RebalanceServlet [HttpExecutor(port:9111)-15] Rebalance 1cd8195171ae0b6c9d9c941492bdb344 failed

java.lang.IllegalStateException: timed out waiting for all nodes to join & cluster active (missing nodes: [e4922efbaa722c9ef303d72aee04090c], state: UNUSABLE)

        at com.couchbase.analytics.control.rebalance.Rebalance.ensureNodesClusterActive(Rebalance.java:526) ~[cbas-server-7.1.0-2117.jar:7.1.0-2117]

        at com.couchbase.analytics.control.rebalance.Rebalance.adjustClusterBeforeRebalance(Rebalance.java:683) ~[cbas-server-7.1.0-2117.jar:7.1.0-2117]

        at com.couchbase.analytics.control.rebalance.Rebalance.doRebalance(Rebalance.java:205) ~[cbas-server-7.1.0-2117.jar:7.1.0-2117]

        at com.couchbase.analytics.control.rebalance.Rebalance.doCall(Rebalance.java:166) ~[cbas-server-7.1.0-2117.jar:7.1.0-2117]

        at com.couchbase.analytics.control.rebalance.Rebalance.doCall(Rebalance.java:84) ~[cbas-server-7.1.0-2117.jar:7.1.0-2117]

        at com.couchbase.analytics.runtime.WriteLockCallable.call(WriteLockCallable.java:27) ~[cbas-connector-7.1.0-2117.jar:7.1.0-2117]

        at java.util.concurrent.FutureTask.run(FutureTask.java:264) [?:?]

        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) [?:?]

        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) [?:?]

        at java.lang.Thread.run(Thread.java:829) [?:?]

2022-01-23T23:33:08.921-08:00 INFO CBAS.cbas requesting isBalanced for 1cd8195171ae0b6c9d9c941492bdb344 from driver

2022-01-23T23:33:08.923-08:00 INFO CBAS.servlet.RebalanceServlet [HttpExecutor(port:9111)-1] +post request: {"nodes":[{"nodeId":"4914cd856897180e302068cb33eb6642","priority":1970329131941888,"opaque":{"cbas-version":"7.1.0-2117","cc-http-port":"9111","controller-id":"43","host":"172.23.106.138","ns-server-port":"8091","num-iodevices":"8","starting-partition-id":"344","svc-http-port":"8095"}},{"nodeId":"5ffd65353607e8c3ef50cd4240c725ed","priority":1970329131942144,"opaque":{"cbas-version":"6.6.5-10076","cc-http-port":"9111","controller-id":"41","host":"172.23.120.73","ns-server-port":"8091","num-iodevices":"8","starting-partition-id":"328","svc-http-port":"8095"}},{"nodeId":"e4922efbaa722c9ef303d72aee04090c","priority":1970329131941888,"opaque":{"cbas-version":"6.6.5-10076","cc-http-port":"9111","controller-id":"42","host":"172.23.120.77","ns-server-port":"8091","num-iodevices":"8","starting-partition-id":"336","svc-http-port":"8095"}}],"id":"1cd8195171ae0b6c9d9c941492bdb344","type":"topology-change-rebalance","ccNodeId":"5ffd65353607e8c3ef50cd4240c725ed","metadataNodeId":"5ffd65353607e8c3ef50cd4240c725ed","metadataPartition":-1,"rev":507,"configVersion":1,"balanceState":"unknown","keepNodesUpdated":false,"keepNodes":["4914cd856897180e302068cb33eb6642","5ffd65353607e8c3ef50cd4240c725ed","e4922efbaa722c9ef303d72aee04090c"],"inPlaceNumReplicas":0}

2022-01-23T23:33:08.923-08:00 INFO CBAS.rebalance.TopologyManager [HttpExecutor(port:9111)-1] found missing masters in partitions topology

2022-01-23T23:33:08.923-08:00 INFO CBAS.rebalance.Rebalance [HttpExecutor(port:9111)-1] latest partitions topology is not balanced

2022-01-23T23:33:08.924-08:00 INFO CBAS.cbas setting balanced state to unbalanced for 1cd8195171ae0b6c9d9c941492bdb344

2022-01-23T23:33:08.926-08:00 INFO CBAS.cbas updating balance state unbalanced for 1cd8195171ae0b6c9d9c941492bdb344

Retried failed rebalance multiple times. Still rebalance did not succeed and Rebalance button is still enabled.

cbcollect_info attached.

Attachments

Issue Links

links to

AsterixDB commit

Gerrit Reviews

- Issue Only
- Show All Reviews
- Show Open Reviews
- Show All Issues
- Show Open Issues

No reviews matched the request. Check your Options in the drop-down menu of this sections header.

Activity

People

Assignee:: Balakumaran Gopal

Reporter:: Balakumaran Gopal

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Dates

Created:: 23/Jan/22 11:48 PM

Updated:: 06/Mar/22 8:41 PM

Resolved:: 28/Jan/22 6:54 AM

Gerrit Reviews

There are no open Gerrit changes

Show There is 1 closed Gerrit change

Hide There is 1 closed Gerrit change

MB-50545: Log changes to node active partitions: Gerrit Review:

[System test upgrade] - Multi node failover + recovery fails post upgrade to 7.1 with "java.lang.IllegalStateException: timed out waiting for all nodes to join & cluster active"

Details

Description

Attachments

Issue Links

Gerrit Reviews

Activity

People

Dates

Gerrit Reviews

PagerDuty