[high-bucket] - 30 multi bucket test rebalance failed with buckets_cleanup_failed error
Description
Components
Affects versions
Fix versions
Environment
Link to Log File, atop/blg, CBCollectInfo, Core dump
Release Notes Description
Attachments
- 11 Feb 2022, 12:06 AM
- 11 Feb 2022, 12:05 AM
- 11 Feb 2022, 12:04 AM
- 11 Feb 2022, 12:02 AM
- 10 Feb 2022, 11:58 PM
- 10 Feb 2022, 11:55 PM
- 19 Jan 2022, 03:30 AM
- 19 Jan 2022, 03:30 AM
- 19 Jan 2022, 03:29 AM
- 19 Jan 2022, 03:29 AM
- 01 Dec 2021, 06:25 PM
- 09 Sep 2021, 07:07 PM
Activity
Jyotsna Nayak April 22, 2022 at 5:53 PMEdited
Analysis after increasing the amount of sleep :
I have rerun the test after increasing the sleep between the rebalances from 1 hour to
1. 2 hours (test failed due to 6 cbas mutations left to catch up ; link to job: here
Error message:
{"stageInfo":{"analytics":{"totalProgress":5.700000000000002e-13,"perNodeProgress":
{"ns_1@172.23.99.160":5.700000000000002e-15,"ns_1@172.23.96.23":5.700000000000002e-15}
,"startTime":"2022-04-12T20:26:36.001-07:00","completedTime":false,"timeTaken":56506},"eventing":{"startTime":false,"completedTime":false,"timeTaken":false},"search":{"totalProgress":100,"perNodeProgress":
{"ns_1@172.23.96.20":1}
,"startTime":"2022-04-12T20:26:32.355-07:00","completedTime":"2022-04-12T20:26:32.869-07:00","timeTaken":514},"index":{"totalProgress":100,"perNodeProgress":
{"ns_1@172.23.96.15":1,"ns_1@172.23.96.19":1}
,"startTime":"2022-04-12T20:26:32.869-07:00","completedTime":"2022-04-12T20:26:36.001-07:00","timeTaken":3132},"data":{"totalProgress":100,"perNodeProgress":
{"ns_1@172.23.99.157":1,"ns_1@172.23.99.158":1,"ns_1@172.23.99.159":1}
,"startTime":"2022-04-12T20:26:22.889-07:00","completedTime":"2022-04-12T20:26:32.355-07:00","timeTaken":9466},"query":{"startTime":false,"completedTime":false,"timeTaken":false}},"rebalanceId":"fef9a523cd142ca550b5671cb67f02ec","nodesInfo":
{"active_nodes":["ns_1@172.23.99.157","ns_1@172.23.99.158","ns_1@172.23.99.159","ns_1@172.23.96.19","ns_1@172.23.96.15","ns_1@172.23.97.177","ns_1@172.23.96.23","ns_1@172.23.96.20","ns_1@172.23.99.160"],"keep_nodes":["ns_1@172.23.99.157","ns_1@172.23.99.158","ns_1@172.23.99.159","ns_1@172.23.96.19","ns_1@172.23.96.15","ns_1@172.23.97.177","ns_1@172.23.96.23","ns_1@172.23.96.20","ns_1@172.23.99.160"],"eject_nodes":[],"delta_nodes":[],"failed_nodes":[]}
,"masterNode":"ns_1@172.23.99.157","startTime":"2022-04-12T20:26:22.880-07:00","completedTime":"2022-04-12T20:27:32.508-07:00","timeTaken":69628,"completionMessage":"Rebalance exited with reason {service_rebalance_failed,cbas,\n {worker_died,\n {'EXIT',<0.23164.614>,\n {rebalance_failed,\n {service_error,\n <<\"Rebalance cf90e012469a96b7555ad9eb9a0902cc failed: CBAS0001: Analytics collections in different partitions have different DCP states. Mutations needed to catch up = 6. User action: Try again later\">>}}}}}."}
2. 3 hours (test failed due to 1 cbas mutations left to catch up ; link to the job : here )
Error message:
{"stageInfo":{"analytics":{"totalProgress":5.729979539608404,"perNodeProgress":
{"ns_1@172.23.99.160":0.05729979539608404,"ns_1@172.23.96.23":0.05729979539608404}
,"startTime":"2022-04-21T22:45:58.519-07:00","completedTime":false,"timeTaken":481388},"eventing":{"startTime":false,"completedTime":false,"timeTaken":false},"search":{"totalProgress":100,"perNodeProgress":
{"ns_1@172.23.96.20":1}
,"startTime":"2022-04-21T22:45:54.745-07:00","completedTime":"2022-04-21T22:45:55.266-07:00","timeTaken":520},"index":{"totalProgress":100,"perNodeProgress":
{"ns_1@172.23.96.15":1,"ns_1@172.23.96.19":1}
,"startTime":"2022-04-21T22:45:55.266-07:00","completedTime":"2022-04-21T22:45:58.519-07:00","timeTaken":3253},"data":{"totalProgress":100,"perNodeProgress":
{"ns_1@172.23.99.157":1,"ns_1@172.23.99.158":1,"ns_1@172.23.99.159":1}
,"startTime":"2022-04-21T22:45:45.579-07:00","completedTime":"2022-04-21T22:45:54.745-07:00","timeTaken":9166},"query":{"startTime":false,"completedTime":false,"timeTaken":false}},"rebalanceId":"a484886399b811651e3c3a8386bdb95c","nodesInfo":
{"active_nodes":["ns_1@172.23.99.157","ns_1@172.23.99.158","ns_1@172.23.99.159","ns_1@172.23.96.19","ns_1@172.23.96.15","ns_1@172.23.97.177","ns_1@172.23.96.23","ns_1@172.23.96.20","ns_1@172.23.99.160"],"keep_nodes":["ns_1@172.23.99.157","ns_1@172.23.99.158","ns_1@172.23.99.159","ns_1@172.23.96.19","ns_1@172.23.96.15","ns_1@172.23.97.177","ns_1@172.23.96.23","ns_1@172.23.96.20","ns_1@172.23.99.160"],"eject_nodes":[],"delta_nodes":[],"failed_nodes":[]}
,"masterNode":"ns_1@172.23.99.157","startTime":"2022-04-21T22:45:45.574-07:00","completedTime":"2022-04-21T22:53:59.906-07:00","timeTaken":494332,"completionMessage":"Rebalance exited with reason {service_rebalance_failed,cbas,\n {worker_died,\n {'EXIT',<0.17599.784>,\n {rebalance_failed,\n {service_error,\n <<\"Rebalance 861ea35e761c76836acfa59ee14411da failed: CBAS0001: Analytics collections in different partitions have different DCP states. Mutations needed to catch up = 1. User action: Try again later\">>}}}}}."}
Wayne Siu April 12, 2022 at 3:35 AM
@Jyotsna Nayak@Murtadha Al Hubail
I'm closing this ticket based on latest updates. (origin issue reported is no longer observed).
Please open a new ticket should there is a new issue from the re-run with a new sleep time. Thanks.
Jyotsna Nayak April 6, 2022 at 4:07 PM
This test has run from end to end ; and the cluster seemed to be balanced on the UI front a few mins after the test completed the run .
Link to the job : http://perf.jenkins.couchbase.com/job/themis_multibucket/121/
Will have a rerun of the test after increasing the sleep time in between the rebalances is increased.
The issue due to which this bug was initially filed is no longer observed.
Murtadha Al Hubail March 31, 2022 at 1:10 PM
@Jyotsna Nayak,
This is expected when the DCP stream is disconnected from Analytics ungracefully (e.g. as a result of a KV topology change). As the rebalance failure message suggests, some data partitions are 1738 mutations behind. It usually takes less than a minute for all partitions to catch up to the same DCP state. If you try the rebalance after a minute or so, the rebalance should proceed.
Jyotsna Nayak March 31, 2022 at 11:55 AMEdited
@Murtadha Al Hubail , I have run the test as mentioned in the comment above ; with the parameter set to 2173600 Bytes.
The test is failing at after rebalancing all the components ; with the following error
The cluster is not balanced
Upon checking the rebalance logs , this is the message printed
{"stageInfo":{"analytics":{"totalProgress":2.484999999999952e-11,"perNodeProgress":
{"ns_1@172.23.99.160":2.484999999999952e-13,"ns_1@172.23.96.23":2.484999999999952e-13}
,"startTime":"2022-03-30T18:04:08.826-07:00","completedTime":false,"timeTaken":2554572},"eventing":{"startTime":false,"completedTime":false,"timeTaken":false},"search":{"totalProgress":100,"perNodeProgress":
{"ns_1@172.23.96.20":1}
,"startTime":"2022-03-30T18:04:05.482-07:00","completedTime":"2022-03-30T18:04:05.936-07:00","timeTaken":453},"index":{"totalProgress":100,"perNodeProgress":
{"ns_1@172.23.96.15":1,"ns_1@172.23.96.19":1}
,"startTime":"2022-03-30T18:04:05.936-07:00","completedTime":"2022-03-30T18:04:08.826-07:00","timeTaken":2890},"data":{"totalProgress":100,"perNodeProgress":
{"ns_1@172.23.99.157":1,"ns_1@172.23.99.158":1,"ns_1@172.23.99.159":1}
,"startTime":"2022-03-30T18:03:55.918-07:00","completedTime":"2022-03-30T18:04:05.482-07:00","timeTaken":9565},"query":{"startTime":false,"completedTime":false,"timeTaken":false}},"rebalanceId":"9d7d027beca1eaf5d1746604e115a43f","nodesInfo":{"active_nodes":["ns_1@172.23.99.157","ns_1@172.23.99.158","ns_1@172.23.99.159","ns_1@172.23.96.19","ns_1@172.23.96.15","ns_1@172.23.97.177","ns_1@172.23.96.23","ns_1@172.23.96.20","ns_1@172.23.99.160"],"keep_nodes":["ns_1@172.23.99.157","ns_1@172.23.99.158","ns_1@172.23.99.159","ns_1@172.23.96.19","ns_1@172.23.96.15","ns_1@172.23.97.177","ns_1@172.23.96.23","ns_1@172.23.96.20","ns_1@172.23.99.160"],"eject_nodes":[],"delta_nodes":[],"failed_nodes":[]},"masterNode":"ns_1@172.23.99.157","startTime":"2022-03-30T18:03:55.913-07:00","completedTime":"2022-03-30T18:46:43.398-07:00","timeTaken":2567486,"completionMessage":"Rebalance exited with reason {service_rebalance_failed,cbas,\n {worker_died,\n {'EXIT',<0.25460.435>,\n {rebalance_failed,\n
{service_error,\n <<\"Rebalance 5692dee195b5f22cd3fb646ea3a742a8 failed: CBAS0001: Analytics collections in different partitions have different DCP states. Mutations needed to catch up = 1738. User action: Try again later\">>}
}}}}."}
Link to the job : http://perf.jenkins.couchbase.com/job/themis_multibucket/121/
logs:
https://s3-us-west-2.amazonaws.com/perf-artifacts/jenkins-themis_multibucket-121/172.23.96.15.zip
https://s3-us-west-2.amazonaws.com/perf-artifacts/jenkins-themis_multibucket-121/172.23.96.19.zip
https://s3-us-west-2.amazonaws.com/perf-artifacts/jenkins-themis_multibucket-121/172.23.96.20.zip
https://s3-us-west-2.amazonaws.com/perf-artifacts/jenkins-themis_multibucket-121/172.23.96.23.zip
https://s3-us-west-2.amazonaws.com/perf-artifacts/jenkins-themis_multibucket-121/172.23.97.177.zip
https://s3-us-west-2.amazonaws.com/perf-artifacts/jenkins-themis_multibucket-121/172.23.99.157.zip
https://s3-us-west-2.amazonaws.com/perf-artifacts/jenkins-themis_multibucket-121/172.23.99.158.zip
https://s3-us-west-2.amazonaws.com/perf-artifacts/jenkins-themis_multibucket-121/172.23.99.159.zip
https://s3-us-west-2.amazonaws.com/perf-artifacts/jenkins-themis_multibucket-121/172.23.99.160.zip
https://s3-us-west-2.amazonaws.com/perf-artifacts/jenkins-themis_multibucket-121/172.23.99.161.zip
https://s3-us-west-2.amazonaws.com/perf-artifacts/jenkins-themis_multibucket-121/tools.zip
Environment: 7.0.0-5295
Test : 30 bucket test with all the components
Failed at : Rebalance step
Error message:
Link to the job : http://perf.jenkins.couchbase.com/view/Eventing/job/themis_multibucket/102/
Steps of the test :
Load the buckets with documents
Create n1ql indexes
Initialise XDCR (init_only_xdcr() )
Creating the eventing functions
Creating FTS indexes
Creating Analytics dataset
Running rebalance for each phase as follows :
KV rebalance
Rebalance in with mutations
Rebalance swap
Rebalance out
Index rebalance
Rebalance in
Rebalance swap
Rebalance Out
Eventing rebalance
Rebalance in
Rebalance swap
Rebalance Out
CBAS rebalance
Rebalance in
Rebalance swap
Rebalance Out
Backup
FTS swap rebalance
The test failed when Eventing Swap rebalance was being executed. (Marked in red)
Cluster setup and the cluster details are mentioned in the screenshot attached below.