Analysis after increasing the amount of sleep :
I have rerun the test after increasing the sleep between the rebalances from 1 hour to
1. 2 hours (test failed due to 6 cbas mutations left to catch up ; link to job: here
Error message:
{"stageInfo":{"analytics":{"totalProgress":5.700000000000002e-13,"perNodeProgress":
{"ns_1@172.23.99.160":5.700000000000002e-15,"ns_1@172.23.96.23":5.700000000000002e-15}
,"startTime":"2022-04-12T20:26:36.001-07:00","completedTime":false,"timeTaken":56506},"eventing":{"startTime":false,"completedTime":false,"timeTaken":false},"search":{"totalProgress":100,"perNodeProgress":
{"ns_1@172.23.96.20":1}
,"startTime":"2022-04-12T20:26:32.355-07:00","completedTime":"2022-04-12T20:26:32.869-07:00","timeTaken":514},"index":{"totalProgress":100,"perNodeProgress":
{"ns_1@172.23.96.15":1,"ns_1@172.23.96.19":1}
,"startTime":"2022-04-12T20:26:32.869-07:00","completedTime":"2022-04-12T20:26:36.001-07:00","timeTaken":3132},"data":{"totalProgress":100,"perNodeProgress":
{"ns_1@172.23.99.157":1,"ns_1@172.23.99.158":1,"ns_1@172.23.99.159":1}
,"startTime":"2022-04-12T20:26:22.889-07:00","completedTime":"2022-04-12T20:26:32.355-07:00","timeTaken":9466},"query":{"startTime":false,"completedTime":false,"timeTaken":false}},"rebalanceId":"fef9a523cd142ca550b5671cb67f02ec","nodesInfo":
{"active_nodes":["ns_1@172.23.99.157","ns_1@172.23.99.158","ns_1@172.23.99.159","ns_1@172.23.96.19","ns_1@172.23.96.15","ns_1@172.23.97.177","ns_1@172.23.96.23","ns_1@172.23.96.20","ns_1@172.23.99.160"],"keep_nodes":["ns_1@172.23.99.157","ns_1@172.23.99.158","ns_1@172.23.99.159","ns_1@172.23.96.19","ns_1@172.23.96.15","ns_1@172.23.97.177","ns_1@172.23.96.23","ns_1@172.23.96.20","ns_1@172.23.99.160"],"eject_nodes":[],"delta_nodes":[],"failed_nodes":[]}
,"masterNode":"ns_1@172.23.99.157","startTime":"2022-04-12T20:26:22.880-07:00","completedTime":"2022-04-12T20:27:32.508-07:00","timeTaken":69628,"completionMessage":"Rebalance exited with reason {service_rebalance_failed,cbas,\n {worker_died,\n {'EXIT',<0.23164.614>,\n {rebalance_failed,\n {service_error,\n <<\"Rebalance cf90e012469a96b7555ad9eb9a0902cc failed: CBAS0001: Analytics collections in different partitions have different DCP states. Mutations needed to catch up = 6. User action: Try again later\">>}}}}}."}
2. 3 hours (test failed due to 1 cbas mutations left to catch up ; link to the job : here )
Error message:
{"stageInfo":{"analytics":{"totalProgress":5.729979539608404,"perNodeProgress":
{"ns_1@172.23.99.160":0.05729979539608404,"ns_1@172.23.96.23":0.05729979539608404}
,"startTime":"2022-04-21T22:45:58.519-07:00","completedTime":false,"timeTaken":481388},"eventing":{"startTime":false,"completedTime":false,"timeTaken":false},"search":{"totalProgress":100,"perNodeProgress":
{"ns_1@172.23.96.20":1}
,"startTime":"2022-04-21T22:45:54.745-07:00","completedTime":"2022-04-21T22:45:55.266-07:00","timeTaken":520},"index":{"totalProgress":100,"perNodeProgress":
{"ns_1@172.23.96.15":1,"ns_1@172.23.96.19":1}
,"startTime":"2022-04-21T22:45:55.266-07:00","completedTime":"2022-04-21T22:45:58.519-07:00","timeTaken":3253},"data":{"totalProgress":100,"perNodeProgress":
{"ns_1@172.23.99.157":1,"ns_1@172.23.99.158":1,"ns_1@172.23.99.159":1}
,"startTime":"2022-04-21T22:45:45.579-07:00","completedTime":"2022-04-21T22:45:54.745-07:00","timeTaken":9166},"query":{"startTime":false,"completedTime":false,"timeTaken":false}},"rebalanceId":"a484886399b811651e3c3a8386bdb95c","nodesInfo":
{"active_nodes":["ns_1@172.23.99.157","ns_1@172.23.99.158","ns_1@172.23.99.159","ns_1@172.23.96.19","ns_1@172.23.96.15","ns_1@172.23.97.177","ns_1@172.23.96.23","ns_1@172.23.96.20","ns_1@172.23.99.160"],"keep_nodes":["ns_1@172.23.99.157","ns_1@172.23.99.158","ns_1@172.23.99.159","ns_1@172.23.96.19","ns_1@172.23.96.15","ns_1@172.23.97.177","ns_1@172.23.96.23","ns_1@172.23.96.20","ns_1@172.23.99.160"],"eject_nodes":[],"delta_nodes":[],"failed_nodes":[]}
,"masterNode":"ns_1@172.23.99.157","startTime":"2022-04-21T22:45:45.574-07:00","completedTime":"2022-04-21T22:53:59.906-07:00","timeTaken":494332,"completionMessage":"Rebalance exited with reason {service_rebalance_failed,cbas,\n {worker_died,\n {'EXIT',<0.17599.784>,\n {rebalance_failed,\n {service_error,\n <<\"Rebalance 861ea35e761c76836acfa59ee14411da failed: CBAS0001: Analytics collections in different partitions have different DCP states. Mutations needed to catch up = 1. User action: Try again later\">>}}}}}."}
Murtadha Hubail , I have run the test as mentioned in the comment above ; with the parameter set to 2173600 Bytes.
The test is failing at after rebalancing all the components ; with the following error
The cluster is not balanced
Upon checking the rebalance logs , this is the message printed
{"stageInfo":{"analytics":{"totalProgress":2.484999999999952e-11,"perNodeProgress":
{"ns_1@172.23.99.160":2.484999999999952e-13,"ns_1@172.23.96.23":2.484999999999952e-13},"startTime":"2022-03-30T18:04:08.826-07:00","completedTime":false,"timeTaken":2554572},"eventing":{"startTime":false,"completedTime":false,"timeTaken":false},"search":{"totalProgress":100,"perNodeProgress":
{"ns_1@172.23.96.20":1},"startTime":"2022-03-30T18:04:05.482-07:00","completedTime":"2022-03-30T18:04:05.936-07:00","timeTaken":453},"index":{"totalProgress":100,"perNodeProgress":
{"ns_1@172.23.96.15":1,"ns_1@172.23.96.19":1},"startTime":"2022-03-30T18:04:05.936-07:00","completedTime":"2022-03-30T18:04:08.826-07:00","timeTaken":2890},"data":{"totalProgress":100,"perNodeProgress":
{"ns_1@172.23.99.157":1,"ns_1@172.23.99.158":1,"ns_1@172.23.99.159":1},"startTime":"2022-03-30T18:03:55.918-07:00","completedTime":"2022-03-30T18:04:05.482-07:00","timeTaken":9565},"query":{"startTime":false,"completedTime":false,"timeTaken":false}},"rebalanceId":"9d7d027beca1eaf5d1746604e115a43f","nodesInfo":
{"active_nodes":["ns_1@172.23.99.157","ns_1@172.23.99.158","ns_1@172.23.99.159","ns_1@172.23.96.19","ns_1@172.23.96.15","ns_1@172.23.97.177","ns_1@172.23.96.23","ns_1@172.23.96.20","ns_1@172.23.99.160"],"keep_nodes":["ns_1@172.23.99.157","ns_1@172.23.99.158","ns_1@172.23.99.159","ns_1@172.23.96.19","ns_1@172.23.96.15","ns_1@172.23.97.177","ns_1@172.23.96.23","ns_1@172.23.96.20","ns_1@172.23.99.160"],"eject_nodes":[],"delta_nodes":[],"failed_nodes":[]},"masterNode":"ns_1@172.23.99.157","startTime":"2022-03-30T18:03:55.913-07:00","completedTime":"2022-03-30T18:46:43.398-07:00","timeTaken":2567486,"completionMessage":"Rebalance exited with reason {service_rebalance_failed,cbas,\n {worker_died,\n {'EXIT',<0.25460.435>,\n {rebalance_failed,\n
{service_error,\n <<\"Rebalance 5692dee195b5f22cd3fb646ea3a742a8 failed: CBAS0001: Analytics collections in different partitions have different DCP states. Mutations needed to catch up = 1738. User action: Try again later\">>}}}}}."}
Link to the job : http://perf.jenkins.couchbase.com/job/themis_multibucket/121/
logs:
https://s3-us-west-2.amazonaws.com/perf-artifacts/jenkins-themis_multibucket-121/172.23.96.15.zip
https://s3-us-west-2.amazonaws.com/perf-artifacts/jenkins-themis_multibucket-121/172.23.96.19.zip
https://s3-us-west-2.amazonaws.com/perf-artifacts/jenkins-themis_multibucket-121/172.23.96.20.zip
https://s3-us-west-2.amazonaws.com/perf-artifacts/jenkins-themis_multibucket-121/172.23.96.23.zip
https://s3-us-west-2.amazonaws.com/perf-artifacts/jenkins-themis_multibucket-121/172.23.97.177.zip
https://s3-us-west-2.amazonaws.com/perf-artifacts/jenkins-themis_multibucket-121/172.23.99.157.zip
https://s3-us-west-2.amazonaws.com/perf-artifacts/jenkins-themis_multibucket-121/172.23.99.158.zip
https://s3-us-west-2.amazonaws.com/perf-artifacts/jenkins-themis_multibucket-121/172.23.99.159.zip
https://s3-us-west-2.amazonaws.com/perf-artifacts/jenkins-themis_multibucket-121/172.23.99.160.zip
https://s3-us-west-2.amazonaws.com/perf-artifacts/jenkins-themis_multibucket-121/172.23.99.161.zip
https://s3-us-west-2.amazonaws.com/perf-artifacts/jenkins-themis_multibucket-121/tools.zip