[high-bucket] - 30 multi bucket test rebalance failed with buckets_cleanup_failed error

Description

Environment: 7.0.0-5295
Test : 30 bucket test with all the components 
Failed at : Rebalance step 
Error message: 

completionMessage":"Rebalance exited with reason
Unknown macro:
Unknown macro: {buckets_cleanup_failed,['ns_1@172.23.96.20']}
."}

Link to the job : http://perf.jenkins.couchbase.com/view/Eventing/job/themis_multibucket/102/ 

Steps of the test :

  1. Load the buckets with documents 

  2. Create n1ql indexes 

  3. Initialise XDCR (init_only_xdcr() )

  4. Creating the eventing functions

  5. Creating FTS indexes 

  6. Creating Analytics dataset

  7. Running rebalance for each phase as follows :

    1. KV rebalance 

      1. Rebalance in with mutations

      2. Rebalance swap 

      3. Rebalance out 

    2. Index rebalance

      1. Rebalance in 

      2. Rebalance swap

      3. Rebalance Out 

    3. Eventing rebalance

      1. Rebalance in 

      2. Rebalance swap

      3. Rebalance Out 

    4. CBAS rebalance 

      1. Rebalance in 

      2. Rebalance swap

      3. Rebalance Out 

  8. Backup

  9. FTS swap rebalance

The test failed when Eventing Swap rebalance was being executed. (Marked in red) 

Cluster setup and the cluster details are mentioned in the screenshot attached below.

Components

Affects versions

Fix versions

Environment

None

Link to Log File, atop/blg, CBCollectInfo, Core dump

https://s3-us-west-2.amazonaws.com/perf-artifacts/jenkins-themis_multibucket-102/172.23.96.15.zip https://s3-us-west-2.amazonaws.com/perf-artifacts/jenkins-themis_multibucket-102/172.23.96.19.zip https://s3-us-west-2.amazonaws.com/perf-artifacts/jenkins-themis_multibucket-102/172.23.96.20.zip https://s3-us-west-2.amazonaws.com/perf-artifacts/jenkins-themis_multibucket-102/172.23.96.23.zip https://s3-us-west-2.amazonaws.com/perf-artifacts/jenkins-themis_multibucket-102/172.23.97.177.zip https://s3-us-west-2.amazonaws.com/perf-artifacts/jenkins-themis_multibucket-102/172.23.99.157.zip https://s3-us-west-2.amazonaws.com/perf-artifacts/jenkins-themis_multibucket-102/172.23.99.158.zip https://s3-us-west-2.amazonaws.com/perf-artifacts/jenkins-themis_multibucket-102/172.23.99.159.zip https://s3-us-west-2.amazonaws.com/perf-artifacts/jenkins-themis_multibucket-102/172.23.99.160.zip https://s3-us-west-2.amazonaws.com/perf-artifacts/jenkins-themis_multibucket-102/172.23.99.161.zip

Release Notes Description

None

Attachments

29
  • 11 Feb 2022, 12:06 AM
  • 11 Feb 2022, 12:05 AM
  • 11 Feb 2022, 12:04 AM
  • 11 Feb 2022, 12:02 AM
  • 10 Feb 2022, 11:58 PM
  • 10 Feb 2022, 11:55 PM
  • 19 Jan 2022, 03:30 AM
  • 19 Jan 2022, 03:30 AM
  • 19 Jan 2022, 03:29 AM
  • 19 Jan 2022, 03:29 AM
  • 01 Dec 2021, 06:25 PM
  • 09 Sep 2021, 07:07 PM

Activity

Show:

Jyotsna Nayak April 22, 2022 at 5:53 PM
Edited

Analysis after increasing the amount of sleep :

I have rerun the test after increasing the sleep between the rebalances from 1 hour to 
1.  2 hours (test failed due to 6 cbas mutations left to catch up ; link to job: here 
Error message: 

{"stageInfo":{"analytics":{"totalProgress":5.700000000000002e-13,"perNodeProgress":

{"ns_1@172.23.99.160":5.700000000000002e-15,"ns_1@172.23.96.23":5.700000000000002e-15}

,"startTime":"2022-04-12T20:26:36.001-07:00","completedTime":false,"timeTaken":56506},"eventing":{"startTime":false,"completedTime":false,"timeTaken":false},"search":{"totalProgress":100,"perNodeProgress":

{"ns_1@172.23.96.20":1}

,"startTime":"2022-04-12T20:26:32.355-07:00","completedTime":"2022-04-12T20:26:32.869-07:00","timeTaken":514},"index":{"totalProgress":100,"perNodeProgress":

{"ns_1@172.23.96.15":1,"ns_1@172.23.96.19":1}

,"startTime":"2022-04-12T20:26:32.869-07:00","completedTime":"2022-04-12T20:26:36.001-07:00","timeTaken":3132},"data":{"totalProgress":100,"perNodeProgress":

{"ns_1@172.23.99.157":1,"ns_1@172.23.99.158":1,"ns_1@172.23.99.159":1}

,"startTime":"2022-04-12T20:26:22.889-07:00","completedTime":"2022-04-12T20:26:32.355-07:00","timeTaken":9466},"query":{"startTime":false,"completedTime":false,"timeTaken":false}},"rebalanceId":"fef9a523cd142ca550b5671cb67f02ec","nodesInfo":

{"active_nodes":["ns_1@172.23.99.157","ns_1@172.23.99.158","ns_1@172.23.99.159","ns_1@172.23.96.19","ns_1@172.23.96.15","ns_1@172.23.97.177","ns_1@172.23.96.23","ns_1@172.23.96.20","ns_1@172.23.99.160"],"keep_nodes":["ns_1@172.23.99.157","ns_1@172.23.99.158","ns_1@172.23.99.159","ns_1@172.23.96.19","ns_1@172.23.96.15","ns_1@172.23.97.177","ns_1@172.23.96.23","ns_1@172.23.96.20","ns_1@172.23.99.160"],"eject_nodes":[],"delta_nodes":[],"failed_nodes":[]}

,"masterNode":"ns_1@172.23.99.157","startTime":"2022-04-12T20:26:22.880-07:00","completedTime":"2022-04-12T20:27:32.508-07:00","timeTaken":69628,"completionMessage":"Rebalance exited with reason {service_rebalance_failed,cbas,\n                              {worker_died,\n                               {'EXIT',<0.23164.614>,\n                                {rebalance_failed,\n                                 {service_error,\n                                  <<\"Rebalance cf90e012469a96b7555ad9eb9a0902cc failed: CBAS0001: Analytics collections in different partitions have different DCP states. Mutations needed to catch up = 6. User action: Try again later\">>}}}}}."}

2. 3 hours (test failed due to 1 cbas mutations left to catch up ; link to the job : here )
Error message:
{"stageInfo":{"analytics":{"totalProgress":5.729979539608404,"perNodeProgress":

{"ns_1@172.23.99.160":0.05729979539608404,"ns_1@172.23.96.23":0.05729979539608404}

,"startTime":"2022-04-21T22:45:58.519-07:00","completedTime":false,"timeTaken":481388},"eventing":{"startTime":false,"completedTime":false,"timeTaken":false},"search":{"totalProgress":100,"perNodeProgress":

{"ns_1@172.23.96.20":1}

,"startTime":"2022-04-21T22:45:54.745-07:00","completedTime":"2022-04-21T22:45:55.266-07:00","timeTaken":520},"index":{"totalProgress":100,"perNodeProgress":

{"ns_1@172.23.96.15":1,"ns_1@172.23.96.19":1}

,"startTime":"2022-04-21T22:45:55.266-07:00","completedTime":"2022-04-21T22:45:58.519-07:00","timeTaken":3253},"data":{"totalProgress":100,"perNodeProgress":

{"ns_1@172.23.99.157":1,"ns_1@172.23.99.158":1,"ns_1@172.23.99.159":1}

,"startTime":"2022-04-21T22:45:45.579-07:00","completedTime":"2022-04-21T22:45:54.745-07:00","timeTaken":9166},"query":{"startTime":false,"completedTime":false,"timeTaken":false}},"rebalanceId":"a484886399b811651e3c3a8386bdb95c","nodesInfo":

{"active_nodes":["ns_1@172.23.99.157","ns_1@172.23.99.158","ns_1@172.23.99.159","ns_1@172.23.96.19","ns_1@172.23.96.15","ns_1@172.23.97.177","ns_1@172.23.96.23","ns_1@172.23.96.20","ns_1@172.23.99.160"],"keep_nodes":["ns_1@172.23.99.157","ns_1@172.23.99.158","ns_1@172.23.99.159","ns_1@172.23.96.19","ns_1@172.23.96.15","ns_1@172.23.97.177","ns_1@172.23.96.23","ns_1@172.23.96.20","ns_1@172.23.99.160"],"eject_nodes":[],"delta_nodes":[],"failed_nodes":[]}

,"masterNode":"ns_1@172.23.99.157","startTime":"2022-04-21T22:45:45.574-07:00","completedTime":"2022-04-21T22:53:59.906-07:00","timeTaken":494332,"completionMessage":"Rebalance exited with reason {service_rebalance_failed,cbas,\n                              {worker_died,\n                               {'EXIT',<0.17599.784>,\n                                {rebalance_failed,\n                                 {service_error,\n                                  <<\"Rebalance 861ea35e761c76836acfa59ee14411da failed: CBAS0001: Analytics collections in different partitions have different DCP states. Mutations needed to catch up = 1. User action: Try again later\">>}}}}}."} 

Wayne Siu April 12, 2022 at 3:35 AM


I'm closing this ticket based on latest updates. (origin issue reported is no longer observed).
Please open a new ticket should there is a new issue from the re-run with a new sleep time. Thanks.

Jyotsna Nayak April 6, 2022 at 4:07 PM

This test has run from end to end ; and the cluster seemed to be balanced on the UI front a few mins after the test completed the run . 
Link to the job :  http://perf.jenkins.couchbase.com/job/themis_multibucket/121/
Will have a rerun of the test after increasing the sleep time  in between the rebalances is increased.
The issue due to which this bug was initially filed is no longer observed. 

Murtadha Al Hubail March 31, 2022 at 1:10 PM

,

This is expected when the DCP stream is disconnected from Analytics ungracefully (e.g. as a result of a KV topology change). As the rebalance failure message suggests, some data partitions are 1738 mutations behind. It usually takes less than a minute for all partitions to catch up to the same DCP state. If you try the rebalance after a minute or so, the rebalance should proceed.

Jyotsna Nayak March 31, 2022 at 11:55 AM
Edited

  , I have run the test as mentioned in the comment above ; with the parameter set to 2173600 Bytes.
The test is failing at after rebalancing all the components ; with the following error
The cluster is not balanced
Upon checking the rebalance logs , this is the message printed 

{"stageInfo":{"analytics":{"totalProgress":2.484999999999952e-11,"perNodeProgress":

{"ns_1@172.23.99.160":2.484999999999952e-13,"ns_1@172.23.96.23":2.484999999999952e-13}

,"startTime":"2022-03-30T18:04:08.826-07:00","completedTime":false,"timeTaken":2554572},"eventing":{"startTime":false,"completedTime":false,"timeTaken":false},"search":{"totalProgress":100,"perNodeProgress":

{"ns_1@172.23.96.20":1}

,"startTime":"2022-03-30T18:04:05.482-07:00","completedTime":"2022-03-30T18:04:05.936-07:00","timeTaken":453},"index":{"totalProgress":100,"perNodeProgress":

{"ns_1@172.23.96.15":1,"ns_1@172.23.96.19":1}

,"startTime":"2022-03-30T18:04:05.936-07:00","completedTime":"2022-03-30T18:04:08.826-07:00","timeTaken":2890},"data":{"totalProgress":100,"perNodeProgress":

{"ns_1@172.23.99.157":1,"ns_1@172.23.99.158":1,"ns_1@172.23.99.159":1}

,"startTime":"2022-03-30T18:03:55.918-07:00","completedTime":"2022-03-30T18:04:05.482-07:00","timeTaken":9565},"query":{"startTime":false,"completedTime":false,"timeTaken":false}},"rebalanceId":"9d7d027beca1eaf5d1746604e115a43f","nodesInfo":{"active_nodes":["ns_1@172.23.99.157","ns_1@172.23.99.158","ns_1@172.23.99.159","ns_1@172.23.96.19","ns_1@172.23.96.15","ns_1@172.23.97.177","ns_1@172.23.96.23","ns_1@172.23.96.20","ns_1@172.23.99.160"],"keep_nodes":["ns_1@172.23.99.157","ns_1@172.23.99.158","ns_1@172.23.99.159","ns_1@172.23.96.19","ns_1@172.23.96.15","ns_1@172.23.97.177","ns_1@172.23.96.23","ns_1@172.23.96.20","ns_1@172.23.99.160"],"eject_nodes":[],"delta_nodes":[],"failed_nodes":[]},"masterNode":"ns_1@172.23.99.157","startTime":"2022-03-30T18:03:55.913-07:00","completedTime":"2022-03-30T18:46:43.398-07:00","timeTaken":2567486,"completionMessage":"Rebalance exited with reason {service_rebalance_failed,cbas,\n                              {worker_died,\n                               {'EXIT',<0.25460.435>,\n                                {rebalance_failed,\n                                

{service_error,\n                                  <<\"Rebalance 5692dee195b5f22cd3fb646ea3a742a8 failed: CBAS0001: Analytics collections in different partitions have different DCP states. Mutations needed to catch up = 1738. User action: Try again later\">>}

}}}}."}

Link to the job :  http://perf.jenkins.couchbase.com/job/themis_multibucket/121/

logs:

https://s3-us-west-2.amazonaws.com/perf-artifacts/jenkins-themis_multibucket-121/172.23.96.15.zip
https://s3-us-west-2.amazonaws.com/perf-artifacts/jenkins-themis_multibucket-121/172.23.96.19.zip
https://s3-us-west-2.amazonaws.com/perf-artifacts/jenkins-themis_multibucket-121/172.23.96.20.zip
https://s3-us-west-2.amazonaws.com/perf-artifacts/jenkins-themis_multibucket-121/172.23.96.23.zip
https://s3-us-west-2.amazonaws.com/perf-artifacts/jenkins-themis_multibucket-121/172.23.97.177.zip
https://s3-us-west-2.amazonaws.com/perf-artifacts/jenkins-themis_multibucket-121/172.23.99.157.zip
https://s3-us-west-2.amazonaws.com/perf-artifacts/jenkins-themis_multibucket-121/172.23.99.158.zip
https://s3-us-west-2.amazonaws.com/perf-artifacts/jenkins-themis_multibucket-121/172.23.99.159.zip
https://s3-us-west-2.amazonaws.com/perf-artifacts/jenkins-themis_multibucket-121/172.23.99.160.zip
https://s3-us-west-2.amazonaws.com/perf-artifacts/jenkins-themis_multibucket-121/172.23.99.161.zip
https://s3-us-west-2.amazonaws.com/perf-artifacts/jenkins-themis_multibucket-121/tools.zip

Fixed
Pinned fields
Click on the next to a field label to start pinning.

Details

Assignee

Jyotsna Nayak(Deactivated)

Reporter

Is this a Regression?

No

Triage

Untriaged

Story Points

1

Priority

Instabug

Open Instabug

PagerDuty

Sentry

Zendesk Support

Created June 30, 2021 at 5:11 AM
Updated October 23, 2024 at 9:52 AM
Resolved February 17, 2022 at 11:38 PM
Instabug