Loading...

XML

Word

Printable

Details

Type: Bug
Resolution: Fixed
Priority: Critical
Fix Version/s: 6.6.3
Affects Version/s: 6.6.2
Component/s: eventing
Labels:
- system_test_upgrade
- upgrade
Environment:
6.6.2-9588 -> 7.0.0-5226

Triage:
Untriaged
Operating System:
Centos 64-bit
Story Points:
1
Is this a Regression?:
Yes

Description

Steps to Repro
1. Run the following longevity on 6.6.2 for 3-4 days

./sequoia -client 172.23.96.162:2375 -provider file:centos_third_cluster.yml -test tests/integration/test_allFeatures_madhatter_durability.yml -scope tests/integration/scope_Xattrs_Madhatter.yml -scale 3 -repeat 0 -log_level 0 -version 6.6.2-9588 -skip_setup=false -skip_test=false -skip_teardown=true -skip_cleanup=false -continue=false -collect_on_error=false -stop_on_error=false -duration=604800 -show_topology=true

2. We have 27 node cluster in 6.6.2
3. Add 6 nodes(1 of each service - 7.0.0-5226) and remove 6 nodes(6.6.2) and do a swap rebalance to upgrade the cluster.
4. Failover 6 node(1 of each service - 6.6.2), upgrade, do a recovery and rebalance.
5. Tried to continue those steps for the rest of the nodes in the cluster, but one of the rebalances failed as shown below.

ns_1@172.23.106.70 7:18:13 AM 26 May, 2021

Starting rebalance, KeepNodes = ['ns_1@172.23.104.15','ns_1@172.23.104.214',

'ns_1@172.23.104.232','ns_1@172.23.104.244',

'ns_1@172.23.104.245','ns_1@172.23.105.102',

'ns_1@172.23.105.109','ns_1@172.23.105.112',

'ns_1@172.23.105.118','ns_1@172.23.105.206',

'ns_1@172.23.105.210','ns_1@172.23.105.25',

'ns_1@172.23.105.29','ns_1@172.23.105.61',

'ns_1@172.23.105.86','ns_1@172.23.105.90',

'ns_1@172.23.106.117','ns_1@172.23.106.191',

'ns_1@172.23.106.207','ns_1@172.23.106.225',

'ns_1@172.23.106.232','ns_1@172.23.106.239',

'ns_1@172.23.106.246','ns_1@172.23.106.37',

'ns_1@172.23.106.54','ns_1@172.23.106.70',

'ns_1@172.23.110.75'], EjectNodes = [], Failed over and being ejected nodes = []; no delta recovery nodes; Operation Id = 57cca96fe563d50d27549ba664c85dfe

ns_1@172.23.106.70 7:53:28 AM 26 May, 2021

Rebalance exited with reason {service_rebalance_failed,eventing,

{worker_died,

{'EXIT',<0.15454.774>,

{rebalance_failed,

{service_error,

<<"eventing rebalance hasn't made progress for past 1200 secs">>}}}}}.

Rebalance Operation Id = 57cca96fe563d50d27549ba664c85dfe

attaching cbcollect in some time.
This was not seen on upgrade from 6.6.2-9588 -> 7.0.0-5141.

Attachments

Issue Links

Clones

MB-46564 [System test]Online upgrade using graceful failover + full recovery + rebalance fails in eventing with "service_rebalance_failed,eventing, {worker_died,"

Closed

is a backport of

MB-46763 [System Test]: Eventing rebalance hung

Closed

Gerrit Reviews

- Issue Only
- Show All Reviews
- Show Open Reviews
- Show All Issues
- Show Open Issues

No reviews matched the request. Check your Options in the drop-down menu of this sections header.

Activity

People

Assignee:: Chanabasappa Ghali

Reporter:: Ankit Prabhu

Votes:: 0 Vote for this issue

Watchers:: 4 Start watching this issue

Dates

Due:: 01/Jun/21

Created:: 14/Jun/21 2:03 AM

Updated:: 24/Nov/21 1:29 AM

Resolved:: 23/Jun/21 2:25 AM

Gerrit Reviews

There are no open Gerrit changes

Show There are 2 closed Gerrit changes

Hide There are 2 closed Gerrit changes

MB-46887: Return invalid feed if feed don't have connection with any kv node: Gerrit Review:

MB-46887: Return invalid feed when feed for master is not present: Gerrit Review:

[BP of MB-46564] [System test]Online upgrade using graceful failover + full recovery + rebalance fails in eventing with "service_rebalance_failed,eventing, {worker_died,"

Details

Description

Attachments

Issue Links

Gerrit Reviews

Activity

People

Dates

Gerrit Reviews

PagerDuty