Loading...

Details

Type: Bug
Resolution: Fixed
Priority: Critical
Fix Version/s: 7.0.0
Affects Version/s: Cheshire-Cat
Component/s: analytics
Labels:
- triaged
- upgrade
Environment:
6.6.2-9588 -> 7.0.0-5141

Triage:
Untriaged
Operating System:
Centos 64-bit
Story Points:
1
Is this a Regression?:
Yes
Sprint:
CX Sprint 247

Description

Scripts to Repro
1. Run the 6.6.2 longevity test for 3 days.

./sequoia -client 172.23.96.162:2375 -provider file:centos_third_cluster.yml -test tests/integration/test_allFeatures_madhatter_durability.yml -scope tests/integration/scope_Xattrs_Madhatter.yml -scale 3 -repeat 0 -log_level 0 -version 6.6.2-9588 -skip_setup=false -skip_test=false -skip_teardown=true -skip_cleanup=false -continue=false -collect_on_error=false -stop_on_error=false -duration=604800 -show_topology=true

2. It had 27 nodes at the end of the test.
3. Added 6 7.0.0(172.23.105.102,172.23.105.62,172.23.106.232,172.23.106.239,172.23.106.37, 172.23.106.246) nodes and rebalanced in and removed 6 node from 6.6.2(172.23.110.75,172.23.110.76,172.23.105.61,172.23.106.191,172.23.106.209,172.23.106.70)
and rebalanced out.
4. Failed over 6 nodes and graceful failover + recovery + rebalance.
5. Now swap rebalance 6 nodes. 2 data + 2 index + 1 eventing + 1 analytics as shown below.

ns_1@172.23.105.10211:42:57 PM 11 May, 2021

Starting rebalance, KeepNodes = ['ns_1@172.23.104.15','ns_1@172.23.104.214',

'ns_1@172.23.104.232','ns_1@172.23.104.244',

'ns_1@172.23.104.245','ns_1@172.23.105.102',

'ns_1@172.23.105.109','ns_1@172.23.105.112',

'ns_1@172.23.105.118','ns_1@172.23.105.164',

'ns_1@172.23.105.61','ns_1@172.23.105.62',

'ns_1@172.23.105.90','ns_1@172.23.105.93',

'ns_1@172.23.106.117','ns_1@172.23.106.191',

'ns_1@172.23.106.207','ns_1@172.23.106.209',

'ns_1@172.23.106.232','ns_1@172.23.106.239',

'ns_1@172.23.106.246','ns_1@172.23.106.32',

'ns_1@172.23.106.37','ns_1@172.23.106.70',

'ns_1@172.23.110.75','ns_1@172.23.110.76'], EjectNodes = ['ns_1@172.23.106.54',

'ns_1@172.23.105.210',

'ns_1@172.23.105.25',

'ns_1@172.23.105.86',

'ns_1@172.23.105.206',

'ns_1@172.23.106.225'], Failed over and being ejected nodes = []; no delta recovery nodes; Operation Id = 7e7071f79333e252943a2259497d743d

The above rebalance failed as shown below. This is related to ~~MB-46246~~.
ns_1@172.23.105.10212:09:57 AM 12 May, 2021

Rebalance exited with reason {service_rebalance_failed,eventing,

{agent_died,<31276.23862.7>,

{lost_connection,

{'ns_1@172.23.106.70',shutdown}}}}.

Rebalance Operation Id = 7e7071f79333e252943a2259497d743d

Now I retried the failed rebalance again .
ns_1@172.23.105.10212:25:53 AM 12 May, 2021

Starting rebalance, KeepNodes = ['ns_1@172.23.104.15','ns_1@172.23.104.214',

'ns_1@172.23.104.232','ns_1@172.23.104.244',

'ns_1@172.23.104.245','ns_1@172.23.105.102',

'ns_1@172.23.105.109','ns_1@172.23.105.112',

'ns_1@172.23.105.118','ns_1@172.23.105.164',

'ns_1@172.23.105.61','ns_1@172.23.105.62',

'ns_1@172.23.105.90','ns_1@172.23.105.93',

'ns_1@172.23.106.117','ns_1@172.23.106.191',

'ns_1@172.23.106.207','ns_1@172.23.106.209',

'ns_1@172.23.106.232','ns_1@172.23.106.239',

'ns_1@172.23.106.246','ns_1@172.23.106.32',

'ns_1@172.23.106.37','ns_1@172.23.106.70',

'ns_1@172.23.110.75','ns_1@172.23.110.76'], EjectNodes = ['ns_1@172.23.106.54',

'ns_1@172.23.105.210',

'ns_1@172.23.105.25',

'ns_1@172.23.105.86',

'ns_1@172.23.105.206',

'ns_1@172.23.106.225'], Failed over and being ejected nodes = []; no delta recovery nodes; Operation Id = e5d19839baa473b6d0c1155448d81eeb

This rebalance hung at indexing service for well over 6+ hours. It got stuck at 53.69318181818181 %.See ~~MB-46274~~ for more details.

To proceed with the upgrade of the entire cluster I stopped the above rebalance and retried the rebalance again. This retried rebalance failed as shown below.

ns_1@172.23.105.102 7:58:38 PM 12 May, 2021

Rebalance exited with reason {service_rebalance_failed,cbas,

{worker_died,

{'EXIT',<0.25904.1315>,

{rebalance_failed,

{service_error,

<<"Rebalance f4cda0e6b5a1cc69f95ea635aaaf4942 failed: timed out waiting for all nodes to join & cluster active (missing nodes: [e6f0383d4902ece226bc1f2329d23993], state: ACTIVE)">>}}}}}.

Rebalance Operation Id = a5f54b861372ce2c5a86d6a1f34d8daa

At the exact same time I noticed analytics services failing as shown below which I believe caused the above rebalance to fail.
ns_1@172.23.106.209 7:58:38 PM 12 May, 2021

Analytics Service unable to successfully rebalance f4cda0e6b5a1cc69f95ea635aaaf4942 due to 'java.lang.IllegalStateException: timed out waiting for all nodes to join & cluster active (missing nodes: [e6f0383d4902ece226bc1f2329d23993], state: ACTIVE)'; see analytics_info.log for details

cbcollect_info attached. This was not seen in upgrade during 6.6.2->9588 to 7.0.0-5033.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending
- Thumbnails
- List
- Download All

rebalanceReport.json
4.89 MB
12/May/21 10:19 PM

Issue Links

backports to

MB-46781 [BP 6.6.3][system test upgrade] : Analytics rebalance fails with "java.lang.IllegalStateException: timed out waiting for all nodes to join & cluster active" during upgrade from 6.6.2 -> 7.0.0

Closed

duplicates

MB-45869 [System Test][Analytics] Rebalance failed with error - java.lang.IllegalStateException: timed out waiting for all nodes to join & cluster active (missing nodes: [xxxx], state: ACTIVE)

Closed

MB-46782 [BP 6.6.3][System Test][Analytics] Rebalance failed with error - java.lang.IllegalStateException: timed out waiting for all nodes to join & cluster active (missing nodes: [xxxx], state: ACTIVE)

Closed

relates to

MB-46865 [BP 6.6.3] Unhelpful error message on: timed out waiting for all nodes to join & cluster active (missing nodes: [e6f0383d4902ece226bc1f2329d23993]

Closed

MB-46293 [CX] Unhelpful error message on: timed out waiting for all nodes to join & cluster active (missing nodes: [e6f0383d4902ece226bc1f2329d23993]

Closed

[system test upgrade] : Analytics rebalance fails with "java.lang.IllegalStateException: timed out waiting for all nodes to join & cluster active" during upgrade from 6.6.2 -> 7.0.0

Details

Description

Attachments

Attachments

Issue Links

Activity

People

Dates

PagerDuty