Details
-
Bug
-
Resolution: Fixed
-
Critical
-
7.1.4, 7.2.0
-
7.1.4-3601 -> 7.2.0-5324
-
Untriaged
-
Centos 64-bit
-
0
-
No
-
Analytics Sprint 20
Description
Steps to Repro
1. Run a longevity test on 7.1.4 for 2 days.
./sequoia -client 172.23.104.27:2375 -provider file:centos_pine.yml -test tests/integration/neo/test_neo.yml -scope tests/integration/neo/scope_neo_magma.yml -scale 3 -repeat 0 -log_level 0 -version 7.1.4-3601 -skip_setup=false -skip_test=false -skip_teardown=true -skip_cleanup=false -continue=false -collect_on_error=false -stop_on_error=false -duration=604800 -show_topology=true
|
2. Upgraded to 7.2.0-5324 using online upgrade with failover/recovery strategy.
3. Enabled CDC on all buckets and on some collections post upgrade
I did failover of one node of each service(data,index,query,analytics,eventing,search)
and did an rebalance out which failed.
172.23.120.58 1:12:55 AMĀ 15 May, 2023
Starting rebalance, KeepNodes = ['ns_1@172.23.120.75','ns_1@172.23.120.81',
|
'ns_1@172.23.120.86','ns_1@172.23.121.77',
|
'ns_1@172.23.123.25','ns_1@172.23.123.26',
|
'ns_1@172.23.123.31','ns_1@172.23.123.33',
|
'ns_1@172.23.96.243','ns_1@172.23.96.254',
|
'ns_1@172.23.96.48','ns_1@172.23.97.105',
|
'ns_1@172.23.97.110','ns_1@172.23.97.112',
|
'ns_1@172.23.97.148','ns_1@172.23.97.241',
|
'ns_1@172.23.97.74'], EjectNodes = [], Failed over and being ejected nodes = ['ns_1@172.23.120.58',
|
'ns_1@172.23.120.73',
|
'ns_1@172.23.120.74',
|
'ns_1@172.23.120.77',
|
'ns_1@172.23.123.32',
|
'ns_1@172.23.96.122']; no delta recovery nodes; Operation Id = 0b87a72990070d13f578d3d9630d8b70
|
172.23.120.58 3:40:06 AMĀ 15 May, 2023
Rebalance exited with reason {service_rebalance_failed,cbas,
|
{worker_died,
|
{'EXIT',<0.32352.1156>,
|
{rebalance_failed,
|
{service_error,
|
<<"Rebalance 7fef0ad83705736f70d24831ccdf0c6e failed: Index with resource ID 6245 already exists.">>}}}}}.
|
Rebalance Operation Id = 0b87a72990070d13f578d3d9630d8b70
|
Retried failed rebalance
172.23.120.75 3:45:40 AMĀ 15 May, 2023
Starting rebalance, KeepNodes = ['ns_1@172.23.120.75','ns_1@172.23.120.81',
|
'ns_1@172.23.120.86','ns_1@172.23.121.77',
|
'ns_1@172.23.123.25','ns_1@172.23.123.26',
|
'ns_1@172.23.123.31','ns_1@172.23.123.33',
|
'ns_1@172.23.96.243','ns_1@172.23.96.254',
|
'ns_1@172.23.96.48','ns_1@172.23.97.105',
|
'ns_1@172.23.97.110','ns_1@172.23.97.112',
|
'ns_1@172.23.97.148','ns_1@172.23.97.241',
|
'ns_1@172.23.97.74'], EjectNodes = [], Failed over and being ejected nodes = []; no delta recovery nodes; Operation Id = 3c4e20642bb3df706e7b2a68bafe3faf
|
I noticed the rebalance progress keeps increasing beyond 100%.
balakumaran.g@Balakumarans-MacBook-Pro-2 sequoia % curl -u Administrator:password http://172.23.97.74:8091/pools/default/rebalanceProgress | jq
|
|
% Total % Received % Xferd Average Speed Time Time Time Current
|
Dload Upload Total Spent Left Speed
|
100 690 100 690 0 0 1296 0 --:--:-- --:--:-- --:--:-- 1294
|
{
|
"status": "running",
|
"ns_1@172.23.97.110": {
|
"progress": 0
|
},
|
"ns_1@172.23.120.81": {
|
"progress": 1
|
},
|
"ns_1@172.23.97.241": {
|
"progress": 1
|
},
|
"ns_1@172.23.123.31": {
|
"progress": 1
|
},
|
"ns_1@172.23.97.112": {
|
"progress": 1
|
},
|
"ns_1@172.23.96.243": {
|
"progress": 1
|
},
|
"ns_1@172.23.123.33": {
|
"progress": 1
|
},
|
"ns_1@172.23.97.74": {
|
"progress": 1
|
},
|
"ns_1@172.23.96.254": {
|
"progress": 0
|
},
|
"ns_1@172.23.120.75": {
|
"progress": 0
|
},
|
"ns_1@172.23.123.25": {
|
"progress": 0
|
},
|
"ns_1@172.23.97.105": {
|
"progress": 1
|
},
|
"ns_1@172.23.120.86": {
|
"progress": 2.569999999999968e-13
|
},
|
"ns_1@172.23.123.26": {
|
"progress": 1
|
},
|
"ns_1@172.23.121.77": {
|
"progress": 2.569999999999968e-13
|
},
|
"ns_1@172.23.96.48": {
|
"progress": 2.569999999999968e-13
|
},
|
"ns_1@172.23.97.148": {
|
"progress": 0
|
}
|
}
|
balakumaran.g@Balakumarans-MacBook-Pro-2 sequoia %
|
balakumaran.g@Balakumarans-MacBook-Pro-2 sequoia % curl -u Administrator:password http://172.23.97.74:8091/pools/default/rebalanceProgress | jq
|
|
% Total % Received % Xferd Average Speed Time Time Time Current
|
Dload Upload Total Spent Left Speed
|
100 690 100 690 0 0 864 0 --:--:-- --:--:-- --:--:-- 863
|
{
|
"status": "running",
|
"ns_1@172.23.97.110": {
|
"progress": 0
|
},
|
"ns_1@172.23.120.81": {
|
"progress": 1
|
},
|
"ns_1@172.23.97.241": {
|
"progress": 1
|
},
|
"ns_1@172.23.123.31": {
|
"progress": 1
|
},
|
"ns_1@172.23.97.112": {
|
"progress": 1
|
},
|
"ns_1@172.23.96.243": {
|
"progress": 1
|
},
|
"ns_1@172.23.123.33": {
|
"progress": 1
|
},
|
"ns_1@172.23.97.74": {
|
"progress": 1
|
},
|
"ns_1@172.23.96.254": {
|
"progress": 0
|
},
|
"ns_1@172.23.120.75": {
|
"progress": 0
|
},
|
"ns_1@172.23.123.25": {
|
"progress": 0
|
},
|
"ns_1@172.23.97.105": {
|
"progress": 1
|
},
|
"ns_1@172.23.120.86": {
|
"progress": 4.123000000000275e-13
|
},
|
"ns_1@172.23.123.26": {
|
"progress": 1
|
},
|
"ns_1@172.23.121.77": {
|
"progress": 4.123000000000275e-13
|
},
|
"ns_1@172.23.96.48": {
|
"progress": 4.123000000000275e-13
|
},
|
"ns_1@172.23.97.148": {
|
"progress": 0
|
}
|
}
|
balakumaran.g@Balakumarans-MacBook-Pro-2 sequoia %
|
My guess is probably the rebalance failure is the cause of this unusual behaviour. Would this result in a hang? If so, could we get some workaround ?
cbcollect_info attached.
Attachments
Issue Links
- blocks
-
MB-56955 Analytics Service unable to successfully rebalance 4c95651b68cfdfc22d0c1d306ff4d7c1 due to 'java.lang.IllegalStateException: timed out waiting for all nodes to join & cluster active (missing nodes: [453dcde5201e809268a0df89fe474ebe], state: UNUSABLE)
- Closed
- links to