Steps:
1. Create a 85 node cluster
2. Create a bucket and 20 collections
3. Start load: maxttl=60, durability=Majority
4. Rebalance in 172.23.96.89. Rebalance failed due to bas replicas. buckets_cleanup_failed
'ns_1@172.23.120.119','ns_1@172.23.121.79'], EjectNodes = [], Failed over and being ejected nodes = []; no delta recovery nodes; Operation Id = 41a4e6ef0de9120bda567a6716bde64a
[user:error,2021-05-04T05:17:15.953-07:00,ns_1@172.23.106.155:<0.764.0>:ns_orchestrator:log_rebalance_completion:1405]Rebalance exited with reason bad_replicas.
Bit more info..
[user:info,2021-05-04T05:17:15.951-07:00,ns_1@172.23.106.155:<0.9679.17>:ns_rebalancer:verify_replication:879]Bad replicators after rebalance:
[ns_server:info,2021-05-04T05:17:15.951-07:00,ns_1@172.23.106.155:rebalance_agent<0.815.0>:rebalance_agent:handle_down:290]Rebalancer process <0.9679.17> died (reason bad_replicas).
The common node is 172.23.106.155'
On 172.23.106.155 we see memcached exited with status 137, which mean it was external terminated (possibly using kill)
[ns_server:info,2021-05-04T05:17:01.303-07:00,babysitter_of_ns_1@cb.local:<0.132.0>:supervisor_cushion:handle_info:54]Cushion managed supervisor for memcached failed: {abnormal,137}
[user:error,2021-05-04T05:21:52.145-07:00,ns_1@172.23.106.155:<0.764.0>:ns_orchestrator:log_rebalance_completion:1405]Rebalance exited with reason {buckets_cleanup_failed,
Daniel Owen
added a comment - - edited Looking on the orchestrator (172.23.106.155) we have 2 different rebalance failures
Rebalance failure 1
[user:info,2021-05-04T05:16:30.211-07:00,ns_1@172.23.106.155:<0.764.0>:ns_orchestrator:idle:773]Starting rebalance, KeepNodes = ['ns_1@172.23.106.62','ns_1@172.23.121.76',
'ns_1@172.23.123.6','ns_1@172.23.121.128',
'ns_1@172.23.120.124','ns_1@172.23.122.226',
'ns_1@172.23.121.136','ns_1@172.23.122.248',
'ns_1@172.23.107.18','ns_1@172.23.122.243',
'ns_1@172.23.121.153','ns_1@172.23.96.89',
'ns_1@172.23.121.127','ns_1@172.23.120.125',
'ns_1@172.23.120.129','ns_1@172.23.122.221',
'ns_1@172.23.121.123','ns_1@172.23.121.131',
'ns_1@172.23.122.225','ns_1@172.23.106.99',
'ns_1@172.23.107.12','ns_1@172.23.123.5',
'ns_1@172.23.107.110','ns_1@172.23.121.162',
'ns_1@172.23.122.247','ns_1@172.23.121.154',
'ns_1@172.23.120.128','ns_1@172.23.120.120',
'ns_1@172.23.123.10','ns_1@172.23.123.4',
'ns_1@172.23.120.123','ns_1@172.23.106.53',
'ns_1@172.23.121.155','ns_1@172.23.121.152',
'ns_1@172.23.107.112','ns_1@172.23.107.1',
'ns_1@172.23.121.144','ns_1@172.23.123.7',
'ns_1@172.23.106.164','ns_1@172.23.122.242',
'ns_1@172.23.107.33','ns_1@172.23.106.159',
'ns_1@172.23.121.130','ns_1@172.23.121.141',
'ns_1@172.23.106.56','ns_1@172.23.121.133',
'ns_1@172.23.122.223','ns_1@172.23.122.231',
'ns_1@172.23.107.31','ns_1@172.23.106.155',
'ns_1@172.23.121.48','ns_1@172.23.121.139',
'ns_1@172.23.121.156','ns_1@172.23.122.229',
'ns_1@172.23.122.245','ns_1@172.23.121.142',
'ns_1@172.23.106.163','ns_1@172.23.122.240',
'ns_1@172.23.121.134','ns_1@172.23.123.12',
'ns_1@172.23.121.116','ns_1@172.23.122.228',
'ns_1@172.23.120.126','ns_1@172.23.107.32',
'ns_1@172.23.120.122','ns_1@172.23.121.157',
'ns_1@172.23.107.16','ns_1@172.23.121.143',
'ns_1@172.23.122.241','ns_1@172.23.121.140',
'ns_1@172.23.121.115','ns_1@172.23.121.132',
'ns_1@172.23.122.208','ns_1@172.23.121.135',
'ns_1@172.23.120.170','ns_1@172.23.121.129',
'ns_1@172.23.122.224','ns_1@172.23.121.124',
'ns_1@172.23.121.126','ns_1@172.23.122.227',
'ns_1@172.23.106.95','ns_1@172.23.122.249',
'ns_1@172.23.120.127','ns_1@172.23.123.11',
'ns_1@172.23.120.119','ns_1@172.23.121.79'], EjectNodes = [], Failed over and being ejected nodes = []; no delta recovery nodes; Operation Id = 41a4e6ef0de9120bda567a6716bde64a
[user:error,2021-05-04T05:17:15.953-07:00,ns_1@172.23.106.155:<0.764.0>:ns_orchestrator:log_rebalance_completion:1405]Rebalance exited with reason bad_replicas.
Bit more info..
[user:info,2021-05-04T05:17:15.951-07:00,ns_1@172.23.106.155:<0.9679.17>:ns_rebalancer:verify_replication:879]Bad replicators after rebalance:
Missing = [{'ns_1@172.23.106.155','ns_1@172.23.107.1',208},
{'ns_1@172.23.106.155','ns_1@172.23.107.1',209},
{'ns_1@172.23.106.155','ns_1@172.23.107.18',91},
{'ns_1@172.23.106.155','ns_1@172.23.107.18',328},
{'ns_1@172.23.106.155','ns_1@172.23.107.32',69},
{'ns_1@172.23.106.155','ns_1@172.23.120.122',109},
{'ns_1@172.23.106.155','ns_1@172.23.120.125',38},
{'ns_1@172.23.106.155','ns_1@172.23.120.128',868},
{'ns_1@172.23.106.155','ns_1@172.23.121.128',74},
{'ns_1@172.23.106.155','ns_1@172.23.121.140',520},
{'ns_1@172.23.106.155','ns_1@172.23.121.143',664},
{'ns_1@172.23.106.155','ns_1@172.23.121.156',278},
{'ns_1@172.23.106.164','ns_1@172.23.106.155',0},
{'ns_1@172.23.106.164','ns_1@172.23.106.155',1},
{'ns_1@172.23.120.123','ns_1@172.23.106.155',124},
{'ns_1@172.23.121.126','ns_1@172.23.106.155',532},
{'ns_1@172.23.121.133','ns_1@172.23.106.155',332},
{'ns_1@172.23.121.142','ns_1@172.23.106.155',119},
{'ns_1@172.23.121.153','ns_1@172.23.106.155',714},
{'ns_1@172.23.121.162','ns_1@172.23.106.155',700},
{'ns_1@172.23.122.224','ns_1@172.23.106.155',916},
{'ns_1@172.23.122.229','ns_1@172.23.106.155',892},
{'ns_1@172.23.122.245','ns_1@172.23.106.155',761},
{'ns_1@172.23.122.245','ns_1@172.23.106.155',810}]
Extras = []
[ns_server:info,2021-05-04T05:17:15.951-07:00,ns_1@172.23.106.155:rebalance_agent<0.815.0>:rebalance_agent:handle_down:290]Rebalancer process <0.9679.17> died (reason bad_replicas).
The common node is 172.23.106.155'
On 172.23.106.155 we see memcached exited with status 137, which mean it was external terminated (possibly using kill)
[ns_server:info,2021-05-04T05:17:01.303-07:00,babysitter_of_ns_1@cb.local:<0.132.0>:supervisor_cushion:handle_info:54]Cushion managed supervisor for memcached failed: {abnormal,137}
Rebalance failure 2
[user:info,2021-05-04T05:21:36.329-07:00,ns_1@172.23.106.155:<0.764.0>:ns_orchestrator:idle:773]Starting rebalance, KeepNodes = ['ns_1@172.23.106.62','ns_1@172.23.121.76',
[user:error,2021-05-04T05:21:52.145-07:00,ns_1@172.23.106.155:<0.764.0>:ns_orchestrator:log_rebalance_completion:1405]Rebalance exited with reason {buckets_cleanup_failed,
Hi Ritesh Agarwal. For the first rebalance failure the memcached process was externally terminated (with kill). It does not appear to be due to a Linux OOM. Does the test perform a kill of the memcached process?
thanks
Daniel Owen
added a comment - Hi Ritesh Agarwal . For the first rebalance failure the memcached process was externally terminated (with kill). It does not appear to be due to a Linux OOM. Does the test perform a kill of the memcached process?
thanks
Hi Daniel Owen, you are right, the first rebalance was an intentional abort and then we resumed rebalance and saw the 2nd rebalance failure.
Please note when i restarted the test i did not see this issue hence it is an intermittent problem.
Ritesh Agarwal
added a comment - Hi Daniel Owen , you are right, the first rebalance was an intentional abort and then we resumed rebalance and saw the 2nd rebalance failure.
Please note when i restarted the test i did not see this issue hence it is an intermittent problem.
OK thanks Ritesh Agarwal so the rebalance of interest is the second one that failed due to buckets_cleanup_failed I will take a look at this one. Thanks
Daniel Owen
added a comment - OK thanks Ritesh Agarwal so the rebalance of interest is the second one that failed due to buckets_cleanup_failed I will take a look at this one. Thanks
Looking at rebalance failure 2 on 172.23.106.155 we see
[user:error,2021-05-04T05:21:52.145-07:00,ns_1@172.23.106.155:<0.764.0>:ns_orchestrator:log_rebalance_completion:1405]Rebalance exited with reason {buckets_cleanup_failed,
['ns_1@172.23.120.126',
'ns_1@172.23.122.223']}.
On node 172.23.120.126 the ns_server.debug.log last entry is at 2021-05-04T05:21:04.468
On node 172.23.122.223 the ns_server.debug.log last entry is at 2021-05-04T05:21:05.408
Therefore the logs don't cover the period when the buckets_cleanup_failed is reported.
However on 172.23.106.155 we do see
[rebalance:error,2021-05-04T05:21:52.137-07:00,ns_1@172.23.106.155:<0.25724.23>:ns_rebalancer:maybe_cleanup_old_buckets:941]Failed to cleanup old buckets on node 'ns_1@172.23.120.126': {badrpc,
{'EXIT',
timeout}}
[rebalance:error,2021-05-04T05:21:52.137-07:00,ns_1@172.23.106.155:<0.25724.23>:ns_rebalancer:maybe_cleanup_old_buckets:941]Failed to cleanup old buckets on node 'ns_1@172.23.122.223': {badrpc,
{'EXIT',
timeout}}
Strongly suspect it is a duplicate of MB-45594.
However will let ns_server confirm.
Daniel Owen
added a comment - Looking at rebalance failure 2 on 172.23.106.155 we see
[user:error,2021-05-04T05:21:52.145-07:00,ns_1@172.23.106.155:<0.764.0>:ns_orchestrator:log_rebalance_completion:1405]Rebalance exited with reason {buckets_cleanup_failed,
['ns_1@172.23.120.126',
'ns_1@172.23.122.223']}.
On node 172.23.120.126 the ns_server.debug.log last entry is at 2021-05-04T05:21:04.468
On node 172.23.122.223 the ns_server.debug.log last entry is at 2021-05-04T05:21:05.408
Therefore the logs don't cover the period when the buckets_cleanup_failed is reported.
However on 172.23.106.155 we do see
[rebalance:error,2021-05-04T05:21:52.137-07:00,ns_1@172.23.106.155:<0.25724.23>:ns_rebalancer:maybe_cleanup_old_buckets:941]Failed to cleanup old buckets on node 'ns_1@172.23.120.126': {badrpc,
{'EXIT',
timeout}}
[rebalance:error,2021-05-04T05:21:52.137-07:00,ns_1@172.23.106.155:<0.25724.23>:ns_rebalancer:maybe_cleanup_old_buckets:941]Failed to cleanup old buckets on node 'ns_1@172.23.122.223': {badrpc,
{'EXIT',
timeout}}
Strongly suspect it is a duplicate of MB-45594 .
However will let ns_server confirm.
I agree. This seems quite similar to MB-45594 and couple other tickets with the same behavior. Assigning to Aliaksey for confirmation.
Meni Hillel (Inactive)
added a comment - I agree. This seems quite similar to MB-45594 and couple other tickets with the same behavior. Assigning to Aliaksey for confirmation.
I've not had a lot of luck with logs from this ticket. But the information on MB-45594 was somewhat useful.
First observation is that these issues always occur during log collection. The cluster from MB-45594 fortunately has got atop collection enabled with 10 second granularity. Looking at the atop archives, all three nodes that timed out had the disk partition that chronicle writes to at 100% utilization. All of it seems to be cbcollect_info writing temporary files.
That being said, I don't know if just that is enough to explain the timeouts: the faulty nodes lag behind by up to a minute compared to other nodes. And it's not like there are lots of mutations happening.
I'll keep looking.
In addition, I'll use this ticket to add more diagnostics in chronicle.
Aliaksey Artamonau (Inactive)
added a comment - - edited I've not had a lot of luck with logs from this ticket. But the information on MB-45594 was somewhat useful.
First observation is that these issues always occur during log collection. The cluster from MB-45594 fortunately has got atop collection enabled with 10 second granularity. Looking at the atop archives, all three nodes that timed out had the disk partition that chronicle writes to at 100% utilization. All of it seems to be cbcollect_info writing temporary files.
That being said, I don't know if just that is enough to explain the timeouts: the faulty nodes lag behind by up to a minute compared to other nodes. And it's not like there are lots of mutations happening.
I'll keep looking.
In addition, I'll use this ticket to add more diagnostics in chronicle.
1. I did some basic testing, and I see that there's an order of magnitude difference in the time that fsync and fdatasync take on a system with a lot of writes generated by external processes (trying to simulate cbcollect_info here). I've seen fsync latency over 100ms, but fdatasync latency never broke 10ms. And I suspect that on the test clusters the fsync latency may actually be as high a 1 second (based on some circumstantial evidence). My speculation as to why there's such a significant difference between fsync and fdatasync is that the former needs to sync the filesystem journal. But it probably can only sync all of the journal, so the effects of outside writes are more visible. I remember that fdatasync posed some complexities, which is why I decided not to use it. But it may be worth reconsidering.
2. Currently there's no logic in chronicle that would try to coalesce adjacent writes on downstream nodes. And each write results in its own fsync. This makes some sense for nodes that are up to date: writes are batched on the leader, and each node needs to respond asap for those writes to be acknowledged. But for nodes that lag behind it doesn't make sense to continue fsyncing each individual write.
Aliaksey Artamonau (Inactive)
added a comment - Some improvements to make.
1. I did some basic testing, and I see that there's an order of magnitude difference in the time that fsync and fdatasync take on a system with a lot of writes generated by external processes (trying to simulate cbcollect_info here). I've seen fsync latency over 100ms, but fdatasync latency never broke 10ms. And I suspect that on the test clusters the fsync latency may actually be as high a 1 second (based on some circumstantial evidence). My speculation as to why there's such a significant difference between fsync and fdatasync is that the former needs to sync the filesystem journal. But it probably can only sync all of the journal, so the effects of outside writes are more visible. I remember that fdatasync posed some complexities, which is why I decided not to use it. But it may be worth reconsidering.
2. Currently there's no logic in chronicle that would try to coalesce adjacent writes on downstream nodes. And each write results in its own fsync. This makes some sense for nodes that are up to date: writes are batched on the leader, and each node needs to respond asap for those writes to be acknowledged. But for nodes that lag behind it doesn't make sense to continue fsyncing each individual write.
Have there been any recent occurrences of this issue? If so, I'd be interested to look at the logs.
Aliaksey Artamonau (Inactive)
added a comment - - edited Have there been any recent occurrences of this issue? If so, I'd be interested to look at the logs.
Hello Aliaksey Artamonau, we do not run 130Node test quite often hence it is difficult to say on any recent occurrences. MB-47169 may be another instance of this issue...not too sure but the error seems to be similar.
Ritesh Agarwal
added a comment - Hello Aliaksey Artamonau , we do not run 130Node test quite often hence it is difficult to say on any recent occurrences.
MB-47169 may be another instance of this issue...not too sure but the error seems to be similar.
I've merged changes to mitigate this to chronicle. In addition, cbcollect_info (which is what triggered the issue in this test) should be less disruptive now. So I'm going to close the ticket.
The investigation into a related MB-47169 is still ongoing.
Aliaksey Artamonau (Inactive)
added a comment - I've merged changes to mitigate this to chronicle. In addition, cbcollect_info (which is what triggered the issue in this test) should be less disruptive now. So I'm going to close the ticket.
The investigation into a related MB-47169 is still ongoing.
Build couchbase-server-7.0.1-5960 contains ns_server commit 80e2872 with commit message: MB-46099[cbcollect_info] Control the amount of outstanding IO.
Couchbase Build Team
added a comment - Build couchbase-server-7.0.1-5960 contains ns_server commit 80e2872 with commit message:
MB-46099 [cbcollect_info] Control the amount of outstanding IO.
Build couchbase-server-7.1.0-1096 contains ns_server commit 80e2872 with commit message: MB-46099[cbcollect_info] Control the amount of outstanding IO.
Couchbase Build Team
added a comment - Build couchbase-server-7.1.0-1096 contains ns_server commit 80e2872 with commit message:
MB-46099 [cbcollect_info] Control the amount of outstanding IO.
Looking on the orchestrator (172.23.106.155) we have 2 different rebalance failures
Rebalance failure 1
[user:info,2021-05-04T05:16:30.211-07:00,ns_1@172.23.106.155:<0.764.0>:ns_orchestrator:idle:773]Starting rebalance, KeepNodes = ['ns_1@172.23.106.62','ns_1@172.23.121.76',
'ns_1@172.23.123.6','ns_1@172.23.121.128',
'ns_1@172.23.120.124','ns_1@172.23.122.226',
'ns_1@172.23.121.136','ns_1@172.23.122.248',
'ns_1@172.23.107.18','ns_1@172.23.122.243',
'ns_1@172.23.121.153','ns_1@172.23.96.89',
'ns_1@172.23.121.127','ns_1@172.23.120.125',
'ns_1@172.23.120.129','ns_1@172.23.122.221',
'ns_1@172.23.121.123','ns_1@172.23.121.131',
'ns_1@172.23.122.225','ns_1@172.23.106.99',
'ns_1@172.23.107.12','ns_1@172.23.123.5',
'ns_1@172.23.107.110','ns_1@172.23.121.162',
'ns_1@172.23.122.247','ns_1@172.23.121.154',
'ns_1@172.23.120.128','ns_1@172.23.120.120',
'ns_1@172.23.123.10','ns_1@172.23.123.4',
'ns_1@172.23.120.123','ns_1@172.23.106.53',
'ns_1@172.23.121.155','ns_1@172.23.121.152',
'ns_1@172.23.107.112','ns_1@172.23.107.1',
'ns_1@172.23.121.144','ns_1@172.23.123.7',
'ns_1@172.23.106.164','ns_1@172.23.122.242',
'ns_1@172.23.107.33','ns_1@172.23.106.159',
'ns_1@172.23.121.130','ns_1@172.23.121.141',
'ns_1@172.23.106.56','ns_1@172.23.121.133',
'ns_1@172.23.122.223','ns_1@172.23.122.231',
'ns_1@172.23.107.31','ns_1@172.23.106.155',
'ns_1@172.23.121.48','ns_1@172.23.121.139',
'ns_1@172.23.121.156','ns_1@172.23.122.229',
'ns_1@172.23.122.245','ns_1@172.23.121.142',
'ns_1@172.23.106.163','ns_1@172.23.122.240',
'ns_1@172.23.121.134','ns_1@172.23.123.12',
'ns_1@172.23.121.116','ns_1@172.23.122.228',
'ns_1@172.23.120.126','ns_1@172.23.107.32',
'ns_1@172.23.120.122','ns_1@172.23.121.157',
'ns_1@172.23.107.16','ns_1@172.23.121.143',
'ns_1@172.23.122.241','ns_1@172.23.121.140',
'ns_1@172.23.121.115','ns_1@172.23.121.132',
'ns_1@172.23.122.208','ns_1@172.23.121.135',
'ns_1@172.23.120.170','ns_1@172.23.121.129',
'ns_1@172.23.122.224','ns_1@172.23.121.124',
'ns_1@172.23.121.126','ns_1@172.23.122.227',
'ns_1@172.23.106.95','ns_1@172.23.122.249',
'ns_1@172.23.120.127','ns_1@172.23.123.11',
'ns_1@172.23.120.119','ns_1@172.23.121.79'], EjectNodes = [], Failed over and being ejected nodes = []; no delta recovery nodes; Operation Id = 41a4e6ef0de9120bda567a6716bde64a
[user:error,2021-05-04T05:17:15.953-07:00,ns_1@172.23.106.155:<0.764.0>:ns_orchestrator:log_rebalance_completion:1405]Rebalance exited with reason bad_replicas.
Bit more info..
[user:info,2021-05-04T05:17:15.951-07:00,ns_1@172.23.106.155:<0.9679.17>:ns_rebalancer:verify_replication:879]Bad replicators after rebalance:
Missing = [{'ns_1@172.23.106.155','ns_1@172.23.107.1',208},
{'ns_1@172.23.106.155','ns_1@172.23.107.1',209},
{'ns_1@172.23.106.155','ns_1@172.23.107.18',91},
{'ns_1@172.23.106.155','ns_1@172.23.107.18',328},
{'ns_1@172.23.106.155','ns_1@172.23.107.32',69},
{'ns_1@172.23.106.155','ns_1@172.23.120.122',109},
{'ns_1@172.23.106.155','ns_1@172.23.120.125',38},
{'ns_1@172.23.106.155','ns_1@172.23.120.128',868},
{'ns_1@172.23.106.155','ns_1@172.23.121.128',74},
{'ns_1@172.23.106.155','ns_1@172.23.121.140',520},
{'ns_1@172.23.106.155','ns_1@172.23.121.143',664},
{'ns_1@172.23.106.155','ns_1@172.23.121.156',278},
{'ns_1@172.23.106.164','ns_1@172.23.106.155',0},
{'ns_1@172.23.106.164','ns_1@172.23.106.155',1},
{'ns_1@172.23.120.123','ns_1@172.23.106.155',124},
{'ns_1@172.23.121.126','ns_1@172.23.106.155',532},
{'ns_1@172.23.121.133','ns_1@172.23.106.155',332},
{'ns_1@172.23.121.142','ns_1@172.23.106.155',119},
{'ns_1@172.23.121.153','ns_1@172.23.106.155',714},
{'ns_1@172.23.121.162','ns_1@172.23.106.155',700},
{'ns_1@172.23.122.224','ns_1@172.23.106.155',916},
{'ns_1@172.23.122.229','ns_1@172.23.106.155',892},
{'ns_1@172.23.122.245','ns_1@172.23.106.155',761},
{'ns_1@172.23.122.245','ns_1@172.23.106.155',810}]
Extras = []
[ns_server:info,2021-05-04T05:17:15.951-07:00,ns_1@172.23.106.155:rebalance_agent<0.815.0>:rebalance_agent:handle_down:290]Rebalancer process <0.9679.17> died (reason bad_replicas).
The common node is 172.23.106.155'
On 172.23.106.155 we see memcached exited with status 137, which mean it was external terminated (possibly using kill)
[ns_server:info,2021-05-04T05:17:01.303-07:00,babysitter_of_ns_1@cb.local:<0.132.0>:supervisor_cushion:handle_info:54]Cushion managed supervisor for memcached failed: {abnormal,137}
Rebalance failure 2
[user:info,2021-05-04T05:21:36.329-07:00,ns_1@172.23.106.155:<0.764.0>:ns_orchestrator:idle:773]Starting rebalance, KeepNodes = ['ns_1@172.23.106.62','ns_1@172.23.121.76',
[user:error,2021-05-04T05:21:52.145-07:00,ns_1@172.23.106.155:<0.764.0>:ns_orchestrator:log_rebalance_completion:1405]Rebalance exited with reason {buckets_cleanup_failed,