Details
-
Bug
-
Resolution: Fixed
-
Critical
-
4.0.0
-
Security Level: Public
-
Sherlock RC4 4.0.0-4047 - This symptom *probably* existed before Sherlock RC1, we only just got to the bottom of triaging this.
-
Untriaged
-
Centos 64-bit
-
Unknown
-
KV: Sep 14 - Oct 2
Description
Test first loads 100M documents and did a graceful failover. It was fine.
Test then add back the node (.14) and starts rebalance. It didn't complete.
(If I then manually trigger rebalance again, it is fine.)
(Also, if run with 10M documents total, test also passes.)
(The 100M case is very reproducible on Ares.)
REST call to pools/default/tasks got this:
{u'status': u'notRunning', u'statusIsStale': False, u'errorMessage': u'Reba lance failed. See logs for detailed reason. You can try rebalance again.', u'type': u'rebalance', u'masterRequestTimedOu t': False}Here is some log snippet from the console:
Failed to wait deletion of some buckets on some nodes: [{'ns_1@172.23.96.14',
{'EXIT',
}}]
Here is something possibly relevant in the ns_server.debug.log on the .14 node:
[ns_server:error,2015-08-25T00:47:09.552-07:00,ns_1@172.23.96.14:timeout_diag_logger<0.129.0>:timeout_diag_logger:do_diag:105]Got timeout {slow_bucket_stop,{{single_bucket_kv_sup,"bucket-1"},
<0.369.0>,supervisor,
[single_bucket_kv_sup]}}
Attachments
Issue Links
- duplicates
-
MB-15374 [system test] Hard Fail Over -> add back with Full Recovery: Rebalance exited with reason {buckets_shutdown_wait_failed, {old_buckets_shutdown_wait_failed,
- Closed