Details
-
Bug
-
Resolution: Fixed
-
Critical
-
5.0.0
-
Untriaged
-
Centos 64-bit
-
-
Yes
Description
Builds 5.0.0-3192+.
Setup:
- 4 data nodes
- 1 Couchbase bucket, 100M documents
- Mixed workload (10K ops/sec)
- Auto-failover timeout is 5 seconds
Steps:
- Load data
- Start workload
- Trigger graceful or hard failover after a few minutes (1 node - 172.23.99.206)
- Add the node back
- Trigger recovery after a few minutes
That worked just fine until this patch:
https://github.com/couchbase/ns_server/commit/683a6944de4fb86170dfa06973ab23d3fa493dad
This is what we observe in our tests now:
[user:info,2017-06-24T15:31:02.497-07:00,ns_1@172.23.99.203:<0.32134.2>:ns_rebalancer:run_graceful_failover:1323]Starting vbucket moves for graceful failover of 'ns_1@172.23.99.206'
|
[user:info,2017-06-24T15:31:10.696-07:00,ns_1@172.23.99.203:<0.32134.2>:ns_rebalancer:orchestrate_failover:79]Starting failing over 'ns_1@172.23.99.206'
|
[user:info,2017-06-24T15:31:10.744-07:00,ns_1@172.23.99.203:<0.32134.2>:ns_rebalancer:orchestrate_failover:82]Failed over 'ns_1@172.23.99.206': ok
|
[user:info,2017-06-24T15:51:20.208-07:00,ns_1@172.23.99.203:<0.1274.0>:ns_orchestrator:idle:666]Starting rebalance, KeepNodes = ['ns_1@172.23.99.203','ns_1@172.23.99.204',
|
'ns_1@172.23.99.205','ns_1@172.23.99.206'], EjectNodes = [], Failed over and being ejected nodes = [], Delta recovery nodes = ['ns_1@172.23.99.206'], Delta recovery buckets = all
|
[user:info,2017-06-24T15:51:25.083-07:00,ns_1@172.23.99.203:<0.1274.0>:ns_orchestrator:rebalancing:858]Stopping rebalance as we received a {try_autofailover,'ns_1@172.23.99.206'} request
|
[user:info,2017-06-24T15:52:20.228-07:00,ns_1@172.23.99.203:<0.7517.5>:ns_rebalancer:orchestrate_failover:79]Starting failing over 'ns_1@172.23.99.206'
|
[user:info,2017-06-24T15:52:20.299-07:00,ns_1@172.23.99.203:<0.7517.5>:ns_rebalancer:orchestrate_failover:82]Failed over 'ns_1@172.23.99.206': ok
|
Basically, the node that we are trying to recover is being auto-failed over. That doesn't make a lot of sense.
Attachments
Issue Links
- relates to
-
MB-24242 Rebalance does not fail immediately when there are node failures
- Closed