Loading...

XML

Word

Printable

Details

Type: Bug
Resolution: Fixed
Priority: Critical
Fix Version/s: 5.0.0
Affects Version/s: 5.0.0
Component/s: ns_server
Labels:
- performance

Triage:
Untriaged
Operating System:
Centos 64-bit
Link to Log File, atop/blg, CBCollectInfo, Core dump:

Hide
https://s3-us-west-2.amazonaws.com/perf-artifacts/jenkins-hestia-1512/172.23.99.203.zip
https://s3-us-west-2.amazonaws.com/perf-artifacts/jenkins-hestia-1512/172.23.99.204.zip
https://s3-us-west-2.amazonaws.com/perf-artifacts/jenkins-hestia-1512/172.23.99.205.zip
https://s3-us-west-2.amazonaws.com/perf-artifacts/jenkins-hestia-1512/172.23.99.206.zip

Show
https://s3-us-west-2.amazonaws.com/perf-artifacts/jenkins-hestia-1512/172.23.99.203.zip https://s3-us-west-2.amazonaws.com/perf-artifacts/jenkins-hestia-1512/172.23.99.204.zip https://s3-us-west-2.amazonaws.com/perf-artifacts/jenkins-hestia-1512/172.23.99.205.zip https://s3-us-west-2.amazonaws.com/perf-artifacts/jenkins-hestia-1512/172.23.99.206.zip
Is this a Regression?:
Yes

Description

Builds 5.0.0-3192+.

Setup:

4 data nodes
1 Couchbase bucket, 100M documents
Mixed workload (10K ops/sec)
Auto-failover timeout is 5 seconds

Steps:

Load data
Start workload
Trigger graceful or hard failover after a few minutes (1 node - 172.23.99.206)
Add the node back
Trigger recovery after a few minutes

That worked just fine until this patch:

https://github.com/couchbase/ns_server/commit/683a6944de4fb86170dfa06973ab23d3fa493dad

This is what we observe in our tests now:

[user:info,2017-06-24T15:31:02.497-07:00,ns_1@172.23.99.203:<0.32134.2>:ns_rebalancer:run_graceful_failover:1323]Starting vbucket moves for graceful failover of 'ns_1@172.23.99.206'

[user:info,2017-06-24T15:31:10.696-07:00,ns_1@172.23.99.203:<0.32134.2>:ns_rebalancer:orchestrate_failover:79]Starting failing over 'ns_1@172.23.99.206'

[user:info,2017-06-24T15:31:10.744-07:00,ns_1@172.23.99.203:<0.32134.2>:ns_rebalancer:orchestrate_failover:82]Failed over 'ns_1@172.23.99.206': ok

[user:info,2017-06-24T15:51:20.208-07:00,ns_1@172.23.99.203:<0.1274.0>:ns_orchestrator:idle:666]Starting rebalance, KeepNodes = ['ns_1@172.23.99.203','ns_1@172.23.99.204',

                                 'ns_1@172.23.99.205','ns_1@172.23.99.206'], EjectNodes = [], Failed over and being ejected nodes = [], Delta recovery nodes = ['ns_1@172.23.99.206'],  Delta recovery buckets = all

[user:info,2017-06-24T15:51:25.083-07:00,ns_1@172.23.99.203:<0.1274.0>:ns_orchestrator:rebalancing:858]Stopping rebalance as we received a {try_autofailover,'ns_1@172.23.99.206'} request

[user:info,2017-06-24T15:52:20.228-07:00,ns_1@172.23.99.203:<0.7517.5>:ns_rebalancer:orchestrate_failover:79]Starting failing over 'ns_1@172.23.99.206'

[user:info,2017-06-24T15:52:20.299-07:00,ns_1@172.23.99.203:<0.7517.5>:ns_rebalancer:orchestrate_failover:82]Failed over 'ns_1@172.23.99.206': ok

Basically, the node that we are trying to recover is being auto-failed over. That doesn't make a lot of sense.

Attachments

Issue Links

relates to

MB-24242 Rebalance does not fail immediately when there are node failures

Closed

Gerrit Reviews

- Issue Only
- Show All Reviews
- Show Open Reviews
- Show All Issues
- Show Open Issues

No reviews matched the request. Check your Options in the drop-down menu of this sections header.

Activity

People

Assignee:: Pavel Paulau (Inactive)

Reporter:: Pavel Paulau (Inactive)

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Dates

Created:: 29/Jun/17 4:52 PM

Updated:: 24/Jul/17 8:45 AM

Resolved:: 19/Jul/17 3:27 PM

Gerrit Reviews

There are no open Gerrit changes

Show There is 1 closed Gerrit change

Hide There is 1 closed Gerrit change

MB-25088: Auto-failover to not interrupt rebalance...: Gerrit Review:

Delta recovery is interrupted: stopping rebalance as we received a "try_autofailover" request

Details

Description

Attachments

Issue Links

Gerrit Reviews

Activity

People

Dates

Gerrit Reviews

PagerDuty