Loading...

XML

Word

Printable

Details

Type: Bug
Resolution: Fixed
Priority: Major
Fix Version/s: 6.5.0
Affects Version/s: 5.5.1
Component/s: ns_server
Labels:
- request-dev-verify

Triage:
Untriaged
Is this a Regression?:
Unknown

Description

Setup steps:

Build out a 3-node cluster using 5.5.1
Enable autofailover with a 10 second timeout for up to 1 event
Add two buckets with 1 replica each. They can be empty

The goal here is to setup a cluster where a rebalance operation takes longer than the autofailover timeout. In our case we used physical machines with HDD storage. Rebalance operations would take about 40 seconds which is fast enough for humans to iterate over a bunch of them but slow enough to be longer than the 10 second autofailover timeout.

If you try to reproduce this in a VM environment or with really fast storage it may be necessary to increase the bucket count or take other measures to ensure the rebalance takes the right amount of time

With the above setup, here's how to reproduce

Pick a node and do a graceful failover
Mark that node for add back with delta recovery
rebalance

Expected results:

Node rebalances back into the cluster cleanly and without warning or other side effects or unexpected behavior

Actual results:

The message "Could not automatically fail over nodes (['ns_1@lca1-app0541.stg.linkedin.com']). Rebalance is running" is displayed in the Logs page in the webui 10 seconds into the delta-recovery rebalance
If the rebalance completes successfully, no further symptoms are exhibited

However! We can setup another test that is worse. Follow the above repro steps of doing graceful failout and add-back with delta recovery. But this time interrupt the rebalance partway through once 10 seconds have elapsed. In this case we'd expect the node to still be a member of the cluster but the cluster is still in an unbalanced state. However the actual behavior is that the node is immediately auto-failovered out of the cluster again.

The above seems like very wrong behavior.

I have come up with some ways to fail a node out and bring it back to the cluster without triggering the above condition. Any one will work:

Use full recovery instead of delta
Disable auto-failover and then do delta recovery
Set the auto-failover timeout really high (higher than the time a rebalance takes to complete) and then do delta recovery
Leave autofailover enabled and with a low setting (10 seconds) and do a delta recovery. But then abort it after just a couple seconds (before the autofailover timeout threshold is crossed). Then do a subsequent rebalance again which exhibits no strange behavior

We are about to embark on upgrading very many of our production couchbase clusters using a graceful failover and delta recovery method and would rather this behavior be fixed instead of needing to work around it in our automation and tooling

Attachments

Gerrit Reviews

- Issue Only
- Show All Reviews
- Show Open Reviews
- Show All Issues
- Show Open Issues

No reviews matched the request. Check your Options in the drop-down menu of this sections header.

Activity

People

Assignee:: Balakumaran Gopal

Reporter:: bweir

Votes:: 1 Vote for this issue

Watchers:: 12 Start watching this issue

Dates

Created:: 20/Sep/18 12:36 PM

Updated:: 02/Jan/20 10:26 AM

Resolved:: 22/Oct/18 5:43 PM

Gerrit Reviews

There are no open Gerrit changes

Show There is 1 closed Gerrit change

Hide There is 1 closed Gerrit change

MB-24242, MB-31366: Set relevant vBuckets to ...: Gerrit Review:

delta recovery and autofailover do not play nicely together

Details

Description

Attachments

Gerrit Reviews

Activity

People

Dates

Gerrit Reviews

PagerDuty