Loading...

XML

Word

Printable

Details

Type: Bug
Resolution: Incomplete
Priority: Critical
Fix Version/s: CheshireCat.Next
Affects Version/s: 6.5.0
Component/s: ns_server
Labels:
- candidate-for-cheshirecat

Triage:
Untriaged
Is this a Regression?:
Unknown

Description

It's significantly easier in Mad Hatter to generate bucket config conflicts than it is in 6.0 - at least for the test I tried. This is the test I ran:

Start a 4 node cluster; set the auto-failover time to 5 s for convenience
Remove node 4 (or at least a node which is not the orchestrator)
Start rebalance
More or less right away send a SIGSTOP to the orchestrator node
Wait until another node has picked up orchestrator-ship and completed failover of the old orchestrator
Send SIGCONT to the old orchestrator

In Mad Hatter, mostly what happens is that the old orchestrator has pending changes to the vbucket map that get sent after it is resumed post-failover. This causes a conflict in the vbucket map effectively rolling back the failover. In particular, the vbucket map now has entries referring to nodes that are now failed over. E.g.

[user:info,2019-09-16T13:15:08.564-07:00,n_0@127.0.0.1:ns_config_rep<0.12972.1>:ns_config:merge_values_using_timestamps:1358]Conflicting configuration changes to field buckets:

[{'_vclock',[{<<"5f297d430cc59ebb80412011aefb1cc1">>,{86,63735883915}},

...

             {<<"e4fdefb5da810b83581a58ba409c08ee">>,{94,63735883506}}]},

 {configs,[{"default",

            [{deltaRecoveryMap,undefined},

             {uuid,<<"dc7266278f1c9bc404bf66ab9eb649ff">>}

...

            {map,[['n_0@127.0.0.1','n_1@127.0.0.1'],

                   ['n_0@127.0.0.1','n_1@127.0.0.1'],

                   ['n_0@127.0.0.1','n_1@127.0.0.1'],

                   ['n_0@127.0.0.1','n_1@127.0.0.1'],

...

and

[{'_vclock',[{<<"5f297d430cc59ebb80412011aefb1cc1">>,{86,63735883915}},

...

             {<<"e4fdefb5da810b83581a58ba409c08ee">>,{94,63735883506}}]},

 {configs,[{"default",

            [{deltaRecoveryMap,undefined},

             {uuid,<<"dc7266278f1c9bc404bf66ab9eb649ff">>},

...

             {map,[['n_1@127.0.0.1',undefined],

                   ['n_1@127.0.0.1',undefined],

                   ['n_1@127.0.0.1',undefined],

                   ['n_1@127.0.0.1',undefined],

             {fastForwardMap,undefined}]}]}], choosing the former, which looks newer.

Here's the full sequence of events:

# start rebalancing bucket default

[user:info,2019-09-16T13:13:56.888-07:00,n_0@127.0.0.1:<0.24818.1>:ns_rebalancer:rebalance_bucket:606]Started rebalancing bucket default

...

# here's where ns_server gets suspended and then, about a minute later, resumed

[ns_server:info,2019-09-16T13:14:00.748-07:00,n_0@127.0.0.1:<0.13790.1>:ns_memcached:do_handle_call:550]Changed bucket "default" vbucket 56 state to replica

[ns_server:debug,2019-09-16T13:15:08.533-07:00,n_0@127.0.0.1:janitor_agent-default<0.13775.1>:janitor_agent:handle_info:859]Got done message from subprocess: <0.26165.1> (ok)

...

# bucket config is in conflict

[user:info,2019-09-16T13:15:08.564-07:00,n_0@127.0.0.1:ns_config_rep<0.12972.1>:ns_config:merge_values_using_timestamps:1358]Conflicting configuration changes to field buckets:

...

# trace showing that we just sent a map sync from the old orchestrator - which caused the conflict

[ns_server:debug,2019-09-16T13:15:08.619-07:00,n_0@127.0.0.1:<0.24985.1>:ns_vbucket_mover:map_sync:320]Batched 1 vbucket map syncs together.

There is no easy way to recover from this situation; effectively the vbucket map is corrupt.

We have a delay of 5 ms on the vbucket map sync to allow batching the map synchronizations as they are somewhat expensive. I tried reducing the delay to 0 and was still able to reproduce this issue, though it seemed to be more difficult.

Then I tried reproducing the same problem in 6.0 and I failed to repro it once. Other conflicts were produced (on the counters and cbas_memory_quota keys) but these are not very consequential and the situation was entirely recoverable. Interestingly the old orchestrator frequently resumed its position as orchestrator.

Mad Hatter logs (with standard delay of 5 ms):
https://cb-engineering.s3.amazonaws.com/davef/bucket_conflicts/collectinfo-2019-09-16T201626-n_0%40127.0.0.1.zip
https://cb-engineering.s3.amazonaws.com/davef/bucket_conflicts/collectinfo-2019-09-16T201626-n_1%40127.0.0.1.zip
https://cb-engineering.s3.amazonaws.com/davef/bucket_conflicts/collectinfo-2019-09-16T201626-n_2%40127.0.0.1.zip
https://cb-engineering.s3.amazonaws.com/davef/bucket_conflicts/collectinfo-2019-09-16T201626-n_3%40127.0.0.1.zip

Mad Hatter logs (with delay of 0 ms):
https://cb-engineering.s3.amazonaws.com/davef/bucket_conflicts_0_delay/collectinfo-2019-09-16T204624-n_0%40127.0.0.1.zip
https://cb-engineering.s3.amazonaws.com/davef/bucket_conflicts_0_delay/collectinfo-2019-09-16T204624-n_1%40127.0.0.1.zip
https://cb-engineering.s3.amazonaws.com/davef/bucket_conflicts_0_delay/collectinfo-2019-09-16T204624-n_2%40127.0.0.1.zip
https://cb-engineering.s3.amazonaws.com/davef/bucket_conflicts_0_delay/collectinfo-2019-09-16T204624-n_3%40127.0.0.1.zip

6.0 logs:
https://cb-engineering.s3.amazonaws.com/davef/no_bucket_conflicts_6.0/collectinfo-2019-09-17T034835-n_0%40127.0.0.1.zip
https://cb-engineering.s3.amazonaws.com/davef/no_bucket_conflicts_6.0/collectinfo-2019-09-17T034835-n_1%40127.0.0.1.zip
https://cb-engineering.s3.amazonaws.com/davef/no_bucket_conflicts_6.0/collectinfo-2019-09-17T034835-n_2%40127.0.0.1.zip
https://cb-engineering.s3.amazonaws.com/davef/no_bucket_conflicts_6.0/collectinfo-2019-09-17T034835-n_3%40127.0.0.1.zip

Attachments

Gerrit Reviews

- Issue Only
- Show All Reviews
- Show Open Reviews
- Show All Issues
- Show Open Issues

No reviews matched the request. Check your Options in the drop-down menu of this sections header.

Activity

People

Assignee:: Dave Finlay

Reporter:: Dave Finlay

Votes:: 0 Vote for this issue

Watchers:: 6 Start watching this issue

Dates

Created:: 16/Sep/19 8:55 PM

Updated:: 06/Jan/22 11:10 AM

Resolved:: 20/May/21 12:00 PM

Gerrit Reviews

There are no open Gerrit changes

Easier to cause conflict in bucket config in Mad Hatter when compared with 6.0

Details

Description

Attachments

Gerrit Reviews

Activity

People

Dates

Gerrit Reviews

PagerDuty