Details
-
Bug
-
Resolution: Incomplete
-
Critical
-
6.5.0
-
Untriaged
-
Unknown
Description
It's significantly easier in Mad Hatter to generate bucket config conflicts than it is in 6.0 - at least for the test I tried. This is the test I ran:
- Start a 4 node cluster; set the auto-failover time to 5 s for convenience
- Remove node 4 (or at least a node which is not the orchestrator)
- Start rebalance
- More or less right away send a SIGSTOP to the orchestrator node
- Wait until another node has picked up orchestrator-ship and completed failover of the old orchestrator
- Send SIGCONT to the old orchestrator
In Mad Hatter, mostly what happens is that the old orchestrator has pending changes to the vbucket map that get sent after it is resumed post-failover. This causes a conflict in the vbucket map effectively rolling back the failover. In particular, the vbucket map now has entries referring to nodes that are now failed over. E.g.
[user:info,2019-09-16T13:15:08.564-07:00,n_0@127.0.0.1:ns_config_rep<0.12972.1>:ns_config:merge_values_using_timestamps:1358]Conflicting configuration changes to field buckets:
|
[{'_vclock',[{<<"5f297d430cc59ebb80412011aefb1cc1">>,{86,63735883915}},
|
...
|
{<<"e4fdefb5da810b83581a58ba409c08ee">>,{94,63735883506}}]},
|
{configs,[{"default",
|
[{deltaRecoveryMap,undefined},
|
{uuid,<<"dc7266278f1c9bc404bf66ab9eb649ff">>}
|
...
|
{map,[['n_0@127.0.0.1','n_1@127.0.0.1'],
|
['n_0@127.0.0.1','n_1@127.0.0.1'],
|
['n_0@127.0.0.1','n_1@127.0.0.1'],
|
['n_0@127.0.0.1','n_1@127.0.0.1'],
|
...
|
and
|
[{'_vclock',[{<<"5f297d430cc59ebb80412011aefb1cc1">>,{86,63735883915}},
|
...
|
{<<"e4fdefb5da810b83581a58ba409c08ee">>,{94,63735883506}}]},
|
{configs,[{"default",
|
[{deltaRecoveryMap,undefined},
|
{uuid,<<"dc7266278f1c9bc404bf66ab9eb649ff">>},
|
...
|
{map,[['n_1@127.0.0.1',undefined],
|
['n_1@127.0.0.1',undefined],
|
['n_1@127.0.0.1',undefined],
|
['n_1@127.0.0.1',undefined],
|
{fastForwardMap,undefined}]}]}], choosing the former, which looks newer.
|
Here's the full sequence of events:
# start rebalancing bucket default
|
[user:info,2019-09-16T13:13:56.888-07:00,n_0@127.0.0.1:<0.24818.1>:ns_rebalancer:rebalance_bucket:606]Started rebalancing bucket default
|
...
|
# here's where ns_server gets suspended and then, about a minute later, resumed
|
[ns_server:info,2019-09-16T13:14:00.748-07:00,n_0@127.0.0.1:<0.13790.1>:ns_memcached:do_handle_call:550]Changed bucket "default" vbucket 56 state to replica
|
[ns_server:debug,2019-09-16T13:15:08.533-07:00,n_0@127.0.0.1:janitor_agent-default<0.13775.1>:janitor_agent:handle_info:859]Got done message from subprocess: <0.26165.1> (ok)
|
...
|
# bucket config is in conflict
|
[user:info,2019-09-16T13:15:08.564-07:00,n_0@127.0.0.1:ns_config_rep<0.12972.1>:ns_config:merge_values_using_timestamps:1358]Conflicting configuration changes to field buckets:
|
...
|
# trace showing that we just sent a map sync from the old orchestrator - which caused the conflict
|
[ns_server:debug,2019-09-16T13:15:08.619-07:00,n_0@127.0.0.1:<0.24985.1>:ns_vbucket_mover:map_sync:320]Batched 1 vbucket map syncs together.
|
There is no easy way to recover from this situation; effectively the vbucket map is corrupt.
We have a delay of 5 ms on the vbucket map sync to allow batching the map synchronizations as they are somewhat expensive. I tried reducing the delay to 0 and was still able to reproduce this issue, though it seemed to be more difficult.
Then I tried reproducing the same problem in 6.0 and I failed to repro it once. Other conflicts were produced (on the counters and cbas_memory_quota keys) but these are not very consequential and the situation was entirely recoverable. Interestingly the old orchestrator frequently resumed its position as orchestrator.
Mad Hatter logs (with standard delay of 5 ms):
https://cb-engineering.s3.amazonaws.com/davef/bucket_conflicts/collectinfo-2019-09-16T201626-n_0%40127.0.0.1.zip
https://cb-engineering.s3.amazonaws.com/davef/bucket_conflicts/collectinfo-2019-09-16T201626-n_1%40127.0.0.1.zip
https://cb-engineering.s3.amazonaws.com/davef/bucket_conflicts/collectinfo-2019-09-16T201626-n_2%40127.0.0.1.zip
https://cb-engineering.s3.amazonaws.com/davef/bucket_conflicts/collectinfo-2019-09-16T201626-n_3%40127.0.0.1.zip
Mad Hatter logs (with delay of 0 ms):
https://cb-engineering.s3.amazonaws.com/davef/bucket_conflicts_0_delay/collectinfo-2019-09-16T204624-n_0%40127.0.0.1.zip
https://cb-engineering.s3.amazonaws.com/davef/bucket_conflicts_0_delay/collectinfo-2019-09-16T204624-n_1%40127.0.0.1.zip
https://cb-engineering.s3.amazonaws.com/davef/bucket_conflicts_0_delay/collectinfo-2019-09-16T204624-n_2%40127.0.0.1.zip
https://cb-engineering.s3.amazonaws.com/davef/bucket_conflicts_0_delay/collectinfo-2019-09-16T204624-n_3%40127.0.0.1.zip
6.0 logs:
https://cb-engineering.s3.amazonaws.com/davef/no_bucket_conflicts_6.0/collectinfo-2019-09-17T034835-n_0%40127.0.0.1.zip
https://cb-engineering.s3.amazonaws.com/davef/no_bucket_conflicts_6.0/collectinfo-2019-09-17T034835-n_1%40127.0.0.1.zip
https://cb-engineering.s3.amazonaws.com/davef/no_bucket_conflicts_6.0/collectinfo-2019-09-17T034835-n_2%40127.0.0.1.zip
https://cb-engineering.s3.amazonaws.com/davef/no_bucket_conflicts_6.0/collectinfo-2019-09-17T034835-n_3%40127.0.0.1.zip