Loading...

XML

Word

Printable

Details

Type: Bug
Resolution: Fixed
Priority: Critical
Fix Version/s: 7.2.4
Affects Version/s: None
Component/s: ns_server
Labels:
- approved-for-7.2.4
- request-dev-verify

Triage:
Untriaged
Story Points:
0
Is this a Regression?:
Unknown

Description

A cluster test does this:
1) Start a 3 node cluster nodes A,B,C add a bucket
2) Failover node A

We see during startup, the janitor initially runs and tries to add initial map for the bucket:

[ns_server:info,2023-10-19T19:27:31.179Z,n_0@127.0.0.1:<0.2453.0>:ns_janitor:cleanup_with_membase_bucket_check_map:198]janitor decided to generate initial vbucket map

We see the janitor run being cancelled and interrupted by failover(as expected by test steps because it executes failover right a way after start):

[ns_server:debug,2023-10-19T19:27:31.279Z,n_0@127.0.0.1:cleanup_process<0.2438.0>:misc:executing_on_new_process_body:1431]Aborting <0.2450.0> (body is #Fun<leader_activities.0.26169839>) because we are interrupted by an exit message {'EXIT',

                                                                                                                <0.1191.0>,

                                                                                                                shutdown}

[ns_server:debug,2023-10-19T19:27:31.279Z,n_0@127.0.0.1:cleanup_process<0.2438.0>:misc:with_trap_exit_maybe_exit:2893]Terminating due to exit message {'EXIT',<0.1191.0>,shutdown}

[user:info,2023-10-19T19:27:31.286Z,n_0@127.0.0.1:<0.1194.0>:ns_orchestrator:handle_start_failover:1861]Starting failover of nodes ['n_0@127.0.0.1'] AllowUnsafe = false Operation Id = 4cab98544233b93238edf400cf944b8e

We know that failover does a config sync at the start and then gets a config Snapshot that is used for the failover task throughout.

In this case it is not finding any vbucket map for the bucket so it doesn't do any bucket failover, and initially this makes sense as the janitor run that was trying to set the initial map earlier was interrupted by failover so was canceled:

[ns_server:info,2023-10-19T19:27:31.286Z,n_0@127.0.0.1:<0.2517.0>:failover:pre_failover_config_sync:202]Going to pull config from ['n_1@127.0.0.1','n_2@127.0.0.1'] before failover

[user:info,2023-10-19T19:27:31.286Z,n_0@127.0.0.1:<0.1194.0>:ns_orchestrator:handle_start_failover:1861]Starting failover of nodes ['n_0@127.0.0.1'] AllowUnsafe = false Operation Id = 4cab98544233b93238edf400cf944b8e

[ns_server:info,2023-10-19T19:27:31.286Z,n_0@127.0.0.1:<0.2517.0>:failover:pre_failover_config_sync:202]Going to pull config from ['n_1@127.0.0.1','n_2@127.0.0.1'] before failover

[ns_server:debug,2023-10-19T19:27:31.292Z,n_0@127.0.0.1:<0.2517.0>:failover:failover_bucket_prep:480]Skipping failover of bucket "testbucket" because it has no vbuckets.

However we notice that while this failover is running(as an orchestrator task obviously), and before it finishes the initial map update takes place in chronicle essentially changing the underlying config as the failover is running.

[ns_server:debug,2023-10-19T19:27:31.297Z,n_0@127.0.0.1:chronicle_kv_log<0.543.0>:chronicle_kv_log:log:59]update (key: {bucket,"testbucket",props}, rev: {<<"7619eaf54303391ba53296a9fb690ddb">>,

                                                43})

[{map,[{0,[],['n_0@127.0.0.1','n_1@127.0.0.1']},

       {1,[],['n_0@127.0.0.1','n_1@127.0.0.1']},

       {2,[],['n_0@127.0.0.1','n_1@127.0.0.1']},

       {3,[],['n_0@127.0.0.1','n_2@127.0.0.1']},

       {4,[],['n_0@127.0.0.1','n_2@127.0.0.1']},

       {5,[],['n_0@127.0.0.1','n_2@127.0.0.1']},

       {6,[],['n_1@127.0.0.1','n_0@127.0.0.1']},

       {7,[],['n_1@127.0.0.1','n_0@127.0.0.1']},

       {8,[],['n_1@127.0.0.1','n_0@127.0.0.1']},

       {9,[],['n_1@127.0.0.1','n_2@127.0.0.1']},

       {10,[],['n_1@127.0.0.1','n_2@127.0.0.1']},

       {11,[],['n_2@127.0.0.1','n_0@127.0.0.1']},

       {12,[],['n_2@127.0.0.1','n_0@127.0.0.1']},

       {13,[],['n_2@127.0.0.1','n_1@127.0.0.1']},

       {14,[],['n_2@127.0.0.1','n_1@127.0.0.1']},

       {15,[],['n_2@127.0.0.1','n_1@127.0.0.1']}]},

The takeaway from this then is that config updates from an older previously canceled orchestrator task can really be applied some time later in the future, even when a different orchestrator task is running. So in this case, the previously canceled janitor task's config update took place while failover was running. This sort of behavior seems like a good recipe for general config corruption bugs.

In this specific case, the type of config corruption that occurs gets fixed by the janitor when it finds inconsistent bucket config. However the test was also doing an eject of n_0 right after, which broke things in an unpleasant way, and the janitor was not able to mediate the situation.

Why does it happen?
Looking at the chronicle code, the short answer is that under the hood chronicle master will batch both config apply/update requests and config sync requests. The config sync requests are batched separately than the config apply/update requests. The default batching intervals are also different. Syncs are batched every 5ms and commands(so config updates) are batched ever 20ms by default.

Therefore a config sync at the start of an orchestrator task alone is not enough to prevent scenarios like these because it is entirely possible that when the config sync batch runs, there are still items in the "command" batch which will get processed later on during the batch update. So if an orchestrator task takes a config snapshot after a config sync, the snapshot can become outdated from the actual true config of the cluster anytime from a past update of a previously canceled orchestrator task. As long as the previous orchestrator task was doing some config update, and the config update made it to the current batch of the master, this scenario will be possible, even if the task was canceled.

In theory, we can handle a situation like this better, if we ensure that during config sync, all pending config updates in the batch also get processed out. This would ensure the config snapshot at start of the orchestrator task is a reliable/most up to date view of the cluster, and no surprise config updates can take place in the future. Requires more thought.