Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-36017

Easier to cause conflict in bucket config in Mad Hatter when compared with 6.0

    XMLWordPrintable

Details

    • Untriaged
    • Unknown

    Description

      It's significantly easier in Mad Hatter to generate bucket config conflicts than it is in 6.0 - at least for the test I tried. This is the test I ran:

      1. Start a 4 node cluster; set the auto-failover time to 5 s for convenience
      2. Remove node 4 (or at least a node which is not the orchestrator)
      3. Start rebalance
      4. More or less right away send a SIGSTOP to the orchestrator node
      5. Wait until another node has picked up orchestrator-ship and completed failover of the old orchestrator
      6. Send SIGCONT to the old orchestrator

      In Mad Hatter, mostly what happens is that the old orchestrator has pending changes to the vbucket map that get sent after it is resumed post-failover. This causes a conflict in the vbucket map effectively rolling back the failover. In particular, the vbucket map now has entries referring to nodes that are now failed over. E.g.

      [user:info,2019-09-16T13:15:08.564-07:00,n_0@127.0.0.1:ns_config_rep<0.12972.1>:ns_config:merge_values_using_timestamps:1358]Conflicting configuration changes to field buckets:
      [{'_vclock',[{<<"5f297d430cc59ebb80412011aefb1cc1">>,{86,63735883915}},
      ...
                   {<<"e4fdefb5da810b83581a58ba409c08ee">>,{94,63735883506}}]},
       {configs,[{"default",
                  [{deltaRecoveryMap,undefined},
                   {uuid,<<"dc7266278f1c9bc404bf66ab9eb649ff">>}
      ...
                  {map,[['n_0@127.0.0.1','n_1@127.0.0.1'],
                         ['n_0@127.0.0.1','n_1@127.0.0.1'],
                         ['n_0@127.0.0.1','n_1@127.0.0.1'],
                         ['n_0@127.0.0.1','n_1@127.0.0.1'],
      ...
      and
      [{'_vclock',[{<<"5f297d430cc59ebb80412011aefb1cc1">>,{86,63735883915}},
      ...
                   {<<"e4fdefb5da810b83581a58ba409c08ee">>,{94,63735883506}}]},
       {configs,[{"default",
                  [{deltaRecoveryMap,undefined},
                   {uuid,<<"dc7266278f1c9bc404bf66ab9eb649ff">>},
      ...
                   {map,[['n_1@127.0.0.1',undefined],
                         ['n_1@127.0.0.1',undefined],
                         ['n_1@127.0.0.1',undefined],
                         ['n_1@127.0.0.1',undefined],
                   {fastForwardMap,undefined}]}]}], choosing the former, which looks newer.
      

      Here's the full sequence of events:

      # start rebalancing bucket default
      [user:info,2019-09-16T13:13:56.888-07:00,n_0@127.0.0.1:<0.24818.1>:ns_rebalancer:rebalance_bucket:606]Started rebalancing bucket default
      ...
      # here's where ns_server gets suspended and then, about a minute later, resumed
      [ns_server:info,2019-09-16T13:14:00.748-07:00,n_0@127.0.0.1:<0.13790.1>:ns_memcached:do_handle_call:550]Changed bucket "default" vbucket 56 state to replica
      [ns_server:debug,2019-09-16T13:15:08.533-07:00,n_0@127.0.0.1:janitor_agent-default<0.13775.1>:janitor_agent:handle_info:859]Got done message from subprocess: <0.26165.1> (ok)
      ...
      # bucket config is in conflict
      [user:info,2019-09-16T13:15:08.564-07:00,n_0@127.0.0.1:ns_config_rep<0.12972.1>:ns_config:merge_values_using_timestamps:1358]Conflicting configuration changes to field buckets:
      ...
      # trace showing that we just sent a map sync from the old orchestrator - which caused the conflict
      [ns_server:debug,2019-09-16T13:15:08.619-07:00,n_0@127.0.0.1:<0.24985.1>:ns_vbucket_mover:map_sync:320]Batched 1 vbucket map syncs together.
      

      There is no easy way to recover from this situation; effectively the vbucket map is corrupt.

      We have a delay of 5 ms on the vbucket map sync to allow batching the map synchronizations as they are somewhat expensive. I tried reducing the delay to 0 and was still able to reproduce this issue, though it seemed to be more difficult.

      Then I tried reproducing the same problem in 6.0 and I failed to repro it once. Other conflicts were produced (on the counters and cbas_memory_quota keys) but these are not very consequential and the situation was entirely recoverable. Interestingly the old orchestrator frequently resumed its position as orchestrator.

      Mad Hatter logs (with standard delay of 5 ms):
      https://cb-engineering.s3.amazonaws.com/davef/bucket_conflicts/collectinfo-2019-09-16T201626-n_0%40127.0.0.1.zip
      https://cb-engineering.s3.amazonaws.com/davef/bucket_conflicts/collectinfo-2019-09-16T201626-n_1%40127.0.0.1.zip
      https://cb-engineering.s3.amazonaws.com/davef/bucket_conflicts/collectinfo-2019-09-16T201626-n_2%40127.0.0.1.zip
      https://cb-engineering.s3.amazonaws.com/davef/bucket_conflicts/collectinfo-2019-09-16T201626-n_3%40127.0.0.1.zip

      Mad Hatter logs (with delay of 0 ms):
      https://cb-engineering.s3.amazonaws.com/davef/bucket_conflicts_0_delay/collectinfo-2019-09-16T204624-n_0%40127.0.0.1.zip
      https://cb-engineering.s3.amazonaws.com/davef/bucket_conflicts_0_delay/collectinfo-2019-09-16T204624-n_1%40127.0.0.1.zip
      https://cb-engineering.s3.amazonaws.com/davef/bucket_conflicts_0_delay/collectinfo-2019-09-16T204624-n_2%40127.0.0.1.zip
      https://cb-engineering.s3.amazonaws.com/davef/bucket_conflicts_0_delay/collectinfo-2019-09-16T204624-n_3%40127.0.0.1.zip

      6.0 logs:
      https://cb-engineering.s3.amazonaws.com/davef/no_bucket_conflicts_6.0/collectinfo-2019-09-17T034835-n_0%40127.0.0.1.zip
      https://cb-engineering.s3.amazonaws.com/davef/no_bucket_conflicts_6.0/collectinfo-2019-09-17T034835-n_1%40127.0.0.1.zip
      https://cb-engineering.s3.amazonaws.com/davef/no_bucket_conflicts_6.0/collectinfo-2019-09-17T034835-n_2%40127.0.0.1.zip
      https://cb-engineering.s3.amazonaws.com/davef/no_bucket_conflicts_6.0/collectinfo-2019-09-17T034835-n_3%40127.0.0.1.zip

      Attachments

        No reviews matched the request. Check your Options in the drop-down menu of this sections header.

        Activity

          People

            dfinlay Dave Finlay
            dfinlay Dave Finlay
            Votes:
            0 Vote for this issue
            Watchers:
            6 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Gerrit Reviews

                There are no open Gerrit changes

                PagerDuty