Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-41277

[System Test]: Graceful failover exited with badmatch , leader_activities_error

    XMLWordPrintable

    Details

    • Type: Bug
    • Status: Closed
    • Priority: Critical
    • Resolution: Not a Bug
    • Affects Version/s: 6.6.0
    • Fix Version/s: 6.6.0
    • Component/s: ns_server
    • Labels:

      Description

      Build: 6.6.0-7918 (MP1)

      Test: MH longevity

      Cycle:3

      Day: 2nd

      Test Step

      [2020-09-02T18:57:34-07:00, sequoiatools/couchbase-cli:6.5:dd7d72] server-add -c 172.23.108.103:8091 --server-add https://172.23.106.100 -u Administrator -p password --server-add-username Administrator --server-add-password password --services data
      [2020-09-02T19:04:29-07:00, sequoiatools/couchbase-cli:6.5:34b52f] failover -c 172.23.108.103:8091 --server-failover 172.23.97.239:8091 -u Administrator -p password
      → 
       
       
      Error occurred on container - sequoiatools/couchbase-cli:6.5:[failover -c 172.23.108.103:8091 --server-failover 172.23.97.239:8091 -u Administrator -p password]
       
       
      docker logs 34b52f
      docker start 34b52f
       
       
      sWARNING: couchbase-cli version 6.5.0-3216-enterprise does not match couchbase server version 6.6.0-7918-enterprise
      *Unable to display progress bar on this os
      RERROR: Graceful failover failed. See logs for detailed reason. You can try again.
      [2020-09-02T19:15:24-07:00, sequoiatools/couchbase-cli:6.5:f074ca] failover -c 172.23.108.103:8091 --server-failover 172.23.99.20:8091 -u Administrator -p password --force
      [2020-09-02T19:16:21-07:00, sequoiatools/couchbase-cli:6.5:2e2e36] rebalance -c 172.23.108.103:8091 -u Administrator -p password
      [2020-09-02T19:29:48-07:00, sequoiatools/cmd:8be21f] 60 

      Error logs

      [ns_server:error,2020-09-02T19:09:06.631-07:00,ns_1@172.23.108.103:<0.32085.869>:leader_activities:report_error:1011]Activity {default,{ns_janitor,"HISTORY",apply_config}} failed with error {quorum_lost,
                                                                                {lease_lost,
                                                                                 'ns_1@172.23.104.67'}}
      [user:error,2020-09-02T19:09:06.633-07:00,ns_1@172.23.108.103:<0.11364.0>:ns_orchestrator:log_rebalance_completion:1445]Graceful failover exited with reason {{badmatch,
                                             {leader_activities_error,
                                              {default,
                                               {ns_janitor,"HISTORY",apply_config}},
                                              {quorum_lost,
                                               {lease_lost,'ns_1@172.23.104.67'}}}},
                                            [{ns_janitor,cleanup_apply_config,4,
                                              [{file,"src/ns_janitor.erl"},
                                               {line,213}]},
                                             {ns_janitor,
                                              '-cleanup_membase_bucket/4-fun-0-',4,
                                              [{file,"src/ns_janitor.erl"},
                                               {line,65}]},
                                             {async,'-async_init/4-fun-1-',3,
                                              [{file,"src/async.erl"},{line,197}]}]}.
      Rebalance Operation Id = e30cc2fa709686920d4b76d18a97dbec 

        Attachments

        No reviews matched the request. Check your Options in the drop-down menu of this sections header.

          Activity

          Hide
          dfinlay Dave Finlay added a comment -

          The graceful failover stops because the quorum is lost on the janitor run prior to the HISTORY bucket being gracefully failed over.

          [ns_server:debug,2020-09-02T19:09:06.630-07:00,ns_1@172.23.108.103:leader_activities<0.11336.0>:leader_activities:terminate_activities:635]Terminating activities (reason is {shutdown,
                                             {quorum_lost,
                                              {lease_lost,'ns_1@172.23.104.67'}}}):
          [{activity,<0.31690.869>,#Ref<0.2360701235.947388423.130974>,default,
                     <<"32eb65b8b6ffe3fbf3ec4559094b6036">>,
                     [graceful_failover,
                      {ns_janitor,"HISTORY",cleanup},
                      {ns_janitor,"HISTORY",apply_config}],
                     {all,{set,11,16,16,8,80,48,
                               {[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[]},
                               {{['ns_1@172.23.99.20'],
                                 ['ns_1@172.23.99.21','ns_1@172.23.97.121',
                                  'ns_1@172.23.104.61'],
                                 ['ns_1@172.23.97.122'],
                                 ['ns_1@172.23.108.103'],
                                 [],
                                 ['ns_1@172.23.99.25'],
                                 [],
                                 ['ns_1@172.23.104.67','ns_1@172.23.104.137'],
                                 [],
                                 ['ns_1@172.23.97.239','ns_1@172.23.97.119'],
                                 [],[],[],[],[],[]}}}},
                     [quiet]}]
          

          This part of the janitor requires a unanimous quorum in the KV nodes - it's more than a majority quorum and it's always been the case that the janitor requires all non-failed over KV nodes to be up. The lease is lost to node .67 and that's enough to terminate the graceful failover after 5 buckets have been successfully gracefuly failed over.

          At this point in time the CPU on the orchestrator node is in the vicinity of 95%

          During the graceful failover we can see that leases are lost from quite a number of nodes and then subsequently reacquired. These lost leases don't happen to cause quorum loss because at these earlier points the activity only requires a majority quorum or the leases are lost on non-KV nodes.

          [ns_server:warn,2020-09-02T19:05:16.930-07:00,ns_1@172.23.108.103:<0.14973.759>:leader_lease_acquire_worker:handle_acquire_timeout:112]Timeout while trying to acquire lease from 'ns_1@172.23.99.11'.
          Acquire options were [{timeout,0},{period,15000}]
          --
          [ns_server:warn,2020-09-02T19:05:16.930-07:00,ns_1@172.23.108.103:<0.30045.554>:leader_lease_acquire_worker:handle_acquire_timeout:112]Timeout while trying to acquire lease from 'ns_1@172.23.104.61'.
          Acquire options were [{timeout,0},{period,15000}]
          --
          [ns_server:info,2020-09-02T19:05:25.492-07:00,ns_1@172.23.108.103:<0.30045.554>:leader_lease_acquire_worker:handle_fresh_lease_acquired:302]Acquired lease from node 'ns_1@172.23.104.61' (lease uuid: <<"de96a54b9b6ea9570c884be89beba701">>)
          [ns_server:info,2020-09-02T19:05:25.589-07:00,ns_1@172.23.108.103:<0.14973.759>:leader_lease_acquire_worker:handle_fresh_lease_acquired:302]Acquired lease from node 'ns_1@172.23.99.11' (lease uuid: <<"de96a54b9b6ea9570c884be89beba701">>)
          [ns_server:debug,2020-09-02T19:05:25.822-07:00,ns_1@172.23.108.103:<0.23991.867>:ns_memcached:run_check_and_maybe_update_config:814]ensure_bucket took too long: 2132006 us
          --
          [ns_server:warn,2020-09-02T19:06:13.517-07:00,ns_1@172.23.108.103:<0.15864.838>:leader_lease_acquire_worker:handle_acquire_timeout:112]Timeout while trying to acquire lease from 'ns_1@172.23.97.242'.
          Acquire options were [{timeout,0},{period,15000}]
          --
          [ns_server:info,2020-09-02T19:06:14.248-07:00,ns_1@172.23.108.103:<0.15864.838>:leader_lease_acquire_worker:handle_fresh_lease_acquired:302]Acquired lease from node 'ns_1@172.23.97.242' (lease uuid: <<"de96a54b9b6ea9570c884be89beba701">>)
          [ns_server:info,2020-09-02T19:06:14.289-07:00,ns_1@172.23.108.103:<0.7310.868>:compaction_daemon:maybe_compact_vbucket:739]Compaction of <<"ITEM/11">> has finished with ok
          --
          [ns_server:warn,2020-09-02T19:07:32.271-07:00,ns_1@172.23.108.103:<0.14973.759>:leader_lease_acquire_worker:handle_acquire_timeout:112]Timeout while trying to acquire lease from 'ns_1@172.23.99.11'.
          Acquire options were [{timeout,0},{period,15000}]
          --
          [ns_server:warn,2020-09-02T19:07:32.271-07:00,ns_1@172.23.108.103:<0.30270.692>:leader_lease_acquire_worker:handle_acquire_timeout:112]Timeout while trying to acquire lease from 'ns_1@172.23.96.148'.
          Acquire options were [{timeout,0},{period,15000}]
          --
          [ns_server:info,2020-09-02T19:07:37.029-07:00,ns_1@172.23.108.103:<0.14973.759>:leader_lease_acquire_worker:handle_fresh_lease_acquired:302]Acquired lease from node 'ns_1@172.23.99.11' (lease uuid: <<"de96a54b9b6ea9570c884be89beba701">>)
          [ns_server:info,2020-09-02T19:07:37.031-07:00,ns_1@172.23.108.103:<0.30270.692>:leader_lease_acquire_worker:handle_fresh_lease_acquired:302]Acquired lease from node 'ns_1@172.23.96.148' (lease uuid: <<"de96a54b9b6ea9570c884be89beba701">>)
          [ns_server:info,2020-09-02T19:07:37.066-07:00,ns_1@172.23.108.103:<0.2978.869>:compaction_daemon:maybe_compact_vbucket:736]Compacting 'WAREHOUSE/228', DataSize = 33604, FileSize = 237659, Options = {1599095221,
          --
          [ns_server:warn,2020-09-02T19:09:06.627-07:00,ns_1@172.23.108.103:<0.15380.195>:leader_lease_acquire_worker:handle_acquire_timeout:112]Timeout while trying to acquire lease from 'ns_1@172.23.104.67'.
          Acquire options were [{timeout,0},{period,15000}]
          

          In each case we can see that the lease is lost with a 0 timeout - which means the lease acquirer process was woken up after the lease had already expired.

          This isn't a bug - the node is running too high CPU for us to expect graceful failover to always succeed.

          Show
          dfinlay Dave Finlay added a comment - The graceful failover stops because the quorum is lost on the janitor run prior to the HISTORY bucket being gracefully failed over. [ns_server:debug,2020-09-02T19:09:06.630-07:00,ns_1@172.23.108.103:leader_activities<0.11336.0>:leader_activities:terminate_activities:635]Terminating activities (reason is {shutdown, {quorum_lost, {lease_lost,'ns_1@172.23.104.67'}}}): [{activity,<0.31690.869>,#Ref<0.2360701235.947388423.130974>,default, <<"32eb65b8b6ffe3fbf3ec4559094b6036">>, [graceful_failover, {ns_janitor,"HISTORY",cleanup}, {ns_janitor,"HISTORY",apply_config}], {all,{set,11,16,16,8,80,48, {[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[]}, {{['ns_1@172.23.99.20'], ['ns_1@172.23.99.21','ns_1@172.23.97.121', 'ns_1@172.23.104.61'], ['ns_1@172.23.97.122'], ['ns_1@172.23.108.103'], [], ['ns_1@172.23.99.25'], [], ['ns_1@172.23.104.67','ns_1@172.23.104.137'], [], ['ns_1@172.23.97.239','ns_1@172.23.97.119'], [],[],[],[],[],[]}}}}, [quiet]}] This part of the janitor requires a unanimous quorum in the KV nodes - it's more than a majority quorum and it's always been the case that the janitor requires all non-failed over KV nodes to be up. The lease is lost to node .67 and that's enough to terminate the graceful failover after 5 buckets have been successfully gracefuly failed over. At this point in time the CPU on the orchestrator node is in the vicinity of 95% During the graceful failover we can see that leases are lost from quite a number of nodes and then subsequently reacquired. These lost leases don't happen to cause quorum loss because at these earlier points the activity only requires a majority quorum or the leases are lost on non-KV nodes. [ns_server:warn,2020-09-02T19:05:16.930-07:00,ns_1@172.23.108.103:<0.14973.759>:leader_lease_acquire_worker:handle_acquire_timeout:112]Timeout while trying to acquire lease from 'ns_1@172.23.99.11'. Acquire options were [{timeout,0},{period,15000}] -- [ns_server:warn,2020-09-02T19:05:16.930-07:00,ns_1@172.23.108.103:<0.30045.554>:leader_lease_acquire_worker:handle_acquire_timeout:112]Timeout while trying to acquire lease from 'ns_1@172.23.104.61'. Acquire options were [{timeout,0},{period,15000}] -- [ns_server:info,2020-09-02T19:05:25.492-07:00,ns_1@172.23.108.103:<0.30045.554>:leader_lease_acquire_worker:handle_fresh_lease_acquired:302]Acquired lease from node 'ns_1@172.23.104.61' (lease uuid: <<"de96a54b9b6ea9570c884be89beba701">>) [ns_server:info,2020-09-02T19:05:25.589-07:00,ns_1@172.23.108.103:<0.14973.759>:leader_lease_acquire_worker:handle_fresh_lease_acquired:302]Acquired lease from node 'ns_1@172.23.99.11' (lease uuid: <<"de96a54b9b6ea9570c884be89beba701">>) [ns_server:debug,2020-09-02T19:05:25.822-07:00,ns_1@172.23.108.103:<0.23991.867>:ns_memcached:run_check_and_maybe_update_config:814]ensure_bucket took too long: 2132006 us -- [ns_server:warn,2020-09-02T19:06:13.517-07:00,ns_1@172.23.108.103:<0.15864.838>:leader_lease_acquire_worker:handle_acquire_timeout:112]Timeout while trying to acquire lease from 'ns_1@172.23.97.242'. Acquire options were [{timeout,0},{period,15000}] -- [ns_server:info,2020-09-02T19:06:14.248-07:00,ns_1@172.23.108.103:<0.15864.838>:leader_lease_acquire_worker:handle_fresh_lease_acquired:302]Acquired lease from node 'ns_1@172.23.97.242' (lease uuid: <<"de96a54b9b6ea9570c884be89beba701">>) [ns_server:info,2020-09-02T19:06:14.289-07:00,ns_1@172.23.108.103:<0.7310.868>:compaction_daemon:maybe_compact_vbucket:739]Compaction of <<"ITEM/11">> has finished with ok -- [ns_server:warn,2020-09-02T19:07:32.271-07:00,ns_1@172.23.108.103:<0.14973.759>:leader_lease_acquire_worker:handle_acquire_timeout:112]Timeout while trying to acquire lease from 'ns_1@172.23.99.11'. Acquire options were [{timeout,0},{period,15000}] -- [ns_server:warn,2020-09-02T19:07:32.271-07:00,ns_1@172.23.108.103:<0.30270.692>:leader_lease_acquire_worker:handle_acquire_timeout:112]Timeout while trying to acquire lease from 'ns_1@172.23.96.148'. Acquire options were [{timeout,0},{period,15000}] -- [ns_server:info,2020-09-02T19:07:37.029-07:00,ns_1@172.23.108.103:<0.14973.759>:leader_lease_acquire_worker:handle_fresh_lease_acquired:302]Acquired lease from node 'ns_1@172.23.99.11' (lease uuid: <<"de96a54b9b6ea9570c884be89beba701">>) [ns_server:info,2020-09-02T19:07:37.031-07:00,ns_1@172.23.108.103:<0.30270.692>:leader_lease_acquire_worker:handle_fresh_lease_acquired:302]Acquired lease from node 'ns_1@172.23.96.148' (lease uuid: <<"de96a54b9b6ea9570c884be89beba701">>) [ns_server:info,2020-09-02T19:07:37.066-07:00,ns_1@172.23.108.103:<0.2978.869>:compaction_daemon:maybe_compact_vbucket:736]Compacting 'WAREHOUSE/228', DataSize = 33604, FileSize = 237659, Options = {1599095221, -- [ns_server:warn,2020-09-02T19:09:06.627-07:00,ns_1@172.23.108.103:<0.15380.195>:leader_lease_acquire_worker:handle_acquire_timeout:112]Timeout while trying to acquire lease from 'ns_1@172.23.104.67'. Acquire options were [{timeout,0},{period,15000}] In each case we can see that the lease is lost with a 0 timeout - which means the lease acquirer process was woken up after the lease had already expired. This isn't a bug - the node is running too high CPU for us to expect graceful failover to always succeed.

            People

            Assignee:
            dfinlay Dave Finlay
            Reporter:
            vikas.chaudhary Vikas Chaudhary
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

              Dates

              Created:
              Updated:
              Resolved:

                Gerrit Reviews

                There are no open Gerrit changes

                  PagerDuty