Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-46532

[BP MB-43285 to 6.6.3] Rebalance exited with reason {{badmatch : Timeout while trying to acquire lease

    XMLWordPrintable

Details

    • Untriaged
    • 1
    • Unknown

    Description

      This bug contains two different failures, for different proximate reasons, more than one month apart, but possibly share the same root cause. The test scenario was created to stress metakv:

      • Dave Finlay analyzed the first one, from 2020-12-10, which has 10 log URLs, then requested a newer run.
      • Kevin Cherkauer analyzed the second one, from 2021-01-21, which has a Supportal snapshot. This run failed for a different reason than the first one.

       

      Description of the first failure (2020-12-10):

      Build: 6.6.0-7909

      • Cluster with 3 kv, 2 index+n1ql, 4 search nodes
      • 6 bkts with 5000 docs each
      • Built 200 GSI indexes with replica 1 (50 indexes on 4 buckets)
      • Created 30 fts custom indexes (10 indexes on 3 buckets), just to add more entries to metakv
      • Create and Drop 100 gsi indexes sequentially on 4 buckets ( so this would be adding more entries of create/drop of 400 indexes)
      • Create and drop 50 fts indexes on 3 buckets.
      • Note for QE: Examples to create/drop, fts and gsi indexes can be found here: https://github.com/couchbaselabs/productivitynautomation/tree/master/create-drop-gsi-fts-indexes
      • Graceful Failover a node with kv service. Failover rebalance fails with below:

      {view_fragmentation_threshold,{30,undefined}}]
      [ns_server:warn,2020-12-14T09:40:56.993-08:00,ns_1@172.23.106.245:<0.28966.245>:leader_lease_acquire_worker:handle_acquire_timeout:112]Timeout while trying to acquire lease from 'ns_1@172.23.121.47'.
      Acquire options were [{timeout,0},{period,15000}]
      [ns_server:warn,2020-12-14T09:40:56.993-08:00,ns_1@172.23.106.245:<0.28988.245>:leader_lease_acquire_worker:handle_acquire_timeout:112]Timeout while trying to acquire lease from 'ns_1@172.23.107.197'.
      Acquire options were [{timeout,0},{period,15000}]
      [ns_server:warn,2020-12-14T09:40:56.993-08:00,ns_1@172.23.106.245:<0.29045.245>:leader_lease_acquire_worker:handle_acquire_timeout:112]Timeout while trying to acquire lease from 'ns_1@172.23.107.87'.
      Acquire options were [{timeout,0},{period,15000}]
      [ns_server:warn,2020-12-14T09:40:56.993-08:00,ns_1@172.23.106.245:<0.21396.245>:leader_lease_acquire_worker:handle_acquire_timeout:112]Timeout while trying to acquire lease from 'ns_1@172.23.121.41'.
      Acquire options were [{timeout,0},{period,15000}]
      [ns_server:warn,2020-12-14T09:40:56.994-08:00,ns_1@172.23.106.245:<0.29093.245>:leader_lease_acquire_worker:handle_acquire_timeout:112]Timeout while trying to acquire lease from 'ns_1@172.23.106.245'.
      Acquire options were [{timeout,0},{period,15000}]
      [ns_server:debug,2020-12-14T09:40:56.994-08:00,ns_1@172.23.106.245:leader_lease_agent<0.23775.0>:leader_lease_agent:handle_lease_expired:286]Lease held by {lease_holder,<<"5d86e482ae739d52262a6ebd2d87c1ca">>,
                                  'ns_1@172.23.106.245'} expired. Starting expirer.
      [ns_server:debug,2020-12-14T09:40:56.994-08:00,ns_1@172.23.106.245:leader_activities<0.23774.0>:leader_activities:terminate_activities:635]Terminating activities (reason is {shutdown,
                                         {quorum_lost,
                                          {lease_lost,'ns_1@172.23.106.245'}}}):
      [{activity,<0.26428.889>,#Ref<0.2150457706.1058275335.246268>,default,
                 <<"59a2588ae4959d8162566e0e9c3ab763">>,
                 [rebalance],
                 majority,[]}]
      [ns_server:info,2020-12-14T09:40:57.000-08:00,ns_1@172.23.106.245:rebalance_agent<0.23812.0>:rebalance_agent:handle_down:296]Rebalancer process <0.26591.889> died (reason shutdown).
      [ns_server:error,2020-12-14T09:40:57.000-08:00,ns_1@172.23.106.245:<0.26556.889>:leader_activities:report_error:1011]Activity {default,rebalance} failed with error {quorum_lost,
                                                      {lease_lost,
                                                       'ns_1@172.23.106.245'}}
      [ns_server:debug,2020-12-14T09:40:57.004-08:00,ns_1@172.23.106.245:ns_config_log<0.201.0>:ns_config_log:log_common:229]config change:
      {local_changes_count,<<"fda11948c11fa3107f8aa6d83b739109">>} ->
      [{'_vclock',[{<<"fda11948c11fa3107f8aa6d83b739109">>,{1572,63775186856}}]}]
      [user:error,2020-12-14T09:40:57.008-08:00,ns_1@172.23.106.245:<0.29120.245>:ns_orchestrator:log_rebalance_completion:1445]Rebalance exited with reason {{badmatch,
                                     {leader_activities_error,
                                      {default,rebalance},
                                      {quorum_lost,
                                       {lease_lost,'ns_1@172.23.106.245'}}}},
                                    [{ns_rebalancer,rebalance,5,
                                      [{file,"src/ns_rebalancer.erl"},{line,480}]},
                                     {proc_lib,init_p_do_apply,3,
                                      [{file,"proc_lib.erl"},{line,247}]}]}.
      Rebalance Operation Id = f5876e978c98291d711dd1704249b07b
      

      Logs:
      url : https://cb-jira.s3.us-east-2.amazonaws.com/logs/systestmon-1607975035/collectinfo-2020-12-14T194357-ns_1%40172.23.106.245.zip. (kv node)
      url : https://cb-jira.s3.us-east-2.amazonaws.com/logs/systestmon-1607975035/collectinfo-2020-12-14T194357-ns_1%40172.23.107.104.zip (kv node)
      url : https://cb-jira.s3.us-east-2.amazonaws.com/logs/systestmon-1607975035/collectinfo-2020-12-14T194357-ns_1%40172.23.107.197.zip (kv node)
      url : https://cb-jira.s3.us-east-2.amazonaws.com/logs/systestmon-1607975035/collectinfo-2020-12-14T194357-ns_1%40172.23.107.87.zip
      url : https://cb-jira.s3.us-east-2.amazonaws.com/logs/systestmon-1607975035/collectinfo-2020-12-14T194357-ns_1%40172.23.121.41.zip
      url : https://cb-jira.s3.us-east-2.amazonaws.com/logs/systestmon-1607975035/collectinfo-2020-12-14T194357-ns_1%40172.23.121.45.zip
      url : https://cb-jira.s3.us-east-2.amazonaws.com/logs/systestmon-1607975035/collectinfo-2020-12-14T194357-ns_1%40172.23.121.46.zip
      url : https://cb-jira.s3.us-east-2.amazonaws.com/logs/systestmon-1607975035/collectinfo-2020-12-14T194357-ns_1%40172.23.121.47.zip
      url : https://cb-jira.s3.us-east-2.amazonaws.com/logs/systestmon-1607975035/collectinfo-2020-12-14T194357-ns_1%40172.23.121.49.zip
      url : https://cb-jira.s3.us-east-2.amazonaws.com/logs/systestmon-1607975035/collectinfo-2020-12-14T194357-ns_1%40172.23.121.52.zip

      Attachments

        Issue Links

          For Gerrit Dashboard: MB-46532
          # Subject Branch Project Status CR V

          Activity

            People

              girish.benakappa Girish Benakappa
              kevin.cherkauer Kevin Cherkauer (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Gerrit Reviews

                  There are no open Gerrit changes

                  PagerDuty