Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-43290

Rebalance failure observed in build sanity after backup service is added

    XMLWordPrintable

Details

    • Untriaged
    • 1
    • Unknown

    Description

      Issue observed in: 7.0.0-4025

      Test:
      ./testrunner -i node_conf.ini -p get-cbcollect-info=True,get-couch-dbinfo=True,skip_cleanup=False,skip_log_scan=False -t ent_backup_restore.enterprise_backup_restore_test.EnterpriseBackupRestoreTest.test_backup_restore_sanity,items=1000

      From diag.log:

      020-12-14T11:05:49.498-08:00, memcached_config_mgr:0:info:message(ns_1@172.23.105.153) - Hot-reloaded memcached.json for config change of the following keys: [<<"scramsha_fallback_salt">>]
      2020-12-14T11:05:50.109-08:00, ns_orchestrator:0:info:message(ns_1@172.23.105.151) - Starting rebalance, KeepNodes = ['ns_1@172.23.105.151','ns_1@172.23.105.153'], EjectNodes = [], Failed over and being ejected nodes = []; no delta recovery nodes; Operation Id = 4f9a3be20d968903fc7ea27ccb5b3b56
      2020-12-14T11:05:52.360-08:00, ns_orchestrator:0:critical:message(ns_1@172.23.105.151) - Rebalance exited with reason {{badmatch,failed},
                                    [{ns_rebalancer,rebalance_body,5,
                                         [{file,"src/ns_rebalancer.erl"},
                                          {line,532}]},
                                     {async,'-async_init/4-fun-1-',3,
                                         [{file,"src/async.erl"},{line,197}]}]}.
      Rebalance Operation Id = 4f9a3be20d968903fc7ea27ccb5b3b56
      2020-12-14T11:06:00.305-08:00, menelaus_web:102:warning:client-side error report(ns_1@172.23.105.151) - Client-side error-report for user "<ud>Administrator</ud>" on node 'ns_1@172.23.105.151':
      User-Agent:Python-httplib2/0.13.1 (gzip)
      Starting rebalance from test, ejected nodes ['ns_1@172.23.105.153']
      2020-12-14T11:06:00.313-08:00, ns_orchestrator:0:info:message(ns_1@172.23.105.151) - Starting rebalance, KeepNodes = ['ns_1@172.23.105.151'], EjectNodes = ['ns_1@172.23.105.153'], Failed over and being ejected nodes = []; no delta recovery nodes; Operation Id = 0045d716e47be11e253d0725577b86cf
      2020-12-14T11:06:10.439-08:00, ns_cluster:1:info:message(ns_1@172.23.105.153) - Node 'ns_1@172.23.105.153' is leaving cluster.
      2020-12-14T11:06:10.447-08:00, ns_orchestrator:0:info:message(ns_1@172.23.105.151) - Rebalance completed successfully.
      Rebalance Operation Id = 0045d716e47be11e253d0725577b86cf
      2020-12-14T11:06:10.644-08:00, ns_node_disco:5:warning:node down(ns_1@172.23.105.151) - Node 'ns_1@172.23.105.151' saw that node 'ns_1@172.23.105.153' went down. Details: [{nodedown_reason,
                                                                                           connection_closed}]
      2020-12-14T11:07:01.831-08:00, ns_cookie_manager:3:info:cookie update(ns_1@172.23.105.151) - Initial otp cookie generated: {sanitized,
                                        <<"VOL7MTlDuCj/QIAJDPiYpZNWoVQkVkznD/h9HETT13E=">>}
      2020-12-14T11:07:01.957-08:00, menelaus_sup:1:info:web start ok(ns_1@172.23.105.151) - Couchbase Server has started on web port 8091 on node 'ns_1@172.23.105.151'. Version: "7.0.0-4025-enterprise".
      2020-12-14T11:07:02.094-08:00, mb_master:0:info:message(ns_1@172.23.105.151) - I'm the only node, so I'm the master.
      2020-12-14T11:07:02.170-08:00, compat_mode_manager:0:warning:message(ns_1@172.23.105.151) - Changed cluster compat mode from undefined to [7,0]
      2020-12-14T11:07:02.203-08:00, auto_failover:0:info:message(ns_1@172.23.105.151) - Enabled auto-failover with timeout 120 and max count 1
      2020-12-14T11:07:08.878-08:00, menelaus_web:102:warning:client-side error report(ns_1@172.23.105.151) - Client-side error-report for user "<ud>Administrator</ud>" on node 'ns_1@172.23.105.151':
      User-Agent:Python-httplib2/0.13.1 (gzip)
      2020-12-14 11:07:08.856707 : test_backup_restore_sanity finished 
      -------------------------------
       
       
      per_node_processes('ns_1@172.23.105.151') =
           {<0.5656.0>,
            [{backtrace,
                 [<<"Program counter: 0x00007f261dcf6ff0 (diag_handler:'-collect_diag_per_node/1-fun-1-'/2 + 112)">>,
                  <<"CP: 0x0000000000000000 (invalid)">>,<<>>,
                  <<"0x00007f25d7f7a470 Return addr 0x00007f26653d6390 (proc_lib:init_p/3 + 200)">>,
                  <<"y(0)     <0.5655.0>">>,<<>>,
                  <<"0x00007f25d7f7a480 Return addr 0x0000000000986fa8 (<terminate process normally>)">>,
                  <<"y(0)     []">>,<<"y(1)     []">>,
                  <<"y(2)     Catch 0x00007f26653d63a0 (proc_lib:init_p/3 + 216)">>,
                  <<>>]},
             {messages,[]},
             {dictionary,
                 [{'$ancestors',[<0.5655.0>]},
                  {'$initial_call',
                      {diag_handler,'-collect_diag_per_node/1-fun-1-',0}}]},
             {registered_name,[]},
             {status,waiting},
             {initial_call,{proc_lib,init_p,3}},
             {error_handler,error_handler},
             {garbage_collection,
                 [{max_heap_size,#{error_logger => true,kill => true,size => 0}},
                  {min_bin_vheap_size,46422},
                  {min_heap_size,233},
                  {fullsweep_after,512},
                  {minor_gcs,0}]},
             {garbage_collection_info,
                 [{old_heap_block_size,0},
                  {heap_block_size,233},
                  {mbuf_size,0},
                  {recent_size,0},
                  {stack_size,6},
                  {old_heap_size,0},
                  {heap_size,32},
                  {bin_vheap_size,0},
                  {bin_vheap_block_size,46422},
                  {bin_old_vheap_size,0},
                  {bin_old_vheap_block_size,46422}]},
             {links,[<0.5655.0>]},
             {monitors,[{process,<0.339.0>},{process,<0.5655.0>}]},
             {monitored_by,[]},
             {memory,2860},
             {message_queue_len,0},
             {reductions,13},
             {trap_exit,false},
             {current_location,
                 {diag_handler,'-collect_diag_per_node/1-fun-1-',2,
                     [{file,"src/diag_handler.erl"},{line,228}]}}]}
      

      We added backup service to build sanity and started seeing this failure - attaching logs

      Attachments

        No reviews matched the request. Check your Options in the drop-down menu of this sections header.

        Activity

          Raju Suravarjjala Not, yet I came back from holiday on Monday and the issue I can only reproduce with the testrunner test so I have spent most of my time setting that up and creating toy builds to see exactly why and how it is happening.

          carlos.gonzalez Carlos Gonzalez Betancort (Inactive) added a comment - Raju Suravarjjala Not, yet I came back from holiday on Monday and the issue I can only reproduce with the testrunner test so I have spent most of my time setting that up and creating toy builds to see exactly why and how it is happening.

          I have an update after some investigation it seems like the canceling of the original rebalance is not happening fast enough which is causing the issue. I have made a toybuild where the cancellig happens faster and that seems to do the trick. Locally the test still fails but is no longer due to the rebalance but something due to sizing which seems unrelated. I will upload it so it gets reviewed.

          carlos.gonzalez Carlos Gonzalez Betancort (Inactive) added a comment - I have an update after some investigation it seems like the canceling of the original rebalance is not happening fast enough which is causing the issue. I have made a toybuild where the cancellig happens faster and that seems to do the trick. Locally the test still fails but is no longer due to the rebalance but something due to sizing which seems unrelated. I will upload it so it gets reviewed.

          The newest patch did the trick locally, at least there where no more rebalance errors. Reopen if it persists please.

          carlos.gonzalez Carlos Gonzalez Betancort (Inactive) added a comment - The newest patch did the trick locally, at least there where no more rebalance errors. Reopen if it persists please.

          Build couchbase-server-7.0.0-4216 contains cbbs commit b62a68d with commit message:
          MB-43290 Cancel in a more timely matter

          build-team Couchbase Build Team added a comment - Build couchbase-server-7.0.0-4216 contains cbbs commit b62a68d with commit message: MB-43290 Cancel in a more timely matter

          Issue not observed in 7.0.0-4218

          arunkumar Arunkumar Senthilnathan (Inactive) added a comment - Issue not observed in 7.0.0-4218

          People

            arunkumar Arunkumar Senthilnathan (Inactive)
            arunkumar Arunkumar Senthilnathan (Inactive)
            Votes:
            0 Vote for this issue
            Watchers:
            7 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Gerrit Reviews

                There are no open Gerrit changes

                PagerDuty