Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-48661

Rebalance out a node failed. reason: setup_replications_failed

    XMLWordPrintable

Details

    • Bug
    • Resolution: Duplicate
    • Critical
    • None
    • 7.1.0
    • couchbase-bucket
    • 7.1.0-1363

    Description

      Steps To Reproduce:

      1. Create a 10 node KV cluster
      2. Create a magma bucket with 1 replica. Create 20 collections
      3. Load 10M(0-10M, 0-50k per collection) items and upsert them once
      4. Load another 1M(10M-20M, 10M-20M per collection) items and upsert them
      5. Start CRUD load per collections as below:

        Read Start: 0
        Read End: 500000
        Update Start: 1000000
        Update End: 10000000
        Expiry Start: 0
        Expiry End: 0
        Delete Start: 500000
        Delete End: 1000000
        Create Start: 1000000
        Create End: 10000000
        Final Start: 1000000
        Final End: 10000000
        

      6. Rebalance in one node. Abort->Resume Rebalance at 20%, 40%, 60%, 80%. Rebalance passed
      7. Crash Magma/memc with Loading of docs on all the 10 nodes every random sleep of random.randint(60, 120). After every kill, wait for bucket warmup. Everything went fine at this step. No crashes found and no critical messages in memcached.log
      8. Rebalance out one node. Rebalance Failed:

        {u'code': 0, u'module': u'ns_orchestrator', u'type': u'critical', u'node': u'ns_1@172.23.120.170', u'tstamp': 1632909123361L, u'shortText': u'message', u'serverTime': u'2021-09-29T02:52:03.361Z', u'text': u'Rebalance exited with reason {mover_crashed,\n                              {unexpected_exit,\n                               {\'EXIT\',<0.5065.41>,\n                                {{{badmatch,\n                                   {error,\n                                    {setup_replications_failed,\n                                     [{\'ns_1@172.23.120.170\',\n                                       {errors,[{10,64}]}}]}}},\n                                  [{janitor_agent,handle_apply_vbucket_state,\n                                    2,\n                                    [{file,"src/janitor_agent.erl"},\n                                     {line,1074}]},\n                                   {janitor_agent,\n                                    apply_vbucket_states_worker_loop,0,\n                                    [{file,"src/janitor_agent.erl"},\n                                     {line,1063}]},\n                                   {proc_lib,init_p,3,\n                                    [{file,"proc_lib.erl"},{line,234}]}]},\n                                 {gen_server,call,\n                                  [{\'janitor_agent-GleamBookUsers0\',\n                                    \'ns_1@172.23.121.127\'},\n                                   {if_rebalance,<0.3860.41>,\n                                    {wait_dcp_data_move,\n                                     [\'ns_1@172.23.121.129\',\n                                      \'ns_1@172.23.121.115\'],\n                                     698}},\n                                   infinity]}}}}}.\nRebalance Operation Id = 694a80c21b7d0a2eb1c7118d1781ff67'}
        2021-09-29 02:52:12,555 | test  | ERROR   | pool-3-thread-4 | [rest_client:print_UI_logs:2786] {u'code': 0, u'module': u'ns_vbucket_mover', u'type': u'critical', u'node': u'ns_1@172.23.120.170', u'tstamp': 1632909123311L, u'shortText': u'message', u'serverTime': u'2021-09-29T02:52:03.311Z', u'text': u'Worker <0.5714.41> (for action {move,{698,\n                                      [\'ns_1@172.23.121.127\',\n                                       \'ns_1@172.23.121.129\'],\n                                      [\'ns_1@172.23.121.129\',\n                                       \'ns_1@172.23.121.115\'],\n                                      []}}) exited with reason {unexpected_exit,\n                                                                {\'EXIT\',\n                                                                 <0.5065.41>,\n                                                                 {{{badmatch,\n                                                                    {error,\n                                                                     {setup_replications_failed,\n                                                                      [{\'ns_1@172.23.120.170\',\n                                                                        {errors,\n                                                                         [{10,\n                                                                           64}]}}]}}},\n                                                                   [{janitor_agent,\n                                                                     handle_apply_vbucket_state,\n                                                                     2,\n                                                                     [{file,\n                                                                       "src/janitor_agent.erl"},\n                                                                      {line,\n                                                                       1074}]},\n                                                                    {janitor_agent,\n                                                                     apply_vbucket_states_worker_loop,\n                                                                     0,\n                                                                     [{file,\n                                                                       "src/janitor_agent.erl"},\n                                                                      {line,\n                                                                       1063}]},\n                                                                    {proc_lib,\n                                                                     init_p,3,\n                                                                     [{file,\n                                                                       "proc_lib.erl"},\n                                                                      {line,\n                                                                       234}]}]},\n                                                                  {gen_server,\n                                                                   call,\n                                                                   [{\'janitor_agent-GleamBookUsers0\',\n                                                                     \'ns_1@172.23.121.127\'},\n                                                                    {if_rebalance,\n                                                                     <0.3860.41>,\n                                                                     {wait_dcp_data_move,\n                                                                      [\'ns_1@172.23.121.129\',\n                                                                       \'ns_1@172.23.121.115\'],\n                                                                      698}},\n                                                                    infinity]}}}}'}
        

      Expected Result:
      Rebalance should progress and should not fail.

      QE Test

      git fetch "http://review.couchbase.org/TAF" refs/changes/97/162297/1 && git checkout FETCH_HEAD
      guides/gradlew --refresh-dependencies testrunner -P jython=/opt/jython/bin/jython -P 'args=-i /tmp/magma_temp_job4.ini -p bucket_storage=magma,bucket_eviction_policy=fullEviction,rerun=False,iterations=2,sdk_timeout=60,log_level=debug,infra_log_level=debug,skip_cleanup=True -t aGoodDoctor.Hospital.Murphy.SystemTestMagma,nodes_init=10,graceful=True,skip_cleanup=True,num_items=500000,num_buckets=1,bucket_names=GleamBook,doc_size=2048,key_size=18,assert_crashes_on_load=True,num_collections=20,maxttl=10,num_indexes=20,pc=10,index_nodes=0,query_nodes=0,cbas_nodes=0,fts_nodes=0,ops_rate=50000,doc_ops=create:update:delete:read,durability=Majority,crashes=10,max_commit_points=0 -m rest'
      

      Daniel Owen, the plan wasn't to run this test at this stage but i end up running this as i had to verify another magma bug but then i encountered this one.

      Test Category: Unbounded Volume test that includes rebalance aborts and crashes: https://docs.google.com/spreadsheets/d/1AKutwtUlGX4UckfGPkJSKZu_7wfz_EwMMuoajCYUub8/edit#gid=1608573032&range=G7

      Attachments

        Issue Links

          No reviews matched the request. Check your Options in the drop-down menu of this sections header.

          Activity

            People

              ritesh.agarwal Ritesh Agarwal
              ritesh.agarwal Ritesh Agarwal
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Gerrit Reviews

                  There are no open Gerrit changes

                  PagerDuty