Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-41674

Rebalance in/out fails with mover_crashed

    XMLWordPrintable

Details

    • Bug
    • Resolution: Not a Bug
    • Major
    • None
    • 6.6.1
    • couchbase-bucket
    • 6.6.1-9074-enterprise
    • Untriaged
    • Centos 64-bit
    • 1
    • No

    Description

      Script to Repro

      ./testrunner -i /tmp/win10-bucket-ops.ini -p get-cbcollect-info=True,heartbeat_interval=500,timeout_interval_count=3,lease_time=5000,lease_grace_time=2000,lease_renew_after=500 -t rebalance.rebalance_high_ops_pillowfight.RebalanceHighOpsWithPillowFight.test_rebalance_in_out,nodes_in=2,nodes_out=1,replicas=2,nodes_init=3,items=2000000,batch_size=1000,rate_limit=100000,instances=2,threads=5,loader=high_ops
      

      This is the first time we running with non default orchestrator heartbeats and timeouts(see MB-41562)

      Steps to Repro
      1. Create a 3 node cluster 172.23.98.196, 172.23.98.195, 172.23.120.206
      2. Created a default bucket with 2 replicas
      3. Configure orchestrator heartbeats and timeouts( see MB-41562)

      [2020-09-24 04:41:53,823] - [rest_client:1536] INFO - /diag/eval status on 172.23.98.196:8091: True content: 'ns_config:set({mb_master, heartbeat_interval},500).' command: 'ns_config:set({mb_master, heartbeat_interval},500).'
      [2020-09-24 04:41:53,849] - [rest_client:1536] INFO - /diag/eval status on 172.23.98.196:8091: True content: 'ns_config:set({mb_master, timeout_interval_count},3).' command: 'ns_config:set({mb_master, timeout_interval_count},3).'
      [2020-09-24 04:41:53,871] - [rest_client:1536] INFO - /diag/eval status on 172.23.98.196:8091: True content: 'ns_config:set({leader_lease_acquire_worker, lease_time},5000).' command: 'ns_config:set({leader_lease_acquire_worker, lease_time},5000).'
      [2020-09-24 04:41:53,899] - [rest_client:1536] INFO - /diag/eval status on 172.23.98.196:8091: True content: 'ns_config:set({leader_lease_acquire_worker, lease_grace_time},2000).' command: 'ns_config:set({leader_lease_acquire_worker, lease_grace_time},2000).'
      [2020-09-24 04:41:53,923] - [rest_client:1536] INFO - /diag/eval status on 172.23.98.196:8091: True content: 'ns_config:set({leader_lease_acquire_worker, lease_time},500).' command: 'ns_config:set({leader_lease_acquire_worker, lease_time},500).'
      

      4. Do initial data load. Wait for it to finish
      5. Start async data load again
      6. Add 2 nodes(172.23.104.186,172.23.121.10), remove 1 node(172.23.120.206) and start rebalance

      [2020-09-24 04:43:17,882] - [rest_client:1292] INFO - adding remote node @172.23.104.186:8091 to this cluster @172.23.98.196:8091
      [2020-09-24 04:43:23,351] - [rest_client:1292] INFO - adding remote node @172.23.121.10:8091 to this cluster @172.23.98.196:8091
      [2020-09-24 04:43:30,253] - [rest_client:1518] INFO - rebalance params : {'password': 'password', 'ejectedNodes': u'ns_1@172.23.120.206', 'user': 'Administrator', 'knownNodes': u'ns_1@172.23.120.206,ns_1@172.23.98.195,ns_1@172.23.121.10,ns_1@172.23.98.196,ns_1@172.23.104.186'}
      

      Rebalance fails as shown below

      172.23.104.186

      Service 'memcached' exited with status 137. Restarting. Messages:\n2020-09-24T04:49:46.747144-07:00 WARNING 59: (default) DCP (Producer) eq_dcpq:replication:ns_1@172.23.104.186->ns_1@172.23.98.195:default - (vb:634) Stream closing, sent until seqno 10382 remaining items 0, reason: The stream closed early because the conn was disconnected\n2020-09-24T04:49:46.747154-07:00 WARNING 59: (default) DCP (Producer) eq_dcpq:replication:ns_1@172.23.104.186->ns_1@172.23.98.195:default - (vb:635) Stream closing, sent until seqno 10446 remaining items 0, reason: The stream closed early because the conn was disconnected\n2020-09-24T04:49:46.747167-07:00 WARNING 59: (default) DCP (Producer) eq_dcpq:replication:ns_1@172.23.104.186->ns_1@172.23.98.195:default - (vb:636) Stream closing, sent until seqno 10307 remaining items 0, reason: The stream closed early because the conn was disconnected\n2020-09-24T04:49:46.747179-07:00 WARNING 59: (default) DCP (Producer) eq_dcpq:replication:ns_1@172.23.104.186->ns_1@172.23.98.195:default - (vb:637) Stream closing, sent until seqno 10361 remaining items 0, reason: The stream closed early because the conn was disconnected
      

      172.23.120.206

      Service 'memcached' exited with status 137. Restarting. Messages:\n2020-09-24T04:49:46.610087-07:00 WARNING 54: (default) DCP (Consumer) eq_dcpq:replication:ns_1@172.23.98.196->ns_1@172.23.120.206:default - (vb:727) Setting stream to dead state, last_seqno is 10297, unAckedBytes is 0, status is The stream closed early because the conn was disconnected\n2020-09-24T04:49:46.610090-07:00 WARNING 54: (default) DCP (Consumer) eq_dcpq:replication:ns_1@172.23.98.196->ns_1@172.23.120.206:default - (vb:728) Setting stream to dead state, last_seqno is 10332, unAckedBytes is 0, status is The stream closed early because the conn was disconnected\n2020-09-24T04:49:46.610093-07:00 WARNING 54: (default) DCP (Consumer) eq_dcpq:replication:ns_1@172.23.98.196->ns_1@172.23.120.206:default - (vb:729) Setting stream to dead state, last_seqno is 10390, unAckedBytes is 0, status is The stream closed early because the conn was disconnected\n2020-09-24T04:49:46.610097-07:00 WARNING 54: (default) DCP (Consumer) eq_dcpq:replication:ns_1@172.23.98.196->ns_1@172.23.120.206:default - (vb:730) Setting stream to dead state, last_seqno is 10347, unAckedBytes is 0, status is The stream closed early because the conn was disconnected
      

      172.23.98.195

      Service 'memcached' exited with status 137. Restarting. Messages:\n2020-09-24T04:49:46.720257-07:00 WARNING 68: (default) DCP (Producer) eq_dcpq:replication:ns_1@172.23.98.195->ns_1@172.23.121.10:default - (vb:542) Stream closing, sent until seqno 10436 remaining items 0, reason: The stream closed early because the conn was disconnected\n2020-09-24T04:49:46.720318-07:00 WARNING 68: (default) DCP (Producer) eq_dcpq:replication:ns_1@172.23.98.195->ns_1@172.23.121.10:default - (vb:543) Stream closing, sent until seqno 10383 remaining items 0, reason: The stream closed early because the conn was disconnected\n2020-09-24T04:49:46.720333-07:00 WARNING 68: (default) DCP (Producer) eq_dcpq:replication:ns_1@172.23.98.195->ns_1@172.23.121.10:default - (vb:544) Stream closing, sent until seqno 10430 remaining items 0, reason: The stream closed early because the conn was disconnected\n2020-09-24T04:49:46.720347-07:00 WARNING 68: (default) DCP (Producer) eq_dcpq:replication:ns_1@172.23.98.195->ns_1@172.23.121.10:default - (vb:545) Stream closing, sent until seqno 10390 remaining items 0, reason: The stream closed early because the conn was disconnected
      

      172.23.98.196

      Rebalance exited with reason {mover_crashed,\n                              {unexpected_exit,\n                               {\'EXIT\',<0.5243.85>,\n                                {{{{{child_interrupted,\n                                     {\'EXIT\',<25463.23715.18>,socket_closed}},\n                                    [{dcp_replicator,spawn_and_wait,1,\n                                      [{file,"src/dcp_replicator.erl"},\n                                       {line,266}]},\n                                     {dcp_replicator,handle_call,3,\n                                      [{file,"src/dcp_replicator.erl"},\n                                       {line,121}]},\n                                     {gen_server,try_handle_call,4,\n                                      [{file,"gen_server.erl"},{line,636}]},\n                                     {gen_server,handle_msg,6,\n                                      [{file,"gen_server.erl"},{line,665}]},\n                                     {proc_lib,init_p_do_apply,3,\n                                      [{file,"proc_lib.erl"},{line,247}]}]},\n                                   {gen_server,call,\n                                    [<25463.23713.18>,get_partitions,\n                                     infinity]}},\n                                  {gen_server,call,\n                                   [\'dcp_replication_manager-default\',\n                                    {get_replicator_pid,728},\n                                    infinity]}},\n                                 {gen_server,call,\n                                  [{\'janitor_agent-default\',\n                                    \'ns_1@172.23.121.10\'},\n                                   {if_rebalance,<0.12622.81>,\n                                    {update_vbucket_state,731,active,\n                                     undefined,undefined,\n                                     [[\'ns_1@172.23.121.10\',\n                                       \'ns_1@172.23.98.195\',\n                                       \'ns_1@172.23.104.186\']]}},\n                                   infinity]}}}}}.\nRebalance Operation Id = 2636894a43ceba87e176ae402ee5953e', u'shortText': u'message', u'serverTime': u'2020-09-24T04:49:46.632Z', u'module': u'ns_orchestrator', u'tstamp': 1600948186632, u'type': u'critical'}
      [2020-09-24 04:49:54,725] - [rest_client:3387] ERROR - {u'node': u'ns_1@172.23.98.196', u'code': 0, u'text': u"Control connection to memcached on 'ns_1@172.23.98.196' disconnected. Check logs for details.", u'shortText': u'message', u'serverTime': u'2020-09-24T04:49:46.620Z', u'module': u'ns_memcached', u'tstamp': 1600948186620, u'type': u'info'}
      [2020-09-24 04:49:54,726] - [rest_client:3387] ERROR - {u'node': u'ns_1@172.23.98.196', u'code': 0, u'text': u'Worker <0.4354.85> (for action {move,{731,\n                                      [\'ns_1@172.23.98.196\',\n                                       \'ns_1@172.23.120.206\',\n                                       \'ns_1@172.23.98.195\'],\n                                      [\'ns_1@172.23.121.10\',\n                                       \'ns_1@172.23.98.195\',\n                                       \'ns_1@172.23.104.186\'],\n                                      []}}) exited with reason {unexpected_exit,\n                                                                {\'EXIT\',\n                                                                 <0.5243.85>,\n                                                                 {{{{{child_interrupted,\n                                                                      {\'EXIT\',\n                                                                       <25463.23715.18>,\n                                                                       socket_closed}},\n                                                                     [{dcp_replicator,\n                                                                       spawn_and_wait,\n                                                                       1,\n                                                                       [{file,\n                                                                         "src/dcp_replicator.erl"},\n                                                                        {line,\n                                                                         266}]},\n                                                                      {dcp_replicator,\n                                                                       handle_call,\n                                                                       3,\n                                                                       [{file,\n                                                                         "src/dcp_replicator.erl"},\n                                                                        {line,\n                                                                         121}]},\n                                                                      {gen_server,\n                                                                       try_handle_call,\n                                                                       4,\n                                                                       [{file,\n                                                                         "gen_server.erl"},\n                                                                        {line,\n                                                                         636}]},\n                                                                      {gen_server,\n                                                                       handle_msg,\n                                                                       6,\n                                                                       [{file,\n                                                                         "gen_server.erl"},\n                                                                        {line,\n                                                                         665}]},\n                                                                      {proc_lib,\n                                                                       init_p_do_apply,\n                                                                       3,\n                                                                       [{file,\n                                                                         "proc_lib.erl"},\n                                                                        {line,\n                                                                         247}]}]},\n                                                                    {gen_server,\n                                                                     call,\n                                                                     [<25463.23713.18>,\n                                                                      get_partitions,\n                                                                      infinity]}},\n                                                                   {gen_server,\n                                                                    call,\n                                                                    [\'dcp_replication_manager-default\',\n                                                                     {get_replicator_pid,\n                                                                      728},\n                                                                     infinity]}},\n                                                                  {gen_server,\n                                                                   call,\n                                                                   [{\'janitor_agent-default\',\n                                                                     \'ns_1@172.23.121.10\'},\n                                                                    {if_rebalance,\n                                                                     <0.12622.81>,\n                                                                     {update_vbucket_state,\n                                                                      731,\n                                                                      active,\n                                                                      undefined,\n                                                                      undefined,\n                                                                      [[\'ns_1@172.23.121.10\',\n                                                                        \'ns_1@172.23.98.195\',\n                                                                        \'ns_1@172.23.104.186\']]}},\n                                                                    infinity]}}}}', u'shortText': u'message', u'serverTime': u'2020-09-24T04:49:46.615Z', u'module': u'ns_vbucket_mover
      

      cbcollect_info attached.

      Attachments

        No reviews matched the request. Check your Options in the drop-down menu of this sections header.

        Activity

          People

            Balakumaran.Gopal Balakumaran Gopal
            Balakumaran.Gopal Balakumaran Gopal
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Gerrit Reviews

                There are no open Gerrit changes

                PagerDuty