Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-35096

Rebalance fails with 'unexpected exit: dcp_wait_for_data_move_failed' error [2019/7/19]

    XMLWordPrintable

Details

    • Untriaged
    • Unknown
    • KV-Engine Mad-Hatter Beta

    Description

      testrunner make simple-test failing for all rebalance test of views.

      [2019-07-16 10:03:50,653] - [rest_client] [140160049493760] - ERROR - {u'node': u'n_0@127.0.0.1', u'code': 0, u'text': u'Rebalance exited with reason {mover_crashed,\n                              {unexpected_exit,\n                               {\'EXIT\',<0.18573.0>,\n                                {{dcp_wait_for_data_move_failed,"default",\n                                  1023,\'n_0@127.0.0.1\',\n                                  [\'n_1@127.0.0.1\'],\n                                  {error,no_stats_for_this_vbucket}},\n                                 [{ns_single_vbucket_mover,\n                                   \'-wait_dcp_data_move/5-fun-0-\',5,\n                                   [{file,"src/ns_single_vbucket_mover.erl"},\n                                    {line,465}]},\n                                  {proc_lib,init_p,3,\n                                   [{file,"proc_lib.erl"},{line,232}]}]}}}}.\nRebalance Operation Id = 27fff0232ec662a5b2f54e27ced5c9a4', u'shortText': u'message', u'serverTime': u'2019-07-16T10:03:45.468Z', u'module': u'ns_orchestrator', u'tstamp': 1563251625468, u'type': u'critical'}
      [2019-07-16 10:03:50,653] - [rest_client] [140160049493760] - ERROR - {u'node': u'n_0@127.0.0.1', u'code': 0, u'text': u'Worker <0.18477.0> (for action {move,{1023,\n                                      [\'n_0@127.0.0.1\',undefined],\n                                      [\'n_1@127.0.0.1\',\'n_0@127.0.0.1\'],\n                                      []}}) exited with reason {unexpected_exit,\n                                                                {\'EXIT\',\n                                                                 <0.18573.0>,\n                                                                 {{dcp_wait_for_data_move_failed,\n                                                                   "default",\n                                                                   1023,\n                                                                   \'n_0@127.0.0.1\',\n                                                                   [\'n_1@127.0.0.1\'],\n                                                                   {error,\n                                                                    no_stats_for_this_vbucket}},\n                                                                  [{ns_single_vbucket_mover,\n                                                                    \'-wait_dcp_data_move/5-fun-0-\',\n                                                                    5,\n                                                                    [{file,\n                                                                      "src/ns_single_vbucket_mover.erl"},\n                                                                     {line,\n                                                                      465}]},\n                                                                   {proc_lib,\n                                                                    init_p,3,\n                                                                    [{file,\n                                                                      "proc_lib.erl"},\n                                                                     {line,\n                                                                      232}]}]}}}', u'shortText': u'message', u'serverTime': u'2019-07-16T10:03:45.454Z', u'module': u'ns_vbucket_mover', u'tstamp': 1563251625454, u'type': u'critical'}
      

      Test started failing from 2019-07-15 22:32
      During this time there were 2 new commits on kv_engine. I tried by reverting these 2 patches and all test passed.
      http://ci-views.northscale.in/views-15.07.2019-21.30.fail.html

      kv_engine is at a9f305b, changes since last good build: 
       a9f305bdf MB-35053: Set correct allowedDuplicatePrepareSeqnos
       4834fe37b MB-34873: disk snapshots - Ack snapEnd seqno once persisted
      

       From info.log, ns_server reported dcp_wait_for_data_move_failed.

      [ns_server:info,2019-07-16T09:59:18.250+05:30,n_0@127.0.0.1:janitor_agent-default<0.1170.0>:janitor_agent:handle_info:942]Rebalancer <0.3367.0> died with reason {unexpected_exit,
                                              {'EXIT',<0.3474.0>,
                                               {{dcp_wait_for_data_move_failed,
                                                 "default",1021,'n_0@127.0.0.1',
                                                 ['n_3@127.0.0.1','n_2@127.0.0.1'],
                                                 {error,no_stats_for_this_vbucket}},
                                                [{ns_single_vbucket_mover,
                                                  '-wait_dcp_data_move/5-fun-0-',5,
                                                  [{file,
                                                    "src/ns_single_vbucket_mover.erl"},
                                                   {line,465}]},
                                                 {proc_lib,init_p,3,
                                                  [{file,"proc_lib.erl"},
                                                   {line,232}]}]}}}. Undoing temporary vbucket states caused by rebalance
      

      And after this ns_server is unable to connect to memcached port. This happened for all tests related to rebalance.

      [user:info,2019-07-16T09:59:24.455+05:30,n_0@127.0.0.1:ns_memcached-default<0.1161.0>:ns_memcached:do_terminate:726]Control connection to memcached on 'n_0@127.0.0.1' disconnected: {lost_connection,
                                                                        [{ns_memcached,
                                                                          worker_loop,
                                                                          3,
                                                                          [{file,
                                                                            "src/ns_memcached.erl"},
                                                                           {line,
                                                                            197}]},
                                                                         {proc_lib,
                                                                          init_p_do_apply,
                                                                          3,
                                                                          [{file,
                                                                            "proc_lib.erl"},
                                                                           {line,
                                                                            247}]}]}
       
      [ns_server:warn,2019-07-16T09:59:25.556+05:30,n_0@127.0.0.1:<0.4013.0>:ns_memcached:connect:1051]Unable to connect: {error,{badmatch,{error,econnrefused}}}, retrying.
      [ns_server:warn,2019-07-16T09:59:25.702+05:30,n_0@127.0.0.1:<0.3944.0>:ns_memcached:connect:1051]Unable to connect: {error,{badmatch,{error,econnrefused}}}, retrying.
      

      Around this time memcached reported exception. It seems like this is the reason for the rebalance operation to exit

      3269:2019-07-16T09:59:18.212183+05:30 ERROR 45: exception occurred in runloop during packet execution. Cookie info: [{"aiostat":"success","connection":"[ 127.0.0.1:43285 - 127.0.0.1:11999 (<ud>@ns_server</ud>) ]","engine_storage":"0x00007f2bb5c96810","ewouldblock":false,"packet":{"bodylen":8,"cas":0,"datatype":"raw","extlen":8,"key":"<ud></ud>","keylen":0,"magic":"ClientRequest","opaque":1,"opcode":"DCP_SEQNO_ACKNOWLEDGED","vbucket":1021},"refcount":1}] - closing connection ([ 127.0.0.1:43285 - 127.0.0.1:11999 (<ud>@ns_server</ud>) ]): ActiveDurabilityMonitor::State::processSeqnoAck vb:1021 seqno(58) is greater than lastTrackedSeqno(0)
      3279:2019-07-16T09:59:18.238653+05:30 ERROR 52: exception occurred in runloop during packet execution. Cookie info: [{"aiostat":"success","connection":"[ 127.0.0.1:35871 - 127.0.0.1:11999 (<ud>@ns_server</ud>) ]","engine_storage":"0x00007f2bb6245810","ewouldblock":false,"packet":{"bodylen":8,"cas":0,"datatype":"raw","extlen":8,"key":"<ud></ud>","keylen":0,"magic":"ClientRequest","opaque":1,"opcode":"DCP_SEQNO_ACKNOWLEDGED","vbucket":1021},"refcount":1}] - closing connection ([ 127.0.0.1:35871 - 127.0.0.1:11999 (<ud>@ns_server</ud>) ]): ActiveDurabilityMonitor::State::processSeqnoAck vb:1021 seqno(58) is greater than lastTrackedSeqno(0)
      

      Attachments

        Issue Links

          For Gerrit Dashboard: MB-35096
          # Subject Branch Project Status CR V

          Activity

            People

              ankit.prabhu Ankit Prabhu
              ankit.prabhu Ankit Prabhu
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Gerrit Reviews

                  There are no open Gerrit changes

                  PagerDuty