Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-43345

[Upgrade]: Swap rebalance failed with reason 'mover_crashed' during dcp_takeover

    XMLWordPrintable

Details

    • Bug
    • Status: Closed
    • Critical
    • Resolution: Fixed
    • Cheshire-Cat
    • 7.0.0
    • couchbase-bucket
    • Enterprise Edition 7.0.0 build 4060
      Enterprise Edition 6.5.0 build 4960

    Description

       

      Build: 7.0.0-4060 from 6.5.0-4960

      Scenario:

      1. 4 node KV cluster (6.5.0-4960) with couchbase bucket (replica=1)

        +----------------+----------+-----------------+------------+------------+-----------------------+-----------------------+
        | Node           | Services | CPU_utilization | Mem_total  | Mem_free   | Swap_mem_used         | Version               |
        +----------------+----------+-----------------+------------+------------+-----------------------+-----------------------+
        | 172.23.105.212 | kv       | 1.25628140704   | 4201840640 | 3647959040 | 6553600 / 3758092288  | 6.5.0-4960-enterprise |
        | 172.23.105.155 | kv       | 1.25628140704   | 4201840640 | 3691266048 | 0 / 3758092288        | 6.5.0-4960-enterprise |
        | 172.23.105.213 | kv       | 1.51133501259   | 4201840640 | 3686121472 | 55312384 / 3758092288 | 6.5.0-4960-enterprise |
        | 172.23.105.211 | kv       | 0.759493670886  | 4201840640 | 3658952704 | 14680064 / 3758092288 | 6.5.0-4960-enterprise |
        +----------------+----------+-----------------+------------+------------+-----------------------+-----------------------++---------+-----------+----------+------------+-----+-------+-------------+----------+-----------+
        | Bucket  | Type      | Replicas | Durability | TTL | Items | RAM Quota   | RAM Used | Disk Used |
        +---------+-----------+----------+------------+-----+-------+-------------+----------+-----------+
        | default | couchbase | 1        | none       | 0   | 50000 | 13434355712 | 80015424 | 171941441 |
        +---------+-----------+----------+------------+-----+-------+-------------+----------+-----------+

      1. Upgrading to 7.0.0-4060 using swap rebalance with sync-writes updates in background

      Observation:

      During upgrade of node "172.23.105.212 <-> 172.23.100.163" seeing rebalance failure with following logs

      Node: 172.23.105.212 file: memcached.log.000000.txt

      2020-12-16T23:22:49.729775-08:00 ERROR 44: exception occurred in runloop during packet execution. Cookie info: [] - closing connection ([ 172.23.100.163:59737 - 172.23.105.212:11209 (<ud>@ns_server</ud>) ]): to_string(cb::mcbp::Status): Invalid status code: 11

      UI logs:

      Worker <0.7388.3> (for action {move,{601,
      ['ns_1@172.23.105.212',
      'ns_1@172.23.100.162'],
      ['ns_1@172.23.100.163',
      'ns_1@172.23.100.162'],
      []}}) exited with reason {unexpected_exit, {'EXIT', <0.8856.3>,
      {{{{{child_interrupted, {'EXIT', <28291.16197.0>, socket_closed}},
      [{dcp_replicator, spawn_and_wait, 1, [{file, "src/dcp_replicator.erl"}, {line, 265}]},
      {dcp_replicator, handle_call, 3, [{file, "src/dcp_replicator.erl"}, {line, 127}]},
      {gen_server, try_handle_call, 4, [{file, "gen_server.erl"}, {line, 661}]},
      {gen_server, handle_msg, 6, [{file, "gen_server.erl"}, {line, 690}]},
      {proc_lib, init_p_do_apply, 3, [{file, "proc_lib.erl"}, {line, 249}]}]},
      {gen_server, call, [<28291.16195.0>, get_partitions, infinity]}},
      {gen_server, call,
      ['dcp_replication_manager-default', {get_replicator_pid, 586}, infinity]}},
      {gen_server, call, [{'janitor_agent-default',
      'ns_1@172.23.100.163'}, {if_rebalance, <0.18173.2>,
      {dcp_takeover, 'ns_1@172.23.105.212', 601}}, infinity]}}}}
       
      Rebalance exited with reason {mover_crashed,
      {unexpected_exit, {'EXIT',<0.8856.3>,
      {{{{{child_interrupted,
      {'EXIT',<28291.16197.0>,socket_closed}},
      [{dcp_replicator,spawn_and_wait,1, [{file,"src/dcp_replicator.erl"}, {line,265}]},
      {dcp_replicator,handle_call,3, [{file,"src/dcp_replicator.erl"}, {line,127}]},
      {gen_server,try_handle_call,4, [{file,"gen_server.erl"},{line,661}]},
      {gen_server,handle_msg,6, [{file,"gen_server.erl"},{line,690}]},
      {proc_lib,init_p_do_apply,3, [{file,"proc_lib.erl"},{line,249}]}]},
      {gen_server,call,
      [<28291.16195.0>,get_partitions, infinity]}}, {gen_server,call,
      ['dcp_replication_manager-default', {get_replicator_pid,586}, infinity]}},
      {gen_server,call,
      [{'janitor_agent-default', 'ns_1@172.23.100.163'},
      {if_rebalance,<0.18173.2>,
      {dcp_takeover,'ns_1@172.23.105.212',601}}, infinity]}}}}}.
      Rebalance Operation Id = 38dc297bf54d83472be688f1f6539e36

       

      Attachments

        For Gerrit Dashboard: MB-43345
        # Subject Branch Project Status CR V

        Activity

          jwalker Jim Walker added a comment -

          The 6.5 node has been sent an "OpaqueNoMatch" status code and is failing, I don't think we should of sent the status?

          can you take a look

          jwalker Jim Walker added a comment - The 6.5 node has been sent an "OpaqueNoMatch" status code and is failing, I don't think we should of sent the status? can you take a look
          richard.demellow Richard deMellow added a comment - - edited

          Looking at node 172.23.105.212 we can see it connected with 172.23.100.163 as a replica. We can also see that the consumer .100.163 has tried to enable v7_dcp_status_codes but .105.212 is 6.5.0 so don't know anything about it so should have returned ENGINE_EINVAL.

          2020-12-16T23:22:49.021348-08:00 INFO 44: DCP connection opened successfully. PRODUCER, INCLUDE_XATTRS, DELETE_TIMES [ 172.23.100.163:59737 - 172.23.105.212:11209 (<ud>@ns_server</ud>) ]
          2020-12-16T23:22:49.027910-08:00 WARNING 44: (default) DCP (Producer) eq_dcpq:replication:ns_1@172.23.105.212->ns_1@172.23.100.163:default - Invalid ctrl parameter 'true' for include_deleted_user_xattrs
          2020-12-16T23:22:49.029065-08:00 WARNING 44: (default) DCP (Producer) eq_dcpq:replication:ns_1@172.23.105.212->ns_1@172.23.100.163:default - Invalid ctrl parameter 'true' for v7_dcp_status_codes
          

          However, some time later we see that .105.212 receives a status code of Status:: OpaqueNoMatch.

          2020-12-16T23:22:49.729775-08:00 ERROR 44: exception occurred in runloop during packet execution. Cookie info: [] - closing connection ([ 172.23.100.163:59737 - 172.23.105.212:11209 (<ud>@ns_server</ud>) ]): to_string(cb::mcbp::Status): Invalid status code: 11
          2020-12-16T23:22:49.729798-08:00 INFO 44: (No Engine) DCP (Producer) eq_dcpq:replication:ns_1@172.23.105.212->ns_1@172.23.100.163:default - Removing connection [ 172.23.100.163:59737 - 172.23.105.212:11209 (<ud>@ns_server</ud>) ]
          

          richard.demellow Richard deMellow added a comment - - edited Looking at node 172.23.105.212 we can see it connected with 172.23.100.163 as a replica. We can also see that the consumer .100.163 has tried to enable v7_dcp_status_codes but .105.212 is 6.5.0 so don't know anything about it so should have returned ENGINE_EINVAL . 2020-12-16T23:22:49.021348-08:00 INFO 44: DCP connection opened successfully. PRODUCER, INCLUDE_XATTRS, DELETE_TIMES [ 172.23.100.163:59737 - 172.23.105.212:11209 (<ud>@ns_server</ud>) ] 2020-12-16T23:22:49.027910-08:00 WARNING 44: (default) DCP (Producer) eq_dcpq:replication:ns_1@172.23.105.212->ns_1@172.23.100.163:default - Invalid ctrl parameter 'true' for include_deleted_user_xattrs 2020-12-16T23:22:49.029065-08:00 WARNING 44: (default) DCP (Producer) eq_dcpq:replication:ns_1@172.23.105.212->ns_1@172.23.100.163:default - Invalid ctrl parameter 'true' for v7_dcp_status_codes However, some time later we see that .105.212 receives a status code of Status:: OpaqueNoMatch . 2020-12-16T23:22:49.729775-08:00 ERROR 44: exception occurred in runloop during packet execution. Cookie info: [] - closing connection ([ 172.23.100.163:59737 - 172.23.105.212:11209 (<ud>@ns_server</ud>) ]): to_string(cb::mcbp::Status): Invalid status code: 11 2020-12-16T23:22:49.729798-08:00 INFO 44: (No Engine) DCP (Producer) eq_dcpq:replication:ns_1@172.23.105.212->ns_1@172.23.100.163:default - Removing connection [ 172.23.100.163:59737 - 172.23.105.212:11209 (<ud>@ns_server</ud>) ]

          Looking at the DCP consumer code for 7.0.0 I can see that when I wrote DcpConsumer::enableV7DcpStatus with the assumption that it's call DcpMessageProducersIface::control() that it was blocking in the sense that it would wait for the DcpProducer's response. This is not the case and insead will return ENGINE_SUCCESS when the message is sent. The fix will need to ensure we wait till the DcpProducer has replied with ENGINE_SUCCESS before enabling the v7 DCP status codes.

          richard.demellow Richard deMellow added a comment - Looking at the DCP consumer code for 7.0.0 I can see that when I wrote DcpConsumer::enableV7DcpStatus with the assumption that it's call DcpMessageProducersIface::control() that it was blocking in the sense that it would wait for the DcpProducer's response. This is not the case and insead will return ENGINE_SUCCESS when the message is sent. The fix will need to ensure we wait till the DcpProducer has replied with ENGINE_SUCCESS before enabling the v7 DCP status codes.

          Marking resolved and assigning back to Ashwin Govindarajulu, as fix is now in Chesire-cat.

          richard.demellow Richard deMellow added a comment - Marking resolved and assigning back to Ashwin Govindarajulu , as fix is now in Chesire-cat.

          Build couchbase-server-7.0.0-4170 contains kv_engine commit e56a54d with commit message:
          MB-43345: Fix DCP_CONTROL msg v7_dcp_status_codes

          build-team Couchbase Build Team added a comment - Build couchbase-server-7.0.0-4170 contains kv_engine commit e56a54d with commit message: MB-43345 : Fix DCP_CONTROL msg v7_dcp_status_codes

          Validated the fix using build 7.0.0-4202.

          Closing the ticket.

          ashwin.govindarajulu Ashwin Govindarajulu added a comment - Validated the fix using build 7.0.0-4202. Closing the ticket.

          People

            ashwin.govindarajulu Ashwin Govindarajulu
            ashwin.govindarajulu Ashwin Govindarajulu
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Gerrit Reviews

                There are no open Gerrit changes

                PagerDuty