Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-43345

[Upgrade]: Swap rebalance failed with reason 'mover_crashed' during dcp_takeover

    XMLWordPrintable

Details

    Description

       

      Build: 7.0.0-4060 from 6.5.0-4960

      Scenario:

      1. 4 node KV cluster (6.5.0-4960) with couchbase bucket (replica=1)

        +----------------+----------+-----------------+------------+------------+-----------------------+-----------------------+
        | Node           | Services | CPU_utilization | Mem_total  | Mem_free   | Swap_mem_used         | Version               |
        +----------------+----------+-----------------+------------+------------+-----------------------+-----------------------+
        | 172.23.105.212 | kv       | 1.25628140704   | 4201840640 | 3647959040 | 6553600 / 3758092288  | 6.5.0-4960-enterprise |
        | 172.23.105.155 | kv       | 1.25628140704   | 4201840640 | 3691266048 | 0 / 3758092288        | 6.5.0-4960-enterprise |
        | 172.23.105.213 | kv       | 1.51133501259   | 4201840640 | 3686121472 | 55312384 / 3758092288 | 6.5.0-4960-enterprise |
        | 172.23.105.211 | kv       | 0.759493670886  | 4201840640 | 3658952704 | 14680064 / 3758092288 | 6.5.0-4960-enterprise |
        +----------------+----------+-----------------+------------+------------+-----------------------+-----------------------++---------+-----------+----------+------------+-----+-------+-------------+----------+-----------+
        | Bucket  | Type      | Replicas | Durability | TTL | Items | RAM Quota   | RAM Used | Disk Used |
        +---------+-----------+----------+------------+-----+-------+-------------+----------+-----------+
        | default | couchbase | 1        | none       | 0   | 50000 | 13434355712 | 80015424 | 171941441 |
        +---------+-----------+----------+------------+-----+-------+-------------+----------+-----------+

      1. Upgrading to 7.0.0-4060 using swap rebalance with sync-writes updates in background

      Observation:

      During upgrade of node "172.23.105.212 <-> 172.23.100.163" seeing rebalance failure with following logs

      Node: 172.23.105.212 file: memcached.log.000000.txt

      2020-12-16T23:22:49.729775-08:00 ERROR 44: exception occurred in runloop during packet execution. Cookie info: [] - closing connection ([ 172.23.100.163:59737 - 172.23.105.212:11209 (<ud>@ns_server</ud>) ]): to_string(cb::mcbp::Status): Invalid status code: 11

      UI logs:

      Worker <0.7388.3> (for action {move,{601,
      ['ns_1@172.23.105.212',
      'ns_1@172.23.100.162'],
      ['ns_1@172.23.100.163',
      'ns_1@172.23.100.162'],
      []}}) exited with reason {unexpected_exit, {'EXIT', <0.8856.3>,
      {{{{{child_interrupted, {'EXIT', <28291.16197.0>, socket_closed}},
      [{dcp_replicator, spawn_and_wait, 1, [{file, "src/dcp_replicator.erl"}, {line, 265}]},
      {dcp_replicator, handle_call, 3, [{file, "src/dcp_replicator.erl"}, {line, 127}]},
      {gen_server, try_handle_call, 4, [{file, "gen_server.erl"}, {line, 661}]},
      {gen_server, handle_msg, 6, [{file, "gen_server.erl"}, {line, 690}]},
      {proc_lib, init_p_do_apply, 3, [{file, "proc_lib.erl"}, {line, 249}]}]},
      {gen_server, call, [<28291.16195.0>, get_partitions, infinity]}},
      {gen_server, call,
      ['dcp_replication_manager-default', {get_replicator_pid, 586}, infinity]}},
      {gen_server, call, [{'janitor_agent-default',
      'ns_1@172.23.100.163'}, {if_rebalance, <0.18173.2>,
      {dcp_takeover, 'ns_1@172.23.105.212', 601}}, infinity]}}}}
       
      Rebalance exited with reason {mover_crashed,
      {unexpected_exit, {'EXIT',<0.8856.3>,
      {{{{{child_interrupted,
      {'EXIT',<28291.16197.0>,socket_closed}},
      [{dcp_replicator,spawn_and_wait,1, [{file,"src/dcp_replicator.erl"}, {line,265}]},
      {dcp_replicator,handle_call,3, [{file,"src/dcp_replicator.erl"}, {line,127}]},
      {gen_server,try_handle_call,4, [{file,"gen_server.erl"},{line,661}]},
      {gen_server,handle_msg,6, [{file,"gen_server.erl"},{line,690}]},
      {proc_lib,init_p_do_apply,3, [{file,"proc_lib.erl"},{line,249}]}]},
      {gen_server,call,
      [<28291.16195.0>,get_partitions, infinity]}}, {gen_server,call,
      ['dcp_replication_manager-default', {get_replicator_pid,586}, infinity]}},
      {gen_server,call,
      [{'janitor_agent-default', 'ns_1@172.23.100.163'},
      {if_rebalance,<0.18173.2>,
      {dcp_takeover,'ns_1@172.23.105.212',601}}, infinity]}}}}}.
      Rebalance Operation Id = 38dc297bf54d83472be688f1f6539e36

       

      Attachments

        Activity

          People

            ashwin.govindarajulu Ashwin Govindarajulu
            ashwin.govindarajulu Ashwin Govindarajulu
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              PagerDuty