Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-47390

[Magma] - Multi node rebalance out fails with "Rebalance exited with reason {mover_crashed,{unexpected_exit, {\'EXIT\',<0.3710.1>,{{dcp_wait_for_data_move_failed"

    XMLWordPrintable

Details

    • Bug
    • Resolution: Duplicate
    • Critical
    • None
    • 7.1.0
    • couchbase-bucket
    • 7.1.0-1068
    • Untriaged
    • Centos 64-bit
    • 1
    • No

    Description

      Script to Repro

      guides/gradlew --refresh-dependencies testrunner -P jython=/opt/jython/bin/jython -P 'args=-i /tmp/win10-bucket-ops-temp_rebalance_even2-magma.ini rerun=False,get-cbcollect-info=False,quota_percent=99,crash_warning=True,retry_get_process_num=200,bucket_storage=magma,enable_dp=True,log_level=debug,infra_log_level=debug -t bucket_collections.collections_rebalance.CollectionsRebalance.test_data_load_collections_with_rebalance_out,nodes_init=5,nodes_out=2,bucket_spec=multi_bucket.buckets_for_rebalance_tests_more_collections,data_load_spec=volume_test_load_with_CRUD_on_collections,data_load_stage=before,scrape_interval=5,rebalance_moves_per_node=32,quota_percent=80,skip_validations=False,GROUP=rebalance_with_collection_crud'
      

      Steps to Repro
      1. Create a 5 node cluster
      2021-07-14 02:19:01,244 | test | INFO | pool-5-thread-7 | [table_view:display:72] Rebalance Overview
      ----------------------------------------------------------------------

      Nodes Services Version CPU Status

      ----------------------------------------------------------------------

      172.23.106.156 kv 7.1.0-1068-enterprise 1.16541353383 Cluster node
      172.23.106.159 None     <--- IN —
      172.23.106.163 None     <--- IN —
      172.23.106.164 None     <--- IN —
      172.23.106.165 None     <--- IN —

      ----------------------------------------------------------------------

      2. Create Bucket/scopes/collections/data
      2021-07-14 02:20:39,635 | test | INFO | MainThread | [table_view:display:72] Bucket statistics
      ----------------------------------------------------------------------------------------------------------------------------------------------

      Bucket Type Storage Backend Replicas Durability TTL Items RAM Quota RAM Used Disk Used ARR

      ----------------------------------------------------------------------------------------------------------------------------------------------

      c%MNWwHodAxDAKixAjh6Tx1c0Y6I9nPBOnCx%8la44VtIgp4m-11-583000 couchbase magma 2 none 0 3000000 9.77 GiB 2.73 GiB 3.04 GiB 100

      ----------------------------------------------------------------------------------------------------------------------------------------------

      3. Rebalance out 2 node

      2021-07-14 02:20:47,256 | test  | DEBUG   | MainProcess | pool-5-thread-13 | [rest_client:rebalance:1338] Rebalance params: {'password': 'password', 'knownNodes': u'ns_1@172.23.106.165,ns_1@172.23.106.163,ns_1@172.23.106.156,ns_1@172.23.106.164,ns_1@172.23.106.159', 'ejectedNodes': u'ns_1@172.23.106.164,ns_1@172.23.106.165', 'user': 'Administrator'}
      

      2021-07-14 02:20:47,322 | test | INFO | MainProcess | pool-5-thread-13 | [table_view:display:72] Rebalance Overview
      ----------------------------------------------------------------------

      Nodes Services Version CPU Status

      ----------------------------------------------------------------------

      172.23.106.165 kv 7.1.0-1068-enterprise 3.87323943662 — OUT --->
      172.23.106.163 kv 7.1.0-1068-enterprise 3.72834546824 Cluster node
      172.23.106.156 kv 7.1.0-1068-enterprise 4.52374968585 Cluster node
      172.23.106.164 kv 7.1.0-1068-enterprise 3.46081504702 — OUT --->
      172.23.106.159 kv 7.1.0-1068-enterprise 3.43789209536 Cluster node

      ----------------------------------------------------------------------

      Rebalance fails as shown below

      2021-07-14 02:23:18,486 | test  | ERROR   | MainProcess | pool-5-thread-13 | [rest_client:_rebalance_status_and_progress:1547] {u'errorMessage': u'Rebalance failed. See logs for detailed reason. You can try again.', u'type': u'rebalance', u'masterRequestTimedOut': False, u'statusId': u'cdc7d7ba66d97d2103c7ca44798a42bc', u'statusIsStale': False, u'lastReportURI': u'/logs/rebalanceReport?reportID=a7601ad898297ab91ce100a4e6ef3876', u'status': u'notRunning'} - rebalance failed
      2021-07-14 02:23:18,529 | test  | INFO    | MainProcess | pool-5-thread-13 | [rest_client:print_UI_logs:2693] Latest logs from UI on 172.23.106.156:
      2021-07-14 02:23:18,529 | test  | ERROR   | MainProcess | pool-5-thread-13 | [rest_client:print_UI_logs:2695] {u'code': 0, u'module': u'ns_orchestrator', u'type': u'critical', u'node': u'ns_1@172.23.106.156', u'tstamp': 1626254597921L, u'shortText': u'message', u'serverTime': u'2021-07-14T02:23:17.921Z', u'text': u'Rebalance exited with reason {mover_crashed,\n                              {unexpected_exit,\n                               {\'EXIT\',<0.3710.1>,\n                                {{dcp_wait_for_data_move_failed,\n                                  "c%MNWwHodAxDAKixAjh6Tx1c0Y6I9nPBOnCx%8la44VtIgp4m-11-583000",\n                                  909,\'ns_1@172.23.106.165\',\n                                  [\'ns_1@172.23.106.159\',\n                                   \'ns_1@172.23.106.163\',\n                                   \'ns_1@172.23.106.156\'],\n                                  {error,no_stats_for_this_vbucket}},\n                                 [{ns_single_vbucket_mover,\n                                   \'-wait_dcp_data_move/5-fun-0-\',5,\n                                   [{file,"src/ns_single_vbucket_mover.erl"},\n                                    {line,459}]},\n                                  {proc_lib,init_p,3,\n                                   [{file,"proc_lib.erl"},{line,234}]}]}}}}.\nRebalance Operation Id = a26bf938559f1ab59e8016f8e6a5e2cc'}
      2021-07-14 02:23:18,529 | test  | ERROR   | MainProcess | pool-5-thread-13 | [rest_client:print_UI_logs:2695] {u'code': 0, u'module': u'ns_vbucket_mover', u'type': u'critical', u'node': u'ns_1@172.23.106.156', u'tstamp': 1626254597884L, u'shortText': u'message', u'serverTime': u'2021-07-14T02:23:17.884Z', u'text': u'Worker <0.3015.1> (for action {move,{909,\n                                     [\'ns_1@172.23.106.165\',\n                                      \'ns_1@172.23.106.159\',\n                                      \'ns_1@172.23.106.163\'],\n                                     [\'ns_1@172.23.106.159\',\n                                      \'ns_1@172.23.106.163\',\n                                      \'ns_1@172.23.106.156\'],\n                                     []}}) exited with reason {unexpected_exit,\n                                                               {\'EXIT\',\n                                                                <0.3710.1>,\n                                                                {{dcp_wait_for_data_move_failed,\n                                                                  "c%MNWwHodAxDAKixAjh6Tx1c0Y6I9nPBOnCx%8la44VtIgp4m-11-583000",\n                                                                  909,\n                                                                  \'ns_1@172.23.106.165\',\n                                                                  [\'ns_1@172.23.106.159\',\n                                                                   \'ns_1@172.23.106.163\',\n                                                                   \'ns_1@172.23.106.156\'],\n                                                                  {error,\n                                                                   no_stats_for_this_vbucket}},\n                                                                 [{ns_single_vbucket_mover,\n                                                                   \'-wait_dcp_data_move/5-fun-0-\',\n                                                                   5,\n                                                                   [{file,\n                                                                     "src/ns_single_vbucket_mover.erl"},\n                                                                    {line,\n                                                                     459}]},\n                                                                  {proc_lib,\n                                                                   init_p,3,\n                                                                   [{file,\n                                                                     "proc_lib.erl"},\n                                                                    {line,\n                                                                     234}]}]}}}'}
       
      2021-07-14 02:23:18,529 | test  | ERROR   | MainProcess | pool-5-thread-13 | [rest_client:print_UI_logs:2695] {u'code': 0, u'module': u'ns_vbucket_mover', u'type': u'info', u'node': u'ns_1@172.23.106.156', u'tstamp': 1626254448139L, u'shortText': u'message', u'serverTime': u'2021-07-14T02:20:48.139Z', u'text': u'Bucket "c%MNWwHodAxDAKixAjh6Tx1c0Y6I9nPBOnCx%8la44VtIgp4m-11-583000" rebalance does not seem to be swap rebalance'}
      2021-07-14 02:23:18,530 | test  | ERROR   | MainProcess | pool-5-thread-13 | [rest_client:print_UI_logs:2695] {u'code': 0, u'module': u'ns_rebalancer', u'type': u'info', u'node': u'ns_1@172.23.106.156', u'tstamp': 1626254447682L, u'shortText': u'message', u'serverTime': u'2021-07-14T02:20:47.682Z', u'text': u'Started rebalancing bucket c%MNWwHodAxDAKixAjh6Tx1c0Y6I9nPBOnCx%8la44VtIgp4m-11-583000'}
       
      2021-07-14 02:23:18,530 | test  | ERROR   | MainProcess | pool-5-thread-13 | [rest_client:print_UI_logs:2695] {u'code': 0, u'module': u'ns_orchestrator', u'type': u'info', u'node': u'ns_1@172.23.106.156', u'tstamp': 1626254447262L, u'shortText': u'message', u'serverTime': u'2021-07-14T02:20:47.262Z', u'text': u"Starting rebalance, KeepNodes = ['ns_1@172.23.106.163','ns_1@172.23.106.156',\n                                 'ns_1@172.23.106.159'], EjectNodes = ['ns_1@172.23.106.164',\n                                                                       'ns_1@172.23.106.165'], Failed over and being ejected nodes = []; no delta recovery nodes; Operation Id = a26bf938559f1ab59e8016f8e6a5e2cc"}
      

      ERROR's seen on 172.23.106.165

      2021-07-14 02:23:40,032 | infra | DEBUG   | MainProcess | MainThread | [remote_util:execute_command_raw_jsch:3285] Running command on 172.23.106.165: grep -r ' ERROR ' /opt/couchbase/var/lib/couchbase/logs/memcached.log.000264.txt | grep -v 'XERROR' | grep -v 'compaction failed for vb' | grep -v 'failed - Already exists'
      2021-07-14 02:23:40,073 | test  | CRITICAL | MainProcess | MainThread | [basetestcase:check_coredump_exist:846] 172.23.106.165: Found ' ERROR ' logs - ['2021-07-14T02:23:17.786518-07:00 ERROR 844: (c%MNWwHodAxDAKixAjh6Tx1c0Y6I9nPBOnCx%8la44VtIgp4m-11-583000) DCP (Producer) eq_dcpq:replication:ns_1@172.23.106.165->ns_1@172.23.106.156:c%MNWwHodAxDAKixAjh6Tx1c0Y6I9nPBOnCx%8la44VtIgp4m-11-583000 - DcpProducer::handleResponse disconnecting, received unexpected response:{"bodylen":84,"cas":0,"datatype":["JSON"],"extlen":0,"keylen":0,"magic":"ClientResponse","opaque":164,"opcode":"DCP_MUTATION","status":"Unknown Collection"} for stream:stream name:eq_dcpq:replication:ns_1@172.23.106.165->ns_1@172.23.106.156:c%MNWwHodAxDAKixAjh6Tx1c0Y6I9nPBOnCx%8la44VtIgp4m-11-583000, vb:909, state:backfilling\n', '2021-07-14T02:23:17.786558-07:00 ERROR 844: (c%MNWwHodAxDAKixAjh6Tx1c0Y6I9nPBOnCx%8la44VtIgp4m-11-583000) DCP (Producer) eq_dcpq:replication:ns_1@172.23.106.165->ns_1@172.23.106.156:c%MNWwHodAxDAKixAjh6Tx1c0Y6I9nPBOnCx%8la44VtIgp4m-11-583000 - DcpProducer::handleResponse disconnecting, received unexpected response:{"bodylen":84,"cas":0,"datatype":["JSON"],"extlen":0,"keylen":0,"magic":"ClientResponse","opaque":164,"opcode":"DCP_MUTATION","status":"Unknown Collection"} for stream:stream name:eq_dcpq:replication:ns_1@172.23.106.165->ns_1@172.23.106.156:c%MNWwHodAxDAKixAjh6Tx1c0Y6I9nPBOnCx%8la44VtIgp4m-11-583000, vb:909, state:backfilling\n', '2021-07-14T02:23:17.786589-07:00 ERROR 844: (c%MNWwHodAxDAKixAjh6Tx1c0Y6I9nPBOnCx%8la44VtIgp4m-11-583000) DCP (Producer) eq_dcpq:replication:ns_1@172.23.106.165->ns_1@172.23.106.156:c%MNWwHodAxDAKixAjh6Tx1c0Y6I9nPBOnCx%8la44VtIgp4m-11-583000 - DcpProducer::handleResponse disconnecting, received unexpected response:{"bodylen":84,"cas":0,"datatype":["JSON"],"extlen":0,"keylen":0,"magic":"ClientResponse","opaque":164,"opcode":"DCP_MUTATION","status":"Unknown Collection"} for stream:stream name:eq_dcpq:replication:ns_1@172.23.106.165->ns_1@172.23.106.156:c%MNWwHodAxDAKixAjh6Tx1c0Y6I9nPBOnCx%8la44VtIgp4m-11-583000, vb:909, state:backfilling\n', '2021-07-14T02:23:17.786610-07:00 ERROR 844: (c%MNWwHodAxDAKixAjh6Tx1c0Y6I9nPBOnCx%8la44VtIgp4m-11-583000) DCP (Producer) eq_dcpq:replication:ns_1@172.23.106.165->ns_1@172.23.106.156:c%MNWwHodAxDAKixAjh6Tx1c0Y6I9nPBOnCx%8la44VtIgp4m-11-583000 - DcpProducer::handleResponse disconnecting, received unexpected response:{"bodylen":84,"cas":0,"datatype":["JSON"],"extlen":0,"keylen":0,"magic":"ClientResponse","opaque":164,"opcode":"DCP_MUTATION","status":"Unknown Collection"} for stream:stream name:eq_dcpq:replication:ns_1@172.23.106.165->ns_1@172.23.106.156:c%MNWwHodAxDAKixAjh6Tx1c0Y6I9nPBOnCx%8la44VtIgp4m-11-583000, vb:909, state:backfilling\n', '2021-07-14T02:23:17.786630-07:00 ERROR 844: (c%MNWwHodAxDAKixAjh6Tx1c0Y6I9nPBOnCx%8la44VtIgp4m-11-583000) DCP (Producer) eq_dcpq:replication:ns_1@172.23.106.165->ns_1@172.23.106.156:c%MNWwHodAxDAKixAjh6Tx1c0Y6I9nPBOnCx%8la44VtIgp4m-11-583000 - DcpProducer::handleResponse disconnecting, received unexpected response:{"bodylen":84,"cas":0,"datatype":["JSON"],"extlen":0,"keylen":0,"magic":"ClientResponse","opaque":164,"opcode":"DCP_MUTATION","status":"Unknown Collection"} for stream:stream name:eq_dcpq:replication:ns_1@172.23.106.165->ns_1@172.23.106.156:c%MNWwHodAxDAKixAjh6Tx1c0Y6I9nPBOnCx%8la44VtIgp4m-11-583000, vb:909, state:backfilling\n', '2021-07-14T02:23:17.786649-07:00 ERROR 844: (c%MNWwHodAxDAKixAjh6Tx1c0Y6I9nPBOnCx%8la44VtIgp4m-11-583000) DCP (Producer) eq_dcpq:replication:ns_1@172.23.106.165->ns_1@172.23.106.156:c%MNWwHodAxDAKixAjh6Tx1c0Y6I9nPBOnCx%8la44VtIgp4m-11-583000 - DcpProducer::handleResponse disconnecting, received unexpected response:{"bodylen":84,"cas":0,"datatype":["JSON"],"extlen":0,"keylen":0,"magic":"ClientResponse","opaque":164,"opcode":"DCP_MUTATION","status":"Unknown Collection"} for stream:stream name:eq_dcpq:replication:ns_1@172.23.106.165->ns_1@172.23.106.156:c%MNWwHodAxDAKixAjh6Tx1c0Y6I9nPBOnCx%8la44VtIgp4m-11-583000, vb:909, state:backfilling\n']
      

      cbcollect_info attached.

      Attachments

        1. test.log
          3.29 MB
          Balakumaran Gopal

        Issue Links

          No reviews matched the request. Check your Options in the drop-down menu of this sections header.

          Activity

            People

              Balakumaran.Gopal Balakumaran Gopal
              Balakumaran.Gopal Balakumaran Gopal
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Gerrit Reviews

                  There are no open Gerrit changes

                  PagerDuty