Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-47528

[Magma] - Multi node rebalance out fails with "{mover_crashed,{unexpected_exit,{\'EXIT\',<0.10706.1>,{{dcp_wait_for_data_move_failed"

    XMLWordPrintable

Details

    • Triaged
    • Centos 64-bit
    • 1
    • No

    Description

      Script to Repro

      guides/gradlew --refresh-dependencies testrunner -P jython=/opt/jython/bin/jython -P 'args=-i /tmp/win10-bucket-ops-temp_rebalance_even2-magma.ini rerun=False,get-cbcollect-info=True,quota_percent=99,crash_warning=True,retry_get_process_num=600,bucket_storage=magma,enable_dp=True -t bucket_collections.collections_drop_recreate_rebalance.CollectionsDropRecreateRebalance.test_data_load_collections_with_rebalance_out,nodes_init=5,nodes_out=2,bucket_spec=multi_bucket.buckets_1000_collections'
      

      Steps to Repro
      1. Create a 5 node cluster
      2021-07-20 22:05:38,463 | test | INFO | pool-3-thread-7 | [table_view:display:72] Rebalance Overview
      -----------------------------------------------------------------------

      Nodes Services Version CPU Status

      -----------------------------------------------------------------------

      172.23.121.135 kv 7.1.0-1083-enterprise 0.713749060856 Cluster node
      172.23.121.136 None     <--- IN —
      172.23.121.139 None     <--- IN —
      172.23.121.140 None     <--- IN —
      172.23.121.141 None     <--- IN —

      -----------------------------------------------------------------------

      2. Create bucket/scope/collections/data
      2021-07-20 22:07:15,842 | test | INFO | MainThread | [table_view:display:72] Bucket statistics
      ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------

      Bucket Type Storage Backend Replicas Durability TTL Items RAM Quota RAM Used Disk Used ARR

      ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------

      nm3B1VRZsUKP8%uMu-txjkDFBi-NGJ4oIQx1jai4hU91hdLnYwwoOp4TKoathcSVXpoqeboBdQncke-48-878000 couchbase magma 2 none 0 3000000 9.77 GiB 2.78 GiB 3.10 GiB 100

      ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------

      3. Start CRUD on collections, this goes on until the end of rebalance that is started in the next step.

      4. Start multi node rebalance out(172.23.121.136 and 172.23.121.135).
      2021-07-20 22:07:16,509 | test | INFO | pool-3-thread-4 | [table_view:display:72] Rebalance Overview
      ----------------------------------------------------------------------

      Nodes Services Version CPU Status

      ----------------------------------------------------------------------

      172.23.121.140 kv 7.1.0-1083-enterprise 4.8921224285 Cluster node
      172.23.121.136 kv 7.1.0-1083-enterprise 4.82654600302 — OUT --->
      172.23.121.139 kv 7.1.0-1083-enterprise 4.95037064958 Cluster node
      172.23.121.135 kv 7.1.0-1083-enterprise 4.73737119879 — OUT --->
      172.23.121.141 kv 7.1.0-1083-enterprise 4.77506911284 Cluster node

      ----------------------------------------------------------------------

      This rebalance fails as shown below.

      2021-07-20 22:28:11,601 | test  | ERROR   | pool-3-thread-4 | [rest_client:_rebalance_status_and_progress:1547] {u'errorMessage': u'Rebalance failed. See logs for detailed reason. You can try again.', u'type': u'rebalance', u'masterRequestTimedOut': False, u'statusId': u'ef4929b52095d0e1b7f1ce7c96d01b84', u'statusIsStale': False, u'lastReportURI': u'/logs/rebalanceReport?reportID=a43d262e6e2c00c53e6aa53b0a22d187', u'status': u'notRunning'} - rebalance failed
      2021-07-20 22:28:11,624 | test  | INFO    | pool-3-thread-4 | [rest_client:print_UI_logs:2693] Latest logs from UI on 172.23.121.135:
      2021-07-20 22:28:11,624 | test  | ERROR   | pool-3-thread-4 | [rest_client:print_UI_logs:2695] {u'code': 0, u'module': u'ns_orchestrator', u'type': u'critical', u'node': u'ns_1@172.23.121.135', u'tstamp': 1626845288009L, u'shortText': u'message', u'serverTime': u'2021-07-20T22:28:08.009Z', u'text': u'Rebalance exited with reason {mover_crashed,\n                              {unexpected_exit,\n                               {\'EXIT\',<0.10706.1>,\n                                {{dcp_wait_for_data_move_failed,\n                                  "nm3B1VRZsUKP8%uMu-txjkDFBi-NGJ4oIQx1jai4hU91hdLnYwwoOp4TKoathcSVXpoqeboBdQncke-48-878000",\n                                  203,\'ns_1@172.23.121.135\',\n                                  [\'ns_1@172.23.121.141\',\n                                   \'ns_1@172.23.121.140\',\n                                   \'ns_1@172.23.121.139\'],\n                                  {error,no_stats_for_this_vbucket}},\n                                 [{ns_single_vbucket_mover,\n                                   \'-wait_dcp_data_move/5-fun-0-\',5,\n                                   [{file,"src/ns_single_vbucket_mover.erl"},\n                                    {line,459}]},\n                                  {proc_lib,init_p,3,\n                                   [{file,"proc_lib.erl"},{line,234}]}]}}}}.\nRebalance Operation Id = 49de3cd996b5984e6a69b46336b253fa'}
      2021-07-20 22:28:11,625 | test  | ERROR   | pool-3-thread-4 | [rest_client:print_UI_logs:2695] {u'code': 0, u'module': u'ns_vbucket_mover', u'type': u'critical', u'node': u'ns_1@172.23.121.135', u'tstamp': 1626845287930L, u'shortText': u'message', u'serverTime': u'2021-07-20T22:28:07.930Z', u'text': u'Worker <0.10565.1> (for action {move,{203,\n                                      [\'ns_1@172.23.121.135\',\n                                       \'ns_1@172.23.121.141\',\n                                       \'ns_1@172.23.121.140\'],\n                                      [\'ns_1@172.23.121.141\',\n                                       \'ns_1@172.23.121.140\',\n                                       \'ns_1@172.23.121.139\'],\n                                      []}}) exited with reason {unexpected_exit,\n                                                                {\'EXIT\',\n                                                                 <0.10706.1>,\n                                                                 {{dcp_wait_for_data_move_failed,\n                                                                   "nm3B1VRZsUKP8%uMu-txjkDFBi-NGJ4oIQx1jai4hU91hdLnYwwoOp4TKoathcSVXpoqeboBdQncke-48-878000",\n                                                                   203,\n                                                                   \'ns_1@172.23.121.135\',\n                                                                   [\'ns_1@172.23.121.141\',\n                                                                    \'ns_1@172.23.121.140\',\n                                                                    \'ns_1@172.23.121.139\'],\n                                                                   {error,\n                                                                    no_stats_for_this_vbucket}},\n                                                                  [{ns_single_vbucket_mover,\n                                                                    \'-wait_dcp_data_move/5-fun-0-\',\n                                                                    5,\n                                                                    [{file,\n                                                                      "src/ns_single_vbucket_mover.erl"},\n                                                                     {line,\n                                                                      459}]},\n                                                                   {proc_lib,\n                                                                    init_p,3,\n                                                                    [{file,\n                                                                      "proc_lib.erl"},\n                                                                     {line,\n                                                                      234}]}]}}}'}
      

      It should be noted that this is the same test and similar failure was seen on MB-47390 which was marked dup of MB-42652 which has been fixed on the build this was run on.

      cbcollect_info attached.

      Attachments

        No reviews matched the request. Check your Options in the drop-down menu of this sections header.

        Activity

          People

            Balakumaran.Gopal Balakumaran Gopal
            Balakumaran.Gopal Balakumaran Gopal
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Gerrit Reviews

                There are no open Gerrit changes

                PagerDuty