Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-44417

[Collections] - Collection CRUD + Multi node graceful failover + dgm + rebalance out fails with wait_seqno_persisted_failed

    XMLWordPrintable

Details

    • Triaged
    • Centos 64-bit
    • 1
    • Yes
    • KV-Engine 2021-Feb

    Description

      Script to Repro

      guides/gradlew --refresh-dependencies testrunner -P jython=/opt/jython/bin/jython -P 'args=-i /tmp/win10-bucket-ops111111111111124244224.ini rerun=False,quota_percent=95,crash_warning=True -t bucket_collections.collections_rebalance.CollectionsRebalance.test_data_load_collections_with_graceful_failover_rebalance_out,nodes_init=5,nodes_failover=2,bucket_spec=dgm.buckets_for_rebalance_tests_more_collections,data_load_spec=volume_test_load_with_CRUD_on_collections,data_load_stage=during,dgm=55,GROUP=failover_with_collection_crud_dgm'
      

      Steps to Repro
      1) Create a 5 node cluster
      2021-02-17 23:06:57,332 | test | INFO | pool-1-thread-6 | [table_view:display:72] Rebalance Overview
      ----------------------------------------------------------------------

      Nodes Services Version CPU Status

      ----------------------------------------------------------------------

      172.23.106.209 kv 7.0.0-4486-enterprise 1.26598144898 Cluster node
      172.23.106.225 None     <--- IN —
      172.23.106.232 None     <--- IN —
      172.23.106.239 None     <--- IN —
      172.23.106.246 None     <--- IN —

      ----------------------------------------------------------------------
      2) Create bucket/scopes/collections/data
      2021-02-17 23:08:19,099 | test | INFO | MainThread | [table_view:display:72] Bucket statistics
      -------------------------------------------------------------------------

      Bucket Type Replicas Durability TTL Items RAM Quota RAM Used Disk Used

      -------------------------------------------------------------------------

      default couchbase 2 none 0 500000 4194304000 505249400 535876815

      -------------------------------------------------------------------------
      3) Push the bucket to DGM

      2021-02-17 23:13:04,911 | test  | INFO    | pool-1-thread-10 | [task:_load_bucket_into_dgm:2079] Active DGM 100% Replica DGM 53.4731016949% achieved for 'default'. Loaded docs: 5480000
      

      4) Graceful failover 2 nodes

      2021-02-17 23:13:08,638 | test  | INFO    | MainThread | [collections_rebalance:rebalance_operation:473] failing over nodes [ip:172.23.106.239 port:8091 ssh_username:root, ip:172.23.106.246 port:8091 ssh_username:root]
      2021-02-17 23:18:59,082 | test  | WARNING | MainThread | [rest_client:get_nodes:1706] 172.23.106.239 - Node not part of cluster inactiveFailed
      2021-02-17 23:18:59,084 | test  | WARNING | MainThread | [rest_client:get_nodes:1706] 172.23.106.246 - Node not part of cluster inactiveFailed
      

      5) Start Rebalance.
      2021-02-17 23:19:01,479 | test | INFO | pool-1-thread-28 | [table_view:display:72] Rebalance Overview
      -----------------------------------------------------------------------

      Nodes Services Version CPU Status

      -----------------------------------------------------------------------

      172.23.106.225 kv 7.0.0-4486-enterprise 60.47399605 Cluster node
      172.23.106.246 kv 7.0.0-4486-enterprise 0.350921168066 — OUT --->
      172.23.106.239 kv 7.0.0-4486-enterprise 0.526052104208 — OUT --->
      172.23.106.232 kv 7.0.0-4486-enterprise 54.9271523179 Cluster node
      172.23.106.209 kv 7.0.0-4486-enterprise 56.7614011897 Cluster node

      -----------------------------------------------------------------------

      Rebalance fails as shown below.

      2021-02-17 23:19:38,091 | test  | INFO    | pool-1-thread-28 | [rest_client:print_UI_logs:2599] Latest logs from UI on 172.23.106.209:
      2021-02-17 23:19:38,092 | test  | ERROR   | pool-1-thread-28 | [rest_client:print_UI_logs:2601] {u'code': 0, u'module': u'ns_orchestrator', u'type': u'critical', u'node': u'ns_1@172.23.106.209', u'tstamp': 1613632774771L, u'shortText': u'message', u'serverTime': u'2021-02-17T23:19:34.771Z', u'text': u'Rebalance exited with reason {mover_crashed,\n                              {unexpected_exit,\n                               {\'EXIT\',<0.13534.5>,\n                                {{wait_seqno_persisted_failed,"default",882,\n                                  8522,\n                                  [{\'ns_1@172.23.106.232\',\n                                    {\'EXIT\',\n                                     {{{{{child_interrupted,\n                                          {\'EXIT\',<27245.4009.0>,\n                                           socket_closed}},\n                                         [{dcp_replicator,spawn_and_wait,1,\n                                           [{file,"src/dcp_replicator.erl"},\n                                            {line,265}]},\n                                          {dcp_replicator,handle_call,3,\n                                           [{file,"src/dcp_replicator.erl"},\n                                            {line,121}]},\n                                          {gen_server,try_handle_call,4,\n                                           [{file,"gen_server.erl"},\n                                            {line,661}]},\n                                          {gen_server,handle_msg,6,\n                                           [{file,"gen_server.erl"},\n                                            {line,690}]},\n                                          {proc_lib,init_p_do_apply,3,\n                                           [{file,"proc_lib.erl"},\n                                            {line,249}]}]},\n                                        {gen_server,call,\n                                         [<27245.4007.0>,get_partitions,\n                                          infinity]}},\n                                       {gen_server,call,\n                                        [\'dcp_replication_manager-default\',\n                                         {get_replicator_pid,871},\n                                         infinity]}},\n                                      {gen_server,call,\n                                       [{\'janitor_agent-default\',\n                                         \'ns_1@172.23.106.232\'},\n                                        {if_rebalance,<0.26409.4>,\n                                         {wait_seqno_persisted,882,8522}},\n                                        infinity]}}}}]},\n                                 [{ns_single_vbucket_mover,\n                                   \'-wait_seqno_persisted_many/5-fun-2-\',5,\n                                   [{file,"src/ns_single_vbucket_mover.erl"},\n                                    {line,488}]},\n                                  {proc_lib,init_p,3,\n                                   [{file,"proc_lib.erl"},{line,234}]}]}}}}.\nRebalance Operation Id = 5cb6bf3a18b1d66ecdc991d84fc0bb8c'}
      

      2021-02-17 23:19:38,092 | test  | ERROR   | pool-1-thread-28 | [rest_client:print_UI_logs:2601] {u'code': 0, u'module': u'ns_vbucket_mover', u'type': u'critical', u'node': u'ns_1@172.23.106.209', u'tstamp': 1613632774703L, u'shortText': u'message', u'serverTime': u'2021-02-17T23:19:34.703Z', u'text': u'Worker <0.13484.5> (for action {move,{882,\n                                      [\'ns_1@172.23.106.225\',\n                                       \'ns_1@172.23.106.209\',undefined],\n                                      [\'ns_1@172.23.106.209\',\n                                       \'ns_1@172.23.106.225\',\n                                       \'ns_1@172.23.106.232\'],\n                                      []}}) exited with reason {unexpected_exit,\n                                                                {\'EXIT\',\n                                                                 <0.13534.5>,\n                                                                 {{wait_seqno_persisted_failed,\n                                                                   "default",\n                                                                   882,8522,\n                                                                   [{\'ns_1@172.23.106.232\',\n                                                                     {\'EXIT\',\n                                                                      {{{{{child_interrupted,\n                                                                           {\'EXIT\',\n                                                                            <27245.4009.0>,\n                                                                            socket_closed}},\n                                                                          [{dcp_replicator,\n                                                                            spawn_and_wait,\n                                                                            1,\n                                                                            [{file,\n                                                                              "src/dcp_replicator.erl"},\n                                                                             {line,\n                                                                              265}]},\n                                                                           {dcp_replicator,\n                                                                            handle_call,\n                                                                            3,\n                                                                            [{file,\n                                                                              "src/dcp_replicator.erl"},\n                                                                             {line,\n                                                                              121}]},\n                                                                           {gen_server,\n                                                                            try_handle_call,\n                                                                            4,\n                                                                            [{file,\n                                                                              "gen_server.erl"},\n                                                                             {line,\n                                                                              661}]},\n                                                                           {gen_server,\n                                                                            handle_msg,\n                                                                            6,\n                                                                            [{file,\n                                                                              "gen_server.erl"},\n                                                                             {line,\n                                                                              690}]},\n                                                                           {proc_lib,\n                                                                            init_p_do_apply,\n                                                                            3,\n                                                                            [{file,\n                                                                              "proc_lib.erl"},\n                                                                             {line,\n                                                                              249}]}]},\n                                                                         {gen_server,\n                                                                          call,\n                                                                          [<27245.4007.0>,\n                                                                           get_partitions,\n                                                                           infinity]}},\n                                                                        {gen_server,\n                                                                         call,\n                                                                         [\'dcp_replication_manager-default\',\n                                                                          {get_replicator_pid,\n                                                                           871},\n                                                                          infinity]}},\n                                                                       {gen_server,\n                                                                        call,\n                                                                        [{\'janitor_agent-default\',\n                                                                          \'ns_1@172.23.106.232\'},\n                                                                         {if_rebalance,\n                                                                          <0.26409.4>,\n                                                                          {wait_seqno_persisted,\n                                                                           882,\n                                                                           8522}},\n                                                                         infinity]}}}}]},\n                                                                  [{ns_single_vbucket_mover,\n                                                                    \'-wait_seqno_persisted_many/5-fun-2-\',\n                                                                    5,\n                                                                    [{file,\n                                                                      "src/ns_single_vbucket_mover.erl"},\n                                                                     {line,\n                                                                      488}]},\n                                                                   {proc_lib,\n                                                                    init_p,3,\n                                                                    [{file,\n                                                                      "proc_lib.erl"},\n                                                                     {line,\n                                                                      234}]}]}}}'}
      

      Strangely enough when I login to the cluster the nodes have been rebalanced out inspite of the failure.

      We also see the following ERROR messages in the logs.
      On 172.23.106.209

      2021-02-17 23:20:03,496 | test  | CRITICAL | MainThread | [basetestcase:check_coredump_exist:791] 172.23.106.209: Found ' ERROR ' logs - ['2021-02-17T23:19:34.422577-08:00 ERROR 76: (default) DCP (Producer) eq_dcpq:replication:ns_1@172.23.106.209->ns_1@172.23.106.232:default - DcpProducer::handleResponse disconnecting, received unexpected response:{"bodylen":0,"cas":0,"datatype":"raw","extlen":0,"keylen":0,"magic":"ClientResponse","opaque":355,"opcode":"DCP_SYSTEM_EVENT","status":"Invalid arguments"} for stream:stream name:eq_dcpq:replication:ns_1@172.23.106.209->ns_1@172.23.106.232:default, vb:870, state:in-memory\n']
      

      cbcollect_info attached. This test had passed on 7.0.0-4454.

      Attachments

        Issue Links

          No reviews matched the request. Check your Options in the drop-down menu of this sections header.

          Activity

            People

              Balakumaran.Gopal Balakumaran Gopal
              Balakumaran.Gopal Balakumaran Gopal
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Gerrit Reviews

                  There are no open Gerrit changes

                  PagerDuty