Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-55731

[CDC] - graceful failover + full recovery + cdc + collection crud fails with "Rebalance exited with reason {mover_crashed,\n{unexpected_exit,\n {\'EXIT\',<0.22657.10>,\n{{wait_seqno_persisted_failed,"

    XMLWordPrintable

Details

    • Bug
    • Resolution: Duplicate
    • Major
    • None
    • 7.2.0
    • couchbase-bucket
    • 7.2.0-5214-enterprise
    • Untriaged
    • Centos 64-bit
    • 0
    • No

    Description

      Script to Repro

      guides/gradlew --refresh-dependencies testrunner -P jython=/opt/jython/bin/jython -P 'args=-i /tmp/win10-bucket-ops-temp_rebalance_magma.ini rerun=False,disk_optimized_thread_settings=True,get-cbcollect-info=True,autoCompactionDefined=true,default_history_retention_for_collections=True,bucket_history_retention_seconds=600,bucket_history_retention_bytes=1000000000,magma_key_tree_data_block_size=131072,magma_seq_tree_data_block_size=131072 -t bucket_collections.collections_rebalance.CollectionsRebalance.test_data_load_collections_with_graceful_failover_recovery,nodes_init=5,nodes_failover=1,recovery_type=full,bucket_spec=magma_dgm.1_percent_dgm.5_node_3_replica_magma_768_single_bucket,doc_size=768,randomize_value=True,data_load_stage=during,skip_validations=False,data_load_spec=volume_test_load_1_percent_dgm,retry_get_process_num=300,GROUP=failover_set0'
      

      Steps to Repro
      1. Create a 5 node cluster.

      2023-02-26 23:07:03,023 | test  | INFO    | MainThread | [table_view:display:72] Cluster statistics
      +----------------+----------+-----------------+-----------+-----------+----------------------+---------------------+-----------------------+
      | Node           | Services | CPU_utilization | Mem_total | Mem_free  | Swap_mem_used        | Active / Replica    | Version               |
      +----------------+----------+-----------------+-----------+-----------+----------------------+---------------------+-----------------------+
      | 172.23.107.217 | kv       | 88.2273112831   | 23.36 GiB | 20.59 GiB | 0.0 Byte / 3.50 GiB  | 24346814 / 73027353 | 7.2.0-5214-enterprise |
      | 172.23.107.222 | kv       | 93.85629977     | 23.36 GiB | 21.04 GiB | 0.0 Byte / 3.50 GiB  | 24308029 / 72790348 | 7.2.0-5214-enterprise |
      | 172.23.107.102 | kv       | 85.0860824522   | 23.36 GiB | 21.10 GiB | 0.0 Byte / 3.50 GiB  | 24227633 / 72919459 | 7.2.0-5214-enterprise |
      | 172.23.107.99  | kv       | 84.4463419681   | 23.36 GiB | 20.99 GiB | 56.69 MiB / 3.50 GiB | 24384602 / 73038474 | 7.2.0-5214-enterprise |
      | 172.23.107.223 | kv       | 86.5693340577   | 23.36 GiB | 21.02 GiB | 0.0 Byte / 0.0 Byte  | 24345463 / 73030557 | 7.2.0-5214-enterprise |
      +----------------+----------+-----------------+-----------+-----------+----------------------+---------------------+-----------------------+
      

      2. Create bucket/scopesc/collections/data and move the bucket to 1% dgm. Also set the following params default_history_retention_for_collections=True,bucket_history_retention_seconds=600,bucket_history_retention_bytes=1000000000,magma_key_tree_data_block_size=131072,magma_seq_tree_data_block_size=131072

      2023-02-26 23:07:18,953 | test  | INFO    | MainThread | [table_view:display:72] Bucket statistics
      +---------+-----------+-----------------+----------+------------+-----+-----------+-----------+----------+------------+---------------+
      | Bucket  | Type      | Storage Backend | Replicas | Durability | TTL | Items     | RAM Quota | RAM Used | Disk Used  | ARR           |
      +---------+-----------+-----------------+----------+------------+-----+-----------+-----------+----------+------------+---------------+
      | default | couchbase | magma           | 3        | none       | 0   | 122100000 | 3.75 GiB  | 2.86 GiB | 262.13 GiB | 1.32214905815 |
      +---------+-----------+-----------------+----------+------------+-----+-----------+-----------+----------+------------+---------------+
      

      3. Graceful failover a node(172.23.107.102).

      2023-02-26 23:07:21,293 | test  | INFO    | MainThread | [collections_rebalance:rebalance_operation:371] Starting rebalance operation of type : graceful_failover_recovery
      2023-02-26 23:07:21,295 | test  | INFO    | MainThread | [collections_rebalance:rebalance_operation:687] failing over nodes [ip:172.23.107.102 port:8091 ssh_username:root]
      2023-02-26 23:15:34,986 | test  | WARNING | MainThread | [rest_client:get_nodes:1782] 172.23.107.102 - Node not part of cluster inactiveFailed
      

      4. Start CRUD on data and collections.
      5. Do a full recovery of the node(172.23.107.102) and rebalance.

      2023-02-26 23:15:36,911 | test  | INFO    | pool-3-thread-7 | [table_view:display:72] Rebalance Overview
      +----------------+----------+-----------------------+----------------+--------------+-----------------------+
      | Nodes          | Services | Version               | CPU            | Status       | Membership / Recovery |
      +----------------+----------+-----------------------+----------------+--------------+-----------------------+
      | 172.23.107.217 | kv       | 7.2.0-5214-enterprise | 67.0829713     | Cluster node | active / none         |
      | 172.23.107.222 | kv       | 7.2.0-5214-enterprise | 62.4108960845  | Cluster node | active / none         |
      | 172.23.107.102 | kv       | 7.2.0-5214-enterprise | 0.379022780819 | Cluster node | inactiveAdded / full  |
      | 172.23.107.99  | kv       | 7.2.0-5214-enterprise | 85.9448718028  | Cluster node | active / none         |
      | 172.23.107.223 | kv       | 7.2.0-5214-enterprise | 76.3748129776  | Cluster node | active / none         |
      +----------------+----------+-----------------------+----------------+--------------+-----------------------+
      

      Rebalance fails as shown below
      172.23.107.222

      2023-02-26 23:15:57,102 | test  | ERROR   | pool-3-thread-7 | [rest_client:print_UI_logs:2733] {u'code': 0, u'module': u'ns_orchestrator', u'type': u'critical', u'node': u'ns_1@172.23.107.222', u'tstamp': 1677482154203L, u'shortText': u'message', u'serverTime': u'2023-02-26T23:15:54.203Z', u'text': u'Rebalance exited with reason {mover_crashed,\n                              {unexpected_exit,\n                               {\'EXIT\',<0.22657.10>,\n                                {{wait_seqno_persisted_failed,"default",152,\n                                  123405,\n                                  [{\'ns_1@172.23.107.102\',\n                                    {\'EXIT\',\n                                     {socket_closed,\n                                      {gen_server,call,\n                                       [{\'janitor_agent-default\',\n                                         \'ns_1@172.23.107.102\'},\n                                        {if_rebalance,<0.22379.10>,\n                                         {wait_seqno_persisted,152,123405}},\n                                        infinity]}}}}]},\n                                 [{ns_single_vbucket_mover,\n                                   \'-wait_seqno_persisted_many/5-fun-2-\',5,\n                                   [{file,"src/ns_single_vbucket_mover.erl"},\n                                    {line,474}]},\n                                  {proc_lib,init_p,3,\n                                   [{file,"proc_lib.erl"},{line,211}]}]}}}}.\nRebalance Operation Id = 4228e547c88a4f9f49693b16e068ae2f'}
      2023-02-26 23:15:57,102 | test  | ERROR   | pool-3-thread-7 | [rest_client:print_UI_logs:2733] {u'code': 0, u'module': u'ns_vbucket_mover', u'type': u'critical', u'node': u'ns_1@172.23.107.222', u'tstamp': 1677482154165L, u'shortText': u'message', u'serverTime': u'2023-02-26T23:15:54.165Z', u'text': u'Worker <0.22424.10> (for action {move,{152,\n                                       [\'ns_1@172.23.107.223\',\n                                        \'ns_1@172.23.107.99\',\n                                        \'ns_1@172.23.107.222\',undefined],\n                                       [\'ns_1@172.23.107.102\',\n                                        \'ns_1@172.23.107.223\',\n                                        \'ns_1@172.23.107.99\',\n                                        \'ns_1@172.23.107.222\'],\n                                       []}}) exited with reason {unexpected_exit,\n                                                                 {\'EXIT\',\n                                                                  <0.22657.10>,\n                                                                  {{wait_seqno_persisted_failed,\n                                                                    "default",\n                                                                    152,\n                                                                    123405,\n                                                                    [{\'ns_1@172.23.107.102\',\n                                                                      {\'EXIT\',\n                                                                       {socket_closed,\n                                                                        {gen_server,\n                                                                         call,\n                                                                         [{\'janitor_agent-default\',\n                                                                           \'ns_1@172.23.107.102\'},\n                                                                          {if_rebalance,\n                                                                           <0.22379.10>,\n                                                                           {wait_seqno_persisted,\n                                                                            152,\n                                                                            123405}},\n                                                                          infinity]}}}}]},\n                                                                   [{ns_single_vbucket_mover,\n                                                                     \'-wait_seqno_persisted_many/5-fun-2-\',\n                                                                     5,\n                                                                     [{file,\n                                                                       "src/ns_single_vbucket_mover.erl"},\n                                                                      {line,\n                                                                       474}]},\n                                                                    {proc_lib,\n                                                                     init_p,3,\n                                                                     [{file,\n                                                                       "proc_lib.erl"},\n                                                                      {line,\n                                                                       211}]}]}}}'}
      

      Also saw the following ERROR message on 172.23.107.222.

      2023-02-26 23:16:01,066 | test  | INFO    | MainThread | [basetestcase:check_coredump_exist:924] 172.23.107.222: Found ' ERROR ' logs - ['2023-02-26T23:15:53.295567-08:00 ERROR 9308: (default) DCP (Producer) eq_dcpq:replication:ns_1@172.23.107.222->ns_1@172.23.107.102:default - DcpProducer::handleResponse disconnecting, received unexpected response:{"bodylen":83,"cas":0,"datatype":["JSON"],"extlen":0,"keylen":0,"magic":"ClientResponse","opaque":1,"opcode":"DCP_MUTATION","status":"Unknown Collection"} for stream:stream name:eq_dcpq:replication:ns_1@172.23.107.222->ns_1@172.23.107.102:default, vb:101, state:in-memory\n']
      

      Seeing "status":"Unknown Collection" on lot of rebalance tests. This is actually the first time we are running with both CDC and magma block size changes set together.
      cbcollect_info attached.

      Attachments

        No reviews matched the request. Check your Options in the drop-down menu of this sections header.

        Activity

          People

            Balakumaran.Gopal Balakumaran Gopal
            Balakumaran.Gopal Balakumaran Gopal
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Gerrit Reviews

                There are no open Gerrit changes

                PagerDuty