Details
-
Bug
-
Resolution: Duplicate
-
Major
-
None
-
7.2.0
-
7.2.0-5214-enterprise
-
Untriaged
-
Centos 64-bit
-
0
-
No
Description
Script to Repro
guides/gradlew --refresh-dependencies testrunner -P jython=/opt/jython/bin/jython -P 'args=-i /tmp/win10-bucket-ops-temp_rebalance_magma.ini rerun=False,disk_optimized_thread_settings=True,get-cbcollect-info=True,autoCompactionDefined=true,default_history_retention_for_collections=True,bucket_history_retention_seconds=600,bucket_history_retention_bytes=1000000000,magma_key_tree_data_block_size=131072,magma_seq_tree_data_block_size=131072 -t bucket_collections.collections_rebalance.CollectionsRebalance.test_data_load_collections_with_graceful_failover_recovery,nodes_init=5,nodes_failover=1,recovery_type=full,bucket_spec=magma_dgm.1_percent_dgm.5_node_3_replica_magma_768_single_bucket,doc_size=768,randomize_value=True,data_load_stage=during,skip_validations=False,data_load_spec=volume_test_load_1_percent_dgm,retry_get_process_num=300,GROUP=failover_set0'
|
Steps to Repro
1. Create a 5 node cluster.
2023-02-26 23:07:03,023 | test | INFO | MainThread | [table_view:display:72] Cluster statistics
|
+----------------+----------+-----------------+-----------+-----------+----------------------+---------------------+-----------------------+
|
| Node | Services | CPU_utilization | Mem_total | Mem_free | Swap_mem_used | Active / Replica | Version |
|
+----------------+----------+-----------------+-----------+-----------+----------------------+---------------------+-----------------------+
|
| 172.23.107.217 | kv | 88.2273112831 | 23.36 GiB | 20.59 GiB | 0.0 Byte / 3.50 GiB | 24346814 / 73027353 | 7.2.0-5214-enterprise |
|
| 172.23.107.222 | kv | 93.85629977 | 23.36 GiB | 21.04 GiB | 0.0 Byte / 3.50 GiB | 24308029 / 72790348 | 7.2.0-5214-enterprise |
|
| 172.23.107.102 | kv | 85.0860824522 | 23.36 GiB | 21.10 GiB | 0.0 Byte / 3.50 GiB | 24227633 / 72919459 | 7.2.0-5214-enterprise |
|
| 172.23.107.99 | kv | 84.4463419681 | 23.36 GiB | 20.99 GiB | 56.69 MiB / 3.50 GiB | 24384602 / 73038474 | 7.2.0-5214-enterprise |
|
| 172.23.107.223 | kv | 86.5693340577 | 23.36 GiB | 21.02 GiB | 0.0 Byte / 0.0 Byte | 24345463 / 73030557 | 7.2.0-5214-enterprise |
|
+----------------+----------+-----------------+-----------+-----------+----------------------+---------------------+-----------------------+
|
2. Create bucket/scopesc/collections/data and move the bucket to 1% dgm. Also set the following params default_history_retention_for_collections=True,bucket_history_retention_seconds=600,bucket_history_retention_bytes=1000000000,magma_key_tree_data_block_size=131072,magma_seq_tree_data_block_size=131072
2023-02-26 23:07:18,953 | test | INFO | MainThread | [table_view:display:72] Bucket statistics
|
+---------+-----------+-----------------+----------+------------+-----+-----------+-----------+----------+------------+---------------+
|
| Bucket | Type | Storage Backend | Replicas | Durability | TTL | Items | RAM Quota | RAM Used | Disk Used | ARR |
|
+---------+-----------+-----------------+----------+------------+-----+-----------+-----------+----------+------------+---------------+
|
| default | couchbase | magma | 3 | none | 0 | 122100000 | 3.75 GiB | 2.86 GiB | 262.13 GiB | 1.32214905815 |
|
+---------+-----------+-----------------+----------+------------+-----+-----------+-----------+----------+------------+---------------+
|
3. Graceful failover a node(172.23.107.102).
2023-02-26 23:07:21,293 | test | INFO | MainThread | [collections_rebalance:rebalance_operation:371] Starting rebalance operation of type : graceful_failover_recovery
|
2023-02-26 23:07:21,295 | test | INFO | MainThread | [collections_rebalance:rebalance_operation:687] failing over nodes [ip:172.23.107.102 port:8091 ssh_username:root]
|
2023-02-26 23:15:34,986 | test | WARNING | MainThread | [rest_client:get_nodes:1782] 172.23.107.102 - Node not part of cluster inactiveFailed
|
4. Start CRUD on data and collections.
5. Do a full recovery of the node(172.23.107.102) and rebalance.
2023-02-26 23:15:36,911 | test | INFO | pool-3-thread-7 | [table_view:display:72] Rebalance Overview
|
+----------------+----------+-----------------------+----------------+--------------+-----------------------+
|
| Nodes | Services | Version | CPU | Status | Membership / Recovery |
|
+----------------+----------+-----------------------+----------------+--------------+-----------------------+
|
| 172.23.107.217 | kv | 7.2.0-5214-enterprise | 67.0829713 | Cluster node | active / none |
|
| 172.23.107.222 | kv | 7.2.0-5214-enterprise | 62.4108960845 | Cluster node | active / none |
|
| 172.23.107.102 | kv | 7.2.0-5214-enterprise | 0.379022780819 | Cluster node | inactiveAdded / full |
|
| 172.23.107.99 | kv | 7.2.0-5214-enterprise | 85.9448718028 | Cluster node | active / none |
|
| 172.23.107.223 | kv | 7.2.0-5214-enterprise | 76.3748129776 | Cluster node | active / none |
|
+----------------+----------+-----------------------+----------------+--------------+-----------------------+
|
Rebalance fails as shown below
172.23.107.222
2023-02-26 23:15:57,102 | test | ERROR | pool-3-thread-7 | [rest_client:print_UI_logs:2733] {u'code': 0, u'module': u'ns_orchestrator', u'type': u'critical', u'node': u'ns_1@172.23.107.222', u'tstamp': 1677482154203L, u'shortText': u'message', u'serverTime': u'2023-02-26T23:15:54.203Z', u'text': u'Rebalance exited with reason {mover_crashed,\n {unexpected_exit,\n {\'EXIT\',<0.22657.10>,\n {{wait_seqno_persisted_failed,"default",152,\n 123405,\n [{\'ns_1@172.23.107.102\',\n {\'EXIT\',\n {socket_closed,\n {gen_server,call,\n [{\'janitor_agent-default\',\n \'ns_1@172.23.107.102\'},\n {if_rebalance,<0.22379.10>,\n {wait_seqno_persisted,152,123405}},\n infinity]}}}}]},\n [{ns_single_vbucket_mover,\n \'-wait_seqno_persisted_many/5-fun-2-\',5,\n [{file,"src/ns_single_vbucket_mover.erl"},\n {line,474}]},\n {proc_lib,init_p,3,\n [{file,"proc_lib.erl"},{line,211}]}]}}}}.\nRebalance Operation Id = 4228e547c88a4f9f49693b16e068ae2f'}
|
2023-02-26 23:15:57,102 | test | ERROR | pool-3-thread-7 | [rest_client:print_UI_logs:2733] {u'code': 0, u'module': u'ns_vbucket_mover', u'type': u'critical', u'node': u'ns_1@172.23.107.222', u'tstamp': 1677482154165L, u'shortText': u'message', u'serverTime': u'2023-02-26T23:15:54.165Z', u'text': u'Worker <0.22424.10> (for action {move,{152,\n [\'ns_1@172.23.107.223\',\n \'ns_1@172.23.107.99\',\n \'ns_1@172.23.107.222\',undefined],\n [\'ns_1@172.23.107.102\',\n \'ns_1@172.23.107.223\',\n \'ns_1@172.23.107.99\',\n \'ns_1@172.23.107.222\'],\n []}}) exited with reason {unexpected_exit,\n {\'EXIT\',\n <0.22657.10>,\n {{wait_seqno_persisted_failed,\n "default",\n 152,\n 123405,\n [{\'ns_1@172.23.107.102\',\n {\'EXIT\',\n {socket_closed,\n {gen_server,\n call,\n [{\'janitor_agent-default\',\n \'ns_1@172.23.107.102\'},\n {if_rebalance,\n <0.22379.10>,\n {wait_seqno_persisted,\n 152,\n 123405}},\n infinity]}}}}]},\n [{ns_single_vbucket_mover,\n \'-wait_seqno_persisted_many/5-fun-2-\',\n 5,\n [{file,\n "src/ns_single_vbucket_mover.erl"},\n {line,\n 474}]},\n {proc_lib,\n init_p,3,\n [{file,\n "proc_lib.erl"},\n {line,\n 211}]}]}}}'}
|
Also saw the following ERROR message on 172.23.107.222.
2023-02-26 23:16:01,066 | test | INFO | MainThread | [basetestcase:check_coredump_exist:924] 172.23.107.222: Found ' ERROR ' logs - ['2023-02-26T23:15:53.295567-08:00 ERROR 9308: (default) DCP (Producer) eq_dcpq:replication:ns_1@172.23.107.222->ns_1@172.23.107.102:default - DcpProducer::handleResponse disconnecting, received unexpected response:{"bodylen":83,"cas":0,"datatype":["JSON"],"extlen":0,"keylen":0,"magic":"ClientResponse","opaque":1,"opcode":"DCP_MUTATION","status":"Unknown Collection"} for stream:stream name:eq_dcpq:replication:ns_1@172.23.107.222->ns_1@172.23.107.102:default, vb:101, state:in-memory\n']
|
Seeing "status":"Unknown Collection" on lot of rebalance tests. This is actually the first time we are running with both CDC and magma block size changes set together.
cbcollect_info attached.