Details
-
Bug
-
Resolution: Fixed
-
Critical
-
Cheshire-Cat
-
7.0.0-4486-enterprise
-
Triaged
-
Centos 64-bit
-
1
-
Yes
-
KV-Engine 2021-Feb
Description
Script to Repro
guides/gradlew --refresh-dependencies testrunner -P jython=/opt/jython/bin/jython -P 'args=-i /tmp/win10-bucket-ops111111111111124244224.ini rerun=False,quota_percent=95,crash_warning=True -t bucket_collections.collections_rebalance.CollectionsRebalance.test_data_load_collections_with_graceful_failover_rebalance_out,nodes_init=5,nodes_failover=2,bucket_spec=dgm.buckets_for_rebalance_tests_more_collections,data_load_spec=volume_test_load_with_CRUD_on_collections,data_load_stage=during,dgm=55,GROUP=failover_with_collection_crud_dgm'
|
Steps to Repro
1) Create a 5 node cluster
2021-02-17 23:06:57,332 | test | INFO | pool-1-thread-6 | [table_view:display:72] Rebalance Overview
----------------------------------------------------------------------
Nodes | Services | Version | CPU | Status |
----------------------------------------------------------------------
172.23.106.209 | kv | 7.0.0-4486-enterprise | 1.26598144898 | Cluster node |
172.23.106.225 | None | <--- IN — | ||
172.23.106.232 | None | <--- IN — | ||
172.23.106.239 | None | <--- IN — | ||
172.23.106.246 | None | <--- IN — |
----------------------------------------------------------------------
2) Create bucket/scopes/collections/data
2021-02-17 23:08:19,099 | test | INFO | MainThread | [table_view:display:72] Bucket statistics
-------------------------------------------------------------------------
Bucket | Type | Replicas | Durability | TTL | Items | RAM Quota | RAM Used | Disk Used |
-------------------------------------------------------------------------
default | couchbase | 2 | none | 0 | 500000 | 4194304000 | 505249400 | 535876815 |
-------------------------------------------------------------------------
3) Push the bucket to DGM
2021-02-17 23:13:04,911 | test | INFO | pool-1-thread-10 | [task:_load_bucket_into_dgm:2079] Active DGM 100% Replica DGM 53.4731016949% achieved for 'default'. Loaded docs: 5480000
|
4) Graceful failover 2 nodes
2021-02-17 23:13:08,638 | test | INFO | MainThread | [collections_rebalance:rebalance_operation:473] failing over nodes [ip:172.23.106.239 port:8091 ssh_username:root, ip:172.23.106.246 port:8091 ssh_username:root]
|
2021-02-17 23:18:59,082 | test | WARNING | MainThread | [rest_client:get_nodes:1706] 172.23.106.239 - Node not part of cluster inactiveFailed
|
2021-02-17 23:18:59,084 | test | WARNING | MainThread | [rest_client:get_nodes:1706] 172.23.106.246 - Node not part of cluster inactiveFailed
|
5) Start Rebalance.
2021-02-17 23:19:01,479 | test | INFO | pool-1-thread-28 | [table_view:display:72] Rebalance Overview
-----------------------------------------------------------------------
Nodes | Services | Version | CPU | Status |
-----------------------------------------------------------------------
172.23.106.225 | kv | 7.0.0-4486-enterprise | 60.47399605 | Cluster node |
172.23.106.246 | kv | 7.0.0-4486-enterprise | 0.350921168066 | — OUT ---> |
172.23.106.239 | kv | 7.0.0-4486-enterprise | 0.526052104208 | — OUT ---> |
172.23.106.232 | kv | 7.0.0-4486-enterprise | 54.9271523179 | Cluster node |
172.23.106.209 | kv | 7.0.0-4486-enterprise | 56.7614011897 | Cluster node |
-----------------------------------------------------------------------
Rebalance fails as shown below.
2021-02-17 23:19:38,091 | test | INFO | pool-1-thread-28 | [rest_client:print_UI_logs:2599] Latest logs from UI on 172.23.106.209:
|
2021-02-17 23:19:38,092 | test | ERROR | pool-1-thread-28 | [rest_client:print_UI_logs:2601] {u'code': 0, u'module': u'ns_orchestrator', u'type': u'critical', u'node': u'ns_1@172.23.106.209', u'tstamp': 1613632774771L, u'shortText': u'message', u'serverTime': u'2021-02-17T23:19:34.771Z', u'text': u'Rebalance exited with reason {mover_crashed,\n {unexpected_exit,\n {\'EXIT\',<0.13534.5>,\n {{wait_seqno_persisted_failed,"default",882,\n 8522,\n [{\'ns_1@172.23.106.232\',\n {\'EXIT\',\n {{{{{child_interrupted,\n {\'EXIT\',<27245.4009.0>,\n socket_closed}},\n [{dcp_replicator,spawn_and_wait,1,\n [{file,"src/dcp_replicator.erl"},\n {line,265}]},\n {dcp_replicator,handle_call,3,\n [{file,"src/dcp_replicator.erl"},\n {line,121}]},\n {gen_server,try_handle_call,4,\n [{file,"gen_server.erl"},\n {line,661}]},\n {gen_server,handle_msg,6,\n [{file,"gen_server.erl"},\n {line,690}]},\n {proc_lib,init_p_do_apply,3,\n [{file,"proc_lib.erl"},\n {line,249}]}]},\n {gen_server,call,\n [<27245.4007.0>,get_partitions,\n infinity]}},\n {gen_server,call,\n [\'dcp_replication_manager-default\',\n {get_replicator_pid,871},\n infinity]}},\n {gen_server,call,\n [{\'janitor_agent-default\',\n \'ns_1@172.23.106.232\'},\n {if_rebalance,<0.26409.4>,\n {wait_seqno_persisted,882,8522}},\n infinity]}}}}]},\n [{ns_single_vbucket_mover,\n \'-wait_seqno_persisted_many/5-fun-2-\',5,\n [{file,"src/ns_single_vbucket_mover.erl"},\n {line,488}]},\n {proc_lib,init_p,3,\n [{file,"proc_lib.erl"},{line,234}]}]}}}}.\nRebalance Operation Id = 5cb6bf3a18b1d66ecdc991d84fc0bb8c'}
|
2021-02-17 23:19:38,092 | test | ERROR | pool-1-thread-28 | [rest_client:print_UI_logs:2601] {u'code': 0, u'module': u'ns_vbucket_mover', u'type': u'critical', u'node': u'ns_1@172.23.106.209', u'tstamp': 1613632774703L, u'shortText': u'message', u'serverTime': u'2021-02-17T23:19:34.703Z', u'text': u'Worker <0.13484.5> (for action {move,{882,\n [\'ns_1@172.23.106.225\',\n \'ns_1@172.23.106.209\',undefined],\n [\'ns_1@172.23.106.209\',\n \'ns_1@172.23.106.225\',\n \'ns_1@172.23.106.232\'],\n []}}) exited with reason {unexpected_exit,\n {\'EXIT\',\n <0.13534.5>,\n {{wait_seqno_persisted_failed,\n "default",\n 882,8522,\n [{\'ns_1@172.23.106.232\',\n {\'EXIT\',\n {{{{{child_interrupted,\n {\'EXIT\',\n <27245.4009.0>,\n socket_closed}},\n [{dcp_replicator,\n spawn_and_wait,\n 1,\n [{file,\n "src/dcp_replicator.erl"},\n {line,\n 265}]},\n {dcp_replicator,\n handle_call,\n 3,\n [{file,\n "src/dcp_replicator.erl"},\n {line,\n 121}]},\n {gen_server,\n try_handle_call,\n 4,\n [{file,\n "gen_server.erl"},\n {line,\n 661}]},\n {gen_server,\n handle_msg,\n 6,\n [{file,\n "gen_server.erl"},\n {line,\n 690}]},\n {proc_lib,\n init_p_do_apply,\n 3,\n [{file,\n "proc_lib.erl"},\n {line,\n 249}]}]},\n {gen_server,\n call,\n [<27245.4007.0>,\n get_partitions,\n infinity]}},\n {gen_server,\n call,\n [\'dcp_replication_manager-default\',\n {get_replicator_pid,\n 871},\n infinity]}},\n {gen_server,\n call,\n [{\'janitor_agent-default\',\n \'ns_1@172.23.106.232\'},\n {if_rebalance,\n <0.26409.4>,\n {wait_seqno_persisted,\n 882,\n 8522}},\n infinity]}}}}]},\n [{ns_single_vbucket_mover,\n \'-wait_seqno_persisted_many/5-fun-2-\',\n 5,\n [{file,\n "src/ns_single_vbucket_mover.erl"},\n {line,\n 488}]},\n {proc_lib,\n init_p,3,\n [{file,\n "proc_lib.erl"},\n {line,\n 234}]}]}}}'}
|
Strangely enough when I login to the cluster the nodes have been rebalanced out inspite of the failure.
We also see the following ERROR messages in the logs.
On 172.23.106.209
2021-02-17 23:20:03,496 | test | CRITICAL | MainThread | [basetestcase:check_coredump_exist:791] 172.23.106.209: Found ' ERROR ' logs - ['2021-02-17T23:19:34.422577-08:00 ERROR 76: (default) DCP (Producer) eq_dcpq:replication:ns_1@172.23.106.209->ns_1@172.23.106.232:default - DcpProducer::handleResponse disconnecting, received unexpected response:{"bodylen":0,"cas":0,"datatype":"raw","extlen":0,"keylen":0,"magic":"ClientResponse","opaque":355,"opcode":"DCP_SYSTEM_EVENT","status":"Invalid arguments"} for stream:stream name:eq_dcpq:replication:ns_1@172.23.106.209->ns_1@172.23.106.232:default, vb:870, state:in-memory\n']
|
cbcollect_info attached. This test had passed on 7.0.0-4454.
Attachments
Issue Links
For Gerrit Dashboard: MB-44417 | ||||||
---|---|---|---|---|---|---|
# | Subject | Branch | Project | Status | CR | V |
146451,6 | MB-44417: Stop concurrent dereference calls throwing | master | kv_engine | Status: MERGED | +2 | +1 |