Details
-
Bug
-
Status: Closed
-
Critical
-
Resolution: Fixed
-
6.5.0
-
Enterprise Edition 6.5.0 build 4676
-
Triaged
-
Centos 64-bit
-
-
Unknown
-
KV-Engine Mad-Hatter GA
Description
Steps to Reproduce:
Create a 7 node cluster.
+----------------+----------+--------------+
|
| Nodes | Services | Status |
|
+----------------+----------+--------------+
|
| 172.23.106.134 | [u'kv'] | Cluster node | |
| 172.23.106.136 | None | <--- IN --- | |
| 172.23.106.137 | None | <--- IN --- | |
| 172.23.106.138 | None | <--- IN --- | |
| 172.23.105.168 | None | <--- IN --- | |
| 172.23.106.82 | None | <--- IN --- | |
| 172.23.106.83 | None | <--- IN --- | |
+----------------+----------+--------------+Â
|
    1. Create an ephemeral bucket with replicas=1, eviction policy=noEviction, compression=off.
    2. Load 10M docs with durability = MAJORITY. This step was successfully completed. Bucket Stats after this step:
+----------------+-----------+----------+-----+----------+--------------+-------------+-----------+
|
| Bucket | Type | Replicas | TTL | Items | RAM Quota | RAM Used | Disk Used |
|
+----------------+-----------+----------+-----+----------+--------------+-------------+-----------+
|
| GleamBookUsers | ephemeral | 1 | 0 | 10000000 | 139732189184 | 29729510584 | 238 | |
+----------------+-----------+----------+-----+----------+--------------+-------------+-----------+
|
|
   3. Rebalance In 1 node(172.23.106.85) with another 2M creates, 4M updates with durability=MAJORITY in parallel. This step was successfully completed. Bucket Stats after this Step:
+----------------+-----------+----------+-----+----------+--------------+-------------+-----------+
|
| Bucket | Type | Replicas | TTL | Items | RAM Quota | RAM Used | Disk Used |
|
+----------------+-----------+----------+-----+----------+--------------+-------------+-----------+
|
| GleamBookUsers | ephemeral | 1 | 0 | 12000000 | 159693930496 | 37605497792 | 272 | |
+----------------+-----------+----------+-----+----------+--------------+-------------+-----------+
|
|
   4. Rebalance Out 1 node(172.23.106.83) with another 2M creates, 4M updates, 2M deletes with durability=MAJORITY in parallel. This step was successfully completed. Bucket Stats after this Step:
+----------------+-----------+----------+-----+----------+--------------+-------------+-----------+
|
| Bucket | Type | Replicas | TTL | Items | RAM Quota | RAM Used | Disk Used |
|
+----------------+-----------+----------+-----+----------+--------------+-------------+-----------+
|
| GleamBookUsers | ephemeral | 1 | 0 | 12000000 | 139732189184 | 40522783760 | 238 | |
+----------------+-----------+----------+-----+----------+--------------+-------------+-----------+
|
  5. Rebalance In 2 nodes(172.23.106.83, 172.23.106.86) and Rebalance Out 1 node(172.23.106.137) with another 2M creates, 4M updates, 2M deletes with durability=MAJORITY in parallel. Rebalance Operation failed in this step.
Error Messages:
Rebalance exited with reason {mover_crashed,
|
{unexpected_exit,
|
{'EXIT',<0.15239.12>, |
{{{{{child_interrupted,
|
{'EXIT',<23442.27660.5>,socket_closed}}, |
[{dcp_replicator,spawn_and_wait,1, |
[{file,"src/dcp_replicator.erl"}, |
{line,266}]}, |
{dcp_replicator,handle_call,3, |
[{file,"src/dcp_replicator.erl"}, |
{line,121}]}, |
{gen_server,try_handle_call,4, |
[{file,"gen_server.erl"},{line,636}]}, |
{gen_server,handle_msg,6, |
[{file,"gen_server.erl"},{line,665}]}, |
{proc_lib,init_p_do_apply,3, |
[{file,"proc_lib.erl"},{line,247}]}]}, |
{gen_server,call,
|
[<23442.27659.5>, |
{setup_replication,[416,430,431,432,433]}, |
infinity]}},
|
{gen_server,call,
|
['replication_manager-GleamBookUsers', |
{change_vbucket_replication,430, |
'ns_1@172.23.106.136'}, |
infinity]}},
|
{gen_server,call,
|
[{'janitor_agent-GleamBookUsers', |
'ns_1@172.23.106.83'}, |
{if_rebalance,<0.6571.12>, |
{update_vbucket_state,966,active, |
undefined,undefined,
|
[['ns_1@172.23.106.83', |
'ns_1@172.23.106.86']]}}, |
infinity]}}}}}.
|
Rebalance Operation Id = 6e1498b429d0c4f57e09d5db488c9f4e
|
Worker <0.6765.12> (for action {move,{966, |
['ns_1@172.23.106.137', |
'ns_1@172.23.106.136'], |
['ns_1@172.23.106.83', |
'ns_1@172.23.106.86'], |
[]}}) exited with reason {unexpected_exit,
|
{'EXIT', |
<0.15239.12>, |
{{{{{child_interrupted,
|
{'EXIT', |
<23442.27660.5>, |
socket_closed}},
|
[{dcp_replicator,
|
spawn_and_wait,
|
1, |
[{file,
|
"src/dcp_replicator.erl"}, |
{line,
|
266}]}, |
{dcp_replicator,
|
handle_call,
|
3, |
[{file,
|
"src/dcp_replicator.erl"}, |
{line,
|
121}]}, |
{gen_server,
|
try_handle_call,
|
4, |
[{file,
|
"gen_server.erl"}, |
{line,
|
636}]}, |
{gen_server,
|
handle_msg,
|
6, |
[{file,
|
"gen_server.erl"}, |
{line,
|
665}]}, |
{proc_lib,
|
init_p_do_apply,
|
3, |
[{file,
|
"proc_lib.erl"}, |
{line,
|
247}]}]}, |
{gen_server,
|
call,
|
[<23442.27659.5>, |
{setup_replication,
|
[416, |
430, |
431, |
432, |
433]}, |
infinity]}},
|
{gen_server,
|
call,
|
['replication_manager-GleamBookUsers', |
{change_vbucket_replication,
|
430, |
'ns_1@172.23.106.136'}, |
infinity]}},
|
{gen_server,
|
call,
|
[{'janitor_agent-GleamBookUsers', |
'ns_1@172.23.106.83'}, |
{if_rebalance,
|
<0.6571.12>, |
{update_vbucket_state,
|
966, |
active,
|
undefined,
|
undefined,
|
[['ns_1@172.23.106.83', |
'ns_1@172.23.106.86']]}}, |
infinity]}}}}
|
Note: This warning message was not found in memcached logs.
Disconnecting the connection as there is no memory to complete replication
|
Â
Timestamp of rebalance failure is:
[ns_server:info,2019-10-25T01:34:57.603-07:00,ns_1@172.23.106.136:rebalance_agent<0.3477.0>:rebalance_agent:handle_down:296]Rebalancer process <23133.6008.12> died (reason {mover_crashed,
This corresponds to the following error on node .106.83, which causes the DCP connection to be closed (and hence rebalance fails):
2019-10-25T01:34:57.241398-07:00 ERROR 93: exception occurred in runloop during packet execution. Cookie info: [{"aiostat":"success","connection":"[ 127.0.0.1:43619 - 127.0.0.1:11209 (<ud>@ns_server</ud>) ]","engine_storage":"0x00007fd93ad9ae10","ewouldblock":false,"packet":{"bodylen":30,"cas":0,"datatype":"raw","extlen":16,"key":"<ud>Users-01024899</ud>","keylen":14,"magic":"ClientRequest","opaque":30,"opcode":"DCP_ABORT","vbucket":416},"refcount":1}] - closing connection ([ 127.0.0.1:43619 - 127.0.0.1:11209 (<ud>@ns_server</ud>) ]): EphemeralVBucket::abortStoredValue: Cannot call on a non-Pending StoredValue
Some context leading up to this error for the affected vb:416:
2019-10-25T01:33:47.672699-07:00 INFO (GleamBookUsers) VBucket::setState: transitioning vb:416 with high seqno:0 from:replica to:replica
2019-10-25T01:33:47.675395-07:00 INFO 79: (GleamBookUsers) DCP (Consumer) eq_dcpq:replication:ns_1@172.23.106.134->ns_1@172.23.106.83:GleamBookUsers - (vb:416) Attempting to add stream: opaque_:16, start_seqno_:0, end_seqno_:18446744073709551615, vb_uuid:132115003169186, snap_start_seqno_:0
, snap_end_seqno_:0, last_seqno:0, stream_req_value:{"uid":"0"}
... 36s later ...
2019-10-25T01:34:23.326619-07:00 INFO 79: (GleamBookUsers) DCP (Consumer) eq_dcpq:replication:ns_1@172.23.106.134->ns_1@172.23.106.83:GleamBookUsers - (vb:416) Setting stream to dead state, last_seqno is 47118, unAckedBytes is 0, status is The stream closed due to a close stream message
2019-10-25T01:34:23.326652-07:00 INFO 79: (GleamBookUsers) DCP (Consumer) eq_dcpq:replication:ns_1@172.23.106.134->ns_1@172.23.106.83:GleamBookUsers - (vb:416) End stream received with reason 1
... 34s later ...
2019-10-25T01:34:57.237037-07:00 INFO 93: (GleamBookUsers) DCP (Consumer) eq_dcpq:replication:ns_1@172.23.106.136->ns_1@172.23.106.83:GleamBookUsers - (vb:416) Attempting to add stream: opaque_:30, start_seqno_:47118, end_seqno_:18446744073709551615, vb_uuid:173839965648806, snap_start_seqno_:47118, snap_end_seqno_:47118, last_seqno:47118, stream_req_value:{"uid":"0"}
2019-10-25T01:34:57.241398-07:00 ERROR 93: exception occurred in runloop during packet execution. Cookie info: [{"aiostat":"success","connection":"[ 127.0.0.1:43619 - 127.0.0.1:11209 (<ud>@ns_server</ud>) ]","engine_storage":"0x00007fd93ad9ae10","ewouldblock":false,"packet":{"bodylen":30,"cas":0,"datatype":"raw","extlen":16,"key":"<ud>Users-01024899</ud>","keylen":14,"magic":"ClientRequest","opaque":30,"opcode":"DCP_ABORT","vbucket":416},"refcount":1}] - closing connection ([ 127.0.0.1:43619 - 127.0.0.1:11209 (<ud>@ns_server</ud>) ]): EphemeralVBucket::abortStoredValue: Cannot call on a non-Pending StoredValue