Issue occurred 5 days into longevity test with ephemeral buckets having no eviction policy.
Logs show rebalance started, then we got some metadata overhead warnings followed by ns_server backtrace
2017-09-13T07:34:51.604-07:00, ns_orchestrator:4:info:message(ns_1@172.23.106.14) - Starting rebalance, KeepNodes = ['ns_1@172.23.105.60','ns_1@172.23.105.61',
|
'ns_1@172.23.105.62','ns_1@172.23.105.63',
|
'ns_1@172.23.106.14','ns_1@172.23.106.213',
|
'ns_1@172.23.106.96','ns_1@172.23.99.168',
|
'ns_1@172.23.99.253'], EjectNodes = ['ns_1@172.23.105.83'], Failed over and being ejected nodes = []; no delta recovery nodes
|
2017-09-13T07:40:32.197-07:00, ns_vbucket_mover:0:info:message(ns_1@172.23.106.14) - Bucket "default" rebalance appears to be swap rebalance
|
2017-09-13T08:02:01.695-07:00, menelaus_web_alerts_srv:0:info:message(ns_1@172.23.99.253) - Metadata overhead warning. Over 50% of RAM allocated to bucket "default" on node "172.23.99.253" is taken up by keys and metadata.
|
2017-09-13T08:02:22.551-07:00, menelaus_web_alerts_srv:0:info:message(ns_1@172.23.99.253) - Metadata overhead warning. Over 50% of RAM allocated to bucket "default" on node "172.23.99.253" is taken up by keys and metadata. (repeated 6 times)
|
|
per_node_processes('ns_1@172.23.106.14') =
|
{<0.32569.4081>,
|
[{registered_name,[]},
|
{status,waiting},
|
{initial_call,{proc_lib,init_p,5}},
|
{backtrace,
|
[<<"Program counter: 0x00007f460af7b288 (ns_single_vbucket_mover:spawn_and_wait/1 + 72)">>,
|
<<"CP: 0x0000000000000000 (invalid)">>,<<"arity = 0">>,<<>>,
|
<<"0x00007f4609bdd678 Return addr 0x00007f46533eee90 (misc:try_with_maybe_ignorant_after/2 + 80)">>,
|
<<"y(0) []">>,<<"y(1) []">>,<<"y(2) <0.20357.4080>">>,
|
<<>>,
|
<<"0x00007f4609bdd698 Return addr 0x00007f460af7b0d8 (ns_single_vbucket_mover:mover/5 + 896)">>,
|
<<"y(0) []">>,<<"y(1) []">>,<<"y(2) []">>,
|
<<"y(3) []">>,
|
<<"y(4) #Fun<ns_single_vbucket_mover.3.48828051>">>,
|
<<"y(5) Catch 0x00007f46533eeeb0 (misc:try_with_maybe_ignorant_after/2 + 112)">>,
|
<<>>,
|
<<"0x00007f4609bdd6d0 Return addr 0x00007f465befc198 (proc_lib:init_p_do_apply/3 + 56)">>,
|
<<"y(0) []">>,<<"y(1) true">>,
|
<<"y(2) ['ns_1@172.23.105.62','ns_1@172.23.106.213']">>,
|
<<"y(3) ['ns_1@172.23.105.62','ns_1@172.23.105.83']">>,
|
<<"y(4) 27">>,<<"y(5) <0.25037.4080>">>,<<>>,
|
<<"0x00007f4609bdd708 Return addr 0x0000000000893588 (<terminate process normally>)">>,
|
<<"y(0) Catch 0x00007f465befc1b8 (proc_lib:init_p_do_apply/3 + 88)">>,
|
<<>>]},
|
|
Result is that rebalance is hanging in the cluster.