Details
-
Bug
-
Resolution: Unresolved
-
Major
-
4.5.1, 4.6.0
-
4.5.1-2842
-
Untriaged
-
-
Unknown
Description
Running kv/view stress test a rebalance failed with errors seen below.
Test did pass on build 4.5.1 2838, but failed in recent run with 2342 so may not be regression ( would need that confirmed by triage).
Rebalance exited with reason {unexpected_exit,
|
{'EXIT',<0.24477.0>,
|
{bulk_set_vbucket_state_failed,
|
[{'ns_1@172.23.108.105',
|
{'EXIT',
|
{{nodedown,'ns_1@172.23.108.105'},
|
{gen_server,call,
|
[{'janitor_agent-default',
|
'ns_1@172.23.108.105'},
|
{if_rebalance,<0.19074.0>,
|
{update_vbucket_state,646,replica,
|
undefined,undefined}},
|
infinity]}}}}]}}}
|
debug log is showing dcp_replicator exiting at time of rebalance failure
[ns_server:debug,2016-09-20T08:47:48.746-07:00,ns_1@172.23.108.105:dcp_replicator-default-ns_1@172.23.108.97<0.3125.0>:dcp_replicator:spawn_and_wait:228]Received exit with reason {shutdown,nuke} from <0.3040.0>. Killing child process <0.11265.0>
|
|
...
|
|
=========================CRASH REPORT=========================
|
crasher:
|
initial call: dcp_replicator:init/1
|
pid: <0.3125.0>
|
registered_name: 'dcp_replicator-default-ns_1@172.23.108.97'
|
exception exit: {{child_interrupted,{'EXIT',<0.3040.0>,{shutdown,nuke}}},
|
[{dcp_replicator,spawn_and_wait,1,
|
[{file,"src/dcp_replicator.erl"},
|
{line,231}]},
|
{dcp_replicator,handle_call,3,
|
[{file,"src/dcp_replicator.erl"},
|
{line,115}]},
|
{gen_server,handle_msg,5,
|
[{file,"gen_server.erl"},{line,585}]},
|
{proc_lib,init_p_do_apply,3,
|
[{file,"proc_lib.erl"},{line,239}]}]}
|
in function gen_server:terminate/6 (gen_server.erl, line 744)
|
ancestors: ['dcp_sup-default','single_bucket_kv_sup-default',
|
ns_bucket_sup,ns_bucket_worker_sup,ns_server_sup,
|
ns_server_nodes_sup,<0.2289.0>,ns_server_cluster_sup,
|
<0.88.0>]
|
messages: [{'EXIT',<0.3127.0>,killed},{'EXIT',<0.3126.0>,killed}]
|
links: [<0.3028.0>]
|
dictionary: []
|
trap_exit: true
|
status: running
|
heap_size: 1598
|
stack_size: 27
|
reductions: 3399
|
neighbours:
|
|
...
|
ns_server:debug,2016-09-20T08:47:48.748-07:00,ns_1@172.23.108.105:replication_manager-default<0.3030.0>:replication_manag
|
er:terminate:105]Replication manager died {{{{child_interrupted,
|
{'EXIT',<0.3040.0>,{shutdown,nuke}}},
|
[{dcp_replicator,spawn_and_wait,1,
|
[{file,"src/dcp_replicator.erl"},{line,231}]},
|
|
There are also very slow SET operations happening +10s!
|
2016-09-20T08:46:58.934116-07:00 WARNING 67: Slow SET operation on connection: 3596 ms ([ 172.23.108.94:45782 - 172.23.108
|
.105:11210 ])
|
2016-09-20T08:46:59.088348-07:00 WARNING 65: Slow SET operation on connection: 10 s ([ 172.23.108.94:46225 - 172.23.108.10
|
5:11210 ])
|
2016-09-20T08:46:59.189076-07:00 WARNING 118: Slow SET operation on connection: 4723 ms ([ 172.23.108.94:46664 - 172.23.10
|
8.105:11210 ])
|
2016-09-20T08:46:59.217688-07:00 WARNING 124: Slow SET operation on connection: 3049 ms ([ 172.23.108.94:46347 - 172.23.10
|
8.105:11210 ])
|
2016-09-20T08:47:00.084680-07:00 WARNING 122: Slow SET operation on connection: 2210 ms ([ 172.23.108.94:46341 - 172.23.108.105:11210 ])
|
2016-09-20T08:47:00.483088-07:00 WARNING 65: Slow SET operation on connection: 1394 ms ([ 172.23.108.94:46225 - 172.23.108.105:11210 ])
|
One more note, the last time we ran this test there on 2842 there was a kernel panic related to beam.smp on node .105
Sep 19 14:46:31 centos-64-x64 kernel: beam.smp: page allocation failure. order:0, mode:0x20
|
Sep 19 14:46:31 centos-64-x64 kernel: Pid: 30830, comm: beam.smp Not tainted 2.6.32-358.el6.x86_64 #1
|
Sep 19 14:46:31 centos-64-x64 kernel: Call Trace:
|
Sep 19 14:46:31 centos-64-x64 kernel: <IRQ> [<ffffffff8112c127>] ? __alloc_pages_nodemask+0x757/0x8d0
|
http://qa.sc.couchbase.com/job/centos-systest-launcher/235/console