After the failure in MB-45893, the new 7.0 node that was added to the cluster - 172.23.107.129 was removed. That rebalance operation went fine.
Next step was to perform a swap rebalance to take out 6.6.2 node 172.23.107.126 and bring in 172.23.107.129. This swap rebalance operation failed after 10 mins.
Rebalance exited with reason {mover_crashed,
{unexpected_exit,
{'EXIT',<0.27190.4>,
{{wait_seqno_persisted_failed,"default",469,
1553,
[{'ns_1@172.23.107.142',
{'EXIT',
{{{badmatch,{error,timeout}},
[{mc_client_binary,cmd_vocal_recv,5,
[{file,"src/mc_client_binary.erl"},
{line,152}]},
{mc_client_binary,select_bucket,2,
[{file,"src/mc_client_binary.erl"},
{line,344}]},
{ns_memcached_sockets_pool,
take_socket,2,
[{file,
"src/ns_memcached_sockets_pool.erl"},
{line,40}]},
{ns_memcached_sockets_pool,
'-executing_on_socket/3-fun-0-',3,
[{file,
"src/ns_memcached_sockets_pool.erl"},
{line,82}]},
{async,'-async_init/4-fun-1-',3,
[{file,"src/async.erl"},
{line,191}]}]},
{gen_server,call,
[{'janitor_agent-default',
'ns_1@172.23.107.142'},
{if_rebalance,<0.12682.4>,
{wait_seqno_persisted,469,1553}},
infinity]}}}}]},
[{ns_single_vbucket_mover,
'-wait_seqno_persisted_many/5-fun-2-',5,
[{file,"src/ns_single_vbucket_mover.erl"},
{line,482}]},
{proc_lib,init_p,3,
[{file,"proc_lib.erl"},{line,234}]}]}}}}.
Rebalance Operation Id = 81f43662ad33d73c76a515c82d352a89
Nodes in the cluster : 172.23.107.142 (7.0 node), 172.23.107.126 (outgoing 6.6.2 node), 172.23.107.129 (incoming 7.0 node).
Initial steps to setup the cluster are the same as MB-45893.
'ns_1@172.23.107.142'], EjectNodes = [], Failed over and being ejected nodes = []; no delta recovery nodes; Operation Id = fda046d63e6f8a8b8014af2288bb470a
Failed here:
[ns_server:debug,2021-04-23T15:56:13.288-07:00,ns_1@172.23.107.142:leader_activities<0.14526.8>:leader_activities:handle_activity_down:505]Activity terminated with reason {shutdown,
[user:info,2021-04-23T16:12:48.988-07:00,ns_1@172.23.107.142:<0.11676.1>:ns_orchestrator:idle:773]Starting rebalance, KeepNodes = ['ns_1@172.23.107.129','ns_1@172.23.107.142'], EjectNodes = ['ns_1@172.23.107.126'], Failed over and being ejected nodes = []; no delta recovery nodes; Operation Id = 81f43662ad33d73c76a515c82d352a89
Analysis of Rebalance Failure 1
Rebalance only last from 2021-04-23T15:56:02 to 2021-04-23T15:56:13 i.e. 11 seconds.
During this time don't see anything of interest in the memcached.log.
Not sure why we see error,wait_for_memcached_failed. Dave Finlay maybe you can help highlight why? Looks like a dup of the rebalance failure seen in MB-45893
See a gap in stats at this time suggesting the system is struggling (but in this case not due to XDCR)
Analysis of Rebalance Failure 2
The rebalance started at 2021-04-23T16:12:48 and terminated at 2021-04-23T16:23:32 so was running for ~10 minutes
48+53 = 101
And subsequent slow auth commands have a duration of 100s / 99s. It looks like they are by being blocked getting the vbucket-details stats.
We can see around this time cpu utilisation is at 100%
and of the 6 available cpu 4 are being used by ns_server.
Not sure why ..again I think I need your input on this Dave Finlay.
As a side note I see we are very close to the max of system memory
Daniel Owen
added a comment - - edited Analysis of Rebalance Failure 1
Rebalance only last from 2021-04-23T15:56:02 to 2021-04-23T15:56:13 i.e. 11 seconds.
During this time don't see anything of interest in the memcached.log.
Not sure why we see error,wait_for_memcached_failed . Dave Finlay maybe you can help highlight why?
Looks like a dup of the rebalance failure seen in MB-45893
See a gap in stats at this time suggesting the system is struggling (but in this case not due to XDCR)
Analysis of Rebalance Failure 2
The rebalance started at 2021-04-23T16:12:48 and terminated at 2021-04-23T16:23:32 so was running for ~10 minutes
[error_logger:error,2021-04-23T16:23:32.697-07:00,ns_1@172.23.107.142:<0.27190.4>:ale_error_logger_handler:do_log:101]
=========================CRASH REPORT=========================
crasher:
initial call: ns_single_vbucket_mover:'-wait_seqno_persisted_many/5-fun-2-'/0
pid: <0.27190.4>
registered_name: []
exception error: {wait_seqno_persisted_failed,"default",469,1553,
Interesting in the memcached.log we see
2021-04-23T16:21:30.215813-07:00 WARNING 1776: Slow operation. {"cid":"172.23.107.142:61496/0","duration":"10 s","trace":"request=254819401286496:10205175 execute=254819401286496:22 execute=254829606410512:51","command":"SASL_AUTH","peer":"{"ip":"172.23.107.142","port":61496}","bucket":"","packet":{"bodylen":45,"cas":0,"datatype":"raw","extlen":0,"key":"<ud>PLAIN</ud>","keylen":5,"magic":"ClientRequest","opaque":0,"opcode":"SASL_AUTH","vbucket":0}}
< 48 second gap>
2021-04-23T16:22:18.890585-07:00 WARNING 3136: Slow operation. {"cid":"172.23.107.126:64195/e03de","duration":"48 s","trace":"request=254829827705696:48442700 execute=254829827705696:48442700","command":"STAT","peer":"{"ip":"172.23.107.126","port":64195}","bucket":"beer-sample","packet":{"bodylen":15,"cas":0,"datatype":"raw","extlen":0,"key":"<ud>vbucket-details</ud>","keylen":15,"magic":"ClientRequest","opaque":3724742144,"opcode":"STAT","vbucket":0}}
<53 second gap>
2021-04-23T16:23:11.839941-07:00 WARNING 1336: Slow operation. {"cid":"172.23.107.142:57387/e03de","duration":"53 s","trace":"request=254878270969408:52954717 execute=254878270969408:52954717","command":"STAT","peer":"{"ip":"172.23.107.142","port":57387}","bucket":"beer-sample","packet":{"bodylen":15,"cas":0,"datatype":"raw","extlen":0,"key":"<ud>vbucket-details</ud>","keylen":15,"magic":"ClientRequest","opaque":3724742144,"opcode":"STAT","vbucket":0}}
2021-04-23T16:23:11.840663-07:00 WARNING 1656: Slow operation. {"cid":"172.23.107.142:61683/0","duration":"100 s","trace":"request=254831496057008:99730372 execute=254831496057008:22 execute=254931226331488:97","command":"SASL_AUTH","peer":"{"ip":"172.23.107.142","port":61683}","bucket":"","packet":{"bodylen":45,"cas":0,"datatype":"raw","extlen":0,"key":"<ud>PLAIN</ud>","keylen":5,"magic":"ClientRequest","opaque":0,"opcode":"SASL_AUTH","vbucket":0}}
2021-04-23T16:23:11.841113-07:00 WARNING 3312: Slow operation. {"cid":"172.23.107.142:61691/0","duration":"99 s","trace":"request=254831889792304:99337104 execute=254831889792304:26 execute=254931226821840:74","command":"SASL_AUTH","peer":"{"ip":"172.23.107.142","port":61691}","bucket":"","packet":{"bodylen":45,"cas":0,"datatype":"raw","extlen":0,"key":"<ud>PLAIN</ud>","keylen":5,"magic":"ClientRequest","opaque":0,"opcode":"SASL_AUTH","vbucket":0}}
2021-04-23T16:23:11.841128-07:00 WARNING 4788: Slow operation. {"cid":"172.23.107.142:61701/0","duration":"99 s","trace":"request=254832016855472:99210076 execute=254832016855472:27 execute=254931226896288:35","command":"SASL_AUTH","peer":"{"ip":"172.23.107.142","port":61701}","bucket":"","packet":{"bodylen":46,"cas":0,"datatype":"raw","extlen":0,"key":"<ud>PLAIN</ud>","keylen":5,"magic":"ClientRequest","opaque":0,"opcode":"SASL_AUTH","vbucket":0}}
2021-04-23T16:23:11.842429-07:00 WARNING 4784: Slow operation. {"cid":"172.23.107.142:61703/0","duration":"99 s","trace":"request=254832014190400:99214031 execute=254832014190400:18 execute=254931228186592:34","command":"SASL_AUTH","peer":"{"ip":"172.23.107.142","port":61703}","bucket":"","packet":{"bodylen":45,"cas":0,"datatype":"raw","extlen":0,"key":"<ud>PLAIN</ud>","keylen":5,"magic":"ClientRequest","opaque":0,"opcode":"SASL_AUTH","vbucket":0}}
2021-04-23T16:23:11.842435-07:00 WARNING 2968: Slow operation. {"cid":"172.23.107.142:61702/0","duration":"99 s","trace":"request=254832014883872:99213326 execute=254832014883872:13 execute=254931228157040:52","command":"SASL_AUTH","peer":"{"ip":"172.23.107.142","port":61702}","bucket":"","packet":{"bodylen":45,"cas":0,"datatype":"raw","extlen":0,"key":"<ud>PLAIN</ud>","keylen":5,"magic":"ClientRequest","opaque":0,"opcode":"SASL_AUTH","vbucket":0}}
2021-04-23T16:23:11.842593-07:00 WARNING 2960: Slow operation. {"cid":"172.23.107.142:61688/0","duration":"99 s","trace":"request=254831872722224:99355669 execute=254831872722224:78 execute=254931228357376:34","command":"SASL_AUTH","peer":"{"ip":"172.23.107.142","port":61688}","bucket":"","packet":{"bodylen":45,"cas":0,"datatype":"raw","extlen":0,"key":"<ud>PLAIN</ud>","keylen":5,"magic":"ClientRequest","opaque":0,"opcode":"SASL_AUTH","vbucket":0}}
Notice that...
first slow stat duration is 48 seconds.
second slow stat duration is 53 seconds
48+53 = 101
And subsequent slow auth commands have a duration of 100s / 99s. It looks like they are by being blocked getting the vbucket-details stats.
We can see around this time cpu utilisation is at 100%
and of the 6 available cpu 4 are being used by ns_server.
Not sure why ..again I think I need your input on this Dave Finlay .
As a side note I see we are very close to the max of system memory
As Daniel pointed out I think the memory usage is the problem. I see goxdcr use up to 4GiB and cbft use up to 2.5GiB on 172.23.107.142 that only has 6GiB or RAM.
Aliaksey Artamonau (Inactive)
added a comment - - edited As Daniel pointed out I think the memory usage is the problem. I see goxdcr use up to 4GiB and cbft use up to 2.5GiB on 172.23.107.142 that only has 6GiB or RAM.
Mihir Kamdar do you mind seeing if this works with that build or later?
Neil Huang
added a comment - XDCR recently had a memory leak issue resolved as MB-45935 .
7.0.0-5051 should contain this change ( https://issues.couchbase.com/browse/MB-45935?focusedCommentId=495915&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-495915 )
Mihir Kamdar do you mind seeing if this works with that build or later?
The XDCR logs show the same error that has been seen by MB-45893 + MB-45892:
2021-04-23T16:34:47.302-07:00 ERRO GOXDCR.PipelineMgr: Failed to start the pipeline 721e2b5f8f84ad3127296726a3122b7d/beer-sample/beer-sample-169757800
2021-04-23T16:34:47.303-07:00 ERRO GOXDCR.PipelineMgr: Failed to update pipeline 721e2b5f8f84ad3127296726a3122b7d/beer-sample/beer-sample, err=dcp_721e2b5f8f84ad3127296726a3122b7d/beer-sample/beer-sample_172.23.107.142:11210_0 : Not supported
2021-04-23T16:34:47.303-07:00 ERRO GOXDCR.PipelineMgr: Update of pipeline 721e2b5f8f84ad3127296726a3122b7d/beer-sample/beer-sample failed with errors=dcp_721e2b5f8f84ad3127296726a3122b7d/beer-sample/beer-sample_172.23.107.142:11210_0 : Not supported
Combined MB-45892 that causes repeatedly restarts with MB-45935 that bloats the mem per restart, this caused the XDCR memory bloat that eats up all the memory.
Neil Huang
added a comment - - edited The XDCR logs show the same error that has been seen by MB-45893 + MB-45892 :
2021-04-23T16:34:47.302-07:00 ERRO GOXDCR.PipelineMgr: Failed to start the pipeline 721e2b5f8f84ad3127296726a3122b7d/beer-sample/beer-sample-169757800
2021-04-23T16:34:47.303-07:00 ERRO GOXDCR.PipelineMgr: Failed to update pipeline 721e2b5f8f84ad3127296726a3122b7d/beer-sample/beer-sample, err=dcp_721e2b5f8f84ad3127296726a3122b7d/beer-sample/beer-sample_172.23.107.142:11210_0 : Not supported
2021-04-23T16:34:47.303-07:00 ERRO GOXDCR.PipelineMgr: Update of pipeline 721e2b5f8f84ad3127296726a3122b7d/beer-sample/beer-sample failed with errors=dcp_721e2b5f8f84ad3127296726a3122b7d/beer-sample/beer-sample_172.23.107.142:11210_0 : Not supported
Combined MB-45892 that causes repeatedly restarts with MB-45935 that bloats the mem per restart, this caused the XDCR memory bloat that eats up all the memory.
At the very least, 7.0.0-5038 will unblock this issue. ( https://issues.couchbase.com/browse/MB-45893?focusedCommentId=495809&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-495809 )
This looks like it might be KV - so I will take an initial triage