Details
-
Bug
-
Resolution: Fixed
-
Critical
-
6.5.0
-
6.5.0-3883-enterprise
-
Triaged
-
Centos 64-bit
-
Yes
-
KV-Engine MH Beta part 2
Description
Script to Repro
./testrunner -i /tmp/testexec.3494.ini -p get-cbcollect-info=True,flusher_batch_split_trigger=10 -t rebalance.rebalance_high_ops_pillowfight.RebalanceHighOpsWithPillowFight.test_graceful_failover_addback,node_out=3,replicas=2,nodes_init=4,items=2000000,batch_size=1000,rate_limit=100000,recovery_type=delta,instances=2,threads=5,loader=high_ops,flusher_batch_split_trigger=1
|
Steps
- Create a 4 node cluster with 2 replicas, set flusher_batch_split_trigger=1
- Do a dataload with high ops dataloader
- Gracefully failover a node.
- Start high ops dataloader again.
- do a delta recovery.
- Start a Rebalance again.
Rebalance fails as shown below.
{u'node': u'ns_1@172.23.105.105', u'code': 0, u'text': u'Rebalance exited with reason {{badmatch,\n {error,\n {failed_nodes,[\'ns_1@172.23.105.47\']}}},\n [{ns_janitor,cleanup_apply_config_body,4,\n [{file,"src/ns_janitor.erl"},{line,286}]},\n {ns_janitor,\'-cleanup_apply_config/4-fun-0-\',\n 4,\n [{file,"src/ns_janitor.erl"},{line,209}]},\n {async,\'-async_init/4-fun-2-\',3,\n [{file,"src/async.erl"},{line,211}]}]}.\nRebalance Operation Id = 28ffeff813a1d2e394ea0f10d72cbccf', u'shortText': u'message', u'serverTime': u'2019-07-27T23:42:38.878Z', u'module': u'ns_orchestrator', u'tstamp': 1564296158878, u'type': u'critical'}
|
[2019-07-27 23:42:48,906] - [rest_client:3250] ERROR - {u'node': u'ns_1@172.23.105.47', u'code': 0, u'text': u'Control connection to memcached on \'ns_1@172.23.105.47\' disconnected: {lost_connection,\n [{ns_memcached,\n worker_loop,\n 3,\n [{file,\n "src/ns_memcached.erl"},\n {line,\n 231}]},\n {proc_lib,\n init_p_do_apply,\n 3,\n [{file,\n "proc_lib.erl"},\n {line,\n 247}]}]}', u'shortText': u'message', u'serverTime': u'2019-07-27T23:42:38.844Z', u'module': u'ns_memcached', u'tstamp': 1564296158844, u'type': u'info'}
|
I also see a memcached crash on 172.23.105.47.
{u'node': u'ns_1@172.23.105.47', u'code': 0, u'text': u"Service 'memcached' exited with status 134. Restarting. Messages:\n2019-07-27T23:42:38.784342-07:00 CRITICAL /opt/couchbase/bin/../lib/libstdc++.so.6() [0x7f4c6e1e2000+0x8f213]\n2019-07-27T23:42:38.784356-07:00 CRITICAL /opt/couchbase/bin/../lib/../lib/ep.so() [0x7f4c68f6c000+0x70842]\n2019-07-27T23:42:38.784366-07:00 CRITICAL /opt/couchbase/bin/../lib/../lib/ep.so() [0x7f4c68f6c000+0xee6eb]\n2019-07-27T23:42:38.784378-07:00 CRITICAL /opt/couchbase/bin/../lib/../lib/ep.so() [0x7f4c68f6c000+0x13ca45]\n2019-07-27T23:42:38.784392-07:00 CRITICAL /opt/couchbase/bin/../lib/../lib/ep.so() [0x7f4c68f6c000+0x13cf0d]\n2019-07-27T23:42:38.784399-07:00 CRITICAL /opt/couchbase/bin/../lib/../lib/ep.so() [0x7f4c68f6c000+0x1362ef]\n2019-07-27T23:42:38.784404-07:00 CRITICAL /opt/couchbase/bin/../lib/libplatform_so.so.0.1.0() [0x7f4c7007d000+0x8f27]\n2019-07-27T23:42:38.784410-07:00 CRITICAL /lib64/libpthread.so.0() [0x7f4c6daad000+0x7dd5]\n2019-07-27T23:42:38.784443-07:00 CRITICAL /lib64/libc.so.6(clone+0x6d) [0x7f4c6d6e0000+0xfdead]\n[*** LOG ERROR ***] [2019-07-27 23:42:38] [spdlog_file_logger] async log: thread pool doesn't exist anymore", u'shortText': u'message', u'serverTime': u'2019-07-27T23:42:38.838Z', u'module': u'ns_log', u'tstamp': 1564296158838, u'type': u'info'}
|
cbcollect_info attached from all the nodes in the cluster.