Description
In our alter-index-replica-p2 test suite, we intermittently see a rebalance failure during cleanup. The test case it appears in frequently is partial-index-rollback, but we see it intermittently elsewhere. I have been unable to repro this locally, but we see it consistently in jenkins. The common steps in jenkins are:
we delete the default bucket
and then we remove all the nodes from the cluster ( cluster setup is 4 nodes, 1 node kv,index,n1ql, 1 node kv, 2 nodes index)
during this removal we see this error (logs will be attached, the test after this cleanup fails, fails in the setup for this same reason, so those logs are the only ones we have, these logs are the ones attached, check timestamp 14:03:xx):
{u'status': u'none', u'errorMessage': u'Rebalance failed. See logs for detailed reason. You can try again.'} - rebalance failed
[2019-11-22 14:03:35,232] - [rest_client:3330] INFO - Latest logs from UI on 172.23.105.103:
[2019-11-22 14:03:35,232] - [rest_client:3331] ERROR - {u'node': u'ns_1@172.23.105.103', u'code': 0, u'text': u'Rebalance exited with reason {service_rebalance_failed,index,\n {worker_died,\n {\'EXIT\',<0.13968.0>,\n {rebalance_failed,\n
}}}}.\nRebalance Operation Id = 255ab294382df383598eb2aa71e26bc7', u'shortText': u'message', u'serverTime': u'2019-11-22T14:03:31.715Z', u'module': u'ns_orchestrator', u'tstamp': 1574460211715, u'type': u'critical'}
[2019-11-22 14:03:35,233] - [rest_client:3331] ERROR -
[2019-11-22 14:03:35,233] - [rest_client:3331] ERROR -
{u'node': u'ns_1@172.23.105.103', u'code': 102, u'text': u'Client-side error-report for user "Administrator" on node \'ns_1@172.23.105.103\':\nUser-Agent:Python-httplib2/$Rev: 259 $\nStarting rebalance from test, ejected nodes [u\'ns_1@172.23.105.109\', u\'ns_1@172.23.105.106\']', u'shortText': u'client-side error report', u'serverTime': u'2019-11-22T14:03:25.167Z', u'module': u'menelaus_web', u'tstamp': 1574460205167, u'type': u'warning'}[2019-11-22 14:03:35,233] - [rest_client:3331] ERROR -
{u'node': u'ns_1@172.23.105.103', u'code': 11, u'text': u'Deleted bucket "default"\n', u'shortText': u'message', u'serverTime': u'2019-11-22T14:03:23.091Z', u'module': u'menelaus_web', u'tstamp': 1574460203091, u'type': u'info'}[2019-11-22 14:03:35,233] - [rest_client:3331] ERROR -
{u'node': u'ns_1@172.23.105.103', u'code': 0, u'text': u'Shutting down bucket "default" on \'ns_1@172.23.105.103\' for deletion', u'shortText': u'message', u'serverTime': u'2019-11-22T14:03:21.434Z', u'module': u'ns_memcached', u'tstamp': 1574460201434, u'type': u'info'}[2019-11-22 14:03:35,234] - [rest_client:3331] ERROR -
{u'node': u'ns_1@172.23.105.103', u'code': 0, u'text': u'Service \'indexer\' exited with status 143. Restarting. Messages:\n2019-11-22T14:03:11.667-08:00 [Info] [Queryport-connpool:172.23.105.103:9101] active conns 0, free conns 1\n2019-11-22T14:03:11.667-08:00 [Info] [Queryport-connpool:172.23.105.109:9101] active conns 0, free conns 1\n2019-11-22T14:03:11.701-08:00 [Info] [Queryport-connpool:172.23.105.106:9101] active conns 0, free conns 1\n2019-11-22T14:03:14.072-08:00 [Info] serviceChangeNotifier: received PoolChangeNotification\n2019-11-22T14:03:14.095-08:00 [Info] Refreshing indexer list due to cluster changes or auto-refresh.\n2019-11-22T14:03:14.095-08:00 [Info] Refreshed Indexer List: [172.23.105.103:9100 172.23.105.106:9100 172.23.105.109:9100]\n2019-11-22T14:03:14.098-08:00 [Info] switched currmeta from 20 -> 20 force true \n2019-11-22T14:03:21.086-08:00 [Info] [Queryport ":9101"] connection 172.23.105.103:46782 closed\n2019-11-22T14:03:21.086-08:00 [Error] PeerPipe.doRecieve() : ecounter error when received mesasage from Peer 172.23.105.103:50382. Error = EOF. Kill Pipe.\n2019-11-22T14:03:21.087-08:00 [Info] messageListener.start(): message channel closed. Remove peer 7d:40:6:d3:3d:bb:b4:fd as follower.\n', u'shortText': u'message', u'serverTime': u'2019-11-22T14:03:21.258Z', u'module': u'ns_log', u'tstamp': 1574460201258, u'type': u'info'}[2019-11-22 14:03:35,234] - [rest_client:3331] ERROR -
{u'node': u'ns_1@172.23.105.103', u'code': 0, u'text': u'Rebalance completed successfully.\nRebalance Operation Id = bacd046b40bdddbd3354d9f110ab37f4', u'shortText': u'message', u'serverTime': u'2019-11-22T14:03:06.925Z', u'module': u'ns_orchestrator', u'tstamp': 1574460186925, u'type': u'info'}[2019-11-22 14:03:35,234] - [rest_client:3331] ERROR -
{u'node': u'ns_1@172.23.105.103', u'code': 0, u'text': u'Bucket "default" rebalance appears to be swap rebalance', u'shortText': u'message', u'serverTime': u'2019-11-22T14:03:06.467Z', u'module': u'ns_vbucket_mover', u'tstamp': 1574460186467, u'type': u'info'}[2019-11-22 14:03:35,234] - [rest_client:3331] ERROR - {u'node': u'ns_1@172.23.105.103', u'code': 5, u'text': u"Node 'ns_1@172.23.105.103' saw that node 'ns_1@172.23.105.105' went down. Details: [
{nodedown_reason,\n connection_closed}]", u'shortText': u'node down', u'serverTime': u'2019-11-22T14:03:06.399Z', u'module': u'ns_node_disco', u'tstamp': 1574460186399, u'type': u'warning'}[2019-11-22 14:03:35,235] - [rest_client:3331] ERROR - {u'node': u'ns_1@172.23.105.106', u'code': 5, u'text': u"Node 'ns_1@172.23.105.106' saw that node 'ns_1@172.23.105.105' went down. Details: [{nodedown_reason,n connection_closed}
]", u'shortText': u'node down', u'serverTime': u'2019-11-22T14:03:06.371Z', u'module': u'ns_node_disco', u'tstamp': 1574460186371, u'type': u'warning'}
In multiple test cases where we see this failure, the test itself failsover a node and rebalances out that node, perhaps that will help narrow down the issue here?
We see this rebalance failure in cleanups as far back as 6.5.0-4471