From looking at the logs this issue is caused by the rebalance status loop exiting early due to an error being returned by the server. In this case the logs show the following message while the rebalance status loop is running:
2018/08/15 23:59:06 Unsolicited response received on idle HTTP channel starting with "HTTP/1.1 400 Bad Request\r\nServer: MochiWeb/1.0 (Any of you quaids got a smint?)\r\nDate: Wed, 15 Aug 2018 23:59:06 GMT\r\nContent-Length: 0\r\n\r\n"; err=<nil>
The loop exits and then when we check to see if a rebalance is still needed ns_server report that it does because the rebalance is still in process.
time="2018-08-15T23:59:10Z" level=error msg="failed to reconcile: Failed to rebalance: cluster reports rebalance incomplete" cluster-name=test-couchbase-mjwqv module=cluster
The problem is that we need to do better error checking in the rebalance status loop. If the call to check the status fails we need to check another node. Even if all nodes fail we should still retry for a certain amount of time (maybe 60 seconds) before giving up. Below is a link the code that needs to be improved.
It should also be noted that it's possible that we may not be able to check the status for 60 seconds and in that case we should skip raising an event or raise an event for rebalance status unknown.