This happens because janitor_agent can be stuck waiting for:
*) tap connections "ping" (which we do in order to discover and clean up dead connections)
*) stuck vbucket filter change request (which is sent to "other" side, i.e. non-local memcached)
And corresponding ebucketmigrator can be stuck there too.
So unresponsiveness of 1 node can cause this critical component of all other nodes to be stuck. We cannot activate any vbuckets without stopping replication into them. And that requires:
*) janitor agent not be stuck
*) corresponding ebucketmigrators not being stuck
I've re-visited this problem just now and ideally fix will be made with support from ep-engine side which could be done as part of UPR work.
Without ep-engine support that will require significant changes in ns_server which are harder to do right now particularly due to 1.8.x backwards compatibility support. That would be doable but would take at least several days of work.
|For Gerrit Dashboard: &For+MB-8039=message:MB-8039|
|29051,3||MB-8039: don't ping tap connections during janitor runs||ns_server||Status: MERGED||+2||+1|