Details
-
Bug
-
Resolution: Fixed
-
Critical
-
2.0, 2.0.1
-
Security Level: Public
-
Yes
-
02/Sep/2013 - 20/Sep/2013
Description
SUBJ.
This happens because janitor_agent can be stuck waiting for:
*) tap connections "ping" (which we do in order to discover and clean up dead connections)
*) stuck vbucket filter change request (which is sent to "other" side, i.e. non-local memcached)
And corresponding ebucketmigrator can be stuck there too.
So unresponsiveness of 1 node can cause this critical component of all other nodes to be stuck. We cannot activate any vbuckets without stopping replication into them. And that requires:
*) janitor agent not be stuck
*) corresponding ebucketmigrators not being stuck
I've re-visited this problem just now and ideally fix will be made with support from ep-engine side which could be done as part of UPR work.
Without ep-engine support that will require significant changes in ns_server which are harder to do right now particularly due to 1.8.x backwards compatibility support. That would be doable but would take at least several days of work.
Attachments
For Gerrit Dashboard: MB-8039 | ||||||
---|---|---|---|---|---|---|
# | Subject | Branch | Project | Status | CR | V |
29051,3 | MB-8039: don't ping tap connections during janitor runs | master | ns_server | Status: MERGED | +2 | +1 |