Details
-
Task
-
Resolution: Fixed
-
Major
-
4.5.0
-
None
-
KV: June 12 - July 2
Description
Background
There have been a number of related MBs (MB-16657, MB-18679, MB-19567, MB-19695) which all resolve around CouchKVStore::getNumPersistedDeletes throwing an exception when a vbucket file doesn't exist on disk.
In each case we've tried to solve the issue, generally by trying not to call getNumPersistedDeletes() when we think a file doesn't exist. However in each case we've hit the issue again (generally with the test passing once or twice, and then failing again).
Note that prior to watson such an error would be relatively silent - we'd just log a message and return zero. All these issues started happening when we improved our error handling to make "failed to open file errors" explicit, by raising an exception.
At this stage it seems clear that the possible states / times when vBucket files do not exist are not well understood, and we must still be missing some understanding of when it is possible for couchstore files to not exist.
Having all these different (partial / failed) patches to try to address the issue makes the code more confusing to read, and only makes it harder to track down what the underlying root problem is.
For Watson plan to just catch the thrown exception, and return '0' for the delete count (as per Sherlock) - see MB-19695.
Task
1. Identify all patches relating to this issue (see MB list above, may not be exhaustive).
2. Revert each patch in question, unless it actually fixes a real problem. In other words - we should just end up with Sherlock code + catches and return of zero if the getNumPersistedDeletes fails.
Attachments
Issue Links
- blocks
-
MB-19612 4.5.1 Minor Release
- Closed
- relates to
-
MB-16657 dcp-vbtakeover gets stuck on watson resulting in stuck and eventually failed rebalance
- Resolved
-
MB-19695 Rebalance out 2.5.x nodes in online upgrade failed: CouchKVStore::getNumPersistedDeletes failed to open db
- Closed
-
MB-18679 DCP rebalance failed in online upgrade from 2.5.x to 4.5.0-1789 with error supprocess died
- Closed
-
MB-19567 Rebalance fails after delta recovery of a node.
- Closed