In certain scenarios, KV will perform purges of delete operations. A DCP consumer that tries to resume the DCP stream from a sequence number that is smaller than the purge sequence number will be asked to rollback to 0 since KV doesn't have the necessary deletes to reach eventual consistency.
Analytics will ignore the rollback response and will keep trying to re-connect since it will not spot a branch off of the failover log and is not aware of other reasons to rollback.
There are a few problems that are currently there:
Analytics utilizes a single stream to feed multiple datasets. On creation of a new dataset, Analytics will open a stream from sequence number 0. New dataset will store all mutations, existing datasets will ignore mutations until the stream reach the sequence number in their DCP state.
This can lead to inconsistent data in the following scenario.
- Dataset1 stopped at sequence number 500.
- DCP stream is disconnected.
- Dataset2 is created.
- DCP producer purges up to sequence number 1000.
- Analytics will open the stream from sequence number 0 (because of Dataset2).
- Once the stream reaches 500, Dataset1 will start applying mutations, not knowing that deletes have been purged.
When two datasets are inconsistent when it comes to their DCP states, Catch up, close the stream, and re-open. In our example above, at step 5, we open the stream from 0 to 500. Then once that stream ends, we open a new stream from 501 to infinity. Upon requesting the new stream, KV will respond with a rollback response to sequence number 0.
There is quite a lot of wasted effort here. To mitigate it, before we start the stream from 0, we will ask KV for purge sequence numbers and perform rollbacks proactively.
When purge sequence number is large enough, the dcp consumer is expected to stream from 0 all the way to the purge sequence number without connection drops. See https://issues.couchbase.com/browse/MB-27800
If a connection drops, then the DCP producer will ask the consumer to start over. Why is this especially bad for Analytics?
When analytics streams a vbucket, it hashes its mutations into multiple nodes. Each of these nodes gets mutations from all vbuckets and puts then together in an LSM index creating components when memory is full.
When analytics detects a need for rollback, it does that by removing components from the end of the index until the rollback point is exceeded and then re-open the stream.
In this scenario, we are going to rollback to 0, all the vbuckets and restart the operation for all of them... So essentially, to get out of this situation, we need to successfully stream all 1024 vbuckets from 0 till the purge sequence number.
There is no proposed solution for this situation yet. Many solutions exist but none is particularly attractive.
Testing of purge scenarios is not easy. As of now, the only way for us to test is to set a low purge period with minimum limit of 1 hour. Wait for that time and then perform manual compaction.
This is not ideal. Is there a better way?