If a replica vBucket is promoted to active, and the last DCP message it received was a Snapshot Marker which had the first mutation de-duplicated, then the snapshot start of the newly-promoted active ends up greater than the active.
Upon the next Flusher run (i.e. next mutation to the vBucket), the Flusher throws an exception when trying to fetch items which terminates KV-Engine (as exception is thrown on BG thread).
Confirmed to affect 6.6.1 and 6.6.5.
Confirmed to affect 7.0.0 - 7.0.3 (inclusive) - see https://review.couchbase.org/c/kv_engine/+/170373
Confirmed to not affect Neo.
When streaming data from an Active to Replica vBucket, the extent of the Checkpoint is sent via DCP using a SnapshotMarker message, followed by N Mutation / Deletion messages. The snapshot marker may be discontinuous compared to the previous if any de-duplication occurred within the Checkpoint - for example if document "key" was written sufficient times in quick succession, one could end up with the following two Checkpoints on the active and subsequent DCP SnapshotMarker sent to the replica:
Note how there are just two mutations remaining (at seqnos 10 and 12), and that there is a seqno "gap" at 11 (ignore meta-items which are not send over DCP).
When this is replicated over DCP it will be sent as:
- DCP_SNAPSHOT_MARKER(start:0, end:10, flags=CHK)
- DCP_MUTATION(seqno:10, ...)
- DCP_SNAPSHOT_MARKER(start:12, end:12, flags=CHK)
- DCP_MUTATION(seqno:12, ...)
Once these messages are replicated over DCP the replica vBucket should have equivalent state as the active.
However; if the last DCP_MUTATION is not received - for example if the active node is being failed over and the stream is closed before the DCP_MUTATION, then the state of the replica - crucially the Open checkpoint is as follows:
Note that the second SnapshotMarker being flagged as "CHK" (Checkpoint) is essential - we need the replica to end up creating a new Checkpoint with the start and end controlled by the active - a SnapshotMarker without that flag is insufficient as it just extends the existing checkpoint, increasing the checkpoint end but leaving start unaffected.
When this sequence occurs, the seqno range (11,11) in the open Checkpoint is less than the snapshot range (12,12). This is problematic as we have essentially broken an invariant on Checkpoints - that all items within them are between the snapshot start and end.
This doesn't immediately cause a problem, but if this vBucket is converted to Active and starts accepting mutations itself, it will start generating seqnos from the last seqno received - 10 in this case. This results in the next mutation being assigned seqno 11, which when the flusher is woken and attempts to flush throws an exception on the BG thread and crashes KV-Engine:
In theory this scenario seems reasonably easy to hit - one just needs at least one seqno being de-duplicated (i.e. two mutations to the same key within the same Checkpoint), and then a failover where just the SnapshotMarker is received without any of the mutations it contains. However in practice I have not been able to trigger it on a full running cluster with default config values (yet).
This is most likely due to the requirement that the SnapshotMarker received last (before stream is closed) must have the "CHK" flag set - i.e. must represent a newly-created Checkpoint. In 6.6.1 a new Checkpoint is only created under certain criteria (see CheckpointManager::isCheckpointCreationForHighMemUsage_UNLOCKED and CheckpointManager::checkOpenCheckpoint_UNLOCKED) - but the main ones are one of:
- The current Checkpoint has 10,000 items in it, or
- The current Checkpoint has at least 1 item and 5 seconds have passed since it was created.
These criteria somewhat work against the needed scenario to trigger the bug:
- To create a new Checkpoint when there is a large (10,000) items in it, it becomes increasingly unlikely that de-duplication would exactly de-duplicate the first mutation(s) in the Checkpoint.
- To create a new Checkpoint after 5 seconds; but also have de-duplication requires modification of the same key in a small window, but overall little-to-no other traffic.
However the problem has been observed on a live running cluster; where it affected 4 out of ~140 vBuckets when a node was failed over, so clearly it is possible to hit, given the "right" environment.