Loading...

XML

Word

Printable

Details

Type: Bug
Resolution: Fixed
Priority: Major
Fix Version/s: 7.1.2
Affects Version/s: 6.5.2, 6.6.5, 7.0.3, 7.1.0
Component/s: couchbase-bucket
Labels:

Triage:
Triaged
Story Points:
1
Is this a Regression?:
No
Sprint:
KV May 22

Description

Found while working on MB-51689. This issue is similar to ~~MB-51606~~ in terms of impact/workaround.

This issue is only applicable to memory snapshots as the consumer will ack the snapshot end of fully persisted disk snapshots.

Consider a replica receiving a memory snapshot as follows:

[1:Prepare(keyA), 2:Mutation(keyB)]

This snapshot is dealt with by the replica by three DCP messages:

SnapshotMarker 1-2 with flag Memory (and perhaps Checkpoint but that's orthogonal to this issue)
DcpPrepare for keyA with seqno1
DcpMutation for keyB with seqno2

After processing 2 or 3 the flusher may run. In this scenario the flusher runs after processing 2 so the Checkpoint state is as follows:

[1:Prepare(keyA) (HPS = 1, SnapshotStart = 1, SnapshotEnd = 2)]

The flusher sees that this item is not the end of a snapshot range and so it does not attempt to persist the HPS value of the Checkpoint. The node then restarts. The HPS value loaded into memory is 0, which is correct as the replica has not yet received a full snapshot. The prepare is loaded into memory so the PDM has the correct state. On stream resumption the active now sends the following:

SnapshotMarker 1-2 with flag Memory (and perhaps Checkpoint but that's orthogonal to this issue)
DcpMutation for keyB with seqno2

At this point the PassiveStream would normally notify the PDM of a snapshot end as it has the final item and the snapshot contained a prepare. This is an in-memory variable though so post-restart the value is reset and a PassiveSteam processing these messages would not notify the PDM of the snapshot end as it has not seen a prepare. Disk snapshots always notify the PDM so they are not an issue here. As the PDM is not notified, the prepare at seqno 1 is not acked back to the Producer and this node cannot contribute towards it's commit. A subsequent prepare should fix this issue as the PassiveStream should notify the PDM of the snapshot in which it is contained.

Workaround
Perform another durable write to the affected vBucket and the original one will be "unstuck".

Solution

First thoughts here are that we should notify the PDM for all snapshots, but we didn't do that originally, perhaps to avoid some performance issue, so we could experiment with such a fix but may need to do something else.