There are multiple scenarios where replica checkpoints might allocate most of the memory on a node in a state where that memory is not releasable. That can result in hard OOM and possible deadlock in scenarios like rebalance or bulk load.
CBSE-8284 is an example of livelock at rebalance. That shows that without on-going mutations we can end-up with replica disk checkpoint being stuck in the open state, which means that we cannot recover all the memory associated with them.
While those scenarios are uncommon on on-premise envs, the system breaks quite quickly on many, small bucket envs if someone attempts simple loads with (eg) low memory quotas and bigger-than-usual doc sizes.
Due to the (current) invariant / assumption there’s always one open checkpoint - hence cannot close the last one (even though we have the last marker) as we don’t know what the seqnos for the next checkpoint are going to be.
If we relaxed that for replicas (which I think makes sense given they are essentially slaved to the active) then we could close the checkpoint as soon as the last mutation arrives - and hence remove that checkpoint once it’s unreferenced.
This only works for disk checkpoints as we need to know checkpoint ends not snap ends.
Force-closing the open checkpoint at replica comes with its own issues, see historical conversation in comments for details.
In the end we solve by allowing ItemExpel to remove all the mutations in checkpoints.
Note that, differently from the original proposal, the ItemExpel fix is wider-scoped and isn't restricted to Disk Checkpoints. So that improves our memory-recovery ability on Memory Checkpoints too and any similar issue caused by those.
|The last item in a replica checkpoint was not expelled. In scenarios such as large average item size, high numbers of replicas or low Bucket quota could result in a data-node entering an unrecoverable Out-of-Memory state.||ItemExpel has been enhanced to release all the items in a checkpoint when memory conditions allow.|