Thanks Bo! I think this makes it reasonably likely that my theory is the issue that we are encountering here.
One question worth asking is where is memory being used? In this case we have nothing resident so next to no memory usage in the HashTables. The other main memory hog is the CheckpointManager which is using roughly 50% of the quota. We'll have some transient memory usage for flushes but we never grow a large disk write queue so this won't be high. I don't think we would have a substantial amount of memory allocated elsewhere in KV. Magma reported 390325151 bytes (390MB) of memory usage at the end of this test.
Dropping "cursor_dropping_upper_mark" to 90 solves this issue because it allows us to free memory (by dropping cursors) before we hit the threshold at which we stop taking new mutations on the active vBuckets. In a way, it's desirable that normally we stop mutations before dropping cursors as this allow replicas to try to catch up. If the streams that these cursors belong to are in memory or are very near the end of a disk backfill then this holds up. If the cursors are at the start of a disk backfill though then for the sake of availability it's less desirable. I'm not sure we want to change the default of this config value permanently.
I think the real solution for this is going to be hard limits on the CheckpointManager memory usage to keep us below 93% memory usage (provided the pager can run fast enough when we're above the HWM). This will be done as part of
MB-38441. In the meantime Bo-Chun Wang, I'd recommend adding this config parameter to magma tests if they fail in the same way.