Loading...

Details

Type: Improvement
Resolution: Unresolved
Priority: Major
Fix Version/s: Morpheus
Affects Version/s: 7.2.0
Component/s: couchbase-bucket
Labels:
- candidate-for-morpheus
- cbam

Story Points:
0

Description

Background

A DCP client connecting to KV-Engine can have data sent to it from two sources - backfill from disk[*], or in-memory streaming. The intent is that a DCP client will initially receive historical data from disk; then switch to in-memory streaming for "live" data, staying in in-memory mode for the duration. In-memory streaming has much lower latency than waiting for data to first be persisted to disk, so is preferred.

The high level KV-Engine flow when a client requests a DCP stream for seqnos 0 to infinity (i.e. all previous data plus all changes yet to occur) is:

Register a Cursor for the in-memory phase - pointing to the earliest Checkpoint available in memory.
Setup and run a backfill from seqno 0 up to the seqno of the registered Cursor.
Switch to in-memory streaming, reading data from Cursor and advancing it along the CheckpointManager as data is sent to the client.

Each front-end mutation requires memory inside CheckpointManager to enqueue the mutation, until all Cursors have consumed the item when the memory can be freed. If the rate of front-end mutations is such that all of the CheckpointManager quota is consumed - i.e. one or more DCP cursors are keeping the memory alive, then KV has a problem - we cannot perform any more mutations as there's no memory to keep them.

In this case KV-Engine performs Cursor Dropping - the slowest Cursor(s) are discarded, allowing memory to be freed up. This however means that the DCP consumer must switch back to backfilling from disk, until they are caught up again, when they can resume with in-memory streaming.

However cursor dropping can sometimes be problematic for DCP clients - an example
from Steve Yen:

Consider a the recent customer proof-of-concept testing on Capella. GSI ends up desperately trying to write to EBS, but I/O's at saturation, and the frontend app test harness is unaware that GSI is out of capacity, so the frontend test load keeps on continuing with its 30K mutations/sec. The cluster stays up for 3 days – and eventually DCP queue is growing too big in KV where the GSI client's seq # is considered too far behind, so KV kills the DCP stream. This leads next to a backfill-from-zero, which leading to even more cascading, disastrous pressure on the system.

Proposal

Some form of method for DCP clients to request that they are not cursor dropped - instead KV applied back pressure (tmpOOM) to clients to "slow down" their rate of mutation, such that the DCP client(s) does not fall behind.

Initial Comments

From a technical pov I think this is relatively straightforward - we allow GSI / other DCP consumers to opt-out of cursor dropping - the feature we have where “slow” DCP consumers (ones which are lagging behind the DCP stream produced by KV) are flipped to disk-based backfilling (and the memory being kept alive by their behind cursor can be freed). By disabling cursor-dropping, in a slow GSI case, once the memory usage of the CheckpointManager reached 30% of the Bucket Quota, KV would start applying back pressure to new front-end mutations by returning ETMPFAIL.

_{Note this is what we do today with KV replication connections; to ensure that replicas never fall too far behind the active.} This isn't true - KV does currently support cursor-dropping.

However, while this is an interesting idea and one we could certainly explore, but there’s some potential downsides which we need to ensure we’ve fully understood.

There’s some issues which I can immediately think of:

We are bringing GSI into the “availability” sphere of the product - and an issue in GSI could cause an issue in KV. For example. consider a setup with two GSI nodes; and replica indexes configured (i.e. each GSI node is maintaining a copy of some index). With such a change, if one of the GSI nodes has a problem causing a slow disk (hardware issue, mis-configuration, disk at capacity etc) then that could result in it not consuming its DCP streams from KV, which in turn would cause KV to return TMPFAIL to client operations. Essentially the cluster is “down” for KV mutations - even though actually there’s still one copy of the GSI index up-to-date and so otherwise all is well.
Periodic, bulk-load operations could have their throughput affected or even error. Consider a weekly, hour-long bulk-load operation (say outside normal business hours) which updates a large number of documents. A cluster may be sized so GSI can only keep up with the “normal’ online traffic; during the 1 hour bulk load it ends up lagging behind - at present that is fine as KV will cursor-drop GSI; switch to backfilling, GSI will continue consuming data slower, and once the 1 hour bulk-load completes GSI will catch up and is ready for operations.
If we make the above change, during such a workload KV would end up applying back-pressure to the bulk load application. At best that just slows down the bulk load (if temporary failures are handled correctly by the application), although that could be undesirable for the user if the window to load that data is enxended more than they desire. At worst the bulk-load may start to fail as it doesn’t expect / cannot handle the temporary failures. Arguably that’s an application bug and should be fixed; but certainly it makes it a challenge to say change such cursor-dropping behaviour unilaterally, by default.
Initial index build could cause unavailability of KV front-end operations. Even if in steady-state GSI is capable of keeping up with KV mutation rate, during the initial index build GSI must receive all current data from KV, which could easily take longer than the capacity of CheckpointManager, such that if we disabled cursor dropping the normal application workload would be stopped.

To (3), consider a system which has:

A front-end mutation rate of 1,000 mutations per second, each 1kB in size (i.e. 1MB of incoming data per second).
GSI can initially build an index at 10,000 DCP messages per second - i.e. it's capable of 10x the steady-state load.
KV with a 3000MB bucket quota (so 1000MB Checkpoint Quota).

Given the above, it will take 1000 seconds (1000 MB Checkpoint Quota / 1MB /s mutation rate) for the CheckpointManager quota to be consumed if a cursor is stationary - i.e. ~16.6 minutes.

As such, if GSI takes more than 16.6 minutes to read the current state from disk (complete backfill and switch to advancing the in-memory streaming) then we will run out of CheckpointManager memory and either need to cursor drop (current behaviour) or temporary fail incoming mutations - i.e. the front-end workload will break.

Note that this wouldn't need that very data - more than 10GB of on-disk data would be sufficient assuming the example above.

We could adjust the proposal, to only disable cursor-dropping when the DCP stream is in the in-memory phase (i.e. after backfill has completed) - when everyone should be in steady-state, but this doesn't avoid the potential rollback-to-zero issue if GSI backfill takes a very long time. I'm not sure if that's still useful to GSI or not...

CC Steve Yen, Deepkaran Salooja, John Liang - please add any comments you have.

Attachments

Issue Links

relates to

MB-56615 [CDC] [Volume] - Graceful failover hangs with CDC enabled

Closed

Gerrit Reviews

- Issue Only
- Show All Reviews
- Show Open Reviews

No reviews matched the request. Check your Options in the drop-down menu of this sections header.

Apply backpressure to front-end mutations when certain DCP consumers are slow

Details

Description

Attachments

Issue Links

Gerrit Reviews

Activity

People

Dates

Gerrit Reviews

PagerDuty