Details
-
Improvement
-
Resolution: Fixed
-
Critical
-
7.1.3
-
0
-
KV 2023-4
Description
Recent CBSEs (linked) running on disks with high latency (e.g. EBS, GCP) have shown that DCP backfill throughput can be limited by our current AuxIO thread counts.
For example, on an 8-core n2 GCE instance; a single DCP connection which is backfilling can read from disk at a rate of ~12MB/s (2000 IOPS, average read size ~6kB).
Currently AuxIO threads are set to #cores, min 2, max 8. As such, the maximum achievable aggregate backfill throughput is ~96MB, even though EBS / GCP disks - with sufficient concurrency - can in theory achieve many hundreds of MB/s.
Looking at the %CPU consumed by a totally "busy" AuxIO thread running BackfillManagerTasks back-to-back shows values around 20-25% - i.e. the thread is spending 75%-80% of it's time waiting on IO. As such, it should be reasonable to "overcommit" AuxIO threads by a factor of ~4 before we actually start to use that amount of actual CPU.
(In other words, 4 AuxIO threads "spinning" performing Backfills only actually consume 1 CPU cores' worth of cycles).
—
I propose we do the following:
- Increase the current AuxIO thread CPU coefficient (number of AuxIO threads per CPU core) from 1 to say 4
- Increase the cap on automatic AuxIO thread count from 8 to $bigger_number.
- Benchmark the change to DCP backfill throughput we see on high latency disks (EBS, GCP), and confirming we don't swamp the other threads.
- If we do see undesirable impact on front-end operations, we might need to scale back the coefficient; we could also look at reducing the thread priority of AuxIO threads like we do with Writers (see https://github.com/couchbase/kv_engine/blob/master/executor/executorpool.cc#L183-L217).