Details
-
Improvement
-
Resolution: Unresolved
-
Major
-
7.6.0, 7.0.0, 7.1.0, 7.2.0
-
None
-
0
Description
There are several places in KV-Engine where we want to distribute vBuckets into equal-sized sets.
- KV shards - #CPUs
- ConnStore::vbConnLocks - 32 mutexes
- EPBucket::getFlusher - #CPU flushers
- EPBucket::getBgFetcher - #CPU BGFetchers
- CheckpointMemRecoveryTask::getVbucketsSortedByChkMem - checkpoint removers
- ... (might be others)
What we have in all of those places is a finite resource which we want to share between vBuckets in a uniform way. For example, if we have 256 vBuckets on a node, we want 1 vbConnLock per 8 vBuckets.
The way we've done this is by taking vbid % numTasks. This is problematic, because the vBuckets IDs on a node are not consequitive. ns_server computes the vBucket map and decides which vBucket IDs to assign to a given node. In practice, there are gaps in the ranges of vBuckets on a node.
Example
Here's an example to illustrate how that affects the task distribution.
Assuming the following 256 active vBuckets are on a node, such that we have sequences of 16 consequitive numbers, then 48 vbids which are on the other nodes in the cluster, then more vbids. This repeats at interval of 64 vbids.
0-15, 64-79, 128-143, 192-207, 256-271 ... 1008-1023
And we look at the ConnStore::vbConnLocks example where we have 32 locks and distribute them across vBuckets.
Given the vbids above, any vbid % 32 will be less than 16. This means that we will only ever use 16 of the 32 locks, because there might be no vBuckets whose number divided by 32 has a remainer above 15.
This problem would become worse with a bucket configured with 128 vBuckets.
Solution
We should make it such that we equally distribute work regardless of the exact vbids assigned to a node.
We can do this in several ways, here are two:
- We could track the "index" of the vBucket in the set of vBuckets, such that the lowest vbid is always index 0, the second lowest is index 1, etc, and we can use those indexes in place of the vbid. This seems like it needs manual bookkeeping as vBuckets are created/destroyed, and results in vBucket's assignment changing at runtime.
- We could use a hash function - we only require every output value to be ~ equally likely given an input. This will give a static allocation of vBucket -> hash.
I think we should use the second option - a hash function. We can use crc32(vbid) as we already have that in the code base and crc32c's output is uniform across the 32-bit output space. By pre-computing the crc32c for the known vbids, we can optimise away the computation and crc32(vbid) will turn into a table lookup (1024 entries).
'std::hash<Vbid>' could return this value.