As described in the Design Document for BBB, before starting the implementation I performed some tests for assessing the current performance of our Seqno-based Backfill.
I have logically split Backfill in 5 steps/components:
- Read from disk on source
- Software on source (this is our memcached+DCPProducer)
- Software on destination (this is our memcached+DCPConsumer)
- Write to disk on destination
So I have run tests by fixing 1, 3 and 5 (infinite-speed disks on both source and destination (approximated by SSD) and infinite bandwidth (approximated by the loopback interface)) for assessing the behaviour of our DCP fixture and finding bottlenecks.
Tests (load fixed for all tests, measure runtimes):
A) Full kv_engine -> ns_server -> kv_engine rebalance with SBB
B) KV-Engine reads VBuckets data from disk and sends over the network with SBB (measures the performance of step (2) above)
C) KV-Engine receives VBuckets data and writes to disk with SBB (measures the performance of step (4) above)
D) VBucket file copy (SCP or equivalent) from Producer to Consumer (roughly simulate what will happen with BBB during the transfer of a VBucket file)
Note: we want to test DGM scenarios, so a pre-requirement for all tests is that our data files are not cached in the OS Cache; that ensures that we actually read from disk.
CPU Intel(R) Xeon(R) CPU E5-2420 0 @ 1.90GHz
OS Ubuntu 18.04
I started from A and D.
- export LD_LIBRARY_PATH=/home/paolo/source/install/lib/ && export COUCHBASE_NUM_VBUCKETS=4 && ./cluster_run -n 2
- ./couchbase-cli cluster-init --cluster=localhost:9000 --cluster-username=admin --cluster-password=admin1 --services=data --cluster-ramsize=256
- ./couchbase-cli cluster-init --cluster=localhost:9001 --cluster-username=admin --cluster-password=admin1 --services=data --cluster-ramsize=256
- ./couchbase-cli bucket-create -c localhost:9000 -u admin -p admin1 --bucket=example --bucket-type=couchbase --bucket-ramsize=100 --bucket-replica=1 --bucket-eviction-policy=fullEviction --enable-flush=1 --wait
- cbc-pillowfight --spec="couchbase://127.0.0.1:12000/example" --username=admin --password=admin1 --batch-size=10 --num-items=1000000 --num-threads=4 --set-pct=100 --min-size=1024 --max-size=1024 --random-body --populate-only
- ./couchbase-cli server-add --cluster=127.0.0.1:9000 --username=admin --password=admin1 --server-add=127.0.0.1:9001 --server-add-username=admin --server-add-password=admin1
- sync && sudo sh -c 'echo 3 > /proc/sys/vm/drop_caches' && time ./couchbase-cli rebalance -c localhost:9000 -u admin -p admin1
[D] SCP the same 4 vbucket files created at test [A]
- sync && sudo sh -c 'echo 3 > /proc/sys/vm/drop_caches' && time sh -c 'scp paolo@localhost:~/source/ns_server/data/n_0/data/example/.couch. paolo@localhost:/ssd-ext4/paolo-data/ && sync'
Detailed test results at https://docs.google.com/spreadsheets/d/1fuRE2qx05Fw59tr7Nm9ECU5idpYnzPOVePDVU9B6rd8/edit#gid=1321604598 .
Note that we have 4 vbuckets and 1M items, so at rebalance the Producer sends 4 data files - 250k items per file/Backfill.
A) runtime ~ 180 seconds, low resource (disk, network) utilization
D) runtime ~ 15 seconds, high resource (disk, network) utilization
So where do we spend those 180 seconds at Rebalance?
Looking at logs on node n_0 (the Producer) I see:
If you do some maths, you see that we spend roughly 160 seconds at Backfill (on average 160 / 4 = 40 seconds per Backfill).
Ok, this is in line with our expectation: SBB is slow.
But note that the timings above are KV-Engine timings, i.e. memcached + EPEngine/DCP. EPEngine reads from disk and makes data available for memcached to send to the Consumer.
So where is the bottleneck?
With SBB we essentially read from disk mutation by mutation (rather than just reading disk blocks). That is performed by Couchstore. From the EPEngine point of view, a Backfill is (in general) a series of execution of KVStore::scan(startSeqno, endSeqno). Usually it is not a single call to KVStore::scan() because we may split the scan in multiple phases depending on ceratin OOM conditions. So, maybe CouchKVStore::scan() is the bottleneck. Let's check.
I have implemented a simple test for assessing the performance of CouchKVStore::scan(). Note that this test is essentially test [B] above (KV-Engine reads VBuckets data from disk and sends over the network with SBB) at EPEngine level (so, I just read a VBucket from disk with SBB, no memcached/network involved).
The test code is at http://review.couchbase.org/#/c/100874/. There I test the performance of a single Backfill of 250k items. It turns out that the runtime is < 3 seconds. This is an unexpected good performance compared to the 40 seconds of the real-scenario Backfill at test [A].
So, back to test [A].
I repetead the Rebalance and profiled with `perf record`, but perf doesn't report any relevant slow code path.
Now, this is what I see when I add some log tracing in relevant functions:
Note about log messages:
- DCPBackfillDisk::run(): (vb:3) state 1. It marks a call to CouchKVStore::scan(). As I mentioned above, we run it multiple times.
- BackfillManager::bytesCheckAndRead(): buffer SIZE. it is the reason why we interrupt the execution of CouchKVStore::scan(). The BackfillManager::buffer is full.
- Notifying paused connection now that DcpProducer::BufferLog is no longer full. The BufferLog was full and the connection was paused. Now the Consumer has aknowledged received bytes and the Producer can unpause the connection and resume with sending mutations.
So, we have many instances of:
- multiple and fast runs of CouchKVStore::scan()
- we hit BackfillManager::buffer = full
- we don't execute CouchKVStore::scan() for a long time (sleep windows up to 10 seconds in other runs)
- DcpProducer::BufferLog switches from full to non-full
- we resume with exeuting CouchKVStore::scan()
Looking at code (couchbase/vulcan) explains the pattern above.
If the BackfillManager::buffer is full, then we don't even try to proceed with the Backfill (so we don't call CouchKVStore::scan()).
The buffer is set to non-full in:
So, we will set 'buffer.full = false' when memcached processes the connection in a mc:worker thread.
But we have seens that there are many (and long) time windows where the connection is paused because the DcpProducer::BufferLog is full. By looking at the timings in logs, I see that we spend in this state the 90% of the total Backfill time.
As soon as the Consumer aknowledges bytes, the Connection is notified, memcached processed the connection and we set 'buffer.full = false'. At that point the Backfill resumes.
Thus, in this test scenario the slow Rebalance is mostly caused by a slow Consumer.
Note that the default size of the BufferLog is 10MB, so that is the value in the test above.
As expected, if I relax the BufferLog constraint (max size ~ 50MB) I get:
So, a single Backfill runtimes drops from ~40 to ~20 seconds and the total Rebalance runtime from ~180 to ~120 seconds.
Obviously this is not a solution. But, the faster Rebalance completion (with no OOM) suggests that our flow control is not optimal.
So, I'm carrying on the investigation by inspecting the Consumer logs and assessing our buffer/flow control.
Some final important points about SBB and BBB in the scenario described:
- differently from what the first test-A/test-D suggest, our SBB is not extremely slow compared to pure SCP in terms of read-from-disk
- BBB embedded in the existing DCP fixture would not provide any relevant speed-up
- in particular about the read-from-disk phase, CouchKVStore::scan() shows good performance
Note that I'm repeating the same tests at larger scale (it was just 1GB of data files so far).