Loading...

Details

Type: Task
Resolution: Done
Priority: Major
Fix Version/s: 6.5.0
Affects Version/s: master
Component/s: couchbase-bucket, DCP
Labels:
- hidd

Epic Link:
KV: Byte-based backfill

Description

As described in the Design Document for BBB, before starting the implementation I performed some tests for assessing the current performance of our Seqno-based Backfill.

I have logically split Backfill in 5 steps/components:

Read from disk on source
Software on source (this is our memcached+DCPProducer)
Network
Software on destination (this is our memcached+DCPConsumer)
Write to disk on destination

So I have run tests by fixing 1, 3 and 5 (infinite-speed disks on both source and destination (approximated by SSD) and infinite bandwidth (approximated by the loopback interface)) for assessing the behaviour of our DCP fixture and finding bottlenecks.

Tests (load fixed for all tests, measure runtimes):

A) Full kv_engine -> ns_server -> kv_engine rebalance with SBB
B) KV-Engine reads VBuckets data from disk and sends over the network with SBB (measures the performance of step (2) above)
C) KV-Engine receives VBuckets data and writes to disk with SBB (measures the performance of step (4) above)
D) VBucket file copy (SCP or equivalent) from Producer to Consumer (roughly simulate what will happen with BBB during the transfer of a VBucket file)

Note: we want to test DGM scenarios, so a pre-requirement for all tests is that our data files are not cached in the OS Cache; that ensures that we actually read from disk.

Environment (mancouch):
CPU Intel(R) Xeon(R) CPU E5-2420 0 @ 1.90GHz
RAM 130GB
Disk SSD/ext4
OS Ubuntu 18.04

I started from A and D.

[A]

export LD_LIBRARY_PATH=/home/paolo/source/install/lib/ && export COUCHBASE_NUM_VBUCKETS=4 && ./cluster_run -n 2
./couchbase-cli cluster-init --cluster=localhost:9000 --cluster-username=admin --cluster-password=admin1 --services=data --cluster-ramsize=256
./couchbase-cli cluster-init --cluster=localhost:9001 --cluster-username=admin --cluster-password=admin1 --services=data --cluster-ramsize=256
./couchbase-cli bucket-create -c localhost:9000 -u admin -p admin1 --bucket=example --bucket-type=couchbase --bucket-ramsize=100 --bucket-replica=1 --bucket-eviction-policy=fullEviction --enable-flush=1 --wait
cbc-pillowfight --spec="couchbase://127.0.0.1:12000/example" --username=admin --password=admin1 --batch-size=10 --num-items=1000000 --num-threads=4 --set-pct=100 --min-size=1024 --max-size=1024 --random-body --populate-only
./couchbase-cli server-add --cluster=127.0.0.1:9000 --username=admin --password=admin1 --server-add=127.0.0.1:9001 --server-add-username=admin --server-add-password=admin1
sync && sudo sh -c 'echo 3 > /proc/sys/vm/drop_caches' && time ./couchbase-cli rebalance -c localhost:9000 -u admin -p admin1

[D] SCP the same 4 vbucket files created at test [A]

sync && sudo sh -c 'echo 3 > /proc/sys/vm/drop_caches' && time sh -c 'scp paolo@localhost:~/source/ns_server/data/n_0/data/example/.couch. paolo@localhost:/ssd-ext4/paolo-data/ && sync'

Detailed test results at https://docs.google.com/spreadsheets/d/1fuRE2qx05Fw59tr7Nm9ECU5idpYnzPOVePDVU9B6rd8/edit#gid=1321604598 .

Note that we have 4 vbuckets and 1M items, so at rebalance the Producer sends 4 data files - 250k items per file/Backfill.

Results summary:
A) runtime ~ 180 seconds, low resource (disk, network) utilization
D) runtime ~ 15 seconds, high resource (disk, network) utilization

So where do we spend those 180 seconds at Rebalance?

Looking at logs on node n_0 (the Producer) I see:

2018-09-24T15:05:34.385061Z INFO 147: (example) DCP (Producer) eq_dcpq:replication:n_0@192.168.4.39->n_1@127.0.0.1:example - (vb 3) Sending disk snapshot with start seqno 0 and end seqno 250000

2018-09-24T15:06:10.283672Z INFO 147: (example) DCP (Producer) eq_dcpq:replication:n_0@192.168.4.39->n_1@127.0.0.1:example - (vb 3) Backfill complete, 229171 items read from disk, 20829 from memory, last seqno read: 250000, pendingBackfill : False

2018-09-24T15:06:10.661918Z INFO 147: (example) DCP (Producer) eq_dcpq:replication:n_0@192.168.4.39->n_1@127.0.0.1:example - (vb 2) Sending disk snapshot with start seqno 0 and end seqno 250000

2018-09-24T15:06:53.449453Z INFO 147: (example) DCP (Producer) eq_dcpq:replication:n_0@192.168.4.39->n_1@127.0.0.1:example - (vb 2) Backfill complete, 237628 items read from disk, 12372 from memory, last seqno read: 250000, pendingBackfill : False

2018-09-24T15:06:53.702386Z INFO 147: (example) DCP (Producer) eq_dcpq:replication:n_0@192.168.4.39->n_1@127.0.0.1:example - (vb 1) Sending disk snapshot with start seqno 0 and end seqno 250000

2018-09-24T15:07:36.596180Z INFO 147: (example) DCP (Producer) eq_dcpq:replication:n_0@192.168.4.39->n_1@127.0.0.1:example - (vb 1) Backfill complete, 246338 items read from disk, 3662 from memory, last seqno read: 250000, pendingBackfill : False

2018-09-24T15:07:41.686645Z INFO 147: (example) DCP (Producer) eq_dcpq:replication:n_0@192.168.4.39->n_1@127.0.0.1:example - (vb 0) Sending disk snapshot with start seqno 0 and end seqno 250000

2018-09-24T15:08:19.532608Z INFO 147: (example) DCP (Producer) eq_dcpq:replication:n_0@192.168.4.39->n_1@127.0.0.1:example - (vb 0) Backfill complete, 249413 items read from disk, 587 from memory, last seqno read: 250000, pendingBackfill : False

If you do some maths, you see that we spend roughly 160 seconds at Backfill (on average 160 / 4 = 40 seconds per Backfill).
Ok, this is in line with our expectation: SBB is slow.
But note that the timings above are KV-Engine timings, i.e. memcached + EPEngine/DCP. EPEngine reads from disk and makes data available for memcached to send to the Consumer.
So where is the bottleneck?

With SBB we essentially read from disk mutation by mutation (rather than just reading disk blocks). That is performed by Couchstore. From the EPEngine point of view, a Backfill is (in general) a series of execution of KVStore::scan(startSeqno, endSeqno). Usually it is not a single call to KVStore::scan() because we may split the scan in multiple phases depending on ceratin OOM conditions. So, maybe CouchKVStore::scan() is the bottleneck. Let's check.

I have implemented a simple test for assessing the performance of CouchKVStore::scan(). Note that this test is essentially test [B] above (KV-Engine reads VBuckets data from disk and sends over the network with SBB) at EPEngine level (so, I just read a VBucket from disk with SBB, no memcached/network involved).
The test code is at http://review.couchbase.org/#/c/100874/. There I test the performance of a single Backfill of 250k items. It turns out that the runtime is < 3 seconds. This is an unexpected good performance compared to the 40 seconds of the real-scenario Backfill at test [A].

So, back to test [A].
I repetead the Rebalance and profiled with `perf record`, but perf doesn't report any relevant slow code path.
Now, this is what I see when I add some log tracing in relevant functions:

2018-10-23T11:57:47.804033Z WARNING (example) DCPBackfillDisk::run(): (vb:3) state 1

2018-10-23T11:57:47.889521Z WARNING (example) DCPBackfillDisk::run(): (vb:3) state 1

2018-10-23T11:57:47.934469Z WARNING (example) DCPBackfillDisk::run(): (vb:3) state 1

2018-10-23T11:57:47.971526Z WARNING (example) DCPBackfillDisk::run(): (vb:3) state 1

2018-10-23T11:57:47.979288Z WARNING (example) BackfillManager::bytesCheckAndRead(): buffer SIZE, buffer.bytesRead 20971776, bytes 1162, buffer.maxBytes 20972856

2018-10-23T11:57:58.195722Z INFO (example) DCP (Producer) eq_dcpq:replication:n_0@192.168.4.39->n_1@127.0.0.1:example - Notifying paused connection now that DcpProducer::BufferLog is no longer full; ackedBytes:80441900, bytesSent:8366112, maxBytes:10485760

2018-10-23T11:57:58.289984Z WARNING (example) DCPBackfillDisk::run(): (vb:3) state 1

2018-10-23T11:57:58.346384Z WARNING (example) DCPBackfillDisk::run(): (vb:3) state 1

2018-10-23T11:57:58.401520Z WARNING (example) DCPBackfillDisk::run(): (vb:3) state 1

2018-10-23T11:57:58.484053Z WARNING (example) DCPBackfillDisk::run(): (vb:3) state 1

2018-10-23T11:57:58.593796Z WARNING (example) DCPBackfillDisk::run(): (vb:3) state 1

2018-10-23T11:57:58.680420Z WARNING (example) DCPBackfillDisk::run(): (vb:3) state 1

2018-10-23T11:57:58.756619Z WARNING (example) DCPBackfillDisk::run(): (vb:3) state 1

2018-10-23T11:57:58.865010Z WARNING (example) DCPBackfillDisk::run(): (vb:3) state 1

2018-10-23T11:57:58.961389Z WARNING (example) DCPBackfillDisk::run(): (vb:3) state 1

2018-10-23T11:57:59.000384Z WARNING (example) DCPBackfillDisk::run(): (vb:3) state 1

2018-10-23T11:57:59.044171Z WARNING (example) DCPBackfillDisk::run(): (vb:3) state 1

2018-10-23T11:57:59.082776Z WARNING (example) DCPBackfillDisk::run(): (vb:3) state 1

2018-10-23T11:57:59.119610Z WARNING (example) DCPBackfillDisk::run(): (vb:3) state 1

2018-10-23T11:57:59.158290Z WARNING (example) DCPBackfillDisk::run(): (vb:3) state 1

2018-10-23T11:57:59.213326Z WARNING (example) BackfillManager::bytesCheckAndRead(): buffer SIZE, buffer.bytesRead 20971776, bytes 1162, buffer.maxBytes 20972856

2018-10-23T11:58:03.921199Z INFO (example) DCP (Producer) eq_dcpq:replication:n_0@192.168.4.39->n_1@127.0.0.1:example - Notifying paused connection now that DcpProducer::BufferLog is no longer full; ackedBytes:135183740, bytesSent:8359488, maxBytes:10485760

2018-10-23T11:58:03.994284Z WARNING (example) DCPBackfillDisk::run(): (vb:3) state 1

2018-10-23T11:58:04.066900Z WARNING (example) DCPBackfillDisk::run(): (vb:3) state 1

..

<same pattern repeated multiple times>

..

Note about log messages:

DCPBackfillDisk::run(): (vb:3) state 1. It marks a call to CouchKVStore::scan(). As I mentioned above, we run it multiple times.
BackfillManager::bytesCheckAndRead(): buffer SIZE. it is the reason why we interrupt the execution of CouchKVStore::scan(). The BackfillManager::buffer is full.
Notifying paused connection now that DcpProducer::BufferLog is no longer full. The BufferLog was full and the connection was paused. Now the Consumer has aknowledged received bytes and the Producer can unpause the connection and resume with sending mutations.

So, we have many instances of:

multiple and fast runs of CouchKVStore::scan()
we hit BackfillManager::buffer = full
we don't execute CouchKVStore::scan() for a long time (sleep windows up to 10 seconds in other runs)
DcpProducer::BufferLog switches from full to non-full
we resume with exeuting CouchKVStore::scan()

Looking at code (couchbase/vulcan) explains the pattern above.

If the BackfillManager::buffer is full, then we don't even try to proceed with the Backfill (so we don't call CouchKVStore::scan()).

backfill_status_t BackfillManager::backfill() {

..

    if (buffer.full) {

        // If the buffer is full check to make sure we don't have any backfills

        // that no longer have active streams and remove them. This prevents an

        // issue where we have dead backfills taking up buffer space.

        std::list<UniqueDCPBackfillPtr> toDelete;

        for (auto a_itr = activeBackfills.begin();

             a_itr != activeBackfills.end();) {

            if ((*a_itr)->isStreamDead()) {

                (*a_itr)->cancel();

                toDelete.push_back(std::move(*a_itr));

                a_itr = activeBackfills.erase(a_itr);

                engine.getDcpConnMap().decrNumActiveSnoozingBackfills();

            } else {

                ++a_itr;

        lh.unlock();

        bool reschedule = !toDelete.empty();

        while (!toDelete.empty()) {

            UniqueDCPBackfillPtr backfill = std::move(toDelete.front());

            toDelete.pop_front();

        return reschedule ? backfill_success : backfill_snooze;

..

    backfill_status_t status = backfill->run();

..

The buffer is set to non-full in:

BackfillManager::bytesSent
DcpProducer::recordBackfillManagerBytesSent
ActiveStream::backfillPhase
..
DcpProducer::getNextItem
DcpProducer::step

So, we will set 'buffer.full = false' when memcached processes the connection in a mc:worker thread.

void BackfillManager::bytesSent(size_t bytes) {

    LockHolder lh(lock);

    if (bytes > buffer.bytesRead) {

        throw std::invalid_argument("BackfillManager::bytesSent: bytes "

                "(which is" + std::to_string(bytes) + ") is greater than "

                "buffer.bytesRead (which is" + std::to_string(buffer.bytesRead) + ")");

    buffer.bytesRead -= bytes;

    if (buffer.full) {

        /* We can have buffer.bytesRead > buffer.maxBytes */

        size_t unfilledBufferSize = (buffer.maxBytes > buffer.bytesRead)

                                            ? buffer.maxBytes - buffer.bytesRead

                                            : buffer.maxBytes;

        /* If buffer.bytesRead == 0 we want to fit the next read into the

           backfill buffer irrespective of its size */

        bool canFitNext = (buffer.bytesRead == 0) ||

                          (unfilledBufferSize >= buffer.nextReadSize);

        /* <= implicitly takes care of the case where

           buffer.bytesRead == (buffer.maxBytes * 3 / 4) == 0 */

        bool enoughCleared = buffer.bytesRead <= (buffer.maxBytes * 3 / 4);

        if (canFitNext && enoughCleared) {

            buffer.nextReadSize = 0;

            buffer.full = false;

            if (managerTask) {

                ExecutorPool::get()->wake(managerTask->getId());

But we have seens that there are many (and long) time windows where the connection is paused because the DcpProducer::BufferLog is full. By looking at the timings in logs, I see that we spend in this state the 90% of the total Backfill time.
As soon as the Consumer aknowledges bytes, the Connection is notified, memcached processed the connection and we set 'buffer.full = false'. At that point the Backfill resumes.

Thus, in this test scenario the slow Rebalance is mostly caused by a slow Consumer.

Note that the default size of the BufferLog is 10MB, so that is the value in the test above.
As expected, if I relax the BufferLog constraint (max size ~ 50MB) I get:

curl -X POST -u admin:admin1 http://localhost:9000/diag/eval -d 'ns_bucket:update_bucket_props("example", [{extra_config_string, "dcp_conn_buffer_size=49999999"}]).' && kill -9 $(pgrep memcached)

paolo@mancouch:~/source/ns_server/logs$ grep -E "Sending disk|Backfill complete" n_0/memcached.log.00000*

n_0/memcached.log.000001.txt:2018-10-22T17:49:31.132239Z INFO 147: (example) DCP (Producer) eq_dcpq:replication:n_0@192.168.4.39->n_1@127.0.0.1:example - (vb 3) Sending disk snapshot with start seqno 0 and end seqno 250000

n_0/memcached.log.000001.txt:2018-10-22T17:49:51.662652Z INFO 147: (example) DCP (Producer) eq_dcpq:replication:n_0@192.168.4.39->n_1@127.0.0.1:example - (vb 3) Backfill complete, 236794 items read from disk, 13206 from memory, last seqno read: 250000, pendingBackfill : False

..

paolo@mancouch:~/source/install/bin$ sync && sudo sh -c 'echo 3 > /proc/sys/vm/drop_caches' && time ./couchbase-cli rebalance -c localhost:9000 -u admin -p admin1

Rebalancing

Bucket: 01/01 (example)                                                                                                                                                                                                                                                  0 docs remaining

[===============================================================================================================================================================================================================================================================================] 100.0%

SUCCESS: Rebalance complete

real  2m4.560s

user  0m1.618s

sys 0m0.148s

So, a single Backfill runtimes drops from ~40 to ~20 seconds and the total Rebalance runtime from ~180 to ~120 seconds.

Obviously this is not a solution. But, the faster Rebalance completion (with no OOM) suggests that our flow control is not optimal.
So, I'm carrying on the investigation by inspecting the Consumer logs and assessing our buffer/flow control.

Some final important points about SBB and BBB in the scenario described:

differently from what the first test-A/test-D suggest, our SBB is not extremely slow compared to pure SCP in terms of read-from-disk
BBB embedded in the existing DCP fixture would not provide any relevant speed-up
in particular about the read-from-disk phase, CouchKVStore::scan() shows good performance

Note that I'm repeating the same tests at larger scale (it was just 1GB of data files so far).

Attachments

Issue Links

is triggering

MB-31972 Scale backfill at stream level

Open

links to

Backfill performance assessment

BBB preliminary perf-tests

Sub-Tasks

1.

Large scale perf tests

Open

Daniel Owen

Gerrit Reviews

- Show All Reviews
- Show Open Reviews
- Show All Issues
- Show Open Issues

For Gerrit Dashboard: MB-31715
#	Subject	Branch	Project	Status	CR	V
101058,1	MB-31715: Test: recover from OOM quickly at Backfill	vulcan	kv_engine	Status: NEW	0	-1
101084,2	MB-31715: Test: notify immediately in ~UpdateFlowControl	vulcan	kv_engine	Status: NEW	0	-1
101118,1	MB-31715: Log when the DcpProducer::BufferLog becomes full	vulcan	kv_engine	Status: NEW	0	-1

Preliminary performance assessment for Byte-based Backfill

Details

Description

Attachments

Issue Links

Sub-Tasks

Gerrit Reviews

Activity

People

Dates

Gerrit Reviews

PagerDuty