Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-31715

Preliminary performance assessment for Byte-based Backfill

    XMLWordPrintable

Details

    • Task
    • Resolution: Done
    • Major
    • 6.5.0
    • master
    • couchbase-bucket, DCP

    Description

      As described in the Design Document for BBB, before starting the implementation I performed some tests for assessing the current performance of our Seqno-based Backfill.

      I have logically split Backfill in 5 steps/components:

      1. Read from disk on source
      2. Software on source (this is our memcached+DCPProducer)
      3. Network
      4. Software on destination (this is our memcached+DCPConsumer)
      5. Write to disk on destination

      So I have run tests by fixing 1, 3 and 5 (infinite-speed disks on both source and destination (approximated by SSD) and infinite bandwidth (approximated by the loopback interface)) for assessing the behaviour of our DCP fixture and finding bottlenecks.

      Tests (load fixed for all tests, measure runtimes):

      A) Full kv_engine -> ns_server -> kv_engine rebalance with SBB
      B) KV-Engine reads VBuckets data from disk and sends over the network with SBB (measures the performance of step (2) above)
      C) KV-Engine receives VBuckets data and writes to disk with SBB (measures the performance of step (4) above)
      D) VBucket file copy (SCP or equivalent) from Producer to Consumer (roughly simulate what will happen with BBB during the transfer of a VBucket file)

      Note: we want to test DGM scenarios, so a pre-requirement for all tests is that our data files are not cached in the OS Cache; that ensures that we actually read from disk.

      Environment (mancouch):
      CPU Intel(R) Xeon(R) CPU E5-2420 0 @ 1.90GHz
      RAM 130GB
      Disk SSD/ext4
      OS Ubuntu 18.04

      I started from A and D.

      [A]

      • export LD_LIBRARY_PATH=/home/paolo/source/install/lib/ && export COUCHBASE_NUM_VBUCKETS=4 && ./cluster_run -n 2
      • ./couchbase-cli cluster-init --cluster=localhost:9000 --cluster-username=admin --cluster-password=admin1 --services=data --cluster-ramsize=256
      • ./couchbase-cli cluster-init --cluster=localhost:9001 --cluster-username=admin --cluster-password=admin1 --services=data --cluster-ramsize=256
      • ./couchbase-cli bucket-create -c localhost:9000 -u admin -p admin1 --bucket=example --bucket-type=couchbase --bucket-ramsize=100 --bucket-replica=1 --bucket-eviction-policy=fullEviction --enable-flush=1 --wait
      • cbc-pillowfight --spec="couchbase://127.0.0.1:12000/example" --username=admin --password=admin1 --batch-size=10 --num-items=1000000 --num-threads=4 --set-pct=100 --min-size=1024 --max-size=1024 --random-body --populate-only
      • ./couchbase-cli server-add --cluster=127.0.0.1:9000 --username=admin --password=admin1 --server-add=127.0.0.1:9001 --server-add-username=admin --server-add-password=admin1
      • sync && sudo sh -c 'echo 3 > /proc/sys/vm/drop_caches' && time ./couchbase-cli rebalance -c localhost:9000 -u admin -p admin1

      [D] SCP the same 4 vbucket files created at test [A]

      • sync && sudo sh -c 'echo 3 > /proc/sys/vm/drop_caches' && time sh -c 'scp paolo@localhost:~/source/ns_server/data/n_0/data/example/.couch. paolo@localhost:/ssd-ext4/paolo-data/ && sync'

      Detailed test results at https://docs.google.com/spreadsheets/d/1fuRE2qx05Fw59tr7Nm9ECU5idpYnzPOVePDVU9B6rd8/edit#gid=1321604598 .

      Note that we have 4 vbuckets and 1M items, so at rebalance the Producer sends 4 data files - 250k items per file/Backfill.

      Results summary:
      A) runtime ~ 180 seconds, low resource (disk, network) utilization
      D) runtime ~ 15 seconds, high resource (disk, network) utilization

      So where do we spend those 180 seconds at Rebalance?

      Looking at logs on node n_0 (the Producer) I see:

      2018-09-24T15:05:34.385061Z INFO 147: (example) DCP (Producer) eq_dcpq:replication:n_0@192.168.4.39->n_1@127.0.0.1:example - (vb 3) Sending disk snapshot with start seqno 0 and end seqno 250000
      2018-09-24T15:06:10.283672Z INFO 147: (example) DCP (Producer) eq_dcpq:replication:n_0@192.168.4.39->n_1@127.0.0.1:example - (vb 3) Backfill complete, 229171 items read from disk, 20829 from memory, last seqno read: 250000, pendingBackfill : False
      2018-09-24T15:06:10.661918Z INFO 147: (example) DCP (Producer) eq_dcpq:replication:n_0@192.168.4.39->n_1@127.0.0.1:example - (vb 2) Sending disk snapshot with start seqno 0 and end seqno 250000
      2018-09-24T15:06:53.449453Z INFO 147: (example) DCP (Producer) eq_dcpq:replication:n_0@192.168.4.39->n_1@127.0.0.1:example - (vb 2) Backfill complete, 237628 items read from disk, 12372 from memory, last seqno read: 250000, pendingBackfill : False
      2018-09-24T15:06:53.702386Z INFO 147: (example) DCP (Producer) eq_dcpq:replication:n_0@192.168.4.39->n_1@127.0.0.1:example - (vb 1) Sending disk snapshot with start seqno 0 and end seqno 250000
      2018-09-24T15:07:36.596180Z INFO 147: (example) DCP (Producer) eq_dcpq:replication:n_0@192.168.4.39->n_1@127.0.0.1:example - (vb 1) Backfill complete, 246338 items read from disk, 3662 from memory, last seqno read: 250000, pendingBackfill : False
      2018-09-24T15:07:41.686645Z INFO 147: (example) DCP (Producer) eq_dcpq:replication:n_0@192.168.4.39->n_1@127.0.0.1:example - (vb 0) Sending disk snapshot with start seqno 0 and end seqno 250000
      2018-09-24T15:08:19.532608Z INFO 147: (example) DCP (Producer) eq_dcpq:replication:n_0@192.168.4.39->n_1@127.0.0.1:example - (vb 0) Backfill complete, 249413 items read from disk, 587 from memory, last seqno read: 250000, pendingBackfill : False
      

      If you do some maths, you see that we spend roughly 160 seconds at Backfill (on average 160 / 4 = 40 seconds per Backfill).
      Ok, this is in line with our expectation: SBB is slow.
      But note that the timings above are KV-Engine timings, i.e. memcached + EPEngine/DCP. EPEngine reads from disk and makes data available for memcached to send to the Consumer.
      So where is the bottleneck?

      With SBB we essentially read from disk mutation by mutation (rather than just reading disk blocks). That is performed by Couchstore. From the EPEngine point of view, a Backfill is (in general) a series of execution of KVStore::scan(startSeqno, endSeqno). Usually it is not a single call to KVStore::scan() because we may split the scan in multiple phases depending on ceratin OOM conditions. So, maybe CouchKVStore::scan() is the bottleneck. Let's check.

      I have implemented a simple test for assessing the performance of CouchKVStore::scan(). Note that this test is essentially test [B] above (KV-Engine reads VBuckets data from disk and sends over the network with SBB) at EPEngine level (so, I just read a VBucket from disk with SBB, no memcached/network involved).
      The test code is at http://review.couchbase.org/#/c/100874/. There I test the performance of a single Backfill of 250k items. It turns out that the runtime is < 3 seconds. This is an unexpected good performance compared to the 40 seconds of the real-scenario Backfill at test [A].

      So, back to test [A].
      I repetead the Rebalance and profiled with `perf record`, but perf doesn't report any relevant slow code path.
      Now, this is what I see when I add some log tracing in relevant functions:

      2018-10-23T11:57:47.804033Z WARNING (example) DCPBackfillDisk::run(): (vb:3) state 1
      2018-10-23T11:57:47.889521Z WARNING (example) DCPBackfillDisk::run(): (vb:3) state 1
      2018-10-23T11:57:47.934469Z WARNING (example) DCPBackfillDisk::run(): (vb:3) state 1
      2018-10-23T11:57:47.971526Z WARNING (example) DCPBackfillDisk::run(): (vb:3) state 1
      2018-10-23T11:57:47.979288Z WARNING (example) BackfillManager::bytesCheckAndRead(): buffer SIZE, buffer.bytesRead 20971776, bytes 1162, buffer.maxBytes 20972856
      2018-10-23T11:57:58.195722Z INFO (example) DCP (Producer) eq_dcpq:replication:n_0@192.168.4.39->n_1@127.0.0.1:example - Notifying paused connection now that DcpProducer::BufferLog is no longer full; ackedBytes:80441900, bytesSent:8366112, maxBytes:10485760
      2018-10-23T11:57:58.289984Z WARNING (example) DCPBackfillDisk::run(): (vb:3) state 1
      2018-10-23T11:57:58.346384Z WARNING (example) DCPBackfillDisk::run(): (vb:3) state 1
      2018-10-23T11:57:58.401520Z WARNING (example) DCPBackfillDisk::run(): (vb:3) state 1
      2018-10-23T11:57:58.484053Z WARNING (example) DCPBackfillDisk::run(): (vb:3) state 1
      2018-10-23T11:57:58.593796Z WARNING (example) DCPBackfillDisk::run(): (vb:3) state 1
      2018-10-23T11:57:58.680420Z WARNING (example) DCPBackfillDisk::run(): (vb:3) state 1
      2018-10-23T11:57:58.756619Z WARNING (example) DCPBackfillDisk::run(): (vb:3) state 1
      2018-10-23T11:57:58.865010Z WARNING (example) DCPBackfillDisk::run(): (vb:3) state 1
      2018-10-23T11:57:58.961389Z WARNING (example) DCPBackfillDisk::run(): (vb:3) state 1
      2018-10-23T11:57:59.000384Z WARNING (example) DCPBackfillDisk::run(): (vb:3) state 1
      2018-10-23T11:57:59.044171Z WARNING (example) DCPBackfillDisk::run(): (vb:3) state 1
      2018-10-23T11:57:59.082776Z WARNING (example) DCPBackfillDisk::run(): (vb:3) state 1
      2018-10-23T11:57:59.119610Z WARNING (example) DCPBackfillDisk::run(): (vb:3) state 1
      2018-10-23T11:57:59.158290Z WARNING (example) DCPBackfillDisk::run(): (vb:3) state 1
      2018-10-23T11:57:59.213326Z WARNING (example) BackfillManager::bytesCheckAndRead(): buffer SIZE, buffer.bytesRead 20971776, bytes 1162, buffer.maxBytes 20972856
      2018-10-23T11:58:03.921199Z INFO (example) DCP (Producer) eq_dcpq:replication:n_0@192.168.4.39->n_1@127.0.0.1:example - Notifying paused connection now that DcpProducer::BufferLog is no longer full; ackedBytes:135183740, bytesSent:8359488, maxBytes:10485760
      2018-10-23T11:58:03.994284Z WARNING (example) DCPBackfillDisk::run(): (vb:3) state 1
      2018-10-23T11:58:04.066900Z WARNING (example) DCPBackfillDisk::run(): (vb:3) state 1
      ..
      <same pattern repeated multiple times>
      ..
      

      Note about log messages:

      • DCPBackfillDisk::run(): (vb:3) state 1. It marks a call to CouchKVStore::scan(). As I mentioned above, we run it multiple times.
      • BackfillManager::bytesCheckAndRead(): buffer SIZE. it is the reason why we interrupt the execution of CouchKVStore::scan(). The BackfillManager::buffer is full.
      • Notifying paused connection now that DcpProducer::BufferLog is no longer full. The BufferLog was full and the connection was paused. Now the Consumer has aknowledged received bytes and the Producer can unpause the connection and resume with sending mutations.

      So, we have many instances of:

      1. multiple and fast runs of CouchKVStore::scan()
      2. we hit BackfillManager::buffer = full
      3. we don't execute CouchKVStore::scan() for a long time (sleep windows up to 10 seconds in other runs)
      4. DcpProducer::BufferLog switches from full to non-full
      5. we resume with exeuting CouchKVStore::scan()

      Looking at code (couchbase/vulcan) explains the pattern above.

      If the BackfillManager::buffer is full, then we don't even try to proceed with the Backfill (so we don't call CouchKVStore::scan()).

      backfill_status_t BackfillManager::backfill() {
          ..
          
          if (buffer.full) {
              // If the buffer is full check to make sure we don't have any backfills
              // that no longer have active streams and remove them. This prevents an
              // issue where we have dead backfills taking up buffer space.
              std::list<UniqueDCPBackfillPtr> toDelete;
              for (auto a_itr = activeBackfills.begin();
                   a_itr != activeBackfills.end();) {
                  if ((*a_itr)->isStreamDead()) {
                      (*a_itr)->cancel();
                      toDelete.push_back(std::move(*a_itr));
                      a_itr = activeBackfills.erase(a_itr);
                      engine.getDcpConnMap().decrNumActiveSnoozingBackfills();
                  } else {
                      ++a_itr;
                  }
              }
       
              lh.unlock();
              bool reschedule = !toDelete.empty();
              while (!toDelete.empty()) {
                  UniqueDCPBackfillPtr backfill = std::move(toDelete.front());
                  toDelete.pop_front();
              }
              return reschedule ? backfill_success : backfill_snooze;
          }
          
          ..
       
          backfill_status_t status = backfill->run();
       
          ..
      }
      

      The buffer is set to non-full in:

      BackfillManager::bytesSent
        DcpProducer::recordBackfillManagerBytesSent
          ActiveStream::backfillPhase
            ..
              DcpProducer::getNextItem
                DcpProducer::step

      So, we will set 'buffer.full = false' when memcached processes the connection in a mc:worker thread.

      void BackfillManager::bytesSent(size_t bytes) {
          LockHolder lh(lock);
          if (bytes > buffer.bytesRead) {
              throw std::invalid_argument("BackfillManager::bytesSent: bytes "
                      "(which is" + std::to_string(bytes) + ") is greater than "
                      "buffer.bytesRead (which is" + std::to_string(buffer.bytesRead) + ")");
          }
          buffer.bytesRead -= bytes;
       
          if (buffer.full) {
              /* We can have buffer.bytesRead > buffer.maxBytes */
              size_t unfilledBufferSize = (buffer.maxBytes > buffer.bytesRead)
                                                  ? buffer.maxBytes - buffer.bytesRead
                                                  : buffer.maxBytes;
       
              /* If buffer.bytesRead == 0 we want to fit the next read into the
                 backfill buffer irrespective of its size */
              bool canFitNext = (buffer.bytesRead == 0) ||
                                (unfilledBufferSize >= buffer.nextReadSize);
       
              /* <= implicitly takes care of the case where
                 buffer.bytesRead == (buffer.maxBytes * 3 / 4) == 0 */
              bool enoughCleared = buffer.bytesRead <= (buffer.maxBytes * 3 / 4);
              if (canFitNext && enoughCleared) {
                  buffer.nextReadSize = 0;
                  buffer.full = false;
                  if (managerTask) {
                      ExecutorPool::get()->wake(managerTask->getId());
                  }
              }
          }
      }
      

      But we have seens that there are many (and long) time windows where the connection is paused because the DcpProducer::BufferLog is full. By looking at the timings in logs, I see that we spend in this state the 90% of the total Backfill time.
      As soon as the Consumer aknowledges bytes, the Connection is notified, memcached processed the connection and we set 'buffer.full = false'. At that point the Backfill resumes.

      Thus, in this test scenario the slow Rebalance is mostly caused by a slow Consumer.

      Note that the default size of the BufferLog is 10MB, so that is the value in the test above.
      As expected, if I relax the BufferLog constraint (max size ~ 50MB) I get:

      curl -X POST -u admin:admin1 http://localhost:9000/diag/eval -d 'ns_bucket:update_bucket_props("example", [{extra_config_string, "dcp_conn_buffer_size=49999999"}]).' && kill -9 $(pgrep memcached)
      

      paolo@mancouch:~/source/ns_server/logs$ grep -E "Sending disk|Backfill complete" n_0/memcached.log.00000*
      n_0/memcached.log.000001.txt:2018-10-22T17:49:31.132239Z INFO 147: (example) DCP (Producer) eq_dcpq:replication:n_0@192.168.4.39->n_1@127.0.0.1:example - (vb 3) Sending disk snapshot with start seqno 0 and end seqno 250000
      n_0/memcached.log.000001.txt:2018-10-22T17:49:51.662652Z INFO 147: (example) DCP (Producer) eq_dcpq:replication:n_0@192.168.4.39->n_1@127.0.0.1:example - (vb 3) Backfill complete, 236794 items read from disk, 13206 from memory, last seqno read: 250000, pendingBackfill : False
      ..
      

      paolo@mancouch:~/source/install/bin$ sync && sudo sh -c 'echo 3 > /proc/sys/vm/drop_caches' && time ./couchbase-cli rebalance -c localhost:9000 -u admin -p admin1
      Rebalancing                                                                                                                                                                                                                                                                              
      Bucket: 01/01 (example)                                                                                                                                                                                                                                                  0 docs remaining
      [===============================================================================================================================================================================================================================================================================] 100.0%
      SUCCESS: Rebalance complete
       
      real  2m4.560s
      user  0m1.618s
      sys 0m0.148s
      

      So, a single Backfill runtimes drops from ~40 to ~20 seconds and the total Rebalance runtime from ~180 to ~120 seconds.

      Obviously this is not a solution. But, the faster Rebalance completion (with no OOM) suggests that our flow control is not optimal.
      So, I'm carrying on the investigation by inspecting the Consumer logs and assessing our buffer/flow control.

      Some final important points about SBB and BBB in the scenario described:

      1. differently from what the first test-A/test-D suggest, our SBB is not extremely slow compared to pure SCP in terms of read-from-disk
      2. BBB embedded in the existing DCP fixture would not provide any relevant speed-up
      3. in particular about the read-from-disk phase, CouchKVStore::scan() shows good performance

      Note that I'm repeating the same tests at larger scale (it was just 1GB of data files so far).

      Attachments

        Issue Links

          Activity

            People

              paolo.cocchi Paolo Cocchi
              paolo.cocchi Paolo Cocchi
              Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                PagerDuty