Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-28470

CheckpointManager::getNumItemsForCursor is imprecise

    XMLWordPrintable

Details

    • Triaged
    • No

    Description

      During investigation of MB-28431, it was found that CheckpointManager::getNumItemsForCursor return 1 when in fact there are no items available for the cursor. That had the result of causing the flusher to spin forever on the same vBucket; as we would incorrectly believe there were still items to flush.

      The investigation under MB-28431 proved that in some as-yet unknown scenario, a cursor was positioned at the end of a single open checkpoint; yet the Cursor::offset was such that it was one less than CheckpointManager::numItems. This resulted in the value 1 being returned:

      size_t CheckpointManager::getNumItemsForCursor_UNLOCKED(
                                                      const std::string &name) const {
          size_t remains = 0;
          cursor_index::const_iterator it = connCursors.find(name);
          if (it != connCursors.end()) {
              size_t offset = it->second.offset + getNumOfMetaItemsFromCursor(it->second);
              if (numItems >= offset) {
                  remains = numItems - offset;
              } else {
                  LOG(EXTENSION_LOG_WARNING, ...
              }
          } else {
              LOG(EXTENSION_LOG_WARNING, ...
          }
          return remains;
      }
      

      (Suspect this also affects older versions as well; not yet confirmed).

      I've already tried to reproduce the problem (making it explicit by adding the following assert); but neither the unit tests, run-mats nor a manual local rebalance test demonstrate the issue, so the cause is not clear. (When investigating the live cluster, while the effect was seen the cause happened long previously).

      Assert to make the problem explicit when occurs:

      diff --git a/engines/ep/src/vbucket.cc b/engines/ep/src/vbucket.cc
      index aad9cba7f..c7cc25c8a 100644
      --- a/engines/ep/src/vbucket.cc
      +++ b/engines/ep/src/vbucket.cc
      @@ -343,6 +343,7 @@ VBucket::ItemsToFlush VBucket::getItemsToPersist(size_t approxLimit) {
           // cursor, if we haven't yet hit the limit.
           // Note that it is only valid to queue a complete checkpoint - this is where
           // the "approx" in the limit comes from.
      +    const auto itemsPreCkptMgr = result.items.size();
           const auto ckptMgrLimit = approxLimit - result.items.size();
           CheckpointManager::ItemsForCursor ckptItems;
           if (ckptMgrLimit > 0) {
      @@ -353,6 +354,14 @@ VBucket::ItemsToFlush VBucket::getItemsToPersist(size_t approxLimit) {
               stats.persistenceCursorGetItemsHisto.add(
                       std::chrono::duration_cast<std::chrono::microseconds>(
                               ProcessClock::now() - _begin_));
      +
      +        // We were permitted at least one item for this cursor - therefore
      +        // if zero items were returned then numItemsForCursor should return
      +        // zero.
      +        if (result.items.size() == itemsPreCkptMgr) {
      +            cb_assert(checkpointManager->getNumItemsForCursor(
      +                              CheckpointManager::pCursorName) == 0);
      +        }
           } else {
               // We haven't got sufficient remaining capacity to read items from
               // CheckpoitnManager, therefore we must assume that there /could/
      

      Attachments

        Issue Links

          No reviews matched the request. Check your Options in the drop-down menu of this sections header.

          Activity

            People

              owend Daniel Owen
              drigby Dave Rigby (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Gerrit Reviews

                  There are no open Gerrit changes

                  PagerDuty