Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-31495

KVBucket::getRandomKey will hang 0.1% of the time for a node with active, empty VBs

    XMLWordPrintable

Details

    • Bug
    • Resolution: Fixed
    • Blocker
    • 5.1.3, 5.5.3, 6.0.0
    • master, 4.6.5, 5.1.2, 5.5.0, 5.5.1, 5.5.2
    • couchbase-bucket
    • None
    • Untriaged
    • No

    Description

      Summary

      There is a bug in the getRandomKey functionality where it is possible that we get stuck and never return a response to the consumer.
      The bug has existed since 2013, however with the introduction of the query workbench is appearing more often, as it uses the getRandomKey functionality.

      The bug only appears where we have replica vbuckets and / or the all active vbuckets are empty.
      Furthermore the starting vbucket to search for a random key must be zero.

      Main user-visible symptom of this is one (or more) memcached front-end threads spinning at 100% CPU utilisation.

      Details
      The actual bug is KVBucket::getRandomKey

      1395  GetValue KVBucket::getRandomKey() {
      1396      VBucketMap::id_type max = vbMap.getSize();
      1397  
      1398      const long start = random() % max;
      1399      long curr = start;
      1400      std::unique_ptr<Item> itm;
      1401  
      1402      while (itm == NULL) {
      1403          VBucketPtr vb = getVBucket(curr++);
      1404          while (!vb || vb->getState() != vbucket_state_active) {
      1405              if (curr == start) {
      1406                  return GetValue(NULL, ENGINE_KEY_ENOENT);
      1407              }
      1408              if (curr == max) {
      1409                  curr = 0;
      1410              }
      1411  
      1412              vb = getVBucket(curr++);
      1413          }
      1414  
      1415          if ((itm = vb->ht.getRandomKey(random()))) {
      1416              GetValue rv(std::move(itm), ENGINE_SUCCESS);
      1417              return rv;
      1418          }
      1419  
      1420          if (curr == max) {
      1421              curr = 0;
      1422          }
      1423  
      1424          if (curr == start) {
      1425              return GetValue(NULL, ENGINE_KEY_ENOENT);
      1426          }
      1427          // Search next vbucket
      1428      }
      1429  
      1430      return GetValue(NULL, ENGINE_KEY_ENOENT);
      1431  }
      

      The bug is that we check for curr == start before checking if curr == max.

      The issues shows up when we have an active but empty vbucket and start == 0.
      Focusing the inner while loop

      1403          VBucketPtr vb = getVBucket(curr++);
      1404          while (!vb || vb->getState() != vbucket_state_active) {
      1405              if (curr == start) {
      1406                  return GetValue(NULL, ENGINE_KEY_ENOENT);
      1407              }
      1408              if (curr == max) {
      1409                  curr = 0;
      1410              }
      1411  
      1412              vb = getVBucket(curr++);
      1413          }
      

      Recalling that the bucket is empty, the first time we enter the while loop
      curr == 1. curr != start and curr != max and so we will call
      getVBucket and increment curr to 2. We repeat until curr == 1024.
      On going round the while loop again curr != start, but curr == max
      and so set curr = 0.

      We then call getVbucket and increment curr to 1. On going round the
      while loop again curr != start (as it is 1), and hence we loop
      indefinitely.

      Attachments

        Issue Links

          Activity

            People

              owend Daniel Owen
              owend Daniel Owen
              Votes:
              0 Vote for this issue
              Watchers:
              10 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                PagerDuty