Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-42903

Intermittent failure in memcapable.WarmUpMemcachedTest

    XMLWordPrintable

Details

    • Bug
    • Resolution: Fixed
    • Major
    • 7.0.0
    • Cheshire-Cat
    • test-execution
    • None
    • ThreadSanitizer
    • Triaged
    • 1
    • Unknown

    Description

      The testrunner test 'memcapable.WarmUpMemcachedTest' intermittently fails when running under ThreadSanitizer with the following error:

      b'[2020-11-23 16:16:44,229] - [data_helper:314] INFO - creating direct client 127.0.0.1:12000 default'
      b'[2020-11-23 16:16:44,370] - [data_helper:882] WARNING - unable to establish connection to 127.0.0.1. cleanup open connections'
      ERROR
      

      The following backtrace is printed:

      ======================================================================
      ERROR: do_warmup_100k (memcapable.WarmUpMemcachedTest)
      ----------------------------------------------------------------------
      Traceback (most recent call last):
        File "pytests/memcapable.py", line 812, in setUp
          self._create_default_bucket()
        File "pytests/memcapable.py", line 844, in _create_default_bucket
          ready = BucketOperationHelper.wait_for_memcached(master, name)
        File "lib/membase/helper/bucket_helper.py", line 346, in wait_for_memcached
          bucket, timeout_in_seconds, log_msg)
        File "lib/membase/helper/bucket_helper.py", line 270, in wait_for_vbuckets_ready_state
          obj = VBucketAwareMemcached(rest, bucket)
        File "lib/memcached/helper/data_helper.py", line 791, in __init__
          self.reset(rest)
        File "lib/memcached/helper/data_helper.py", line 798, in reset
          m, v, r = self.request_map(self.rest, self.bucket)
        File "lib/memcached/helper/data_helper.py", line 851, in request_map
          self.add_memcached(vBucket.master, memcacheds, rest, bucket)
        File "lib/memcached/helper/data_helper.py", line 884, in add_memcached
          raise ex
        File "lib/memcached/helper/data_helper.py", line 877, in add_memcached
          admin_pass=admin_pass)
        File "lib/memcached/helper/data_helper.py", line 344, in direct_client
          client.bucket_select(bucket)
        File "lib/mc_bin_client.py", line 831, in bucket_select
          return self._doCmd(memcacheConstants.CMD_SELECT_BUCKET, name, '')
        File "lib/mc_bin_client.py", line 240, in _doCmd
          return self._handleSingleResponse(opaque)
        File "lib/mc_bin_client.py", line 231, in _handleSingleResponse
          cmd, opaque, cas, keylen, extralen, data = self._handleKeyedResponse(myopaque)
        File "lib/mc_bin_client.py", line 227, in _handleKeyedResponse
          raise MemcachedError(errcode, msg)
      mc_bin_client.MemcachedError: Memcached error #1 'Not found':  b''
      

      Looking into the call-stack, what appears to be happening is that immediately following bucket creation, testrunner fails to select a given bucket on a new node due to receiving a "NOT FOUND (1)" status code.

      When running under TSan, memcached generally executes slower and hence I think this is exposing an existing timing issue - memcached simply hasn't completed initialisation when the connection attempt is made.

      If we examine the timestamps, we see the failing SELECT_BUCKET request was at:

      b'[2020-11-23 16:16:44,229] - [data_helper:314] INFO - creating direct client 127.0.0.1:12000 default'
      [2020-11-23 16:16:44,370] - [data_helper:882] WARNING - unable to establish connection to 127.0.0.1. cleanup open connections'
      

      We don't know exactly when the SELECT_BUCKET command was received by KV-Engine, but we do know it was somewhere between 16:16:44.229 and 16:16:44.370 - i.e. between when testrunner started to create the connection and when the error message was printed.

      The memcached.log for the node on port 12000 (n_0) shows the following messages around that time:

      2020-11-23T16:16:44.365282+00:00 INFO 85: Client {"ip":"127.0.0.1","port":59374} authenticated as <ud>cbadminbucket</ud>
      2020-11-23T16:16:44.375850+00:00 INFO (default) EP Engine: Initialization of persistent bucket complete
      2020-11-23T16:16:44.376029+00:00 INFO 77 - Bucket [default] created successfully
      

      I believe that fd:85 (username "cbadminbucket") is the testrunner connection (it is within the correct time window) - and that connects ~10ms before the bucket completes creation.

      I don't believe this problem is specific to TSan - it's just easier to hit it as it slows down memcached initialisation. It seems that testrunner should wait until bucket creation has completed or similar.

      Note: Similar instance available here: http://cv.jenkins.couchbase.com/view/Commit%20Validation/view/kv_engine/job/kv_engine-master-post-commit-TSan/3882/consoleFull#-747259684712a389b-c7bc-46ec-bcae-02346ef3ce1b

      Attachments

        No reviews matched the request. Check your Options in the drop-down menu of this sections header.

        Activity

          People

            drigby Dave Rigby (Inactive)
            drigby Dave Rigby (Inactive)
            Votes:
            0 Vote for this issue
            Watchers:
            5 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Gerrit Reviews

                There are no open Gerrit changes

                PagerDuty