Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-42903

Intermittent failure in memcapable.WarmUpMemcachedTest

    XMLWordPrintable

    Details

    • Type: Bug
    • Status: Closed
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: Cheshire-Cat
    • Fix Version/s: 7.0.0
    • Component/s: test-execution
    • Labels:
      None
    • Environment:
      ThreadSanitizer
    • Triage:
      Triaged
    • Story Points:
      1
    • Is this a Regression?:
      Unknown

      Description

      The testrunner test 'memcapable.WarmUpMemcachedTest' intermittently fails when running under ThreadSanitizer with the following error:

      b'[2020-11-23 16:16:44,229] - [data_helper:314] INFO - creating direct client 127.0.0.1:12000 default'
      b'[2020-11-23 16:16:44,370] - [data_helper:882] WARNING - unable to establish connection to 127.0.0.1. cleanup open connections'
      ERROR
      

      The following backtrace is printed:

      ======================================================================
      ERROR: do_warmup_100k (memcapable.WarmUpMemcachedTest)
      ----------------------------------------------------------------------
      Traceback (most recent call last):
        File "pytests/memcapable.py", line 812, in setUp
          self._create_default_bucket()
        File "pytests/memcapable.py", line 844, in _create_default_bucket
          ready = BucketOperationHelper.wait_for_memcached(master, name)
        File "lib/membase/helper/bucket_helper.py", line 346, in wait_for_memcached
          bucket, timeout_in_seconds, log_msg)
        File "lib/membase/helper/bucket_helper.py", line 270, in wait_for_vbuckets_ready_state
          obj = VBucketAwareMemcached(rest, bucket)
        File "lib/memcached/helper/data_helper.py", line 791, in __init__
          self.reset(rest)
        File "lib/memcached/helper/data_helper.py", line 798, in reset
          m, v, r = self.request_map(self.rest, self.bucket)
        File "lib/memcached/helper/data_helper.py", line 851, in request_map
          self.add_memcached(vBucket.master, memcacheds, rest, bucket)
        File "lib/memcached/helper/data_helper.py", line 884, in add_memcached
          raise ex
        File "lib/memcached/helper/data_helper.py", line 877, in add_memcached
          admin_pass=admin_pass)
        File "lib/memcached/helper/data_helper.py", line 344, in direct_client
          client.bucket_select(bucket)
        File "lib/mc_bin_client.py", line 831, in bucket_select
          return self._doCmd(memcacheConstants.CMD_SELECT_BUCKET, name, '')
        File "lib/mc_bin_client.py", line 240, in _doCmd
          return self._handleSingleResponse(opaque)
        File "lib/mc_bin_client.py", line 231, in _handleSingleResponse
          cmd, opaque, cas, keylen, extralen, data = self._handleKeyedResponse(myopaque)
        File "lib/mc_bin_client.py", line 227, in _handleKeyedResponse
          raise MemcachedError(errcode, msg)
      mc_bin_client.MemcachedError: Memcached error #1 'Not found':  b''
      

      Looking into the call-stack, what appears to be happening is that immediately following bucket creation, testrunner fails to select a given bucket on a new node due to receiving a "NOT FOUND (1)" status code.

      When running under TSan, memcached generally executes slower and hence I think this is exposing an existing timing issue - memcached simply hasn't completed initialisation when the connection attempt is made.

      If we examine the timestamps, we see the failing SELECT_BUCKET request was at:

      b'[2020-11-23 16:16:44,229] - [data_helper:314] INFO - creating direct client 127.0.0.1:12000 default'
      [2020-11-23 16:16:44,370] - [data_helper:882] WARNING - unable to establish connection to 127.0.0.1. cleanup open connections'
      

      We don't know exactly when the SELECT_BUCKET command was received by KV-Engine, but we do know it was somewhere between 16:16:44.229 and 16:16:44.370 - i.e. between when testrunner started to create the connection and when the error message was printed.

      The memcached.log for the node on port 12000 (n_0) shows the following messages around that time:

      2020-11-23T16:16:44.365282+00:00 INFO 85: Client {"ip":"127.0.0.1","port":59374} authenticated as <ud>cbadminbucket</ud>
      2020-11-23T16:16:44.375850+00:00 INFO (default) EP Engine: Initialization of persistent bucket complete
      2020-11-23T16:16:44.376029+00:00 INFO 77 - Bucket [default] created successfully
      

      I believe that fd:85 (username "cbadminbucket") is the testrunner connection (it is within the correct time window) - and that connects ~10ms before the bucket completes creation.

      I don't believe this problem is specific to TSan - it's just easier to hit it as it slows down memcached initialisation. It seems that testrunner should wait until bucket creation has completed or similar.

      Note: Similar instance available here: http://cv.jenkins.couchbase.com/view/Commit%20Validation/view/kv_engine/job/kv_engine-master-post-commit-TSan/3882/consoleFull#-747259684712a389b-c7bc-46ec-bcae-02346ef3ce1b

        Attachments

        No reviews matched the request. Check your Options in the drop-down menu of this sections header.

          Activity

          Hide
          drigby Dave Rigby added a comment -

          Note this issue is preventing me from merging Cheshire-Cat patches (specifically MB-36956) as this issue shows up ~25% of the time with that patch merged, due to TSan slowing things down more.

          Show
          drigby Dave Rigby added a comment - Note this issue is preventing me from merging Cheshire-Cat patches (specifically MB-36956 ) as this issue shows up ~25% of the time with that patch merged, due to TSan slowing things down more.
          Hide
          ritam.sharma Ritam Sharma added a comment -

          Ashwin Govindarajulu- Can you please help here.

          Show
          ritam.sharma Ritam Sharma added a comment - Ashwin Govindarajulu - Can you please help here.
          Hide
          drigby Dave Rigby added a comment -

          Note if you need help in reproducing this to fix the issue, I suspect a well-placed "sleep" inside KV-Engine bucket creation would make it occur 100% of the time. Let me know if that would be helpful and I can prepare a patch.

          Show
          drigby Dave Rigby added a comment - Note if you need help in reproducing this to fix the issue, I suspect a well-placed "sleep" inside KV-Engine bucket creation would make it occur 100% of the time. Let me know if that would be helpful and I can prepare a patch.
          Hide
          ashwin.govindarajulu Ashwin Govindarajulu added a comment -

          Hi Dave Rigby ,

          Even I am thinking of the same fix, but just making sure to handle it in generic way so that other tests calling the same API won't fail in future.

          Show
          ashwin.govindarajulu Ashwin Govindarajulu added a comment - Hi Dave Rigby , Even I am thinking of the same fix, but just making sure to handle it in generic way so that other tests calling the same API won't fail in future.
          Hide
          ashwin.govindarajulu Ashwin Govindarajulu added a comment -

          Submitted the patch with sleep.

          Dave Rigby Can you please trigger a build and validate ?

          Show
          ashwin.govindarajulu Ashwin Govindarajulu added a comment - Submitted the patch with sleep. Dave Rigby Can you please trigger a build and validate ?
          Hide
          drigby Dave Rigby added a comment -
          Show
          drigby Dave Rigby added a comment - Not failures in WarmUpMemcachedTest in recent http://cv.jenkins.couchbase.com/view/Commit%20Validation/view/kv_engine/job/kv_engine-master-post-commit-TSan/ runs.
          Hide
          mihir.kamdar Mihir Kamdar added a comment -

          Closing as verified

          Show
          mihir.kamdar Mihir Kamdar added a comment - Closing as verified

            People

            Assignee:
            drigby Dave Rigby
            Reporter:
            drigby Dave Rigby
            Votes:
            0 Vote for this issue
            Watchers:
            5 Start watching this issue

              Dates

              Created:
              Updated:
              Resolved:

                Gerrit Reviews

                There are no open Gerrit changes

                  PagerDuty