Details
-
Bug
-
Resolution: Fixed
-
Major
-
Cheshire-Cat
-
None
-
ThreadSanitizer
-
Triaged
-
1
-
Unknown
Description
The testrunner test 'memcapable.WarmUpMemcachedTest' intermittently fails when running under ThreadSanitizer with the following error:
b'[2020-11-23 16:16:44,229] - [data_helper:314] INFO - creating direct client 127.0.0.1:12000 default'
|
b'[2020-11-23 16:16:44,370] - [data_helper:882] WARNING - unable to establish connection to 127.0.0.1. cleanup open connections'
|
ERROR
|
The following backtrace is printed:
======================================================================
|
ERROR: do_warmup_100k (memcapable.WarmUpMemcachedTest)
|
----------------------------------------------------------------------
|
Traceback (most recent call last):
|
File "pytests/memcapable.py", line 812, in setUp
|
self._create_default_bucket()
|
File "pytests/memcapable.py", line 844, in _create_default_bucket
|
ready = BucketOperationHelper.wait_for_memcached(master, name)
|
File "lib/membase/helper/bucket_helper.py", line 346, in wait_for_memcached
|
bucket, timeout_in_seconds, log_msg)
|
File "lib/membase/helper/bucket_helper.py", line 270, in wait_for_vbuckets_ready_state
|
obj = VBucketAwareMemcached(rest, bucket)
|
File "lib/memcached/helper/data_helper.py", line 791, in __init__
|
self.reset(rest)
|
File "lib/memcached/helper/data_helper.py", line 798, in reset
|
m, v, r = self.request_map(self.rest, self.bucket)
|
File "lib/memcached/helper/data_helper.py", line 851, in request_map
|
self.add_memcached(vBucket.master, memcacheds, rest, bucket)
|
File "lib/memcached/helper/data_helper.py", line 884, in add_memcached
|
raise ex
|
File "lib/memcached/helper/data_helper.py", line 877, in add_memcached
|
admin_pass=admin_pass)
|
File "lib/memcached/helper/data_helper.py", line 344, in direct_client
|
client.bucket_select(bucket)
|
File "lib/mc_bin_client.py", line 831, in bucket_select
|
return self._doCmd(memcacheConstants.CMD_SELECT_BUCKET, name, '')
|
File "lib/mc_bin_client.py", line 240, in _doCmd
|
return self._handleSingleResponse(opaque)
|
File "lib/mc_bin_client.py", line 231, in _handleSingleResponse
|
cmd, opaque, cas, keylen, extralen, data = self._handleKeyedResponse(myopaque)
|
File "lib/mc_bin_client.py", line 227, in _handleKeyedResponse
|
raise MemcachedError(errcode, msg)
|
mc_bin_client.MemcachedError: Memcached error #1 'Not found': b''
|
Looking into the call-stack, what appears to be happening is that immediately following bucket creation, testrunner fails to select a given bucket on a new node due to receiving a "NOT FOUND (1)" status code.
When running under TSan, memcached generally executes slower and hence I think this is exposing an existing timing issue - memcached simply hasn't completed initialisation when the connection attempt is made.
If we examine the timestamps, we see the failing SELECT_BUCKET request was at:
b'[2020-11-23 16:16:44,229] - [data_helper:314] INFO - creating direct client 127.0.0.1:12000 default'
|
[2020-11-23 16:16:44,370] - [data_helper:882] WARNING - unable to establish connection to 127.0.0.1. cleanup open connections'
|
We don't know exactly when the SELECT_BUCKET command was received by KV-Engine, but we do know it was somewhere between 16:16:44.229 and 16:16:44.370 - i.e. between when testrunner started to create the connection and when the error message was printed.
The memcached.log for the node on port 12000 (n_0) shows the following messages around that time:
2020-11-23T16:16:44.365282+00:00 INFO 85: Client {"ip":"127.0.0.1","port":59374} authenticated as <ud>cbadminbucket</ud>
|
2020-11-23T16:16:44.375850+00:00 INFO (default) EP Engine: Initialization of persistent bucket complete
|
2020-11-23T16:16:44.376029+00:00 INFO 77 - Bucket [default] created successfully
|
I believe that fd:85 (username "cbadminbucket") is the testrunner connection (it is within the correct time window) - and that connects ~10ms before the bucket completes creation.
I don't believe this problem is specific to TSan - it's just easier to hit it as it slows down memcached initialisation. It seems that testrunner should wait until bucket creation has completed or similar.
Note: Similar instance available here: http://cv.jenkins.couchbase.com/view/Commit%20Validation/view/kv_engine/job/kv_engine-master-post-commit-TSan/3882/consoleFull#-747259684712a389b-c7bc-46ec-bcae-02346ef3ce1b