Details
-
Bug
-
Status: Resolved
-
Critical
-
Resolution: Fixed
-
2.4.0-beta
-
None
-
None
-
1
-
1,000,000
Description
Scenario:
Simple test that sets the timeout threshold low, then does a bunch of ops in a tight loop. Then does more ops for 20 seconds.
Expected behavior:
ThresholdLoggingTracer would log something
Observed behavior:
Some minorly enhanced timeout logging, but no threshold logging tracer output.
Script and output attached. Invoked as:
$ LCB_LOGLEVEL=2 python timeoutme.py > run.out 2>&1
(note: filing this to Sergey Avseyev since it probably is lcb level, but I wrote the test as python, please move or redirect as appropriate)
Attachments
- run.out
- 6 kB
- run2.txt
- 19 kB
- run3.txt
- 17 kB
- timeoutme.php
- 0.7 kB
- timeoutme.py
- 0.9 kB
Activity
Thing is the sdk-rfc calls for it to be written every 10s and on by default. By write before destruction you mean before process exits?
Sergey Avseyev what's the default logging interval at the moment?
Before the bucket is destroyed. I don't know, it took a lot to get the queue to flush - 400 rounds of multi_upserts with a 100 entry dictionary.
Not sure why it's so reluctant to flush... Will try Sergey's LCB update and see if that helps.
Added the following tuneables…
bucket.timeout = 9e-4
|
bucket.TRACING_ORPHANED_QUEUE_FLUSH_INTERVAL = 10
|
bucket.TRACING_ORPHANED_QUEUE_SIZE =10
|
bucket.TRACING_THRESHOLD_QUEUE_FLUSH_INTERVAL = 10
|
bucket.TRACING_THRESHOLD_QUEUE_SIZE = 10
|
And didn't get output. What output I did get is at run2.txt.
Updated including tunings suggested by Ellis Breen and running with envvar suggested by Sergey Avseyev. Still not working.
script: timeoutme-tuned.py
run output: run3.txt
exec: LCB_OPTIONS="enable_tracing=true" LCB_LOGLEVEL=2 python timeoutme.py > run.out 2>&1
I tried to repro this with PHP, and could not. In particular, the effective connstr came through differently and I had logging output indicating the threshold logging tracer. So, now I wonder if there's a python level problem. Ellis Breen, can you have a look?
script: timeoutme.php
run output: timeoutme_php_output.txt
exec: LCB_LOGLEVEL=2 php timeoutme.php > timeoutme_php_output.txt 2>&1
Matt - the Orphaned responses in the last Python example are clearly visible here:
13860ms [Ia5744edb] {81507/a33} [WARN] (tracer - L:126) Orphan responses observed: {"count":6,"service":"kv","top":[{"last_local_address":"127.0.0.1:54909","last_local_id":"00000000a5744edb/9e327d4953bf9c29","last_remote_address":"localhost:11210","operation_id":"0xc896","operation_name":"get","server_us":0,"total_us":981},{"last_local_address":"127.0.0.1:54909","last_local_id":"00000000a5744edb/9e327d4953bf9c29","last_remote_address":"localhost:11210","operation_id":"0xc895","operation_name":"get","server_us":0,"total_us":1005},{"last_local_address":"127.0.0.1:54909","last_local_id":"00000000a5744edb/9e327d4953bf9c29","last_remote_address":"localhost:11210","operation_id":"0x940d","operation_name":"get","server_us":0,"total_us":1033},{"last_local_address":"127.0.0.1:54909","last_local_id":"00000000a5744edb/9e327d4953bf9c29","last_remote_address":"localhost:11210","operation_id":"0x7dc7","operation_name":"get","server_us":0,"total_us":1087},{"last_local_address":"127.0.0.1:54909","last_local_id":"00000000a5744edb/9e327d4953bf9c29","last_remote_address":"localhost:11210","operation_id":"0x7dc6","operation_name":"get","server_us":0,"total_us":1442},{"last_local_address":"127.0.0.1:54909","last_local_id":"00000000a5744edb/9e327d4953bf9c29","last_remote_address":"localhost:11210","operation_id":"0xa393","operation_name":"get","server_us":0,"total_us":1720}]} |
17305ms [Ia5744edb] {81507/a33} [WARN] (server - L:360) <localhost:11210> (CTX=0x7fef70510260,memcached,SRV=0x7fef7050cda0,IX=0) Server sent us reply for a timed-out command. (OP=0xb5, RC=0x0, SEQ=89334) |
19443ms [Ia5744edb] {81507/a33} [WARN] (tracer - L:126) Orphan responses observed: {"count":6,"service":"kv","top":[{"last_local_address":"127.0.0.1:54909","last_local_id":"00000000a5744edb/9e327d4953bf9c29","last_remote_address":"localhost:11210","operation_id":"0x14ded","operation_name":"get","server_us":0,"total_us":957},{"last_local_address":"127.0.0.1:54909","last_local_id":"00000000a5744edb/9e327d4953bf9c29","last_remote_address":"localhost:11210","operation_id":"0x12932","operation_name":"get","server_us":0,"total_us":974},{"last_local_address":"127.0.0.1:54909","last_local_id":"00000000a5744edb/9e327d4953bf9c29","last_remote_address":"localhost:11210","operation_id":"0x15cf5","operation_name":"get","server_us":0,"total_us":981},{"last_local_address":"127.0.0.1:54909","last_local_id":"00000000a5744edb/9e327d4953bf9c29","last_remote_address":"localhost:11210","operation_id":"0x15cf4","operation_name":"get","server_us":0,"total_us":996},{"last_local_address":"127.0.0.1:54909","last_local_id":"00000000a5744edb/9e327d4953bf9c29","last_remote_address":"localhost:11210","operation_id":"0x1136b","operation_name":"get","server_us":0,"total_us":1042},{"last_local_address":"127.0.0.1:54909","last_local_id":"00000000a5744edb/9e327d4953bf9c29","last_remote_address":"localhost:11210","operation_id":"0x1136c","operation_name":"get","server_us":0,"total_us":1394}]} |
OperationResult<rc=0x0, key=u'key', cas=0x152f8fe62b8b0000, tracing_context=0, is_tracing_stub=False> |
Looks fine to me, and roughly equivalent to the PHP output.
Ellis Breen I think Matt was trying to get Threshold Logging output, not orphaned responses.
Also, what is the current situation regarding flushing of the collected spans on a time interval?
I can't see any 'over-threshold' logs in the PHP output though. I managed to get the threshold warnings in the test I linked above, e.g.
[4141233127] Operations over threshold: {"count":99,"service":"kv","top":[{"last_local_address":"192.168.99.1:49414","last_local_id":"00000000f6d633e7/1816c785a1ee1ffe","last_remote_address":"192.168.99.1:49398","operation_id":"0x1ca4","operation_name":"upsert","server_us":19,"total_us":8898},{"last_local_address":"192.168.99.1:49414","last_local_id":"00000000f6d633e7/1816c785a1ee1ffe","last_remote_address":"192.168.99.1:49398","operation_id":"0x1ca7","operation_name":"upsert","server_us":8,"total_us":10334},{"last_local_address":"192.168.99.1:49414","last_local_id":"00000000f6d633e7/1816c785a1ee1ffe","last_remote_address":"192.168.99.1:49398","operation_id":"0x1caa","operation_name":"upsert","server_us":6,"total_us":11273},{"last_local_address":"192.168.99.1:49414","last_local_id":"00000000f6d633e7/1816c785a1ee1ffe","last_remote_address":"192.168.99.1:49398","operation_id":"0x1cb0","operation_name":"upsert","server_us":6,"total_us":11746},{"last_local_address":"192.168.99.1:49414","last_local_id":"00000000f6d633e7/1816c785a1ee1ffe","last_remote_address":"192.168.99.1:49398","operation_id":"0x1cb4","operation_name":"upsert","server_us":3,"total_us":12232},{"last_local_address":"192.168.99.1:49414","last_local_id":"00000000f6d633e7/1816c785a1ee1ffe","last_remote_address":"192.168.99.1:49398","operation_id":"0x1cb6","operation_name":"upsert","server_us":6,"total_us":12750},{"last_local_address":"192.168.99.1:49414","last_local_id":"00000000f6d633e7/1816c785a1ee1ffe","last_remote_address":"192.168.99.1:49398","operation_id":"0x1cb9","operation_name":"upsert","server_us":3,"total_us":13241},{"last_local_address":"192.168.99.1:49414","last_local_id":"00000000f6d633e7/1816c785a1ee1ffe","last_remote_address":"192.168.99.1:49398","operation_id":"0x1cbd","operation_name":"upsert","server_us":3,"total_us":13687},{"last_local_address":"192.168.99.1:49414","last_local_id":"00000000f6d633e7/1816c785a1ee1ffe","last_remote_address":"192.168.99.1:49398","operation_id":"0x1cbf","operation_name":"upsert","server_us":3,"total_us":14292},{"last_local_address":"192.168.99.1:49414","last_local_id":"00000000f6d633e7/1816c785a1ee1ffe","last_remote_address":"192.168.99.1:49398","operation_id":"0x1cc4","operation_name":"upsert","server_us":3,"total_us":14764},{"last_local_address":"192.168.99.1:49414","last_local_id":"00000000f6d633e7/1816c785a1ee1ffe","last_remote_address":"192.168.99.1:49398","operation_id":"0x1cc8","operation_name":"upsert","server_us":3,"total_us":15346},{"last_local_address":"192.168.99.1:49414","last_local_id":"00000000f6d633e7/1816c785a1ee1ffe","last_remote_address":"192.168.99.1:49398","operation_id":"0x1cc9","operation_name":"upsert","server_us":3,"total_us":16046},{"last_local_address":"192.168.99.1:49414","last_local_id":"00000000f6d633e7/1816c785a1ee1ffe","last_remote_address":"192.168.99.1:49398","operation_id":"0x1ccd","operation_name":"upsert","server_us":3,"total_us":16602},{"last_local_address":"192.168.99.1:49414","last_local_id":"00000000f6d633e7/1816c785a1ee1ffe","last_remote_address":"192.168.99.1:49398","operation_id":"0x1cd7","operation_name":"upsert","server_us":3,"total_us":17054},{"last_local_address":"192.168.99.1:49414","last_local_id":"00000000f6d633e7/1816c785a1ee1ffe","last_remote_address":"192.168.99.1:49398","operation_id":"0x1cdb","operation_name":"upsert","server_us":6,"total_us":17622},{"last_local_address":"192.168.99.1:49414","last_local_id":"00000000f6d633e7/1816c785a1ee1ffe","last_remote_address":"192.168.99.1:49398","operation_id":"0x1cdf","operation_name":"upsert","server_us":3,"total_us":18123},{"last_local_address":"192.168.99.1:49414","last_local_id":"00000000f6d633e7/1816c785a1ee1ffe","last_remote_address":"192.168.99.1:49398","operation_id":"0x1ce8","operation_name":"upsert","server_us":3,"total_us":18531},{"last_local_address":"192.168.99.1:49414","last_local_id":"00000000f6d633e7/1816c785a1ee1ffe","last_remote_address":"192.168.99.1:49398","operation_id":"0x1cef","operation_name":"upsert","server_us":3,"total_us":18987},{"last_local_address":"192.168.99.1:49414","last_local_id":"00000000f6d633e7/1816c785a1ee1ffe","last_remote_address":"192.168.99.1:49398","operation_id":"0x1cf4","operation_name":"upsert","server_us":6,"total_us":19523},{"last_local_address":"192.168.99.1:49414","last_local_id":"00000000f6d633e7/1816c785a1ee1ffe","last_remote_address":"192.168.99.1:49398","operation_id":"0x1cfc","operation_name":"upsert","server_us":3,"total_us":19894},{"last_local_address":"192.168.99.1:49414","last_local_id":"00000000f6d633e7/1816c785a1ee1ffe","last_remote_address":"192.168.99.1:49398","operation_id":"0x1cfe","operation_name":"upsert","server_us":3,"total_us":20497},{"last_local_address":"192.168.99.1:49414","last_local_id":"00000000f6d633e7/1816c785a1ee1ffe","last_remote_address":"192.168.99.1:49398","operation_id":"0x1cff","operation_name":"upsert","server_us":3,"total_us":21137},{"last_local_address":"192.168.99.1:49414","last_local_id":"00000000f6d633e7/1816c785a1ee1ffe","last_remote_address":"192.168.99.1:49398","operation_id":"0x1d01","operation_name":"upsert","server_us":2,"total_us":21838},{"last_local_address":"192.168.99.1:49414", |
|
...
|
|
"total_us":69233},{"last_local_address":"192.168.99.1:49415","last_local_id":"00000000f6d633e7/1d49399a10bf3c19","last_remote_address":"192.168.99.1:49399","operation_id":"0x1cf7","operation_name":"upsert","server_us":6,"total_us":69619},{"last_local_address":"192.168.99.1:49415","last_local_id":"00000000f6d633e7/1d49399a10bf3c19","last_remote_address":"192.168.99.1:49399","operation_id":"0x1cfa","operation_name":"upsert","server_us":3,"total_us":70130},{"last_local_address":"192.168.99.1:49415","last_local_id":"00000000f6d633e7/1d49399a10bf3c19","last_remote_address":"192.168.99.1:49399","operation_id":"0x1cfb","operation_name":"upsert","server_us":3,"total_us":70714},{"last_local_address":"192.168.99.1:49415","last_local_id":"00000000f6d633e7/1d49399a10bf3c19","last_remote_address":"192.168.99.1:49399","operation_id":"0x1d03","operation_name":"upsert","server_us":3,"total_us":71103}]} (L:128) |
You can set the intervals as Matt has done above: https://issues.couchbase.com/browse/PYCBC-485?focusedCommentId=272810&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-272810
The default settings are in libcouchbase.h:
#define LCBTRACE_DEFAULT_ORPHANED_QUEUE_FLUSH_INTERVAL LCB_MS2US(1000)
#define LCBTRACE_DEFAULT_ORPHANED_QUEUE_SIZE 128
#define LCBTRACE_DEFAULT_THRESHOLD_QUEUE_FLUSH_INTERVAL LCB_MS2US(3000)
#define LCBTRACE_DEFAULT_THRESHOLD_QUEUE_SIZE 128
#define LCBTRACE_DEFAULT_THRESHOLD_KV LCB_MS2US(500)
#define LCBTRACE_DEFAULT_THRESHOLD_N1QL LCB_MS2US(1000)
#define LCBTRACE_DEFAULT_THRESHOLD_VIEW LCB_MS2US(1000)
#define LCBTRACE_DEFAULT_THRESHOLD_FTS LCB_MS2US(1000)
#define LCBTRACE_DEFAULT_THRESHOLD_ANALYTICS LCB_MS2US(1000)
So every 1 second for orphaned spans, every 3 seconds for threshold-exceeding spans.
Looks like Sergey has already issued a fix to flush the threshold logging tracer on destruction, so hopefully that will help. It did take a lot to get the buffer to flush before.
However, it may also be necessary to set the KV threshold and queue size/flush frequency on the Bucket as in the tests here:
https://github.com/couchbase/couchbase-python-client/blob/c6e25f42ecb5d280988e103389275e6ab2487e6e/couchbase/tests/base.py#L547
I'm afraid I didn't get to documenting these yet but will do ASAP - https://issues.couchbase.com/browse/PYCBC-486