Details
-
Bug
-
Resolution: Fixed
-
Critical
-
5.5.5, 6.0.0, 6.0.1, 6.0.2, 6.5.0
-
Untriaged
-
Unknown
-
KV-Engine MH 2nd Beta
Description
Summary
During testing of the fix for MB-35629, using pillowfight to drive a steady workload, I observed that the op/s for SET with level=durability would drop from the normal rate of ~15,000 op/s to 0 for a brief (1-2s) time, then recover:
Steps to Reproduce
- Start a two-node cluster with a single replica:
./cluster_run --nodes=2
./cluster_connect -n2
- Start pillowfight performing level=majority SyncWrites:
cbc-pillowfight -U couchbase://127.0.0.1:12000/default -u Administrator -P asdasd -B1 -I 20000 -M 200 -d majority -t 16 -r 50
Expected Results
Op/s should be stable, with no significant deviation.
Actual Results
Op/s drop to zero, at seemingly random intervals; around 1-5 times a minute.
From some initial analysis it appears that slow ops are correlated with the overall sync_write duration histogram recording large (31ms ... 2s) samples:
sync_write_commit_majority (310752779 total)
|
0us - 34us : ( 0.0000%) 1
|
...
|
31ms - 2031ms : (100.0000%) 12936 <<<<<< this count increments...
|
Avg : ( 2349us)
|
Additionally, these ops manifest in the log file as "slow" ops - for example:
2019-08-22T10:25:18.198515+01:00 WARNING 162: Slow operation. {"cid":"d6b7fea79af216fa/8dc6666b7f312ffb/a6812000","duration":"2002 ms","trace":"request=3730548982010515:2002149 store=3730548982030238:37 sync_write.ack_local=3730548982065966:0 sync_write.ack_remote=3730550982026051:
|
0 sync_write.prepare=3730548982058643:2001970 store=3730550984148587:1","command":"SET","peer":"127.0.0.1:59822","bucket":"default"}
|
Post-processing with the slow_ops_2_gtrace.py script produces slow_ops.json Google Trace chart, which suggests that the slowdown is caused by it taking ~2s to receive/process the seqno_ack from the replica. Interestingly enough multiple mutations suffer from this problem at the same time:
Artefacts