Loading...

XML

Word

Printable

Details

Type: Bug
Resolution: Fixed
Priority: Critical
Fix Version/s: 6.5.0
Affects Version/s: 6.0.3, 6.5.0
Component/s: couchbase-bucket
Labels:
- 6.x-candidate

Triage:
Triaged
Is this a Regression?:
Unknown
Sprint:
KV-Engine Mad-Hatter GA

Description

Summary

During investigation of SyncWrite persistTo performance, is was observed that the op/s for a single-threaded persist_majority workload would drop to zero for a number of seconds, recovering after 10s.

Steps to Reproduce

Start a two-node cluster run:

./cluster_run -n2 --dont-rename

./cluster_connect -n2

Drive with a SyncWrite workload (SyncWrites not strictly necessary, but makes the bug very apparent as delay in flusher will cause op/s to go to zero). See attached sync_repl.py script.
./engines/ep/management/sync_repl.py localhost:12000 Administrator asdasd default loop_setD key value 100000 3
Observe op/s on UI.

Expected Results

Op/s should be broadly constant, with only small (sub-second) drops in op/s when compaction runs.

Actual Results

Op/s drops to zero:

During such "blips", Slow operation log messages were observed, each time the "slow" operation took very close to 4 or 8 seconds:

‡ grep "Slow operation" logs/n_0/memcached.log.00000*

logs/n_0/memcached.log.000000.txt:2019-10-08T09:51:14.753174+01:00 WARNING 48: Slow operation. {"cid":"[::1]:50067/b2034b5a","duration":"4003 ms","trace":"request=378423659125160:4003656 store=378423659139440:15 sync_write.ack_remote=378423660320419:0 sync_write.ack_local=378427662099819:0 sync_write.prepare=378423659147291:4002977 store=378427662773987:0","command":"SET","peer":"[::1]:50067","bucket":"default"}

logs/n_0/memcached.log.000000.txt:2019-10-08T09:51:23.697168+01:00 WARNING 48: Slow operation. {"cid":"[::1]:50067/63159010","duration":"4005 ms","trace":"request=378432601417239:4005461 store=378432601426289:20 sync_write.ack_remote=378432603242164:0 sync_write.ack_local=378436606766192:0 sync_write.prepare=378432601435464:4005363 store=378436606870788:1","command":"SET","peer":"[::1]:50067","bucket":"default"}

48: Slow operation. {"cid":"[::1]:50312/747df525","duration":"8005 ms","trace":"request=379458120957691:8005773 store=379458120963442:16 sync_write.ack_local=379458121688761:0 sync_write.ack_remote=379466126503925:0 sync_write.prepare=379458120968581:8005592 store=379466126719482:1","command":"SET","peer":"[::1]:50312","bucket":"default"}

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending
- Thumbnails
- List
- Download All

Screen Shot 2019-10-08 at 17.24.37.png
42 kB
08/Oct/19 9:25 AM
Screen Shot 2019-10-08 at 17.44.54.png
40 kB
08/Oct/19 9:45 AM
sync_repl.py
2 kB
08/Oct/19 9:48 AM

Issue Links

relates to

MB-59334 Investigate if periodic ItemPager wakeup can be simplified / removed

Open

MB-36249 SyncWrite throughput for persistence-related levels is ~30% slower than OBSERVE-based durability

Closed

Gerrit Reviews

- Issue Only
- Show All Reviews
- Show Open Reviews
- Show All Issues
- Show Open Issues

For Gerrit Dashboard: MB-36380
#	Subject	Branch	Project	Status	CR	V
116083,4	MB-36380 [1/2]: Regression test for missing Flusher wakeup	master	kv_engine	Status: MERGED	+2	+1
116084,5	MB-36380 [2/2]: Fix missing Flusher wakeup	master	kv_engine	Status: MERGED	+2	+1

Activity

People

Assignee:: Ritesh Agarwal

Reporter:: Dave Rigby (Inactive)

Votes:: 0 Vote for this issue

Watchers:: 1 Start watching this issue

Dates

Created:: 08/Oct/19 9:13 AM

Updated:: 27/Oct/23 5:17 AM

Resolved:: 09/Oct/19 7:15 AM

Gerrit Reviews

There are no open Gerrit changes

Show There are 2 closed Gerrit changes

Hide There are 2 closed Gerrit changes

MB-36380 [1/2]: Regression test for missing Flusher wakeup: Gerrit Review:

MB-36380 [2/2]: Fix missing Flusher wakeup: Gerrit Review:

Lost wakeup can delay Flusher up to 10s

Details

Description

Attachments

Attachments

Issue Links

Gerrit Reviews

Activity

People

Dates

Gerrit Reviews

PagerDuty