Loading...

XML

Word

Printable

Details

Type: Bug
Resolution: Fixed
Priority: Critical
Fix Version/s: 5.1.0
Affects Version/s: 5.0.0
Component/s: couchbase-bucket
Labels:
- dgm
- performance

Triage:
Untriaged
Is this a Regression?:
Unknown

Description

Steps to Reproduce

Cluster of 4 nodes with reasonable spec (>=8 cores, SSD, fast network)
Create one large (>10GB) bucket with 10% resident ratio
client workload (e.g. pillowfight) doing ~20K writes / sec
Rebalance in a 5th node to the cluster

The above steps can result in the client receiving a significant spike in TMP_OOMs during the rebalance (from the incoming node) and degrade the application's performance. The suspected cause is that DCP replication streams from the existing nodes can quickly saturate the memory on the incoming node. The item pager is either not successfully invoked OR cannot eject items quickly enough - conjecture is it may require several passes to get an item with a sufficient LRU value to eject.

The desired behaviour is that the client application is basically unaffected by the rebalance. This could possibly be achieved in a number of ways. The following are merely suggestions to get the ball rolling:

Change the relative priority of the ItemPager and DCP Processor tasks (currently the processor is higher priority).
Run the item pager more aggressively - note it is not currently triggered by SET_WITH_META (which DCP consumer uses).
Initialise the items on the incoming node with a different LRU value that allows them to be ejected on first pass of the item pager.
Incorporate a more sophisticated throttle / backoff on the DCP stream when the HWM is reached so that frontend client ops have greater priority.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending
- Thumbnails
- List
- Download All

after.png
436 kB
10/Nov/17 6:41 AM
before.png
81 kB
10/Nov/17 6:41 AM

Issue Links

relates to

MB-26738 memcached fails to serve KV requests during swap rebalance due to OOM failures

Closed

MB-11782 Adding Nodes To A Cluster Can Result In Reduced Active Residency Percentages

Resolved

Gerrit Reviews

- Issue Only
- Show All Reviews
- Show Open Reviews
- Show All Issues
- Show Open Issues

For Gerrit Dashboard: MB-26705
#	Subject	Branch	Project	Status	CR	V
85315,7	MB-26705: Remove unnecessary StoredDocKey from core ItemPager code	master	kv_engine	Status: ABANDONED	+2	-1
85384,4	MB-26705: Make ItemPager and associated tasks higher PRIO than DCP	spock	kv_engine	Status: MERGED	+2	+1
85385,10	MB-26705: Make ItemPager's wake mechanism more reliable	spock	kv_engine	Status: MERGED	+2	+1
85386,13	MB-26705: When store returns SUCCESS check if we're above the HWM	spock	kv_engine	Status: MERGED	+2	+1
85387,12	MB-26705: Make SetWithMeta check memory on SUCCESS	spock	kv_engine	Status: MERGED	+2	+1
85388,12	MB-26705: Re-run pager if memory is still high	spock	kv_engine	Status: MERGED	+2	+1
85389,3	MB-26705: Make DCP create Items cold	spock	kv_engine	Status: ABANDONED	0	0
85446,5	MB-26705: Rename 'Processor' so it is more obviously part of DCP	spock	kv_engine	Status: MERGED	+2	+1
85447,5	MB-26705: Make ItemPager take a reference to the engine	spock	kv_engine	Status: MERGED	+2	+1
85448,7	MB-26705: Make ItemPager's snooze amount a config parameter	spock	kv_engine	Status: MERGED	+2	+1
85649,12	MB-26705: DCP should make disk items cold	spock	kv_engine	Status: MERGED	+2	+1
85848,5	Merge remote-tracking branch 'couchbase/spock'	master	kv_engine	Status: ABANDONED	0	+1
85890,2	Merge remote-tracking branch 'couchbase/spock'	master	kv_engine	Status: MERGED	+2	+1
85895,5	MB-26705: Add folly/AtomicBitSet to StoredValue	spock	kv_engine	Status: MERGED	+2	+1
86368,2	Merge remote-tracking branch 'couchbase/spock' into 'couchbase/master'	master	kv_engine	Status: MERGED	+2	+1
86369,3	Merge remote-tracking branch 'couchbase/spock' into 'couchbase/master'	master	kv_engine	Status: MERGED	+2	+1