Loading...

XML

Word

Printable

Details

Type: Improvement
Resolution: Fixed
Priority: Major
Fix Version/s: 7.0.0
Affects Version/s: 6.5.1
Component/s: couchbase-bucket
Labels:
- Cheshire-Cat-Committed

Epic Link:
KV: 1% Residency Ratio
Sprint:
KV Sprint 2020-July

Description

As observed when testing data load / rebalance workloads under high DGM, when KV-Engine hits the high watermark, a large amount of work is done on the front-end thread.

Specifically, we see that 60% of each active front-end thread is spent inside memoryCondition(), which just decides if memory recovery should be attempted or not (it doesn't actually recover any memory):

Note that's every front-end thread is wasting 60% of it's time; across all Front-end threads it's larger.

Analysis of the profile highlights a number of issues:

Excessive time spent in VBucketCountVisitor::visitBucket. 98% of all time in memoryCondition is spent in VBucketCountVisitor::visitBucket. This is called to calculate the number of resident items, and if non-zero them the ItemPager is woken up. However VBucketCountVisitor::visitBucket actually accumulates ~50 or so stats, many of which are more expensive than the item counts, and so there's ~48 stats which are calculated and them simply ignored.
Excessive calls to memoryCondition(). Every time a client operation fails due to not enough memory being available (i.e. at/above high watermark), memoryCondition is called, and perform the above expensive checks. This is very wasteful because:
1. The ItemPager could already be running, and it cannot be re-scheduled again until it's finished anyway.
2. Another client thread could already be running memoryCondition.
memoryCondition is arguably over-complex in what it's trying to do.
1. Firstly, it's essentially doing two things at the same time - determine if we should return ETMPFAIL or NOMEM to the user, and secondly attempt to recover memory if possible.
2. Secondly, the memory recovery logic is complex / brittle - we attempt to predict if memory could be recovered ahead of time, with two possible approaches - paging out items, or closing unreferenced Checkpoints. However, I suspect the prediction isn't always correct in deterring if any more memory can be freed, as it relies on indirect metrics like number of items resident), which could already be zero, and memory is in use elsewhere.
  See also ~~MB-22523~~ which has some relevant commentary on why the current design is the way it is, which is at least partly due to trying to do minimal fixes for an issue late in the development cycle of 5.0.0

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending
- Thumbnails
- List
- Download All

Screenshot 2020-05-15 at 21.04.00.png
15/May/20 1:36 PM
180 kB
Dave Rigby
Screenshot 2020-07-02 at 15.44.04.png
02/Jul/20 7:44 AM
150 kB
Dave Rigby
Screenshot 2020-07-02 at 15.49.43.png
02/Jul/20 7:50 AM
154 kB
Dave Rigby

Issue Links

relates to

MB-38050 Investigate ItemPager iteration further

Closed

MB-35075 bulkload/backup restore speed

Closed

Gerrit Reviews

- Issue Only
- Show All Reviews
- Show Open Reviews
- Show All Issues
- Show Open Issues

No reviews matched the request. Check your Options in the drop-down menu of this sections header.

Activity

People

Assignee:: Dave Rigby (Inactive)

Reporter:: Dave Rigby (Inactive)

Votes:: 0 Vote for this issue

Watchers:: 7 Start watching this issue

Dates

Created:: 15/May/20 1:34 PM

Updated:: 17/Jun/21 2:40 PM

Resolved:: 06/Aug/20 1:45 AM

Gerrit Reviews

There are no open Gerrit changes

Show There is 1 closed Gerrit change

Hide There is 1 closed Gerrit change

MB-39422: Simplify memoryCondition() to be watermark-based: Gerrit Review:

Reduce front-end thread costs when we hit the high watermark (memoryCondition)

Details

Description

Attachments

Attachments

Issue Links

Gerrit Reviews

Activity

People

Dates

Gerrit Reviews

PagerDuty