Loading...

XML

Word

Printable

Details

Type: Improvement
Resolution: Fixed
Priority: Major
Fix Version/s: 6.5.0
Affects Version/s: 5.0.0
Component/s: memcached
Labels:
- performance
Environment:
Intel Platinum 8180 Processor (Skylake) - 2 sockets, 56 threads per socket.

Description

On Intel's Purley platform (with 2 sockets each contains 56 cores). We installed a one node couchbase cluster with version 5.0.0-3358. The node contains only data service.

We populated a bucket using cbc-pillowfight command. The bucket has 100M items each with 32 bytes as value.

cbc-pillowfight --password password --batch-size 1000 --num-items 20000000 --num-threads 25 --min-size 32 --max-size 32 --spec couchbase://192.168.23.4/default --populate-only

Then we ran "read only" test using the following cbc-pillowfight command with 1000 as batch size.

cbc-pillowfight --password password --batch-size 1000 --num-items 20000000 --num-threads 5 --min-size 32 --max-size 32 --spec couchbase://192.168.23.4/default --set-pct 0 --num-cycles 200000 --no-population > /dev/null

We can achieve 170K reach per second using one thread. (the thread in memcached consumes about 80% CPU usage of one core). As we increases the parallel thread number in cbc-pillow fight, the CPU usage increases linearly (with the thread number), but the throughput does not increases linearly (with the thread number), we can achieve 1.6M read / second with 10 threads, but 4M reads / second when using 30 threads, after that achievable throughput decreases, 60 thread (30 threads from 2 nodes each) can only achieve 3.4M read / second.

We did stack snapshot using gstack and CPU profiling using perf command. We identified the following issues as bottleneck:

update_topkeys internally used mutex and introduces locking. We hacked the code and disabled update_topkeys by setting topkey_commands array with all false (modified get_mcbp_topkeys function to set the update topkeys flags to be false). With this change, we can achive maximum throughput of 5M reads / second.

We also increased vbucket number from 1024 to 10240 and this reduced some blocking but did not make much improvement on the throughput.

By analyzing perf profiling trace of memcached worker thread, we found the following call stacks's CPU usage percentage increases a lot when parallel thread number increases.

all_buckets[bucketid].timings.collect called by mcbp_collect_timings function.

fetch_add of the engines' stats.numItems called by ObjectRegistry::onDeleteItem function.

fetch_add of the engines' stats.numItems called by ObjectRegistry::onCreateItem function.

We commented out all the 3 access to global objects, this helped increasing the throughput from 5M reads / second to 7M reads / second.

The test above indicates that reducing access to global objects in memcached could potentially increase memcached throughput in read only scenario.

After the test, i created a sample program to benchmark the performance of atomic variable on Intel Purley platform (with different thread number). With this program (attached), i can only achieve 1B atomic::fetch_add in 10 seconds when using 1 thread. And the throughput reduces to half when the thread number become 2+ (if all threads are running on cores in the same socket). When I force the threads to be running on 2 different sockets, the throughput decreases again to 1B every 30 to 40 seconds. This means, using atomic variable on a Intel Purley system, worst case, we can only achieve 40M to 25M atomic fetch adds.

If we use thread local variables instead of atomic variable, the throughput will be about 8 times faster (in 1 thread case) and the throughput will increase linearly with thread number.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending
- Thumbnails
- List
- Download All

t.cc
31/Aug/17 10:45 AM
1 kB
Hui Wang

Sub-Tasks

There are no Sub-Tasks for this issue.

Gerrit Reviews

- Show All Reviews
- Show Open Reviews
- Show All Issues
- Show Open Issues

No reviews matched the request. Check your Options in the drop-down menu of this sections header.

Activity

People

Assignee:: Unassigned

Reporter:: Hui Wang (Inactive)

Votes:: 0 Vote for this issue

Watchers:: 6 Start watching this issue

Dates

Created:: 31/Aug/17 10:48 AM

Updated:: 08/Jun/20 9:09 AM

Resolved:: 08/Jun/20 9:09 AM

Gerrit Reviews

There are no open Gerrit changes

memcached performance can be improved by reducing access to shared statistics.

Details

Description

Attachments

Attachments

Sub-Tasks

Gerrit Reviews

Activity

People

Dates

Gerrit Reviews

PagerDuty