Details
-
Bug
-
Status: Closed
-
Critical
-
Resolution: Fixed
-
6.5.0
-
Untriaged
-
Centos 64-bit
-
Yes
-
KV-Engine Mad-Hatter GA
Description
There is ~15% TP decrease in "Avg Throughput (ops/sec), Workload A, 3 nodes, 12 vCPU, replicateTo=1". This regression comes in build 4723: http://172.23.123.43:8000/getchangelog?product=couchbase-server&fromb=6.5.0-4722&tob=6.5.0-4723
4722:
http://perf.jenkins.couchbase.com/job/hebe/4973/ - 118370
https://s3-us-west-2.amazonaws.com/perf-artifacts/jenkins-hebe-4973/172.23.100.190.zip
https://s3-us-west-2.amazonaws.com/perf-artifacts/jenkins-hebe-4973/172.23.100.191.zip
https://s3-us-west-2.amazonaws.com/perf-artifacts/jenkins-hebe-4973/172.23.100.192.zip
https://s3-us-west-2.amazonaws.com/perf-artifacts/jenkins-hebe-4973/172.23.100.193.zip
https://s3-us-west-2.amazonaws.com/perf-artifacts/jenkins-hebe-4973/172.23.100.204.zip
https://s3-us-west-2.amazonaws.com/perf-artifacts/jenkins-hebe-4973/172.23.100.205.zip
https://s3-us-west-2.amazonaws.com/perf-artifacts/jenkins-hebe-4973/172.23.100.206.zip
https://s3-us-west-2.amazonaws.com/perf-artifacts/jenkins-hebe-4973/172.23.100.207.zip
http://cbmonitor.sc.couchbase.com/reports/html/?snapshot=hebe_650-4722_access_5f2e
4723:
http://perf.jenkins.couchbase.com/job/hebe/4977/ - 102525
https://s3-us-west-2.amazonaws.com/perf-artifacts/jenkins-hebe-4977/172.23.100.190.zip
https://s3-us-west-2.amazonaws.com/perf-artifacts/jenkins-hebe-4977/172.23.100.191.zip
https://s3-us-west-2.amazonaws.com/perf-artifacts/jenkins-hebe-4977/172.23.100.192.zip
https://s3-us-west-2.amazonaws.com/perf-artifacts/jenkins-hebe-4977/172.23.100.193.zip
https://s3-us-west-2.amazonaws.com/perf-artifacts/jenkins-hebe-4977/172.23.100.204.zip
https://s3-us-west-2.amazonaws.com/perf-artifacts/jenkins-hebe-4977/172.23.100.205.zip
https://s3-us-west-2.amazonaws.com/perf-artifacts/jenkins-hebe-4977/172.23.100.206.zip
https://s3-us-west-2.amazonaws.com/perf-artifacts/jenkins-hebe-4977/172.23.100.207.zip
http://cbmonitor.sc.couchbase.com/reports/html/?snapshot=hebe_650-4723_access_0f9c
Comparison:
Korrigan Clark Thanks for isolating when the performance changes are seen.
The primary drop was in build 4774; when we increased the number of shards from 4 to $NUM_CPS. I've confirmed we correctly set the number of shards (to 12) from the logs from run #5362, via stats.log:
ep_workload:num_shards: 12
Note that the previous build (4773) increased the number of reader and writer threads to 12; however without increasing the shards those threads should be idle (there only exists 4 Flusher tasks for them to execute).
I don't think there's much value in trying toy builds with the following patches reverted:
MB-36723: Set Writer threads to minimum priority - we know this reduces the impact of background writer threads on the frontend; and given there's the same number of Writer threads in 6.5.0-4908 as 6.0.3 (4 threads) I don't see any reason why a lower priority.MB-36249: Don't floor() write amplification stats - this only affects the output of cbstats, doesn't change anything in ep-engine performance.It's possible but unlikely the following patch has any negative impact:
MB-36723: Optimize KVShard memory usage - it just packs the elements in the shard array tighter together to save memory.I suggest first try re-running 4908 with the same number of shards as we had in 6.0.3 - i.e. 4. The additional splitting of the work into more tasks might be adding more overhead. This can be done with a simple override value of bucket_extras.max_num_shards.4.