Details
-
Bug
-
Status: Closed
-
Critical
-
Resolution: Fixed
-
7.1.0
-
Untriaged
-
Centos 64-bit
-
1
-
Unknown
Description
There was a previous issue opened https://issues.couchbase.com/browse/MB-49301 but resolved as not a bug. Maybe more details of the CPU throttle algorithm could help understand the result better and how its intended to work. When I set the throttle level to 0.95, query throughput gets reduced by 40%. The performance test pushes indexer cpu normally close to 100%. If the throttle level is set to 0.95 why does cpu usage get throttled down to 60% instead of right below 95%. It seems there is a lot of wasted cpu cycles here.
Comparing these two test on 7.1.0-1650:
http://showfast.sc.couchbase.com/#/timeline/Linux/n1ql/Q5_Q7/all
Avg. Query Throughput (queries/sec), CI6, Group By Query (1K matches), MOI, not_bounded, s=1 c=1 i=1
http://perf.jenkins.couchbase.com/job/iris-multi-client/12973/ - 15280
Avg. Query Throughput (queries/sec), CI6, Group By Query (1K matches), MOI, not_bounded, Indexer CPU Throttle 0.95, s=1 c=1 i=1
http://perf.jenkins.couchbase.com/job/iris-multi-client/13027/ - 7651
graph comparison: http://cbmonitor.sc.couchbase.com/reports/html/?snapshot=iris_710-1650_access_0436&snapshot=iris_710-1650_access_afdf
Indexer cpu drops from 4800% to 3000%
Logs from throttle run (.45 is the index node):
https://s3-us-west-2.amazonaws.com/perf-artifacts/jenkins-iris-multi-client-13027/172.23.100.45.zip
https://s3-us-west-2.amazonaws.com/perf-artifacts/jenkins-iris-multi-client-13027/172.23.100.55.zip
https://s3-us-west-2.amazonaws.com/perf-artifacts/jenkins-iris-multi-client-13027/172.23.100.70.zip
https://s3-us-west-2.amazonaws.com/perf-artifacts/jenkins-iris-multi-client-13027/172.23.100.71.zip
https://s3-us-west-2.amazonaws.com/perf-artifacts/jenkins-iris-multi-client-13027/172.23.100.72.zip
https://s3-us-west-2.amazonaws.com/perf-artifacts/jenkins-iris-multi-client-13027/172.23.100.73.zip
https://s3-us-west-2.amazonaws.com/perf-artifacts/jenkins-iris-multi-client-13027/172.23.100.9.zip
Is this really the intended behaviour? This seems like it could have negative side effects when upgrading a cluster if this is the default setting.
Korrigan Clark FYI throttling can only react once every 1 second due to this being the frequency with which sigar CPU statistics get updated. I will take a look at the logs, but likely there is nothing wrong here, and any performance test that intends to drive CPU usage above the CPU target is going to trigger massive throttling. Given the inability to react faster than once per second, outcomes of this might be unexpected. Throttling will back off once the actual CPU usage is <= the target. This has been heavily tested and is known to be working. However if the workload is changing at a higher rate than throttling's reaction frequency, such overshoots are entirely possible.