Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-49501

Indexer cpu throttle cuts query throughput by 47%

    XMLWordPrintable

Details

    • Untriaged
    • Centos 64-bit
    • 1
    • Unknown

    Description

      There was a previous issue opened https://issues.couchbase.com/browse/MB-49301 but resolved as not a bug. Maybe more details of the CPU throttle algorithm could help understand the result better and how its intended to work. When I set the throttle level to 0.95, query throughput gets reduced by 40%. The performance test pushes indexer cpu normally close to 100%. If the throttle level is set to 0.95 why does cpu usage get throttled down to 60% instead of right below 95%. It seems there is a lot of wasted cpu cycles here.

      Comparing these two test on 7.1.0-1650:

      http://showfast.sc.couchbase.com/#/timeline/Linux/n1ql/Q5_Q7/all

      Avg. Query Throughput (queries/sec), CI6, Group By Query (1K matches), MOI, not_bounded, s=1 c=1 i=1

      http://perf.jenkins.couchbase.com/job/iris-multi-client/12973/ - 15280

      Avg. Query Throughput (queries/sec), CI6, Group By Query (1K matches), MOI, not_bounded, Indexer CPU Throttle 0.95, s=1 c=1 i=1

      http://perf.jenkins.couchbase.com/job/iris-multi-client/13027/  - 7651

      graph comparison: http://cbmonitor.sc.couchbase.com/reports/html/?snapshot=iris_710-1650_access_0436&snapshot=iris_710-1650_access_afdf

      Indexer cpu drops from 4800% to 3000%

      http://cbmonitor.sc.couchbase.com/reports/html/?snapshot=iris_710-1650_access_0436&snapshot=iris_710-1650_access_afdf#0103718d11a8aa31d5c328c63302ff1d

       

      Logs from throttle run (.45 is the index node):

      https://s3-us-west-2.amazonaws.com/perf-artifacts/jenkins-iris-multi-client-13027/172.23.100.45.zip

      https://s3-us-west-2.amazonaws.com/perf-artifacts/jenkins-iris-multi-client-13027/172.23.100.55.zip

      https://s3-us-west-2.amazonaws.com/perf-artifacts/jenkins-iris-multi-client-13027/172.23.100.70.zip

      https://s3-us-west-2.amazonaws.com/perf-artifacts/jenkins-iris-multi-client-13027/172.23.100.71.zip

      https://s3-us-west-2.amazonaws.com/perf-artifacts/jenkins-iris-multi-client-13027/172.23.100.72.zip

      https://s3-us-west-2.amazonaws.com/perf-artifacts/jenkins-iris-multi-client-13027/172.23.100.73.zip

      https://s3-us-west-2.amazonaws.com/perf-artifacts/jenkins-iris-multi-client-13027/172.23.100.9.zip

       

      Is this really the intended behaviour? This seems like it could have negative side effects when upgrading a cluster if this is the default setting.

       

      Attachments

        Issue Links

          For Gerrit Dashboard: MB-49501
          # Subject Branch Project Status CR V

          Activity

            Korrigan Clark FYI throttling can only react once every 1 second due to this being the frequency with which sigar CPU statistics get updated. I will take a look at the logs, but likely there is nothing wrong here, and any performance test that intends to drive CPU usage above the CPU target is going to trigger massive throttling. Given the inability to react faster than once per second, outcomes of this might be unexpected. Throttling will back off once the actual CPU usage is <= the target. This has been heavily tested and is known to be working. However if the workload is changing at a higher rate than throttling's reaction frequency, such overshoots are entirely possible.

            kevin.cherkauer Kevin Cherkauer added a comment - Korrigan Clark FYI throttling can only react once every 1 second due to this being the frequency with which sigar CPU statistics get updated. I will take a look at the logs, but likely there is nothing wrong here, and any performance test that intends to drive CPU usage above the CPU target is going to trigger massive throttling. Given the inability to react faster than once per second, outcomes of this might be unexpected. Throttling will back off once the actual CPU usage is <= the target. This has been heavily tested and is known to be working. However if the workload is changing at a higher rate than throttling's reaction frequency, such overshoots are entirely possible.
            korrigan.clark Korrigan Clark added a comment - - edited

            Kevin Cherkauer thanks for the analysis... this was really my concern - should this be set by default, especially given that it can cause such chaotic behavior. I have no doubt its working as designed, but rather is this behavior desired by default. There also seems to be alternative throttling or parameters that could be exposed to make the throttling adjustable and more widely usable. Wouldn't a simple sleep every second for 1-(throttle threshold) seconds guarantee CPU utilization, on average, stays at or below the throttle threshold? In the case of this test, sleeping for 1-0.95 or 0.05 seconds means that even if cpu is 100% outside of the sleep time, then over the 1 second interval the average would be 95% utilization. The problem then seems it would be that the granularity of the throttling is too large. In perfrunner we have a throttling mechanism that works per thread and doesn't involve checking cpu utilization - rather it throttles based on requests per second, and a small sleep is applied after each request if needed. For us this mechanism resulted in much smoother throttling and overall requests rate did not see any chaotic behavior. Prior we had a throttling mechanism that was invoked only after a batch of requests was completed with a larger sleep applied, and this resulted in chaotic request rates - spiking up then down similar to what we see in the test for this ticket. Just some thoughts, but would really like to see this throttle off by default if possible.

             From what I can tell the main destabilizing issue, is the feedback loop is exponential instead of linear as well as not smoothly turned off...reducing the throttle sleep to 0 seconds immediately after a large sleep seems obvious that the cpu will spike back up to the level seen 3 seconds prior as workloads probably wont change over that interval very much... a gradual backoff seems like it would help as well - cut the sleep in half each second after initially invoking a throttle

            korrigan.clark Korrigan Clark added a comment - - edited Kevin Cherkauer  thanks for the analysis... this was really my concern - should this be set by default, especially given that it can cause such chaotic behavior. I have no doubt its working as designed, but rather is this behavior desired by default. There also seems to be alternative throttling or parameters that could be exposed to make the throttling adjustable and more widely usable. Wouldn't a simple sleep every second for 1-(throttle threshold) seconds guarantee CPU utilization, on average, stays at or below the throttle threshold? In the case of this test, sleeping for 1-0.95 or 0.05 seconds means that even if cpu is 100% outside of the sleep time, then over the 1 second interval the average would be 95% utilization. The problem then seems it would be that the granularity of the throttling is too large. In perfrunner we have a throttling mechanism that works per thread and doesn't involve checking cpu utilization - rather it throttles based on requests per second, and a small sleep is applied after each request if needed. For us this mechanism resulted in much smoother throttling and overall requests rate did not see any chaotic behavior. Prior we had a throttling mechanism that was invoked only after a batch of requests was completed with a larger sleep applied, and this resulted in chaotic request rates - spiking up then down similar to what we see in the test for this ticket. Just some thoughts, but would really like to see this throttle off by default if possible.  From what I can tell the main destabilizing issue, is the feedback loop is exponential instead of linear as well as not smoothly turned off...reducing the throttle sleep to 0 seconds immediately after a large sleep seems obvious that the cpu will spike back up to the level seen 3 seconds prior as workloads probably wont change over that interval very much... a gradual backoff seems like it would help as well - cut the sleep in half each second after initially invoking a throttle

            Korrigan Clark Thank you for the additional throttling ideas. FYI current algorithm is not expontential – both throttling and backoff are linearly proportional to distance between actual and target, and maximum adjustment amount plus maximum sleep amounts are both capped (at +/-1,000 ms and +10,000 ms, respectively). It is basically a PI controller (subset of PID controllers https://en.wikipedia.org/wiki/PID_controller).

            The fundamental problem is there is far too long a time between CPU stats updates to allow this approach to work safely (feedback loop with long delay lends itself to pathogical system behaviors), or if the adjustment and max caps are reduced to make it mostly safe, it won't be able to react fast enough to be effective. The team is currently debating whether to turn it off by default, which would be my preference.

            "Wouldn't a simple sleep every second for 1-(throttle threshold) seconds guarantee CPU utilization, on average, stays at or below the throttle threshold?"

            This would not work because the Index service is not a single thread.

            "In perfrunner we have a throttling mechanism that works per thread and doesn't involve checking cpu utilization - rather it throttles based on requests per second"

            There was some discussion originally of throttling by gating the number of concurrent scans and snapshots. This road was not taken because the information about these is not centralized so would require significantly more development work to implement.

            kevin.cherkauer Kevin Cherkauer added a comment - Korrigan Clark Thank you for the additional throttling ideas. FYI current algorithm is not expontential – both throttling and backoff are linearly proportional to distance between actual and target, and maximum adjustment amount plus maximum sleep amounts are both capped (at +/-1,000 ms and +10,000 ms, respectively). It is basically a PI controller (subset of PID controllers https://en.wikipedia.org/wiki/PID_controller ). The fundamental problem is there is far too long a time between CPU stats updates to allow this approach to work safely (feedback loop with long delay lends itself to pathogical system behaviors), or if the adjustment and max caps are reduced to make it mostly safe, it won't be able to react fast enough to be effective. The team is currently debating whether to turn it off by default, which would be my preference. "Wouldn't a simple sleep every second for 1-(throttle threshold) seconds guarantee CPU utilization, on average, stays at or below the throttle threshold?" This would not work because the Index service is not a single thread. "In perfrunner we have a throttling mechanism that works per thread and doesn't involve checking cpu utilization - rather it throttles based on requests per second" There was some discussion originally of throttling by gating the number of concurrent scans and snapshots. This road was not taken because the information about these is not centralized so would require significantly more development work to implement.

            Opened MB-49662 to work on fixes for CPU throttling. The current bug is only one aspect of that effort, which is still under discussion within the Index team. Thus I am resolving this one as a DUP so we don't keep two Neo bugs open for this topic.

            kevin.cherkauer Kevin Cherkauer added a comment - Opened MB-49662 to work on fixes for CPU throttling. The current bug is only one aspect of that effort, which is still under discussion within the Index team. Thus I am resolving this one as a DUP so we don't keep two Neo bugs open for this topic.

            I originally was confused and thought this MB was not assigned to GSI, so opened MB-49662 to handle the code changes, then later found they were both in GSI and DUPed this one to the other, but this one has all the detailed info in it so I am reopening it and DUPing the other one.

            kevin.cherkauer Kevin Cherkauer added a comment - I originally was confused and thought this MB was not assigned to GSI, so opened MB-49662 to handle the code changes, then later found they were both in GSI and DUPed this one to the other, but this one has all the detailed info in it so I am reopening it and DUPing the other one.
            kevin.cherkauer Kevin Cherkauer added a comment - - edited

            Part 1 of this fix (done under DUP MB-49662) is to reduce cpu_throttle.go constants

            • MAX_THROTTLE_ADJUST_MS from 1000 to 100
            • MAX_THROTTLE_DELAY_MS from 10,000 to 100
              as the originals are much too large.

            These proved still too large. The throttling is still oscilating, just over a narrower range.

            Bo-Chun Wang did new runs at 1.00 and 0.95 CPU targets (FYI Korrigan Clark).

            These runs had Autofailover set to 5 seconds, which Bo-Chun tells me is how they run all the performance tests.

            7812:2021-11-22T17:14:43.305-08:00 [Info] CpuThrottle::adjustThrottleDelay: Adjusted throttle. cpuTarget: 0.95, currCpu: 0.9822990420658059, throttleDelayMs (new, old, change): (65, 0, 65)
            7813:2021-11-22T17:14:44.305-08:00 [Info] CpuThrottle::adjustThrottleDelay: Adjusted throttle. cpuTarget: 0.95, currCpu: 0.6025667779632721, throttleDelayMs (new, old, change): (0, 65, -65)
            7814:2021-11-22T17:14:46.306-08:00 [Info] CpuThrottle::adjustThrottleDelay: Adjusted throttle. cpuTarget: 0.95, currCpu: 0.9715743440233237, throttleDelayMs (new, old, change): (43, 0, 43)
            7815:2021-11-22T17:14:47.305-08:00 [Info] CpuThrottle::adjustThrottleDelay: Adjusted throttle. cpuTarget: 0.95, currCpu: 0.609486000835771, throttleDelayMs (new, old, change): (0, 43, -43)
            7816:2021-11-22T17:14:49.307-08:00 [Info] CpuThrottle::adjustThrottleDelay: Adjusted throttle. cpuTarget: 0.95, currCpu: 0.9854136278391331, throttleDelayMs (new, old, change): (71, 0, 71)
            7817:2021-11-22T17:14:50.305-08:00 [Info] CpuThrottle::adjustThrottleDelay: Adjusted throttle. cpuTarget: 0.95, currCpu: 0.5872335979941496, throttleDelayMs (new, old, change): (0, 71, -71)
            7818:2021-11-22T17:14:52.305-08:00 [Info] CpuThrottle::adjustThrottleDelay: Adjusted throttle. cpuTarget: 0.95, currCpu: 0.9614422676115048, throttleDelayMs (new, old, change): (23, 0, 23)
            7827:2021-11-22T17:14:53.305-08:00 [Info] CpuThrottle::adjustThrottleDelay: Adjusted throttle. cpuTarget: 0.95, currCpu: 0.5891254435399708, throttleDelayMs (new, old, change): (0, 23, -23)
            

            There is also no evidence from either of these runs that throttling is needed.

            1. IsSafe() API was never called in either run. This is the API ns_server calls before initiating an Autofailover. Thus no Autofailover was ever attempted.

            2. HealthCheck() never took a long time to respond. This is the "heartbeat" API called by ns_server to check whether Index service is healthy to decide whether an Autofailover attempt should be made. Index logs any calls that take longer than 1 second from entry to return, but none were logged. This means no heartbeats were missed.

            3. CPU is saturated on Index node (172.23.100.45) during the entire test when throttling is off, yet no heartbeats missed: http://cbmonitor.sc.couchbase.com/reports/html/?snapshot=iris_710-1759_access_f1a6#683b767e810c43cf6ec58cc3afd78a4a

            kevin.cherkauer Kevin Cherkauer added a comment - - edited Part 1 of this fix (done under DUP MB-49662 ) is to reduce cpu_throttle.go constants MAX_THROTTLE_ADJUST_MS from 1000 to 100 MAX_THROTTLE_DELAY_MS from 10,000 to 100 as the originals are much too large. These proved still too large. The throttling is still oscilating, just over a narrower range. Bo-Chun Wang did new runs at 1.00 and 0.95 CPU targets (FYI Korrigan Clark ). 1.00 does not do any throttling. No performance hit. (Link: http://perf.jenkins.couchbase.com/job/iris-multi-client/13289/console ) 0.95 attempts to keep CPU usage at or below 95%. Performance degradation of 29.5% (vs. 47.4% prior to the above changes) (Link: http://perf.jenkins.couchbase.com/job/iris-multi-client/13290/console ) These runs had Autofailover set to 5 seconds, which Bo-Chun tells me is how they run all the performance tests. 7812:2021-11-22T17:14:43.305-08:00 [Info] CpuThrottle::adjustThrottleDelay: Adjusted throttle. cpuTarget: 0.95, currCpu: 0.9822990420658059, throttleDelayMs (new, old, change): (65, 0, 65) 7813:2021-11-22T17:14:44.305-08:00 [Info] CpuThrottle::adjustThrottleDelay: Adjusted throttle. cpuTarget: 0.95, currCpu: 0.6025667779632721, throttleDelayMs (new, old, change): (0, 65, -65) 7814:2021-11-22T17:14:46.306-08:00 [Info] CpuThrottle::adjustThrottleDelay: Adjusted throttle. cpuTarget: 0.95, currCpu: 0.9715743440233237, throttleDelayMs (new, old, change): (43, 0, 43) 7815:2021-11-22T17:14:47.305-08:00 [Info] CpuThrottle::adjustThrottleDelay: Adjusted throttle. cpuTarget: 0.95, currCpu: 0.609486000835771, throttleDelayMs (new, old, change): (0, 43, -43) 7816:2021-11-22T17:14:49.307-08:00 [Info] CpuThrottle::adjustThrottleDelay: Adjusted throttle. cpuTarget: 0.95, currCpu: 0.9854136278391331, throttleDelayMs (new, old, change): (71, 0, 71) 7817:2021-11-22T17:14:50.305-08:00 [Info] CpuThrottle::adjustThrottleDelay: Adjusted throttle. cpuTarget: 0.95, currCpu: 0.5872335979941496, throttleDelayMs (new, old, change): (0, 71, -71) 7818:2021-11-22T17:14:52.305-08:00 [Info] CpuThrottle::adjustThrottleDelay: Adjusted throttle. cpuTarget: 0.95, currCpu: 0.9614422676115048, throttleDelayMs (new, old, change): (23, 0, 23) 7827:2021-11-22T17:14:53.305-08:00 [Info] CpuThrottle::adjustThrottleDelay: Adjusted throttle. cpuTarget: 0.95, currCpu: 0.5891254435399708, throttleDelayMs (new, old, change): (0, 23, -23) There is also no evidence from either of these runs that throttling is needed. 1. IsSafe() API was never called in either run. This is the API ns_server calls before initiating an Autofailover. Thus no Autofailover was ever attempted. 2. HealthCheck() never took a long time to respond. This is the "heartbeat" API called by ns_server to check whether Index service is healthy to decide whether an Autofailover attempt should be made. Index logs any calls that take longer than 1 second from entry to return, but none were logged. This means no heartbeats were missed. 3. CPU is saturated on Index node (172.23.100.45) during the entire test when throttling is off, yet no heartbeats missed: http://cbmonitor.sc.couchbase.com/reports/html/?snapshot=iris_710-1759_access_f1a6#683b767e810c43cf6ec58cc3afd78a4a

            Part 2 of this fix:

            Further reduce
            1. MAX_THROTTLE_ADJUST_MS from 100 to 5
            2. MAX_THROTTLE_DELAY_MS from 100 to 10

            Change default indexer.cpu.throttle.target from 0.95 to 0.98.

            kevin.cherkauer Kevin Cherkauer added a comment - Part 2 of this fix: Further reduce 1. MAX_THROTTLE_ADJUST_MS from 100 to 5 2. MAX_THROTTLE_DELAY_MS from 100 to 10 Change default indexer.cpu.throttle.target from 0.95 to 0.98.

            Part 3 of this fix:

            If cpuTarget == 1.00, shut the CPU throttling goroutine down. Previously it stayed running and collected new CPU stats and did the full throttle adjustment calculation every 1 second, but never adjusted the throttle anyway, so this was all wasted work.

            kevin.cherkauer Kevin Cherkauer added a comment - Part 3 of this fix: If cpuTarget == 1.00, shut the CPU throttling goroutine down. Previously it stayed running and collected new CPU stats and did the full throttle adjustment calculation every 1 second, but never adjusted the throttle anyway, so this was all wasted work.

            Build couchbase-server-7.1.0-1779 contains indexing commit fa06f68 with commit message:
            MB-49501 Part 3 (7.1.0 1695): Autofailover: Make CPU throttling safer

            build-team Couchbase Build Team added a comment - Build couchbase-server-7.1.0-1779 contains indexing commit fa06f68 with commit message: MB-49501 Part 3 (7.1.0 1695): Autofailover: Make CPU throttling safer

            Build couchbase-server-7.1.0-1779 contains indexing commit e8c2cc3 with commit message:
            MB-49501 Part 2 (7.1.0 1695): Autofailover: Make CPU throttling safer

            build-team Couchbase Build Team added a comment - Build couchbase-server-7.1.0-1779 contains indexing commit e8c2cc3 with commit message: MB-49501 Part 2 (7.1.0 1695): Autofailover: Make CPU throttling safer
            kevin.cherkauer Kevin Cherkauer added a comment - - edited

            Bo-Chun Wang agreed to run the test again with the latest tweaks from Part 2 and Part 3 on Neo build >= 1779 where these changes first appear, at three different cpuTarget setttings for throttling:

            1. 1.00 – no throttling (also CPU stats and throttling computation goroutine will no longer be running in this case)
            2. 0.98 – new default
            3. 0.95 – old default

            Summarizing the caps on throttle adjustment and total throttling, these were

            (per adjustment, cap)

            1. (1,000 ms, 10,000 ms) – original runs by Korrigan Clark 
            2. (100 ms, 100 ms) – first runs by Bo-Chun
            3. (5 ms, 10 ms) – this coming second set of runs by Bo-Chun  (which also adds 0.98 as a cpuTarget option)
            kevin.cherkauer Kevin Cherkauer added a comment - - edited Bo-Chun Wang agreed to run the test again with the latest tweaks from Part 2 and Part 3 on Neo build >= 1779 where these changes first appear, at three different cpuTarget setttings for throttling: 1.00 – no throttling (also CPU stats and throttling computation goroutine will no longer be running in this case) 0.98 – new default 0.95 – old default Summarizing the caps on throttle adjustment and total throttling, these were (per adjustment, cap) (1,000 ms, 10,000 ms) – original runs by Korrigan Clark   (100 ms, 100 ms) – first runs by Bo-Chun (5 ms, 10 ms) – this coming second set of runs by Bo-Chun  (which also adds 0.98 as a cpuTarget option)
            bo-chun.wang Bo-Chun Wang added a comment -

            1.00 – no throttling

            http://perf.jenkins.couchbase.com/job/iris-multi-client/13347/ 

            query throughput: 15772.0

            https://s3-us-west-2.amazonaws.com/perf-artifacts/jenkins-iris-multi-client-13347/172.23.100.45.zip

            https://s3-us-west-2.amazonaws.com/perf-artifacts/jenkins-iris-multi-client-13347/172.23.100.55.zip

            https://s3-us-west-2.amazonaws.com/perf-artifacts/jenkins-iris-multi-client-13347/172.23.100.70.zip

            https://s3-us-west-2.amazonaws.com/perf-artifacts/jenkins-iris-multi-client-13347/172.23.100.71.zip

            https://s3-us-west-2.amazonaws.com/perf-artifacts/jenkins-iris-multi-client-13347/172.23.100.72.zip

            https://s3-us-west-2.amazonaws.com/perf-artifacts/jenkins-iris-multi-client-13347/172.23.100.73.zip

            https://s3-us-west-2.amazonaws.com/perf-artifacts/jenkins-iris-multi-client-13347/172.23.100.9.zip

             

            0.98 – new default

            http://perf.jenkins.couchbase.com/job/iris-multi-client/13348/

            query thoughput: 15649.0

            https://s3-us-west-2.amazonaws.com/perf-artifacts/jenkins-iris-multi-client-13348/172.23.100.45.zip

            https://s3-us-west-2.amazonaws.com/perf-artifacts/jenkins-iris-multi-client-13348/172.23.100.55.zip

            https://s3-us-west-2.amazonaws.com/perf-artifacts/jenkins-iris-multi-client-13348/172.23.100.70.zip

            https://s3-us-west-2.amazonaws.com/perf-artifacts/jenkins-iris-multi-client-13348/172.23.100.71.zip

            https://s3-us-west-2.amazonaws.com/perf-artifacts/jenkins-iris-multi-client-13348/172.23.100.72.zip

            https://s3-us-west-2.amazonaws.com/perf-artifacts/jenkins-iris-multi-client-13348/172.23.100.73.zip

            https://s3-us-west-2.amazonaws.com/perf-artifacts/jenkins-iris-multi-client-13348/172.23.100.9.zip

             

            0.95 – old default

            http://perf.jenkins.couchbase.com/job/iris-multi-client/13346/

            query throughput: 13216.0

            https://s3-us-west-2.amazonaws.com/perf-artifacts/jenkins-iris-multi-client-13346/172.23.100.45.zip

            https://s3-us-west-2.amazonaws.com/perf-artifacts/jenkins-iris-multi-client-13346/172.23.100.55.zip

            https://s3-us-west-2.amazonaws.com/perf-artifacts/jenkins-iris-multi-client-13346/172.23.100.70.zip

            https://s3-us-west-2.amazonaws.com/perf-artifacts/jenkins-iris-multi-client-13346/172.23.100.71.zip

            https://s3-us-west-2.amazonaws.com/perf-artifacts/jenkins-iris-multi-client-13346/172.23.100.72.zip

            https://s3-us-west-2.amazonaws.com/perf-artifacts/jenkins-iris-multi-client-13346/172.23.100.73.zip

            https://s3-us-west-2.amazonaws.com/perf-artifacts/jenkins-iris-multi-client-13346/172.23.100.9.zip

            bo-chun.wang Bo-Chun Wang added a comment - 1.00 – no throttling http://perf.jenkins.couchbase.com/job/iris-multi-client/13347/   query throughput: 15772.0 https://s3-us-west-2.amazonaws.com/perf-artifacts/jenkins-iris-multi-client-13347/172.23.100.45.zip https://s3-us-west-2.amazonaws.com/perf-artifacts/jenkins-iris-multi-client-13347/172.23.100.55.zip https://s3-us-west-2.amazonaws.com/perf-artifacts/jenkins-iris-multi-client-13347/172.23.100.70.zip https://s3-us-west-2.amazonaws.com/perf-artifacts/jenkins-iris-multi-client-13347/172.23.100.71.zip https://s3-us-west-2.amazonaws.com/perf-artifacts/jenkins-iris-multi-client-13347/172.23.100.72.zip https://s3-us-west-2.amazonaws.com/perf-artifacts/jenkins-iris-multi-client-13347/172.23.100.73.zip https://s3-us-west-2.amazonaws.com/perf-artifacts/jenkins-iris-multi-client-13347/172.23.100.9.zip   0.98 – new default http://perf.jenkins.couchbase.com/job/iris-multi-client/13348/ query thoughput: 15649.0 https://s3-us-west-2.amazonaws.com/perf-artifacts/jenkins-iris-multi-client-13348/172.23.100.45.zip https://s3-us-west-2.amazonaws.com/perf-artifacts/jenkins-iris-multi-client-13348/172.23.100.55.zip https://s3-us-west-2.amazonaws.com/perf-artifacts/jenkins-iris-multi-client-13348/172.23.100.70.zip https://s3-us-west-2.amazonaws.com/perf-artifacts/jenkins-iris-multi-client-13348/172.23.100.71.zip https://s3-us-west-2.amazonaws.com/perf-artifacts/jenkins-iris-multi-client-13348/172.23.100.72.zip https://s3-us-west-2.amazonaws.com/perf-artifacts/jenkins-iris-multi-client-13348/172.23.100.73.zip https://s3-us-west-2.amazonaws.com/perf-artifacts/jenkins-iris-multi-client-13348/172.23.100.9.zip   0.95 – old default http://perf.jenkins.couchbase.com/job/iris-multi-client/13346/ query throughput: 13216.0 https://s3-us-west-2.amazonaws.com/perf-artifacts/jenkins-iris-multi-client-13346/172.23.100.45.zip https://s3-us-west-2.amazonaws.com/perf-artifacts/jenkins-iris-multi-client-13346/172.23.100.55.zip https://s3-us-west-2.amazonaws.com/perf-artifacts/jenkins-iris-multi-client-13346/172.23.100.70.zip https://s3-us-west-2.amazonaws.com/perf-artifacts/jenkins-iris-multi-client-13346/172.23.100.71.zip https://s3-us-west-2.amazonaws.com/perf-artifacts/jenkins-iris-multi-client-13346/172.23.100.72.zip https://s3-us-west-2.amazonaws.com/perf-artifacts/jenkins-iris-multi-client-13346/172.23.100.73.zip https://s3-us-west-2.amazonaws.com/perf-artifacts/jenkins-iris-multi-client-13346/172.23.100.9.zip
            kevin.cherkauer Kevin Cherkauer added a comment - - edited

            Part 4 (merged to unstable, awaiting merge to master) sets the CPU target to 1.00, which disables throttling. (Most recent previous default was 0.98, which will do throttling.) I discussed this with Deepkaran Salooja . This change will allow a cycle of performance test runs with throttling off, which will tell us two things:

            1. Do any of them get false Autofailovers? (All perf tests are run with the Autofailover threshold set to 5 seconds, which is the lowest possible.)
            2. How much is the performance impact of the toned down throttling?

            My experiments on the desert cluster indicate there needs to be somewhere around 100x as many CPU-bound goroutines as cores to push the Go scheduler period up to 2 seconds, the point at which Autofailover IsSafe heartbeats will start getting missed. New info from Deep today is that Query already restricts the number of queries to 4 per core (or 8/core if they are request_plus consistency), so these should be far below the danger limit. The other major CPU user is flushers for snapshot creation, which can create a lot of goroutines, but they run for only a very short time because they only have a small number of ms worth of mutations to flush.

            Thus it is possible we do not really need throttling to avoid false Autofailovers and won't need to turn this back on, but the feature will still be present for the field ("In case of emergency, break glass"). We also do not have evidence that the kind of throttling we have will help in very many cases.

            kevin.cherkauer Kevin Cherkauer added a comment - - edited Part 4 (merged to unstable, awaiting merge to master) sets the CPU target to 1.00, which disables throttling. (Most recent previous default was 0.98, which will do throttling.) I discussed this with Deepkaran Salooja . This change will allow a cycle of performance test runs with throttling off, which will tell us two things: Do any of them get false Autofailovers? (All perf tests are run with the Autofailover threshold set to 5 seconds, which is the lowest possible.) How much is the performance impact of the toned down throttling? My experiments on the desert cluster indicate there needs to be somewhere around 100x as many CPU-bound goroutines as cores to push the Go scheduler period up to 2 seconds, the point at which Autofailover IsSafe heartbeats will start getting missed. New info from Deep today is that Query already restricts the number of queries to 4 per core (or 8/core if they are request_plus consistency), so these should be far below the danger limit. The other major CPU user is flushers for snapshot creation, which can create a lot of goroutines, but they run for only a very short time because they only have a small number of ms worth of mutations to flush. Thus it is possible we do not really need throttling to avoid false Autofailovers and won't need to turn this back on, but the feature will still be present for the field ("In case of emergency, break glass"). We also do not have evidence that the kind of throttling we have will help in very many cases.

            Build couchbase-server-7.1.0-1867 contains indexing commit 58027c0 with commit message:
            MB-49501 Part 4 (7.1.0 1861): Temp disable CPU throttling for perf tests

            build-team Couchbase Build Team added a comment - Build couchbase-server-7.1.0-1867 contains indexing commit 58027c0 with commit message: MB-49501 Part 4 (7.1.0 1861): Temp disable CPU throttling for perf tests

            Build couchbase-server-7.1.0-1883 contains indexing commit 9cc3bff with commit message:
            MB-49501 Part 5 (7.1.0 1861): Log long heartbeats; reduce dump frequency

            build-team Couchbase Build Team added a comment - Build couchbase-server-7.1.0-1883 contains indexing commit 9cc3bff with commit message: MB-49501 Part 5 (7.1.0 1861): Log long heartbeats; reduce dump frequency

            Resolve this now because:

            1. Throttling parameters changed so throttling is much lighter.
            2. Throttling is now disabled by default.
            3. If we reenable it, it will be at CPU target 0.98 (98%), which with current parameters had only a small (~0.5%) performance impact on the test for which this MB was opened.
            kevin.cherkauer Kevin Cherkauer added a comment - Resolve this now because: Throttling parameters changed so throttling is much lighter. Throttling is now disabled by default. If we reenable it, it will be at CPU target 0.98 (98%), which with current parameters had only a small (~0.5%) performance impact on the test for which this MB was opened.
            korrigan.clark Korrigan Clark added a comment - http://perf.jenkins.couchbase.com/job/iris-multi-client/14318/

            Build couchbase-server-7.1.0-2049 contains indexing commit c9aeea0 with commit message:
            MB-49501 Part 6 (7.1.0 2037): Shorten HealthCheck "Slow call" threshold

            build-team Couchbase Build Team added a comment - Build couchbase-server-7.1.0-2049 contains indexing commit c9aeea0 with commit message: MB-49501 Part 6 (7.1.0 2037): Shorten HealthCheck "Slow call" threshold

            People

              korrigan.clark Korrigan Clark
              korrigan.clark Korrigan Clark
              Votes:
              0 Vote for this issue
              Watchers:
              10 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                PagerDuty