Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-49301

Index monitor slowing down index scans

    XMLWordPrintable

Details

    • Untriaged
    • Centos 64-bit
    • 1
    • Unknown

    Description

      Avg. Query Throughput (queries/sec), CI19, Composite Filters (1000 -> 1), Covered, MOI, not_bounded

      1593 - https://perf.jenkins.couchbase.com/job/iris-multi-client/12799/ - 59501.0
      1594 - https://perf.jenkins.couchbase.com/job/iris-multi-client/12802/ - 37752.0

      http://changelog.build.couchbase.com/?product=couchbase-server&fromVersion=7.1.0&fromBuild=1593&toVersion=7.1.0&toBuild=1594&f_goxdcr=on&f_ns_server=on&f_product-metadata=on

      http://cbmonitor.sc.couchbase.com/reports/html/?snapshot=iris_710-1593_access_4fff&snapshot=iris_710-1594_access_5f5e

      http://showfast.sc.couchbase.com/#/timeline/Linux/n1ql/API2/MOI

      Many tests that push indexer cpu utilization to the max have regressed in 1594. It seems the only related commits are from the introduction of index monitor and indexer autofailover. It looks like default timeout for the monitor request is 2 seconds which may be too low for heavy indexer usage. In the logs you can see lots of these monitor requests failing

      https://s3-us-west-2.amazonaws.com/perf-artifacts/jenkins-iris-multi-client-12802/172.23.100.45.zip <---indexer
      https://s3-us-west-2.amazonaws.com/perf-artifacts/jenkins-iris-multi-client-12802/172.23.100.55.zip
      https://s3-us-west-2.amazonaws.com/perf-artifacts/jenkins-iris-multi-client-12802/172.23.100.70.zip
      https://s3-us-west-2.amazonaws.com/perf-artifacts/jenkins-iris-multi-client-12802/172.23.100.71.zip
      https://s3-us-west-2.amazonaws.com/perf-artifacts/jenkins-iris-multi-client-12802/172.23.100.72.zip
      https://s3-us-west-2.amazonaws.com/perf-artifacts/jenkins-iris-multi-client-12802/172.23.100.73.zip
      https://s3-us-west-2.amazonaws.com/perf-artifacts/jenkins-iris-multi-client-12802/172.23.100.9.zip

      If there is a way to increase the timeout, I can try that to see if there is less slow down. Also would like to disable indexer autofailover if possible. Can you provide the endpoints and requests to do so?

      Attachments

        Issue Links

          No reviews matched the request. Check your Options in the drop-down menu of this sections header.

          Activity

            Korrigan Clark I will take a look.

            Request from GSI team: Can you please also try a run with:

            • Autofailover on and set to failover after 5 seconds
            • cpuTarget set to 1.00

            and report whether this run triggered Autofailover to happen. This is to get a data point as to whether CPU throttling is actually needed to prevent false failovers in this case, where the index nodes are heavily loaded. Thank you!

            kevin.cherkauer Kevin Cherkauer added a comment - Korrigan Clark I will take a look. Request from GSI team: Can you please also try a run with: Autofailover on and set to failover after 5 seconds cpuTarget set to 1.00 and report whether this run triggered Autofailover to happen. This is to get a data point as to whether CPU throttling is actually needed to prevent false failovers in this case, where the index nodes are heavily loaded. Thank you!

            I just realized i set it to 100 not 1.00... this may be why... let me verify and get you the other requested data...

            korrigan.clark Korrigan Clark added a comment - I just realized i set it to 100 not 1.00... this may be why... let me verify and get you the other requested data...
            kevin.cherkauer Kevin Cherkauer added a comment - - edited

            Korrigan Clark I was going to check for that in the logs. It will ignore any value < 0.50 or > 1.00, so setting to 100 does not change it and the run would have been with the default of 0.95 in that case.

            You can grep in indexer.log for "CpuThrottle::" messages like

            2021-11-02T08:31:10.880-07:00 [Info] CpuThrottle::runThrottling: Starting. cpuTarget: 0.95, throttleDelayMs: 0
             
            2021-11-02T08:36:38.881-07:00 [Info] CpuThrottle::adjustThrottleDelay: Adjusted throttle. cpuTarget: 0.95, currCpu: 0.9653052719316524, throttleDelayMs (new, old, change): (306, 0, 306)
            

            The first would be "cpuTarget: 1" if correctly set (Go will drop insignificant 0's after the decimal point), [Correction: Actually it will always come up with 0.95 *unless* you set it to 1.00 before ns_server makes it's first HealthCheck() call (HealthCheck() is what launches CPU throttling), and changes to cpuTarget are not currently logged, though I am planning to log them in a future delivery. *Or* you can set cpuTarget to 1.00 and turn Autofailover off for at least 10 seconds, which will shut down CPU throttling, then turn Autofailover back on, and you should see the first message logged after the first HealthCheck() call showing cpuTarget: 1] and there should be no "adjustThrottleDelay" messages after that point because the throttle would never be adjusted in this case. [This second is the more important thing to look for – adjustThrottleDelay should not log any adjustments if cpuTarget == 1.00. If cpuTarget is something different then it will log adjustments if CPU usage gets above the target. Those messages also include the current cpuTarget value.]

            kevin.cherkauer Kevin Cherkauer added a comment - - edited Korrigan Clark I was going to check for that in the logs. It will ignore any value < 0.50 or > 1.00, so setting to 100 does not change it and the run would have been with the default of 0.95 in that case. You can grep in indexer.log for "CpuThrottle::" messages like 2021-11-02T08:31:10.880-07:00 [Info] CpuThrottle::runThrottling: Starting. cpuTarget: 0.95, throttleDelayMs: 0   2021-11-02T08:36:38.881-07:00 [Info] CpuThrottle::adjustThrottleDelay: Adjusted throttle. cpuTarget: 0.95, currCpu: 0.9653052719316524, throttleDelayMs (new, old, change): (306, 0, 306) The first would be "cpuTarget: 1" if correctly set (Go will drop insignificant 0's after the decimal point), [Correction: Actually it will always come up with 0.95 *unless* you set it to 1.00 before ns_server makes it's first HealthCheck() call (HealthCheck() is what launches CPU throttling), and changes to cpuTarget are not currently logged, though I am planning to log them in a future delivery. *Or* you can set cpuTarget to 1.00 and turn Autofailover off for at least 10 seconds, which will shut down CPU throttling, then turn Autofailover back on, and you should see the first message logged after the first HealthCheck() call showing cpuTarget: 1] and there should be no "adjustThrottleDelay" messages after that point because the throttle would never be adjusted in this case. [This second is the more important thing to look for – adjustThrottleDelay should not log any adjustments if cpuTarget == 1.00. If cpuTarget is something different then it will log adjustments if CPU usage gets above the target. Those messages also include the current cpuTarget value.]

            Korrigan Clark What am I supposed to be looking at in these links:

            1590 - http://perf.jenkins.couchbase.com/job/arke-dgm/1106/ - 18.2

            1626 with autofailover on and cpu throttle 100 - http://perf.jenkins.couchbase.com/job/arke-dgm/1104/ - 23.9 

            1626 with autofailover off and cpu throttle 100 - http://perf.jenkins.couchbase.com/job/arke-dgm/1111/ - 18.2

             

            They are just a bunch of build logs to me. Is there performance test run data visible somewhere under these?

            kevin.cherkauer Kevin Cherkauer added a comment - Korrigan Clark What am I supposed to be looking at in these links: 1590 - http://perf.jenkins.couchbase.com/job/arke-dgm/1106/  - 18.2 1626 with autofailover on and cpu throttle 100 - http://perf.jenkins.couchbase.com/job/arke-dgm/1104/ - 23.9   1626 with autofailover off and cpu throttle 100 - http://perf.jenkins.couchbase.com/job/arke-dgm/1111/  - 18.2   They are just a bunch of build logs to me. Is there performance test run data visible somewhere under these?

            Since the new run with Autofailover enabled had invalid "indexer.cpu.throttle.target" value passed to it of 100 instead of 1.00, the run would actually run with the default setting of 0.95 and thus is the same as the original run that found slowness. Re-closing as not a bug.

            kevin.cherkauer Kevin Cherkauer added a comment - Since the new run with Autofailover enabled had invalid "indexer.cpu.throttle.target" value passed to it of 100 instead of 1.00, the run would actually run with the default setting of 0.95 and thus is the same as the original run that found slowness. Re-closing as not a bug.

            People

              kevin.cherkauer Kevin Cherkauer
              korrigan.clark Korrigan Clark
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Gerrit Reviews

                  There are no open Gerrit changes

                  PagerDuty