Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-36515

Throughput decrease for plasma DGM Q2 singleton lookup

    XMLWordPrintable

Details

    • Untriaged
    • Yes

    Description

      We are seeing ~20-30% decrease in throughput for Q2 singleton lookup with Plasma and DGM. There are 5 secondary index commits that come in build 4488 that look like they could be causing the issue: 

      http://172.23.123.43:8000/getchangelog?product=couchbase-server&fromb=6.5.0-4487&tob=6.5.0-4488

       

      4558- http://perf.jenkins.couchbase.com/job/iris/16676/ - 15798.0

      4558 - http://perf.jenkins.couchbase.com/job/iris/16677/ - 17045.0

      4537 - http://perf.jenkins.couchbase.com/job/iris/16682/ - 17338.0

      4515 - http://perf.jenkins.couchbase.com/job/iris/16683/ - 15296.0

      4493 - http://perf.jenkins.couchbase.com/job/iris/16684/ - 13245.0

      4492 - http://perf.jenkins.couchbase.com/job/iris/16691/ - 14440.0

      4491 - http://perf.jenkins.couchbase.com/job/iris/16690/ - 15242.0

      4490 - http://perf.jenkins.couchbase.com/job/iris/16689/ - 15059.0

      4489 - http://perf.jenkins.couchbase.com/job/iris/16692/ - 15841.0

      4488 - http://perf.jenkins.couchbase.com/job/iris/16693/ - 13940.0

      4487 - http://perf.jenkins.couchbase.com/job/iris/16687/ - 19802.0

      4485 - http://perf.jenkins.couchbase.com/job/iris/16688/ - 20905.0

      4482 - http://perf.jenkins.couchbase.com/job/iris/16686/ - 20909.0

      4471 - http://perf.jenkins.couchbase.com/job/iris/16685/ - 20731.0

       

      Builds:

      4488 - http://perf.jenkins.couchbase.com/job/iris/16693/ - 13940.0

      https://s3-us-west-2.amazonaws.com/perf-artifacts/jenkins-iris-16693/172.23.100.45.zip
      https://s3-us-west-2.amazonaws.com/perf-artifacts/jenkins-iris-16693/172.23.100.55.zip
      https://s3-us-west-2.amazonaws.com/perf-artifacts/jenkins-iris-16693/172.23.100.70.zip
      https://s3-us-west-2.amazonaws.com/perf-artifacts/jenkins-iris-16693/172.23.100.71.zip
      https://s3-us-west-2.amazonaws.com/perf-artifacts/jenkins-iris-16693/172.23.100.72.zip
      https://s3-us-west-2.amazonaws.com/perf-artifacts/jenkins-iris-16693/172.23.100.73.zip

      4487 - http://perf.jenkins.couchbase.com/job/iris/16687/ - 19802.0

      https://s3-us-west-2.amazonaws.com/perf-artifacts/jenkins-iris-16687/172.23.100.45.zip
      https://s3-us-west-2.amazonaws.com/perf-artifacts/jenkins-iris-16687/172.23.100.55.zip
      https://s3-us-west-2.amazonaws.com/perf-artifacts/jenkins-iris-16687/172.23.100.70.zip
      https://s3-us-west-2.amazonaws.com/perf-artifacts/jenkins-iris-16687/172.23.100.71.zip
      https://s3-us-west-2.amazonaws.com/perf-artifacts/jenkins-iris-16687/172.23.100.72.zip
      https://s3-us-west-2.amazonaws.com/perf-artifacts/jenkins-iris-16687/172.23.100.73.zip

       

      Graphs:

      4488: http://cbmonitor.sc.couchbase.com/reports/html/?snapshot=iris_650-4488_access_c0a4

      4487: http://cbmonitor.sc.couchbase.com/reports/html/?snapshot=iris_650-4487_access_ac04

      Attachments

        1. DGM.png
          DGM.png
          54 kB
        2. NonDGM.png
          NonDGM.png
          55 kB
        No reviews matched the request. Check your Options in the drop-down menu of this sections header.

        Activity

          yes this is request plus John Liang

          korrigan.clark Korrigan Clark (Inactive) added a comment - yes this is request plus John Liang

          There seems to clear difference in throughput from build 4487 to 4488. Out of the 4 commits that go into 4488, one of the below two commits could be causing the regression:

          https://github.com/couchbase/indexing/commit/6f76529b64d5f154f27c3c38b291abccaae57e34 (most likely the cause)

          https://github.com/couchbase/indexing/commit/4b27ae4f0db424494381651f6c2756972861bbfe

          Requesting Deep to take a look at above commits.

          prathibha Prathibha Bisarahalli (Inactive) added a comment - There seems to clear difference in throughput from build 4487 to 4488. Out of the 4 commits that go into 4488, one of the below two commits could be causing the regression: https://github.com/couchbase/indexing/commit/6f76529b64d5f154f27c3c38b291abccaae57e34  (most likely the cause) https://github.com/couchbase/indexing/commit/4b27ae4f0db424494381651f6c2756972861bbfe Requesting Deep to take a look at above commits.
          deepkaran.salooja Deepkaran Salooja added a comment - - edited

          The Q2 singleton lookup request_plus test is run for both Plasma DGM and Non-DGM. Only the DGM test show lower throughput.

          DGM

          Non-DGM

          We can get more information about the scan request init latencies from the stats, but the test seems to be rebooting all the nodes before doing cbcollect, losing that information.

          11:41:37 2019-10-06T11:41:37 [INFO] Rebooting the node
          11:41:37 2019-10-06T11:41:37 [INFO] Waiting for all servers to be available
          11:41:49 2019-10-06T11:41:49 [INFO] Waiting for all servers to be available
          11:43:00 2019-10-06T11:43:00 [INFO] Waiting for all servers to be available
          11:46:32 2019-10-06T11:46:32 [INFO] Running cbcollect_info with redaction
          11:46:32 2019-10-06T11:46:32 [INFO] Running cbcollect_info with redaction
          11:46:32 2019-10-06T11:46:32 [INFO] Running cbcollect_info with redaction
          

          Korrigan Clark, would you be able to turn off the reboot before cbcollect so we have more stats to look at?

          deepkaran.salooja Deepkaran Salooja added a comment - - edited The Q2 singleton lookup request_plus test is run for both Plasma DGM and Non-DGM. Only the DGM test show lower throughput. DGM Non-DGM We can get more information about the scan request init latencies from the stats, but the test seems to be rebooting all the nodes before doing cbcollect, losing that information. 11:41:37 2019-10-06T11:41:37 [INFO] Rebooting the node 11:41:37 2019-10-06T11:41:37 [INFO] Waiting for all servers to be available 11:41:49 2019-10-06T11:41:49 [INFO] Waiting for all servers to be available 11:43:00 2019-10-06T11:43:00 [INFO] Waiting for all servers to be available 11:46:32 2019-10-06T11:46:32 [INFO] Running cbcollect_info with redaction 11:46:32 2019-10-06T11:46:32 [INFO] Running cbcollect_info with redaction 11:46:32 2019-10-06T11:46:32 [INFO] Running cbcollect_info with redaction Korrigan Clark , would you be able to turn off the reboot before cbcollect so we have more stats to look at?
          lynn.straus Lynn Straus added a comment -

          Added due date field (preset to Nov 15). Please update the due date to the current ETA for a fix. Thanks.

          lynn.straus Lynn Straus added a comment - Added due date field (preset to Nov 15). Please update the due date to the current ETA for a fix. Thanks.

          The root cause looks the same as MB-36074.

          The resident percent has come down due to higher purge ratio. Indexer is keeping more disk snapshots leading to this.

          4487

          "resident_ratio":       0.87829,
          "mvcc_purge_ratio":     1.73306,
          "memory_size":          16434416727,
          "memory_size_index":    72677203,
          

          4488

          "resident_ratio":       0.57663,
          "mvcc_purge_ratio":     2.13168,
          "memory_size":          13277885866,
          "memory_size_index":    91204157,
          

          deepkaran.salooja Deepkaran Salooja added a comment - The root cause looks the same as MB-36074 . The resident percent has come down due to higher purge ratio. Indexer is keeping more disk snapshots leading to this. 4487 "resident_ratio": 0.87829, "mvcc_purge_ratio": 1.73306, "memory_size": 16434416727, "memory_size_index": 72677203, 4488 "resident_ratio": 0.57663, "mvcc_purge_ratio": 2.13168, "memory_size": 13277885866, "memory_size_index": 91204157,

          Closing this as duplicate of MB-36074. Both tests are basically the same, one is measuring latency, other is measuring throughput.

          deepkaran.salooja Deepkaran Salooja added a comment - Closing this as duplicate of MB-36074 . Both tests are basically the same, one is measuring latency, other is measuring throughput.

          Bulk closing all invalid, duplicate and won't fix bugs. Please feel free to reopen

          raju Raju Suravarjjala added a comment - Bulk closing all invalid, duplicate and won't fix bugs. Please feel free to reopen

          People

            korrigan.clark Korrigan Clark (Inactive)
            korrigan.clark Korrigan Clark (Inactive)
            Votes:
            0 Vote for this issue
            Watchers:
            7 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Gerrit Reviews

                There are no open Gerrit changes

                PagerDuty