Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-44840

running time of load phase in magma 1000-collection tests increased by at least 50%

    XMLWordPrintable

Details

    • Untriaged
    • 1
    • Yes
    • Magma 2021-Mar1

    Description

      I did several 1-collection and 1000-collection runs on build 7.0.0-4554. I noticed the running time of 1000-collection runs is much longer. I compared the runs with old runs on build 7.0.0-3874 and found the running time of load phase is longer in recent runs. Note that, this issue didn't happen in 1-collection run.

       

      Running time of load phase (minutes)

       

      Test Baseline (3874) Compare (4554)
      Avg Throughput (ops/sec), 4 nodes, 1 bucket x 1B x 1KB, 20/80 R/W, s=1 c=1000, Uniform distribution, 2% Resident Ratio, Magma 180 280
      99.9th percentile GET latency (ms), 4 nodes, 1 bucket x 1B x 1KB, 15K ops/sec (5/90/5 C/R/W), s=1 c=1000, Uniform distribution, 2% Resident Ratio, Magma 180 630

       

      Attachments

        No reviews matched the request. Check your Options in the drop-down menu of this sections header.

        Activity

          Sarath Lakshman

          The issue happened between 7.0.0-4291 and 7.0.0-4342. There are 3 magma changes between these two weekly builds.

           

          • Commit: dca855c16a0e8a89967e7dce757a24c3b89d2dc0 in build: couchbase-server-7.0.0-4330
            MB-41252 magma: Add additional debugging to status in case of missing files
             
          • Commit: e4937f7e4768ddb4dcfb68d8fec6a665fdf99def in build: couchbase-server-7.0.0-4324
            CBSS-591 magma: Avoid using coroutines when queue depth is less than 2
            When queue depth is disabled or 1, the extra overhead for allocating the coroutine stack can be avoided.
             
          • Commit: 4866266d433da7a9d6c48cc3bb001bebc6a08ea2 in build: couchbase-server-7.0.0-4306
            MB-43864 memory: Fix bug on CentOS with memory accounting
            It appears that static local pointers are allocated at runtime on CentOS and freed at process end. This is slightly different than on Ubuntu even though both OS's use the same compiler (gcc 7.3.0). To fix, just need to move the envGuard above the static local.
          bo-chun.wang Bo-Chun Wang added a comment - Sarath Lakshman The issue happened between 7.0.0-4291 and 7.0.0-4342. There are 3 magma changes between these two weekly builds.   Commit: dca855c16a0e8a89967e7dce757a24c3b89d2dc0 in build: couchbase-server-7.0.0-4330 MB-41252 magma: Add additional debugging to status in case of missing files   Commit: e4937f7e4768ddb4dcfb68d8fec6a665fdf99def in build: couchbase-server-7.0.0-4324 CBSS-591 magma: Avoid using coroutines when queue depth is less than 2 When queue depth is disabled or 1, the extra overhead for allocating the coroutine stack can be avoided.   Commit: 4866266d433da7a9d6c48cc3bb001bebc6a08ea2 in build: couchbase-server-7.0.0-4306 MB-43864 memory: Fix bug on CentOS with memory accounting It appears that static local pointers are allocated at runtime on CentOS and freed at process end. This is slightly different than on Ubuntu even though both OS's use the same compiler (gcc 7.3.0). To fix, just need to move the envGuard above the static local.
          sarath Sarath Lakshman added a comment - - edited

          A comparison of two runs does not indicate a server-side problem.

          http://cbmonitor.sc.couchbase.com/reports/html/?snapshot=rhea_700-4554_custom_load_2360&snapshot=rhea_700-4226_custom_load_ebae

          The write queue is almost zero and the ops rate is low indicating that incoming rate is low. I suspect if it is an issue on the document loader on client-side.

          The ops drop after 5000s to very low rate and the load phase is continuing for a long duration.

          sarath Sarath Lakshman added a comment - - edited A comparison of two runs does not indicate a server-side problem. http://cbmonitor.sc.couchbase.com/reports/html/?snapshot=rhea_700-4554_custom_load_2360&snapshot=rhea_700-4226_custom_load_ebae The write queue is almost zero and the ops rate is low indicating that incoming rate is low. I suspect if it is an issue on the document loader on client-side. The ops drop after 5000s to very low rate and the load phase is continuing for a long duration.

          I compare the load phase between default collection, 1 non-default collection, and 1000 collections. The issue happens in 1000-collection runs only. Perfrunner doesn't limit ops rate during the load phase. The loading time is getting much longer in recently builds. We didn't make changes to document loader recently, and we don't see this issue with old builds.

          There are regressions happened between build 7.0.0-4226 and 7.0.0-4342 because of memory issues. Moreover, according to Scott's finding in MB-41876, magme has worse performance in an extreme memory shortage with 1000 collections. That's why I suspect this issue is related to memory.

           

          All runs were using build 7.0.0-4554. 

          default collection: http://perf.jenkins.couchbase.com/job/rhea-5node2/899/ 

          1 non-default collection: http://perf.jenkins.couchbase.com/job/rhea-5node2/916/ 

          1000 collections: http://perf.jenkins.couchbase.com/job/rhea-5node2/917/ 

          http://cbmonitor.sc.couchbase.com/reports/html/?snapshot=rhea_700-4554_custom_load_d36e&label=defaultcollection&snapshot=rhea_700-4554_custom_load_cb38&label=1collection&snapshot=rhea_700-4554_custom_load_1651&label=1000collection

           

          bo-chun.wang Bo-Chun Wang added a comment - I compare the load phase between default collection, 1 non-default collection, and 1000 collections. The issue happens in 1000-collection runs only. Perfrunner doesn't limit ops rate during the load phase. The loading time is getting much longer in recently builds. We didn't make changes to document loader recently, and we don't see this issue with old builds. There are regressions happened between build 7.0.0-4226 and 7.0.0-4342 because of memory issues. Moreover, according to Scott's finding in  MB-41876 , magme has worse performance in an extreme memory shortage with 1000 collections. That's why I suspect this issue is related to memory.   All runs were using build 7.0.0-4554.  default collection: http://perf.jenkins.couchbase.com/job/rhea-5node2/899/   1 non-default collection: http://perf.jenkins.couchbase.com/job/rhea-5node2/916/   1000 collections: http://perf.jenkins.couchbase.com/job/rhea-5node2/917/   http://cbmonitor.sc.couchbase.com/reports/html/?snapshot=rhea_700-4554_custom_load_d36e&label=defaultcollection&snapshot=rhea_700-4554_custom_load_cb38&label=1collection&snapshot=rhea_700-4554_custom_load_1651&label=1000collection  

          There is no sign of memory accounting issues in this test. Incoming load appears to be dropped to as low as 500 items/sec after some time. I do not see any oom in the stats to indicate that the server issues OOM to throttle the incoming load.

           

          sarath Sarath Lakshman added a comment - There is no sign of memory accounting issues in this test. Incoming load appears to be dropped to as low as 500 items/sec after some time. I do not see any oom in the stats to indicate that the server issues OOM to throttle the incoming load.  

          I do not see the test taking 14 hrs anymore and completes within 4hr 32mins.

          sarath Sarath Lakshman added a comment - I do not see the test taking 14 hrs anymore and completes within 4hr 32mins.

          People

            bo-chun.wang Bo-Chun Wang
            bo-chun.wang Bo-Chun Wang
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Gerrit Reviews

                There are no open Gerrit changes

                PagerDuty