Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-49399

View query latency regression in 1070

    XMLWordPrintable

Details

    • Bug
    • Status: Closed
    • Critical
    • Resolution: Fixed
    • 7.1.0
    • 7.1.0
    • performance
    • Untriaged
    • Centos 64-bit
    • 1
    • Unknown
    • KV 2021-Dec

    Description

      Looks like in build 1070 there is a regression in some of the views queries

      http://showfast.sc.couchbase.com/#/timeline/Linux/views/by_type/all

      80th percentile query latency (ms), 1 node, 20M docs, 3 views, 100 updates/sec, 100 queries/sec, stale=false, basic queries, default collection

      1069 - http://perf.jenkins.couchbase.com/job/leto/21592/console - 220

      1070 - http://perf.jenkins.couchbase.com/job/leto/21588/console - 248

      http://changelog.build.couchbase.com/?product=couchbase-server&fromVersion=7.1.0&fromBuild=1069&toVersion=7.1.0&toBuild=1070&f_couchbase-cli=off&f_eventing=off&f_kv_engine=on&f_magma=off&f_product-metadata=off&f_tlm=on

      Looks like there is 3 kv engine commits related to memory allocation. From the graphs it appears as though there is more free memory in the 1070 run: http://cbmonitor.sc.couchbase.com/reports/html/?snapshot=leto_710-1069_access_0477&snapshot=leto_710-1070_access_963d#6799c0925e8393e50adc1e168ec87cbd

      Attachments

        Issue Links

          No reviews matched the request. Check your Options in the drop-down menu of this sections header.

          Activity

            korrigan.clark Korrigan Clark (Inactive) added a comment - rerunning to grab logs: https://perf.jenkins.couchbase.com/job/leto/21594/ https://perf.jenkins.couchbase.com/job/leto/21595/  
            owend Daniel Owen added a comment -

            Currently blocked by CBPS-982

            owend Daniel Owen added a comment - Currently blocked by CBPS-982
            owend Daniel Owen added a comment - - edited

            The 3 KV patches introduced in build 1070 are related to MB-39745

            * Commit: 97dc63a2520f8ae760b25d6f31cc0f83f19acfa3 in build: couchbase-server-7.1.0-1070
            MB-39745: Don't over-allocate memory for vbAbortCompaction
             
            We only need to allocate memory for the vBuckets that a shard will
            deal with. Allocate num vBuckets / num shards entries in the vector
            rather than num vBuckets and use the result of integer division to
            access the correct element.
             
            * Commit: 37d812aa073750a3f95839c37467eb18b3deb8ce in build: couchbase-server-7.1.0-1070
            MB-39745: Don't over-allocate memory for kvstoreRevList
             
            We only need to allocate memory for the vBuckets that a shard will
            deal with. Allocate num vBuckets / num shards entries in the vector
            rather than num vBuckets and use the result of integer division to
            access the correct element.
             
            * Commit: 7644ff5e4d3cadf7592133cb6c97973c4cedd453 in build: couchbase-server-7.1.0-1070
            MB-39745: Don't over-allocate memory for vBucketStates
             
            We only need to allocate memory for the vBuckets that a shard will
            deal with. Allocate num vBuckets / num shards entries in the vector
            rather than num vBuckets and use the result of integer division to
            access the correct element.
            

            Looking at the patches I'd be surprised if they were responsible for the slowdown.

            Looking at the test 80th percentile query latency (ms), 1 node, 20M docs, 3 views, 100 updates/sec, 100 queries/sec, stale=false, basic queries, default collection
            it seems to have quite a bit of variance

            Hi Korrigan Clark, Could you run the test min of 3 times on build 1069 and 1070 to increase the confidence on the regression?

            thanks

            owend Daniel Owen added a comment - - edited The 3 KV patches introduced in build 1070 are related to MB-39745 * Commit: 97dc63a2520f8ae760b25d6f31cc0f83f19acfa3 in build: couchbase-server-7.1.0-1070 MB-39745: Don't over-allocate memory for vbAbortCompaction   We only need to allocate memory for the vBuckets that a shard will deal with. Allocate num vBuckets / num shards entries in the vector rather than num vBuckets and use the result of integer division to access the correct element.   * Commit: 37d812aa073750a3f95839c37467eb18b3deb8ce in build: couchbase-server-7.1.0-1070 MB-39745: Don't over-allocate memory for kvstoreRevList   We only need to allocate memory for the vBuckets that a shard will deal with. Allocate num vBuckets / num shards entries in the vector rather than num vBuckets and use the result of integer division to access the correct element.   * Commit: 7644ff5e4d3cadf7592133cb6c97973c4cedd453 in build: couchbase-server-7.1.0-1070 MB-39745: Don't over-allocate memory for vBucketStates   We only need to allocate memory for the vBuckets that a shard will deal with. Allocate num vBuckets / num shards entries in the vector rather than num vBuckets and use the result of integer division to access the correct element. Looking at the patches I'd be surprised if they were responsible for the slowdown. Looking at the test 80th percentile query latency (ms), 1 node, 20M docs, 3 views, 100 updates/sec, 100 queries/sec, stale=false, basic queries, default collection it seems to have quite a bit of variance Hi Korrigan Clark , Could you run the test min of 3 times on build 1069 and 1070 to increase the confidence on the regression? thanks
            korrigan.clark Korrigan Clark (Inactive) added a comment - - edited

            Daniel Owen the variance is due to upgrading the cluster OS and kernels recently. I have been rerunning the tests after the upgrade on some builds thus the variance. Anyways, I have queued up 3 reruns of each build:

            1070 - 

            https://perf.jenkins.couchbase.com/job/leto/21648/ - 245

            https://perf.jenkins.couchbase.com/job/leto/21649/ - 244

            https://perf.jenkins.couchbase.com/job/leto/21650/ - 245

            1069 - 

            https://perf.jenkins.couchbase.com/job/leto/21651/ - 217

            https://perf.jenkins.couchbase.com/job/leto/21652/ - 220

            https://perf.jenkins.couchbase.com/job/leto/21653/  - 213

            korrigan.clark Korrigan Clark (Inactive) added a comment - - edited Daniel Owen  the variance is due to upgrading the cluster OS and kernels recently. I have been rerunning the tests after the upgrade on some builds thus the variance. Anyways, I have queued up 3 reruns of each build: 1070 -  https://perf.jenkins.couchbase.com/job/leto/21648/ - 245 https://perf.jenkins.couchbase.com/job/leto/21649/ - 244 https://perf.jenkins.couchbase.com/job/leto/21650/ - 245 1069 -  https://perf.jenkins.couchbase.com/job/leto/21651/ - 217 https://perf.jenkins.couchbase.com/job/leto/21652/ - 220 https://perf.jenkins.couchbase.com/job/leto/21653/   - 213
            owend Daniel Owen added a comment -

            Thanks Korrigan Clark for setting up the runs and getting the results.
            Have changed the component to Performance.

            owend Daniel Owen added a comment - Thanks Korrigan Clark for setting up the runs and getting the results. Have changed the component to Performance.

            Daniel Owen why the component change? shouldn't it be views component or kv?

            korrigan.clark Korrigan Clark (Inactive) added a comment - Daniel Owen  why the component change? shouldn't it be views component or kv?
            owend Daniel Owen added a comment - - edited

            Apologies Korrigan Clark. I forgot about the regression! (I was focused on the performance variation issue - therefore thought it was resolved)
            Assigning to the views team to investigate the performance degradation.

            Took a look at the change-log and nothing went into views - therefore KV should investigate further.

            thanks

            owend Daniel Owen added a comment - - edited Apologies Korrigan Clark . I forgot about the regression! (I was focused on the performance variation issue - therefore thought it was resolved) Assigning to the views team to investigate the performance degradation. Took a look at the change-log and nothing went into views - therefore KV should investigate further. thanks
            paolo.cocchi Paolo Cocchi added a comment - - edited

            Note: Comparing builds 1069 on the existing run (https://perf.jenkins.couchbase.com/job/leto/21653/) and build 1070 on the new http://perf.jenkins.couchbase.com/job/leto/22033, as the existing run for 1070 didn't have logs collected.

            cbmonitor shows a correlation between latency_query and avg_disk_update_time:


            The test is doing queries with stale=false (https://docs.couchbase.com/server/current/learn/views/views-operation.html):

            The index is updated before you execute the query, making sure that any documents updated and persisted to disk are included in the view. The client will wait until the index has been updated before the query has executed and, therefore, the response will be delayed until the updated index is available.

            .. which would explain the correlation.

            We have checked out the ViewEngine behaviour with Ankit Prabhu, and it seems that the "stale" param behaviour is inconsistent with what described in docs.
            In particular, it seems that any incremental update to the index is done by an in-memory DCP stream, which isn't expected to be delayed by any degradation of disk writes in KV.

            Regardless of the ongoing investigation on Views, a degradation in the avg_disk_update_time is a more general problem that might affect other latencies, eg PersistToMajority SyncWrite. That's what I'm addressing now.

            UPDATE
            There is no correlation with disk_update, some runs don't show it, eg http://cbmonitor.sc.couchbase.com/reports/html/?snapshot=leto_710-1069_access_0477&snapshot=leto_710-1070_access_963d#6799c0925e8393e50adc1e168ec87cbd.

            paolo.cocchi Paolo Cocchi added a comment - - edited Note: Comparing builds 1069 on the existing run ( https://perf.jenkins.couchbase.com/job/leto/21653/ ) and build 1070 on the new http://perf.jenkins.couchbase.com/job/leto/22033 , as the existing run for 1070 didn't have logs collected. cbmonitor shows a correlation between latency_query and avg_disk_update_time : The test is doing queries with stale=false ( https://docs.couchbase.com/server/current/learn/views/views-operation.html): The index is updated before you execute the query, making sure that any documents updated and persisted to disk are included in the view. The client will wait until the index has been updated before the query has executed and, therefore, the response will be delayed until the updated index is available. .. which would explain the correlation. We have checked out the ViewEngine behaviour with Ankit Prabhu , and it seems that the "stale" param behaviour is inconsistent with what described in docs. In particular, it seems that any incremental update to the index is done by an in-memory DCP stream, which isn't expected to be delayed by any degradation of disk writes in KV. Regardless of the ongoing investigation on Views, a degradation in the avg_disk_update_time is a more general problem that might affect other latencies, eg PersistToMajority SyncWrite. That's what I'm addressing now. UPDATE There is no correlation with disk_update , some runs don't show it, eg http://cbmonitor.sc.couchbase.com/reports/html/?snapshot=leto_710-1069_access_0477&snapshot=leto_710-1070_access_963d#6799c0925e8393e50adc1e168ec87cbd .
            paolo.cocchi Paolo Cocchi added a comment - - edited

            I've run the test on a toy-build (https://github.com/couchbase/manifest/blob/master/toy/toy-paolo-2.xml) that is essentially 1070 (ie, the regressed build) except for KV that is set to point to 1069:

            Custom manifest
            Baseline: couchbase-server-7.1.0-1070-manifest.xml
            kv_engine -> 7.1.0-1069 (ee80242f0bb75df9f7ab0ce06e1a5f2483420503)

            Build / run:

            Result is that we do observe the regression:

            "metric": "query_lat_20M_basic_default_leto",
            "value": 245

            Which means that the regression in 1070 isn't introduced by any KV change.

            paolo.cocchi Paolo Cocchi added a comment - - edited I've run the test on a toy-build ( https://github.com/couchbase/manifest/blob/master/toy/toy-paolo-2.xml ) that is essentially 1070 (ie, the regressed build) except for KV that is set to point to 1069: Custom manifest Baseline: couchbase-server-7.1.0-1070-manifest.xml kv_engine -> 7.1.0-1069 (ee80242f0bb75df9f7ab0ce06e1a5f2483420503) Build / run: http://server.jenkins.couchbase.com/view/Toys/job/toy-unix/8870/console http://latestbuilds.service.couchbase.com/builds/latestbuilds/couchbase-server/toybuilds/202112150/ http://perf.jenkins.couchbase.com/job/leto/22208/console Result is that we do observe the regression: "metric": "query_lat_20M_basic_default_leto", "value": 245 Which means that the regression in 1070 isn't introduced by any KV change.

            Hi Korrigan Clark, I'm assigning this back to you.

            We have excluded any KV relation here, but what's causing the regression is still unclear.
            The [1069, 1070] changelog doesn't contain anything that seems related to this regression. But I've repeated the test on pure 1069 and that confirms better numbers there (http://perf.jenkins.couchbase.com/job/leto/22206/console):

            "metric": "query_lat_20M_basic_default_leto",
            "value": 217

            Changelog: http://changelog.build.couchbase.com/?product=couchbase-server&fromVersion=7.1.0&fromBuild=1069&toVersion=7.1.0&toBuild=1070&f_couchbase-cli=on&f_eventing=on&f_kv_engine=on&f_magma=on&f_product-metadata=on&f_tlm=on
            Maybe something in tlm has unexpected side effects?
            Probably the quickest way to proceed is to keep on making toys by excluding projects one by one (ie, "reverting" it to 1069) and repeating the test.

            Also note that (as discussed with Ankit) queries here are non-trivial. A "stale=false" query is a complex operation, so what specific part is regressing? Answering that question might help in identifying the problem.

            Daniel Owen I guess that the "couchbase-bucket" component can be removed?

            Thanks

            paolo.cocchi Paolo Cocchi added a comment - Hi Korrigan Clark , I'm assigning this back to you. We have excluded any KV relation here, but what's causing the regression is still unclear. The [1069, 1070] changelog doesn't contain anything that seems related to this regression. But I've repeated the test on pure 1069 and that confirms better numbers there ( http://perf.jenkins.couchbase.com/job/leto/22206/console): "metric": "query_lat_20M_basic_default_leto", "value": 217 Changelog: http://changelog.build.couchbase.com/?product=couchbase-server&fromVersion=7.1.0&fromBuild=1069&toVersion=7.1.0&toBuild=1070&f_couchbase-cli=on&f_eventing=on&f_kv_engine=on&f_magma=on&f_product-metadata=on&f_tlm=on Maybe something in tlm has unexpected side effects? Probably the quickest way to proceed is to keep on making toys by excluding projects one by one (ie, "reverting" it to 1069) and repeating the test. Also note that (as discussed with Ankit) queries here are non-trivial. A "stale=false" query is a complex operation, so what specific part is regressing? Answering that question might help in identifying the problem. Daniel Owen I guess that the "couchbase-bucket" component can be removed? Thanks
            owend Daniel Owen added a comment -

            Paolo Cocchi, Thanks for ruling out KV - yes I think couchbase-bucket can be removed.
            Setting to performance, whilst the correct component is identified.

            owend Daniel Owen added a comment - Paolo Cocchi , Thanks for ruling out KV - yes I think couchbase-bucket can be removed. Setting to performance, whilst the correct component is identified.
            ritam.sharma Ritam Sharma added a comment -

            Korrigan Clark and Wayne Siu - Please assign the ticket appropriately .

            ritam.sharma Ritam Sharma added a comment - Korrigan Clark and Wayne Siu - Please assign the ticket appropriately .
            drigby Dave Rigby added a comment -

            Note that Eventing is also seeing a performance regression in build 1070 (MB-47908) - and a common thread between View-engine and Eventing is the use of V8 JS runtime. That was rebuilt in build 1070 - see comments on MB-47908; so I suspect that may be the root cause of this issue.

            drigby Dave Rigby added a comment - Note that Eventing is also seeing a performance regression in build 1070 ( MB-47908 ) - and a common thread between View-engine and Eventing is the use of V8 JS runtime. That was rebuilt in build 1070 - see comments on MB-47908 ; so I suspect that may be the root cause of this issue.
            wayne Wayne Siu added a comment - Created a toy build including the earlier v8 build:  http://latestbuilds.service.couchbase.com/builds/latestbuilds/couchbase-server/toybuilds/14341/
            korrigan.clark Korrigan Clark (Inactive) added a comment - queued toy: http://perf.jenkins.couchbase.com/job/leto/22907/

            225 from toy build... looks back to normal

            korrigan.clark Korrigan Clark (Inactive) added a comment - 225 from toy build... looks back to normal

            Korrigan ClarkCan this bug be closed now?

            raju Raju Suravarjjala added a comment - Korrigan Clark Can this bug be closed now?

            The toy build fixed the issue, but it doesnt seem like the fix has gone into official build yet

            korrigan.clark Korrigan Clark (Inactive) added a comment - The toy build fixed the issue, but it doesnt seem like the fix has gone into official build yet
            wayne Wayne Siu added a comment -

            The fix (V8) for MB-47980 also fixes this view regression.  Tested with build 7.1.0-2223.

            Build couchbase-server-7.1.0-2182 contains tlm commit f4c48aa with commit message:
            MB-47908: Revert to earlier v8 build on x86_64 platforms

            Closing this ticket.

             

             

            wayne Wayne Siu added a comment - The fix (V8) for MB-47980 also fixes this view regression.  Tested with build 7.1.0-2223. Build couchbase-server-7.1.0-2182 contains tlm commit f4c48aa with commit message: MB-47908 : Revert to earlier v8 build on x86_64 platforms Closing this ticket.    

            People

              wayne Wayne Siu
              korrigan.clark Korrigan Clark (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              9 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Gerrit Reviews

                  There are no open Gerrit changes

                  PagerDuty