Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-49399

View query latency regression in 1070

    XMLWordPrintable

Details

    • Bug
    • Status: Open
    • Critical
    • Resolution: Unresolved
    • Neo
    • Neo
    • performance
    • Untriaged
    • Centos 64-bit
    • 1
    • Unknown
    • KV 2021-Dec

    Description

      Looks like in build 1070 there is a regression in some of the views queries

      http://showfast.sc.couchbase.com/#/timeline/Linux/views/by_type/all

      80th percentile query latency (ms), 1 node, 20M docs, 3 views, 100 updates/sec, 100 queries/sec, stale=false, basic queries, default collection

      1069 - http://perf.jenkins.couchbase.com/job/leto/21592/console - 220

      1070 - http://perf.jenkins.couchbase.com/job/leto/21588/console - 248

      http://changelog.build.couchbase.com/?product=couchbase-server&fromVersion=7.1.0&fromBuild=1069&toVersion=7.1.0&toBuild=1070&f_couchbase-cli=off&f_eventing=off&f_kv_engine=on&f_magma=off&f_product-metadata=off&f_tlm=on

      Looks like there is 3 kv engine commits related to memory allocation. From the graphs it appears as though there is more free memory in the 1070 run: http://cbmonitor.sc.couchbase.com/reports/html/?snapshot=leto_710-1069_access_0477&snapshot=leto_710-1070_access_963d#6799c0925e8393e50adc1e168ec87cbd

      Attachments

        Issue Links

          No reviews matched the request. Check your Options in the drop-down menu of this sections header.

          Activity

            paolo.cocchi Paolo Cocchi added a comment - - edited

            Note: Comparing builds 1069 on the existing run (https://perf.jenkins.couchbase.com/job/leto/21653/) and build 1070 on the new http://perf.jenkins.couchbase.com/job/leto/22033, as the existing run for 1070 didn't have logs collected.

            cbmonitor shows a correlation between latency_query and avg_disk_update_time:


            The test is doing queries with stale=false (https://docs.couchbase.com/server/current/learn/views/views-operation.html):

            The index is updated before you execute the query, making sure that any documents updated and persisted to disk are included in the view. The client will wait until the index has been updated before the query has executed and, therefore, the response will be delayed until the updated index is available.

            .. which would explain the correlation.

            We have checked out the ViewEngine behaviour with Ankit Prabhu, and it seems that the "stale" param behaviour is inconsistent with what described in docs.
            In particular, it seems that any incremental update to the index is done by an in-memory DCP stream, which isn't expected to be delayed by any degradation of disk writes in KV.

            Regardless of the ongoing investigation on Views, a degradation in the avg_disk_update_time is a more general problem that might affect other latencies, eg PersistToMajority SyncWrite. That's what I'm addressing now.

            UPDATE
            There is no correlation with disk_update, some runs don't show it, eg http://cbmonitor.sc.couchbase.com/reports/html/?snapshot=leto_710-1069_access_0477&snapshot=leto_710-1070_access_963d#6799c0925e8393e50adc1e168ec87cbd.

            paolo.cocchi Paolo Cocchi added a comment - - edited Note: Comparing builds 1069 on the existing run ( https://perf.jenkins.couchbase.com/job/leto/21653/ ) and build 1070 on the new http://perf.jenkins.couchbase.com/job/leto/22033 , as the existing run for 1070 didn't have logs collected. cbmonitor shows a correlation between latency_query and avg_disk_update_time : The test is doing queries with stale=false ( https://docs.couchbase.com/server/current/learn/views/views-operation.html): The index is updated before you execute the query, making sure that any documents updated and persisted to disk are included in the view. The client will wait until the index has been updated before the query has executed and, therefore, the response will be delayed until the updated index is available. .. which would explain the correlation. We have checked out the ViewEngine behaviour with Ankit Prabhu , and it seems that the "stale" param behaviour is inconsistent with what described in docs. In particular, it seems that any incremental update to the index is done by an in-memory DCP stream, which isn't expected to be delayed by any degradation of disk writes in KV. Regardless of the ongoing investigation on Views, a degradation in the avg_disk_update_time is a more general problem that might affect other latencies, eg PersistToMajority SyncWrite. That's what I'm addressing now. UPDATE There is no correlation with disk_update , some runs don't show it, eg http://cbmonitor.sc.couchbase.com/reports/html/?snapshot=leto_710-1069_access_0477&snapshot=leto_710-1070_access_963d#6799c0925e8393e50adc1e168ec87cbd .
            paolo.cocchi Paolo Cocchi added a comment - - edited

            I've run the test on a toy-build (https://github.com/couchbase/manifest/blob/master/toy/toy-paolo-2.xml) that is essentially 1070 (ie, the regressed build) except for KV that is set to point to 1069:

            Custom manifest
            Baseline: couchbase-server-7.1.0-1070-manifest.xml
            kv_engine -> 7.1.0-1069 (ee80242f0bb75df9f7ab0ce06e1a5f2483420503)

            Build / run:

            Result is that we do observe the regression:

            "metric": "query_lat_20M_basic_default_leto",
            "value": 245

            Which means that the regression in 1070 isn't introduced by any KV change.

            paolo.cocchi Paolo Cocchi added a comment - - edited I've run the test on a toy-build ( https://github.com/couchbase/manifest/blob/master/toy/toy-paolo-2.xml ) that is essentially 1070 (ie, the regressed build) except for KV that is set to point to 1069: Custom manifest Baseline: couchbase-server-7.1.0-1070-manifest.xml kv_engine -> 7.1.0-1069 (ee80242f0bb75df9f7ab0ce06e1a5f2483420503) Build / run: http://server.jenkins.couchbase.com/view/Toys/job/toy-unix/8870/console http://latestbuilds.service.couchbase.com/builds/latestbuilds/couchbase-server/toybuilds/202112150/ http://perf.jenkins.couchbase.com/job/leto/22208/console Result is that we do observe the regression: "metric": "query_lat_20M_basic_default_leto", "value": 245 Which means that the regression in 1070 isn't introduced by any KV change.

            Hi Korrigan Clark, I'm assigning this back to you.

            We have excluded any KV relation here, but what's causing the regression is still unclear.
            The [1069, 1070] changelog doesn't contain anything that seems related to this regression. But I've repeated the test on pure 1069 and that confirms better numbers there (http://perf.jenkins.couchbase.com/job/leto/22206/console):

            "metric": "query_lat_20M_basic_default_leto",
            "value": 217

            Changelog: http://changelog.build.couchbase.com/?product=couchbase-server&fromVersion=7.1.0&fromBuild=1069&toVersion=7.1.0&toBuild=1070&f_couchbase-cli=on&f_eventing=on&f_kv_engine=on&f_magma=on&f_product-metadata=on&f_tlm=on
            Maybe something in tlm has unexpected side effects?
            Probably the quickest way to proceed is to keep on making toys by excluding projects one by one (ie, "reverting" it to 1069) and repeating the test.

            Also note that (as discussed with Ankit) queries here are non-trivial. A "stale=false" query is a complex operation, so what specific part is regressing? Answering that question might help in identifying the problem.

            Daniel Owen I guess that the "couchbase-bucket" component can be removed?

            Thanks

            paolo.cocchi Paolo Cocchi added a comment - Hi Korrigan Clark , I'm assigning this back to you. We have excluded any KV relation here, but what's causing the regression is still unclear. The [1069, 1070] changelog doesn't contain anything that seems related to this regression. But I've repeated the test on pure 1069 and that confirms better numbers there ( http://perf.jenkins.couchbase.com/job/leto/22206/console): "metric": "query_lat_20M_basic_default_leto", "value": 217 Changelog: http://changelog.build.couchbase.com/?product=couchbase-server&fromVersion=7.1.0&fromBuild=1069&toVersion=7.1.0&toBuild=1070&f_couchbase-cli=on&f_eventing=on&f_kv_engine=on&f_magma=on&f_product-metadata=on&f_tlm=on Maybe something in tlm has unexpected side effects? Probably the quickest way to proceed is to keep on making toys by excluding projects one by one (ie, "reverting" it to 1069) and repeating the test. Also note that (as discussed with Ankit) queries here are non-trivial. A "stale=false" query is a complex operation, so what specific part is regressing? Answering that question might help in identifying the problem. Daniel Owen I guess that the "couchbase-bucket" component can be removed? Thanks
            owend Daniel Owen added a comment -

            Paolo Cocchi, Thanks for ruling out KV - yes I think couchbase-bucket can be removed.
            Setting to performance, whilst the correct component is identified.

            owend Daniel Owen added a comment - Paolo Cocchi , Thanks for ruling out KV - yes I think couchbase-bucket can be removed. Setting to performance, whilst the correct component is identified.
            ritam.sharma Ritam Sharma added a comment -

            Korrigan Clark and Wayne Siu - Please assign the ticket appropriately .

            ritam.sharma Ritam Sharma added a comment - Korrigan Clark and Wayne Siu - Please assign the ticket appropriately .

            People

              korrigan.clark Korrigan Clark
              korrigan.clark Korrigan Clark
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

                Created:
                Updated:

                Gerrit Reviews

                  There are no open Gerrit changes

                  PagerDuty