Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-34180

Unstable performance results run-to-run on DCP & Query throughput tests

    XMLWordPrintable

    Details

    • Triage:
      Untriaged
    • Is this a Regression?:
      Yes

      Description

      We are observing a decrease in dcp throughput and aggregate query throughput. It looks like they both are caused by the same commit, hence I am filing them together.

      The following tests all have performance issues:

      • DCP Throughput. 1 node. 250M x 1KB. Java client. compression enabled
        • 68612 to 55816
      • Avg. Query Throughput (queries/sec). AG8. Aggregation query. Plasma. not_bounded
        • 295  to 133

      -----------------------------------------

      Test

      Avg. Query Throughput (queries/sec). AG8. Aggregation query. Plasma. not_bounded

      Index

      CREATE INDEX ag8 ON `bucket-1`(customer.county, DATE_PART_STR(sold_date, 'year'), DATE_PART_STR(sold_date, 'month'), customer.gender, customer.marital_status, customer.education_status, customer.purchase_estimate, customer.credit_rating, customer.dep_count, customer.dep_employed_count, customer.dep_college_count);

      Query

      SELECT customer.gender, customer.marital_status, customer.education_status, customer.purchase_estimate, customer.credit_rating, customer.dep_count, customer.dep_employed_count, customer.dep_college_count, COUNT(1) AS cnt FROM `bucket-1` WHERE customer.county IN $1 AND DATE_PART_STR(sold_date, 'year') = $2 AND DATE_PART_STR(sold_date, 'month') BETWEEN $3 AND $3 + 3 GROUP BY customer.gender, customer.marital_status, customer.education_status, customer.purchase_estimate, customer.credit_rating, customer.dep_count, customer.dep_employed_count, customer.dep_college_count ORDER BY customer.gender, customer.marital_status, customer.education_status, customer.purchase_estimate, customer.credit_rating, customer.dep_count, customer.dep_employed_count, customer.dep_college_count LIMIT 100;

      Results

      6.5.0-3166: 285.0 q/s - http://perf.jenkins.couchbase.com/job/arke/1212/

      6.5.0-3167: 208.0 q/s - http://perf.jenkins.couchbase.com/job/arke/1213

      Report

      http://cbmonitor.sc.couchbase.com/reports/html/?snapshot=arke_650-3166_access_6759&snapshot=arke_650-3167_access_1b71

      Logs

      6.5.0-3166

      https://s3-us-west-2.amazonaws.com/perf-artifacts/jenkins-arke-1212/172.23.97.12.zip
      https://s3-us-west-2.amazonaws.com/perf-artifacts/jenkins-arke-1212/172.23.97.13.zip
      https://s3-us-west-2.amazonaws.com/perf-artifacts/jenkins-arke-1212/172.23.97.14.zip
      https://s3-us-west-2.amazonaws.com/perf-artifacts/jenkins-arke-1212/172.23.97.15.zip
      https://s3-us-west-2.amazonaws.com/perf-artifacts/jenkins-arke-1212/172.23.97.18.zip
      https://s3-us-west-2.amazonaws.com/perf-artifacts/jenkins-arke-1212/172.23.97.19.zip
      https://s3-us-west-2.amazonaws.com/perf-artifacts/jenkins-arke-1212/172.23.97.20.zip

      6.5.0-3167

      https://s3-us-west-2.amazonaws.com/perf-artifacts/jenkins-arke-1213/172.23.97.12.zip
      https://s3-us-west-2.amazonaws.com/perf-artifacts/jenkins-arke-1213/172.23.97.13.zip
      https://s3-us-west-2.amazonaws.com/perf-artifacts/jenkins-arke-1213/172.23.97.14.zip
      https://s3-us-west-2.amazonaws.com/perf-artifacts/jenkins-arke-1213/172.23.97.15.zip
      https://s3-us-west-2.amazonaws.com/perf-artifacts/jenkins-arke-1213/172.23.97.18.zip
      https://s3-us-west-2.amazonaws.com/perf-artifacts/jenkins-arke-1213/172.23.97.19.zip
      https://s3-us-west-2.amazonaws.com/perf-artifacts/jenkins-arke-1213/172.23.97.20.zip

      Changelog

      http://172.23.123.43:8000/getchangelog?product=couchbase-server&fromb=6.5.0-3166&tob=6.5.0-3167

      PProf

      6.5.0-3166

      6.5.0-3167

      Comment

      There are only couchdb, couchbase-cli, and kv_engine commits from 3166-3167. There are also open bugs for indexer that are affecting query  performance as well. 

      -----------------------------------------

      Test

      DCP Throughput. 1 node. 250M x 1KB. Java client. compression enabled

       

      Results

      6.5.0-3166: 66124 w/s - http://perf.jenkins.couchbase.com/job/ares-dcp/379/

      6.5.0-3167: 59007 w/s - http://perf.jenkins.couchbase.com/job/ares-dcp/380/

      Report

      http://cbmonitor.sc.couchbase.com/reports/html/?snapshot=ares_650-3166_access_356a&snapshot=ares_650-3167_access_9b10

      Logs

      6.5.0-3166

      https://s3-us-west-2.amazonaws.com/perf-artifacts/jenkins-ares-dcp-379/172.23.133.11.zip
      https://s3-us-west-2.amazonaws.com/perf-artifacts/jenkins-ares-dcp-379/172.23.133.12.zip
      https://s3-us-west-2.amazonaws.com/perf-artifacts/jenkins-ares-dcp-379/172.23.133.13.zip
      https://s3-us-west-2.amazonaws.com/perf-artifacts/jenkins-ares-dcp-379/172.23.133.14.zip

      6.5.0-3167

      https://s3-us-west-2.amazonaws.com/perf-artifacts/jenkins-ares-dcp-380/172.23.133.11.zip
      https://s3-us-west-2.amazonaws.com/perf-artifacts/jenkins-ares-dcp-380/172.23.133.12.zip
      https://s3-us-west-2.amazonaws.com/perf-artifacts/jenkins-ares-dcp-380/172.23.133.13.zip
      https://s3-us-west-2.amazonaws.com/perf-artifacts/jenkins-ares-dcp-380/172.23.133.14.zip

      Changelog

      http://172.23.123.43:8000/getchangelog?product=couchbase-server&fromb=6.5.0-3166&tob=6.5.0-3167

       

      Comment

      Both of these tests seem to produce performance degradation resulting from the same commit.

       

        Attachments

        No reviews matched the request. Check your Options in the drop-down menu of this sections header.

          Activity

          Hide
          drigby Dave Rigby added a comment - - edited

          And regarding the DCP throughput test, I'm seeing pretty significant variation in that test build-to-build anyways (from http://showfast.sc.couchbase.com/#/timeline/Linux/kv/dcp/all):

          Seeing highs of 68k & lows of 51k across 6.5.0. You see this even more in the daily run (different test parameters but otherwise same underlying test): http://showfast.sc.couchbase.com/daily/#/history/KV%7CDCP,%20250M%20x%201KB%20items,%20DGM%7CAvg%20Throughput%20(items/sec)

          To summarise I'm not convinced there's any regression due to a KV-Engine code-change between builds 3166 and 3167.

          Show
          drigby Dave Rigby added a comment - - edited And regarding the DCP throughput test, I'm seeing pretty significant variation in that test build-to-build anyways (from http://showfast.sc.couchbase.com/#/timeline/Linux/kv/dcp/all): Seeing highs of 68k & lows of 51k across 6.5.0. You see this even more in the daily run (different test parameters but otherwise same underlying test): http://showfast.sc.couchbase.com/daily/#/history/KV%7CDCP,%20250M%20x%201KB%20items,%20DGM%7CAvg%20Throughput%20(items/sec ) To summarise I'm not convinced there's any regression due to a KV-Engine code-change between builds 3166 and 3167.
          Hide
          drigby Dave Rigby added a comment -

          Another couple of observations about the N1QL test.

          Firstly, there's a pretty high number of errors (both in 3166 and 3167 runs):

          These appear to correlate temporally with the previously mentioned gaps in the latency_query graph:

          Secondly, the queries which are handled have pretty consistent latency percentiles:

          While latency and throughput are somewhat independent, this to me shows that N1QL is processing consistently for queries which work.

          Korrigan Clark I think we should investigate the queries which are failing - I can't see any cumulative numbers for queries failing, but I suspect the difference in throughput is at least partly due to requests simply failing and hence the overall number of queries issued in the time period being different. Note that perfrunner defines the overall test metric as simply total_requests / test_time:

              def _avg_n1ql_throughput(self) -> int:
                  test_time = self.test_config.access_settings.time
           
                  query_node = self.cluster_spec.servers_by_role('n1ql')[0]
                  vitals = self.test.rest.get_query_stats(query_node)
                  total_requests = vitals['requests.count']
           
                  throughput = total_requests / test_time
                  return round(throughput, throughput < 1 and 1 or 0)
          

          Show
          drigby Dave Rigby added a comment - Another couple of observations about the N1QL test. Firstly, there's a pretty high number of errors (both in 3166 and 3167 runs): These appear to correlate temporally with the previously mentioned gaps in the latency_query graph: Secondly, the queries which are handled have pretty consistent latency percentiles: While latency and throughput are somewhat independent, this to me shows that N1QL is processing consistently for queries which work. Korrigan Clark I think we should investigate the queries which are failing - I can't see any cumulative numbers for queries failing, but I suspect the difference in throughput is at least partly due to requests simply failing and hence the overall number of queries issued in the time period being different. Note that perfrunner defines the overall test metric as simply total_requests / test_time : def _avg_n1ql_throughput( self ) - > int : test_time = self .test_config.access_settings.time   query_node = self .cluster_spec.servers_by_role( 'n1ql' )[ 0 ] vitals = self .test.rest.get_query_stats(query_node) total_requests = vitals[ 'requests.count' ]   throughput = total_requests / test_time return round (throughput, throughput < 1 and 1 or 0 )
          Hide
          korrigan.clark Korrigan Clark added a comment -

          Dave Rigby

          Thanks Dave, it looks like there is an issue where the indexer is crashing, which I think might be the cause of this. I agree that we need to add some logging of failed queries and throughput of successful queries. I will take a look at this.

          Show
          korrigan.clark Korrigan Clark added a comment - Dave Rigby Thanks Dave, it looks like there is an issue where the indexer is crashing, which I think might be the cause of this. I agree that we need to add some logging of failed queries and throughput of successful queries. I will take a look at this.
          Hide
          ritam.sharma Ritam Sharma added a comment -

          Korrigan Clark - Any update on the ticket.

          Show
          ritam.sharma Ritam Sharma added a comment - Korrigan Clark - Any update on the ticket.
          Hide
          korrigan.clark Korrigan Clark added a comment -

          Ritam Sharma these tests are pretty volatile from run to run so not really a bug. Looks like the current numbers are in line with normal performance. Closing it out.

          Show
          korrigan.clark Korrigan Clark added a comment - Ritam Sharma these tests are pretty volatile from run to run so not really a bug. Looks like the current numbers are in line with normal performance. Closing it out.

            People

            Assignee:
            korrigan.clark Korrigan Clark
            Reporter:
            korrigan.clark Korrigan Clark
            Votes:
            0 Vote for this issue
            Watchers:
            6 Start watching this issue

              Dates

              Created:
              Updated:
              Resolved:

                Gerrit Reviews

                There are no open Gerrit changes

                  PagerDuty