Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-36765

15% throughput regression in YCSB Workload A 12vCPU

    XMLWordPrintable

Details

    • Untriaged
    • Centos 64-bit
    • Yes
    • KV-Engine Mad-Hatter GA

    Description

      There is ~15% TP decrease in "Avg Throughput (ops/sec), Workload A, 3 nodes, 12 vCPU, replicateTo=1". This regression comes in build 4723: http://172.23.123.43:8000/getchangelog?product=couchbase-server&fromb=6.5.0-4722&tob=6.5.0-4723

       

      4722: 

      http://perf.jenkins.couchbase.com/job/hebe/4973/ - 118370

      https://s3-us-west-2.amazonaws.com/perf-artifacts/jenkins-hebe-4973/172.23.100.190.zip
      https://s3-us-west-2.amazonaws.com/perf-artifacts/jenkins-hebe-4973/172.23.100.191.zip
      https://s3-us-west-2.amazonaws.com/perf-artifacts/jenkins-hebe-4973/172.23.100.192.zip
      https://s3-us-west-2.amazonaws.com/perf-artifacts/jenkins-hebe-4973/172.23.100.193.zip
      https://s3-us-west-2.amazonaws.com/perf-artifacts/jenkins-hebe-4973/172.23.100.204.zip
      https://s3-us-west-2.amazonaws.com/perf-artifacts/jenkins-hebe-4973/172.23.100.205.zip
      https://s3-us-west-2.amazonaws.com/perf-artifacts/jenkins-hebe-4973/172.23.100.206.zip
      https://s3-us-west-2.amazonaws.com/perf-artifacts/jenkins-hebe-4973/172.23.100.207.zip

      http://cbmonitor.sc.couchbase.com/reports/html/?snapshot=hebe_650-4722_access_5f2e

      4723:

      http://perf.jenkins.couchbase.com/job/hebe/4977/ - 102525

      https://s3-us-west-2.amazonaws.com/perf-artifacts/jenkins-hebe-4977/172.23.100.190.zip
       https://s3-us-west-2.amazonaws.com/perf-artifacts/jenkins-hebe-4977/172.23.100.191.zip
      https://s3-us-west-2.amazonaws.com/perf-artifacts/jenkins-hebe-4977/172.23.100.192.zip
      https://s3-us-west-2.amazonaws.com/perf-artifacts/jenkins-hebe-4977/172.23.100.193.zip
      https://s3-us-west-2.amazonaws.com/perf-artifacts/jenkins-hebe-4977/172.23.100.204.zip
      https://s3-us-west-2.amazonaws.com/perf-artifacts/jenkins-hebe-4977/172.23.100.205.zip
      https://s3-us-west-2.amazonaws.com/perf-artifacts/jenkins-hebe-4977/172.23.100.206.zip
      https://s3-us-west-2.amazonaws.com/perf-artifacts/jenkins-hebe-4977/172.23.100.207.zip

      http://cbmonitor.sc.couchbase.com/reports/html/?snapshot=hebe_650-4723_access_0f9c

       

      Comparison:

      http://cbmonitor.sc.couchbase.com/reports/html/?snapshot=hebe_650-4722_access_5f2e&snapshot=hebe_650-4723_access_0f9c

      Attachments

        Issue Links

          No reviews matched the request. Check your Options in the drop-down menu of this sections header.

          Activity

            drigby Dave Rigby added a comment -

            Korrigan Clark Thanks for isolating when the performance changes are seen.

            The primary drop was in build 4774; when we increased the number of shards from 4 to $NUM_CPS. I've confirmed we correctly set the number of shards (to 12) from the logs from run #5362, via stats.log:

             ep_workload:num_shards:               12
            

            Note that the previous build (4773) increased the number of reader and writer threads to 12; however without increasing the shards those threads should be idle (there only exists 4 Flusher tasks for them to execute).

            I don't think there's much value in trying toy builds with the following patches reverted:

            • MB-36723: Set Writer threads to minimum priority - we know this reduces the impact of background writer threads on the frontend; and given there's the same number of Writer threads in 6.5.0-4908 as 6.0.3 (4 threads) I don't see any reason why a lower priority.
            • MB-36249: Don't floor() write amplification stats - this only affects the output of cbstats, doesn't change anything in ep-engine performance.

            It's possible but unlikely the following patch has any negative impact:

            • MB-36723: Optimize KVShard memory usage - it just packs the elements in the shard array tighter together to save memory.

            I suggest first try re-running 4908 with the same number of shards as we had in 6.0.3 - i.e. 4. The additional splitting of the work into more tasks might be adding more overhead. This can be done with a simple override value of bucket_extras.max_num_shards.4.

            drigby Dave Rigby added a comment - Korrigan Clark Thanks for isolating when the performance changes are seen. The primary drop was in build 4774; when we increased the number of shards from 4 to $NUM_CPS. I've confirmed we correctly set the number of shards (to 12) from the logs from run #5362, via stats.log: ep_workload:num_shards: 12 Note that the previous build (4773) increased the number of reader and writer threads to 12; however without increasing the shards those threads should be idle (there only exists 4 Flusher tasks for them to execute). I don't think there's much value in trying toy builds with the following patches reverted: MB-36723 : Set Writer threads to minimum priority - we know this reduces the impact of background writer threads on the frontend; and given there's the same number of Writer threads in 6.5.0-4908 as 6.0.3 (4 threads) I don't see any reason why a lower priority. MB-36249 : Don't floor() write amplification stats - this only affects the output of cbstats, doesn't change anything in ep-engine performance. It's possible but unlikely the following patch has any negative impact: MB-36723 : Optimize KVShard memory usage - it just packs the elements in the shard array tighter together to save memory. I suggest first try re-running 4908 with the same number of shards as we had in 6.0.3 - i.e. 4. The additional splitting of the work into more tasks might be adding more overhead. This can be done with a simple override value of bucket_extras.max_num_shards.4 .

            4908 shards=4
            http://perf.jenkins.couchbase.com/job/hebe/5367/ - 94902
            http://perf.jenkins.couchbase.com/job/hebe/5368/ - 94388
            http://perf.jenkins.couchbase.com/job/hebe/5369/ - 94623

             

            Also, the 6.0.3 seem to be stable at around 120k.

            I ran the 4098 runs with bucket_extras.max_num_shards.4 but when I tried to verify this took hold I could not find this value anywhere in the logs. I checked stats.log like you mentioned and but all I see is errors like this:

            ==============================================================================
            memcached stats all
            cbstats -a 127.0.0.1:11209 all -u @ns_server
            ==============================================================================
            ******************************************************************************
            Traceback (most recent call last):
              File "/opt/couchbase/lib/python/cbstats", line 927, in <module>
                main()
              File "/opt/couchbase/lib/python/cbstats", line 924, in main
                c.execute()
              File "/opt/couchbase/lib/python/clitool.py", line 71, in execute
                f[0](mc, *args[2:], **opts.__dict__)
              File "/opt/couchbase/lib/python/cbstats", line 38, in g
                f(*args, **kwargs)
              File "/opt/couchbase/lib/python/cli_auth_utils.py", line 88, in g
                mc.bucket_select(bucket)
              File "/opt/couchbase/lib/python/mc_bin_client.py", line 666, in bucket_select
                return self._doCmd(memcacheConstants.CMD_SELECT_BUCKET, name, '')
              File "/opt/couchbase/lib/python/mc_bin_client.py", line 291, in _doCmd
                return self._handleSingleResponse(opaque)
              File "/opt/couchbase/lib/python/mc_bin_client.py", line 284, in _handleSingleResponse
                cmd, opaque, cas, keylen, extralen, data = self._handleKeyedResponse(myopaque)
              File "/opt/couchbase/lib/python/mc_bin_client.py", line 280, in _handleKeyedResponse
                raise MemcachedError(errcode,  msg)
            mc_bin_client.ErrorKeyEnoent: Memcached error #1:  KEY_ENOENT : Not Found : 
            

            I also check ns_server.stats.log and no entries that mention shard numbers.

             

            If the shard changes really took hold via the override parameter, it seems to indicate that the number of shards in 4908 does not have a performance impact with values of 4 or 12 having the same throughput. Dave Rigby

            korrigan.clark Korrigan Clark added a comment - 4908 shards=4 http://perf.jenkins.couchbase.com/job/hebe/5367/ - 94902 http://perf.jenkins.couchbase.com/job/hebe/5368/ - 94388 http://perf.jenkins.couchbase.com/job/hebe/5369/ - 94623   Also, the 6.0.3 seem to be stable at around 120k. I ran the 4098 runs with bucket_extras.max_num_shards.4 but when I tried to verify this took hold I could not find this value anywhere in the logs. I checked stats.log like you mentioned and but all I see is errors like this: ============================================================================== memcached stats all cbstats -a 127.0 . 0.1 : 11209 all -u @ns_server ============================================================================== ****************************************************************************** Traceback (most recent call last): File "/opt/couchbase/lib/python/cbstats" , line 927 , in <module> main() File "/opt/couchbase/lib/python/cbstats" , line 924 , in main c.execute() File "/opt/couchbase/lib/python/clitool.py" , line 71 , in execute f[ 0 ](mc, *args[ 2 :], **opts.__dict__) File "/opt/couchbase/lib/python/cbstats" , line 38 , in g f(*args, **kwargs) File "/opt/couchbase/lib/python/cli_auth_utils.py" , line 88 , in g mc.bucket_select(bucket) File "/opt/couchbase/lib/python/mc_bin_client.py" , line 666 , in bucket_select return self._doCmd(memcacheConstants.CMD_SELECT_BUCKET, name, '' ) File "/opt/couchbase/lib/python/mc_bin_client.py" , line 291 , in _doCmd return self._handleSingleResponse(opaque) File "/opt/couchbase/lib/python/mc_bin_client.py" , line 284 , in _handleSingleResponse cmd, opaque, cas, keylen, extralen, data = self._handleKeyedResponse(myopaque) File "/opt/couchbase/lib/python/mc_bin_client.py" , line 280 , in _handleKeyedResponse raise MemcachedError(errcode, msg) mc_bin_client.ErrorKeyEnoent: Memcached error # 1 : KEY_ENOENT : Not Found : I also check ns_server.stats.log and no entries that mention shard numbers.   If the shard changes really took hold via the override parameter, it seems to indicate that the number of shards in 4908 does not have a performance impact with values of 4 or 12 having the same throughput. Dave Rigby
            korrigan.clark Korrigan Clark added a comment - - edited

            I was able to verify num shards with the following:

            [root@172-23-100-204 ~]# /opt/couchbase/bin/cbstats -a 127.0.0.1:11209 workload -u Administrator -p password

            ******************************************************************************

            bucket-1

             

            ep_workload:LowPrioQ_AuxIO:InQsize:   1

            ep_workload:LowPrioQ_AuxIO:OutQsize:  0

            ep_workload:LowPrioQ_NonIO:InQsize:   12

            ep_workload:LowPrioQ_NonIO:OutQsize:  0

            ep_workload:LowPrioQ_Reader:InQsize:  4

            ep_workload:LowPrioQ_Reader:OutQsize: 0

            ep_workload:LowPrioQ_Writer:InQsize:  1

            ep_workload:LowPrioQ_Writer:OutQsize: 0

            ep_workload:max_auxio:                2

            ep_workload:max_nonio:                3

            ep_workload:max_readers:              12

            ep_workload:max_writers:              4

            ep_workload:num_auxio:                2

            ep_workload:num_nonio:                3

            ep_workload:num_readers:              12

            ep_workload:num_shards:               4

            ep_workload:num_sleepers:             13

            ep_workload:num_writers:              4

            ep_workload:ready_tasks:              0

            korrigan.clark Korrigan Clark added a comment - - edited I was able to verify num shards with the following: [root@172-23-100-204 ~] # /opt/couchbase/bin/cbstats -a 127.0.0.1:11209 workload -u Administrator -p password ****************************************************************************** bucket-1   ep_workload:LowPrioQ_AuxIO:InQsize:   1 ep_workload:LowPrioQ_AuxIO:OutQsize:  0 ep_workload:LowPrioQ_NonIO:InQsize:   12 ep_workload:LowPrioQ_NonIO:OutQsize:  0 ep_workload:LowPrioQ_Reader:InQsize:  4 ep_workload:LowPrioQ_Reader:OutQsize: 0 ep_workload:LowPrioQ_Writer:InQsize:  1 ep_workload:LowPrioQ_Writer:OutQsize: 0 ep_workload:max_auxio:                2 ep_workload:max_nonio:                3 ep_workload:max_readers:              12 ep_workload:max_writers:              4 ep_workload:num_auxio:                2 ep_workload:num_nonio:                3 ep_workload:num_readers:              12 ep_workload:num_shards:               4 ep_workload:num_sleepers:             13 ep_workload:num_writers:              4 ep_workload:ready_tasks:              0
            drigby Dave Rigby added a comment -

            4908 shards=4
            http://perf.jenkins.couchbase.com/job/hebe/5367/ - 94902
            http://perf.jenkins.couchbase.com/job/hebe/5368/ - 94388
            http://perf.jenkins.couchbase.com/job/hebe/5369/ - 94623

            Also, the 6.0.3 seem to be stable at around 120k.

            Ok, that's helpful - we can rule out the #shards / #threads changes given with both back to their 6.0.3 values we still see the regression.

            I started looking at the writer thread priority patch, and was going to suggest we re-run the test but manually set the priorities back to the default of 0 (via the renice command), however in the process I discovered a bug - we are incorrectly setting the NonIO and AuxIO thread priorities to lowest - see MB-37144.

            Once that bug is resolved we should re-run this test.

            drigby Dave Rigby added a comment - 4908 shards=4 http://perf.jenkins.couchbase.com/job/hebe/5367/ - 94902 http://perf.jenkins.couchbase.com/job/hebe/5368/ - 94388 http://perf.jenkins.couchbase.com/job/hebe/5369/ - 94623 Also, the 6.0.3 seem to be stable at around 120k. Ok, that's helpful - we can rule out the #shards / #threads changes given with both back to their 6.0.3 values we still see the regression. I started looking at the writer thread priority patch, and was going to suggest we re-run the test but manually set the priorities back to the default of 0 (via the renice command), however in the process I discovered a bug - we are incorrectly setting the NonIO and AuxIO thread priorities to lowest - see MB-37144 . Once that bug is resolved we should re-run this test.
            drigby Dave Rigby added a comment -

            Ran build 6.5.0-4923 which includes the fix for MB-37144 (incorrect priorities for NonIO and AuxIO threads):
            http://perf.jenkins.couchbase.com/job/hebe/5375/

            Results in 135,367 op/s - greater than the 6.0.3 numbers. Marking as resolved.

            drigby Dave Rigby added a comment - Ran build 6.5.0-4923 which includes the fix for MB-37144 (incorrect priorities for NonIO and AuxIO threads): http://perf.jenkins.couchbase.com/job/hebe/5375/ Results in 135,367 op/s - greater than the 6.0.3 numbers. Marking as resolved.

            People

              korrigan.clark Korrigan Clark
              korrigan.clark Korrigan Clark
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Gerrit Reviews

                  There are no open Gerrit changes

                  PagerDuty