Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-50520

Eventing throughput test regressed in build 7.1.0-1264

    XMLWordPrintable

Details

    • Untriaged
    • 1
    • Yes

    Attachments

      1. MB-50520_b1263_ops.png
        MB-50520_b1263_ops.png
        62 kB
      2. MB-50520_b2182_ops.png
        MB-50520_b2182_ops.png
        60 kB
      3. MB-50520_del.png
        MB-50520_del.png
        62 kB
      4. MB-50520_get.png
        MB-50520_get.png
        63 kB
      5. MB-50520_rr.png
        MB-50520_rr.png
        49 kB
      6. Screenshot 2022-01-21 at 2.33.49 PM.png
        Screenshot 2022-01-21 at 2.33.49 PM.png
        135 kB
      7. Screenshot 2022-01-21 at 2.45.38 PM.png
        Screenshot 2022-01-21 at 2.45.38 PM.png
        139 kB
      8. Screenshot 2022-01-21 at 2.46.43 PM.png
        Screenshot 2022-01-21 at 2.46.43 PM.png
        379 kB
      9. Screenshot 2022-01-21 at 2.47.45 PM.png
        Screenshot 2022-01-21 at 2.47.45 PM.png
        127 kB

      Issue Links

        No reviews matched the request. Check your Options in the drop-down menu of this sections header.

        Activity

          vikas.chaudhary Vikas Chaudhary created issue -
          vikas.chaudhary Vikas Chaudhary made changes -
          Field Original Value New Value
          Assignee Jeelan Poola [ jeelan.poola ] Ankit Prabhu [ ankit.prabhu ]
          vikas.chaudhary Vikas Chaudhary made changes -
          vikas.chaudhary Vikas Chaudhary made changes -
          vikas.chaudhary Vikas Chaudhary made changes -

          another tests who regressed 

          vikas.chaudhary Vikas Chaudhary added a comment - another tests who regressed 
          vikas.chaudhary Vikas Chaudhary made changes -
          Summary Eventing throughput test regressed between 7.0.3 and 7.1.0 Eventing throughput test regressed
          vikas.chaudhary Vikas Chaudhary made changes -
          Labels performance
          ankit.prabhu Ankit Prabhu added a comment -

          Resolving it as duplicate of https://issues.couchbase.com/browse/MB-47908

          ankit.prabhu Ankit Prabhu added a comment - Resolving it as duplicate of https://issues.couchbase.com/browse/MB-47908
          ankit.prabhu Ankit Prabhu made changes -
          Resolution Duplicate [ 3 ]
          Status Open [ 1 ] Resolved [ 5 ]
          wayne Wayne Siu made changes -
          Link This issue duplicates MB-47908 [ MB-47908 ]
          vikas.chaudhary Vikas Chaudhary added a comment - - edited Ankit Prabhu culprit build is 7.1.0-1264 which is different from   MB-47908 . Can you take a look again? 7.1.0-1263 7.1.0-1264 70,577 59,684 http://perf.jenkins.couchbase.com/job/themis/15000/console   http://perf.jenkins.couchbase.com/job/themis/15131/     Cbmonitor: http://cbmonitor.sc.couchbase.com/reports/html/?snapshot=themis_710-1264_process_timer_events_6012&snapshot=themis_710-1263_process_timer_events_ae4b  
          vikas.chaudhary Vikas Chaudhary made changes -
          Resolution Duplicate [ 3 ]
          Status Resolved [ 5 ] Reopened [ 4 ]
          ankit.prabhu Ankit Prabhu added a comment -
          ankit.prabhu Ankit Prabhu added a comment - There is no changes in eventing between 1263 and 1264. http://changelog.build.couchbase.com/?product=couchbase-server&fromVersion=7.1.0&fromBuild=1263&toVersion=7.1.0&toBuild=1264&f_asterixdb=off&f_backup=off&f_build=on&f_cbas-core=off&f_cbbs=on&f_cbgt=off&f_chronicle=on&f_couchbase-cli=off&f_couchdb=off&f_eventing=on&f_eventing-ee=off&f_goxdcr=off&f_indexing=off&f_kv_engine=on&f_libcouchbase=off&f_magma=on&f_n1fty=on&f_ns_server=off&f_product-metadata=off&f_query=off&f_testrunner=off&f_tlm=on&f_voltron=on&f_forestdb=off Isolating it further, it looks like performance degraded after https://github.com/couchbase/kv_engine/commit/680d905e8d74a123fcc5d911a1b1a42e99c76fb0 kv at https://github.com/couchbase/kv_engine/commit/680d905e8d74a123fcc5d911a1b1a42e99c76fb0 build 7.1.0-14356 http://perf.jenkins.couchbase.com/job/themis-temp/288/console kv at https://github.com/couchbase/kv_engine/commit/0f72689e53098ad58413dae3233f01327f808efe build 7.1.0-14363 http://perf.jenkins.couchbase.com/job/themis-temp/287/console There is a decrease in vb_active_resident_items_ratio in the bad build. http://cbmonitor.sc.couchbase.com/reports/html/?snapshot=themis_710-14363_process_timer_events_d71a&snapshot=themis_710-14356_process_timer_events_ede7#d056096c18cf84643dc7317148ce4c77 Eventing is doing set, get and delete operation on "eventing._default._default" collection and reading from dcp on "bucket-1._default._default". Assigning to kv team for further investigation.
          ankit.prabhu Ankit Prabhu made changes -
          Assignee Ankit Prabhu [ ankit.prabhu ] Daniel Owen [ owend ]
          owend Daniel Owen made changes -
          Summary Eventing throughput test regressed Eventing throughput test regressed in build 7.1.0-1264
          owend Daniel Owen made changes -
          Component/s couchbase-bucket [ 10173 ]
          paolo.cocchi Paolo Cocchi made changes -
          Assignee Daniel Owen [ owend ] Paolo Cocchi [ paolo.cocchi ]
          owend Daniel Owen made changes -
          Sprint KV 2022-Feb [ 2002 ]
          owend Daniel Owen made changes -
          Rank Ranked higher

          Investigating the KV changes introduced in 1264.
          Note that more recent Neo builds have stabilized at ~ 64000 (7.1.0-2182).

          paolo.cocchi Paolo Cocchi added a comment - 7.1.0-1263: 70577 7.1.0-1264: 59684 toy-build (baseline 1264 + KV->1263, http://perf.jenkins.couchbase.com/job/themis/15236): 70425 Investigating the KV changes introduced in 1264. Note that more recent Neo builds have stabilized at ~ 64000 (7.1.0-2182).
          paolo.cocchi Paolo Cocchi made changes -
          Attachment MB-50520_rr.png [ 176809 ]
          paolo.cocchi Paolo Cocchi made changes -
          Attachment MB-50520_get.png [ 176810 ]
          paolo.cocchi Paolo Cocchi added a comment - - edited

          Ankit Prabhu
          Thanks for the details.

          Eventing is doing set, get and delete operation on "eventing._default._default" collection

          So I assume that in that test you're measuring the performance of those operations. I also assume that set/get/delete ops are interleaved and issued on the same MCBP connections to memcached. Is that correct? Could you confirm on that please?

          If that's confirmed, then the lower Resident Ratio at 1264 can explain the throughput degradation, as some operation's latency (eg get/delete) might increase due to higher cache-miss likelihood.

          Patch https://github.com/couchbase/kv_engine/commit/680d905e8d74a123fcc5d911a1b1a42e99c76fb0 is expected to lower the Resident Ratio indeed.
          Point is, that patch is in the middle of Neo development where the Checkpoint Mem Recovery behaviour has drastically changed, so we should really look at the latest Neo builds (ie, dev completed) with regard to that.

          7.1.0-2182 is a good example. The new Checkpoint Quota (30% of the bucket quota by default) is expected to bring improvements in the RR, and in fact (http://cbmonitor.sc.couchbase.com/reports/html/?snapshot=themis_710-2182_process_timer_events_e3e4&snapshot=themis_710-1263_process_timer_events_ae4b&snapshot=themis_710-1264_process_timer_events_6012#d056096c18cf84643dc7317148ce4c77):

          So now the next question. If RR is the cause of the temporary degradation in 1264, why the RR improvements in 2182 don't push the performance at least back to the 1263 levels?
          Not sure yet, but there seems to be a test behaviour change on the latest runs, eg CMD_GET:

          • 1263 -> Flat line at ~ 14000, that seems to be the Eventing target
          • 1264 -> Eventing seems to try to push at 14000 but we have drops (probably due to lower RR)
          • 2182 -> Flat line at ~ 12000 - Eventing seems to have lowered the test target here

           Same pattern for CMD_DEL:

          paolo.cocchi Paolo Cocchi added a comment - - edited Ankit Prabhu Thanks for the details. Eventing is doing set, get and delete operation on "eventing._default._default" collection So I assume that in that test you're measuring the performance of those operations. I also assume that set/get/delete ops are interleaved and issued on the same MCBP connections to memcached. Is that correct? Could you confirm on that please? If that's confirmed, then the lower Resident Ratio at 1264 can explain the throughput degradation, as some operation's latency (eg get/delete) might increase due to higher cache-miss likelihood. Patch https://github.com/couchbase/kv_engine/commit/680d905e8d74a123fcc5d911a1b1a42e99c76fb0 is expected to lower the Resident Ratio indeed. Point is, that patch is in the middle of Neo development where the Checkpoint Mem Recovery behaviour has drastically changed, so we should really look at the latest Neo builds (ie, dev completed) with regard to that. 7.1.0-2182 is a good example. The new Checkpoint Quota (30% of the bucket quota by default) is expected to bring improvements in the RR, and in fact ( http://cbmonitor.sc.couchbase.com/reports/html/?snapshot=themis_710-2182_process_timer_events_e3e4&snapshot=themis_710-1263_process_timer_events_ae4b&snapshot=themis_710-1264_process_timer_events_6012#d056096c18cf84643dc7317148ce4c77): So now the next question. If RR is the cause of the temporary degradation in 1264, why the RR improvements in 2182 don't push the performance at least back to the 1263 levels? Not sure yet, but there seems to be a test behaviour change on the latest runs, eg CMD_GET: 1263 -> Flat line at ~ 14000, that seems to be the Eventing target 1264 -> Eventing seems to try to push at 14000 but we have drops (probably due to lower RR) 2182 -> Flat line at ~ 12000 - Eventing seems to have lowered the test target here  Same pattern for CMD_DEL:
          paolo.cocchi Paolo Cocchi made changes -
          Attachment MB-50520_del.png [ 176811 ]
          ankit.prabhu Ankit Prabhu added a comment -

          So I assume that in that test you're measuring the performance of those operations. I also assume that set/get/delete ops are interleaved and issued on the same MCBP connections to memcached. Is that correct? Could you confirm on that please?

          Yes, Eventing performs set/get/delete operation from the same connection.
          Since this is a timer operation all these operations are done on "eventing._default._default" collection and dcp streams are opened on "bucket-1._default._default" collection.

          ankit.prabhu Ankit Prabhu added a comment - So I assume that in that test you're measuring the performance of those operations. I also assume that set/get/delete ops are interleaved and issued on the same MCBP connections to memcached. Is that correct? Could you confirm on that please? Yes, Eventing performs set/get/delete operation from the same connection. Since this is a timer operation all these operations are done on "eventing._default._default" collection and dcp streams are opened on "bucket-1._default._default" collection.
          paolo.cocchi Paolo Cocchi added a comment - - edited

          Ankit Prabhu, Vikas Chaudhary
          Could you check out what's the test target in the current env?
          As per my last comment, Eventing seems to be pushing KV only up to ~ 12000ops/s on recent runs. For comparing with older runs, we should reset the target to to 14000ops/s.
          Thanks

          paolo.cocchi Paolo Cocchi added a comment - - edited Ankit Prabhu , Vikas Chaudhary Could you check out what's the test target in the current env? As per my last comment, Eventing seems to be pushing KV only up to ~ 12000ops/s on recent runs. For comparing with older runs, we should reset the target to to 14000ops/s. Thanks
          paolo.cocchi Paolo Cocchi made changes -
          Assignee Paolo Cocchi [ paolo.cocchi ] Vikas Chaudhary [ vikas.chaudhary ]
          owend Daniel Owen made changes -
          Sprint KV 2022-Feb [ 2002 ]
          owend Daniel Owen made changes -
          Rank Ranked higher

          Paolo Cocchi We run load with maximum throughput for KV to measure eventing DCP consumption throughput. We are not controlling KV throughput in two build. It could KV which is restricting throughput

          vikas.chaudhary Vikas Chaudhary added a comment - Paolo Cocchi We run load with maximum throughput for KV to measure eventing DCP consumption throughput. We are not controlling KV throughput in two build. It could KV which is restricting throughput
          vikas.chaudhary Vikas Chaudhary made changes -
          Assignee Vikas Chaudhary [ vikas.chaudhary ] Ankit Prabhu [ ankit.prabhu ]
          paolo.cocchi Paolo Cocchi added a comment - - edited

          Vikas Chaudhary

          KV stats show that we are literally seeing less operations on the frontend in latest runs.
          These are stats for bucket "eventing" that is the one that you load with SET/GET/DEL:

          Old build 1263 run (http://perf.jenkins.couchbase.com/job/themis/15000/):

           cas_hits:                                              27547
           cas_misses:                                            0
           decr_hits:                                             0
           decr_misses:                                           0
           delete_hits:                                           29335115
           delete_misses:                                         0
           get_hits:                                              29417254
           get_misses:                                            59718
           incr_hits:                                             24951666
           incr_misses:                                           0
          

          New build 2182 run (http://perf.jenkins.couchbase.com/job/themis/15132/):

           cas_hits:                                                  27076
           cas_misses:                                                0
           decr_hits:                                                 0
           decr_misses:                                               0
           delete_hits:                                               26601871
           delete_misses:                                             0
           get_hits:                                                  26683508
           get_misses:                                                59509
           incr_hits:                                                 25121389
           incr_misses:                                               0
          

          Note: get_misses have slightly lowered, get_hits/delete_hits have lowered too.

          Ops rate shows the same on charts, note no operation rejected by TempOOM/OOM:

          1263

          2182

          Plus, histograms for latencies for SET/GET/DEL show approximately the same distribution on the 2 runs.

          Vikas Chaudhary
          I do see that there's no test configuration parameter for setting the test target frontend throughput, and as you say you "load KV with maximum throughput". I would suggest to check out your changes between 1263 and 2182, there might be some change that lowers the frontend throughput that comes into KV.

          Thanks

          paolo.cocchi Paolo Cocchi added a comment - - edited Vikas Chaudhary KV stats show that we are literally seeing less operations on the frontend in latest runs. These are stats for bucket "eventing" that is the one that you load with SET/GET/DEL: Old build 1263 run ( http://perf.jenkins.couchbase.com/job/themis/15000/): cas_hits: 27547 cas_misses: 0 decr_hits: 0 decr_misses: 0 delete_hits: 29335115 delete_misses: 0 get_hits: 29417254 get_misses: 59718 incr_hits: 24951666 incr_misses: 0 New build 2182 run ( http://perf.jenkins.couchbase.com/job/themis/15132/): cas_hits: 27076 cas_misses: 0 decr_hits: 0 decr_misses: 0 delete_hits: 26601871 delete_misses: 0 get_hits: 26683508 get_misses: 59509 incr_hits: 25121389 incr_misses: 0 Note: get_misses have slightly lowered, get_hits/delete_hits have lowered too. Ops rate shows the same on charts, note no operation rejected by TempOOM/OOM: 1263 2182 Plus, histograms for latencies for SET/GET/DEL show approximately the same distribution on the 2 runs. Vikas Chaudhary I do see that there's no test configuration parameter for setting the test target frontend throughput, and as you say you "load KV with maximum throughput". I would suggest to check out your changes between 1263 and 2182, there might be some change that lowers the frontend throughput that comes into KV. Thanks
          paolo.cocchi Paolo Cocchi made changes -
          Attachment MB-50520_b2182_ops.png [ 177296 ]
          Attachment MB-50520_b1263_ops.png [ 177295 ]
          paolo.cocchi Paolo Cocchi made changes -
          Component/s couchbase-bucket [ 10173 ]
          vikas.chaudhary Vikas Chaudhary added a comment - - edited

          Paolo Cocchi  Thank you for all the analysis. We ran these tests with the same perfrunner code and ran them on the same environment. The only change is build.

          I re-ran the test on 1264 with more quota to the metadata bucket, throughput came back to the original number and ops are 14K. 

          Test Run: http://perf.jenkins.couchbase.com/job/themis/15265/console 

          cbmonitor : http://cbmonitor.sc.couchbase.com/reports/html/?snapshot=themis_710-1264_process_timer_events_8c81 

           

          The same test on 2223 (with increased quota), seeing regression again. Another interesting part is bucket CPU utilisation is very low as compared to 1264.

          http://cbmonitor.sc.couchbase.com/reports/html/?snapshot=themis_710-1264_process_timer_events_8c81&snapshot=themis_710-2223_process_timer_events_5953#a12f6b20223be1c7cc5431f85ddf1618 

          1264 -> http://perf.jenkins.couchbase.com/job/themis/15270/console 

          We are using following SDK for data loading 

          libcouchbase = 2.9.3
          python_client = 2.5.0

          vikas.chaudhary Vikas Chaudhary added a comment - - edited Paolo Cocchi   Thank you for all the analysis. We ran these tests with the same perfrunner code and ran them on the same environment. The only change is build. I re-ran the test on 1264 with more quota to the metadata bucket, throughput came back to the original number and ops are 14K.  Test Run: http://perf.jenkins.couchbase.com/job/themis/15265/console   cbmonitor : http://cbmonitor.sc.couchbase.com/reports/html/?snapshot=themis_710-1264_process_timer_events_8c81     The same test on 2223 (with increased quota), seeing regression again. Another interesting part is bucket CPU utilisation is very low as compared to 1264. http://cbmonitor.sc.couchbase.com/reports/html/?snapshot=themis_710-1264_process_timer_events_8c81&snapshot=themis_710-2223_process_timer_events_5953#a12f6b20223be1c7cc5431f85ddf1618   1264 -> http://perf.jenkins.couchbase.com/job/themis/15270/console   We are using following SDK for data loading  libcouchbase = 2.9.3 python_client = 2.5.0
          paolo.cocchi Paolo Cocchi added a comment -

          As per offline discussion:

          Looks like eventing not pushing documents to "eventing" collection since as you mentioned set operation is less compared to older versions

          Please let me know if you need anything else from the KV side, thanks.

          paolo.cocchi Paolo Cocchi added a comment - As per offline discussion: Looks like eventing not pushing documents to "eventing" collection since as you mentioned set operation is less compared to older versions Please let me know if you need anything else from the KV side, thanks.
          chanabasappa.ghali Chanabasappa Ghali made changes -
          Due Date 01/Mar/22
          ankit.prabhu Ankit Prabhu added a comment -

          Firing a timer depends on 'fuzz' variable which can spread across 500 seconds from starting the test between runs. 

          function OnUpdate(doc, meta) {\n    var fuzz = Math.floor(Math.random() * 500);\n    var fireAt = new Date(1646032251000 + (fuzz * 1000));\n    createTimer(timerCallback, fireAt, meta.id);\n}\n\nfunction timerCallback() {\n}\n

          Vikas Chaudhary, Could you please make this deterministic so that it can be easy to analyse?

          ankit.prabhu Ankit Prabhu added a comment - Firing a timer depends on 'fuzz' variable which can spread across 500 seconds from starting the test between runs.  function OnUpdate(doc, meta) {\n var fuzz = Math.floor(Math.random() * 500);\n var fireAt = new Date(1646032251000 + (fuzz * 1000));\n createTimer(timerCallback, fireAt, meta.id);\n}\n\nfunction timerCallback() {\n}\n Vikas Chaudhary , Could you please make this deterministic so that it can be easy to analyse?
          ankit.prabhu Ankit Prabhu made changes -
          Assignee Ankit Prabhu [ ankit.prabhu ] Vikas Chaudhary [ vikas.chaudhary ]

          Ankit Prabhu that will be a new test / we will not be able to compare 1:1

          vikas.chaudhary Vikas Chaudhary added a comment - Ankit Prabhu that will be a new test / we will not be able to compare 1:1
          vikas.chaudhary Vikas Chaudhary added a comment - - edited

          Ankit Prabhu  Please find the run with 60 sec of fuzz 

          http://perf.jenkins.couchbase.com/job/themis/15806/console - 64366.0

          Run on good build 

          http://perf.jenkins.couchbase.com/job/themis/15807/console – 71627.0

          vikas.chaudhary Vikas Chaudhary added a comment - - edited Ankit Prabhu   Please find the run with 60 sec of fuzz  http://perf.jenkins.couchbase.com/job/themis/15806/console - 64366.0 Run on good build  http://perf.jenkins.couchbase.com/job/themis/15807/console – 71627.0
          vikas.chaudhary Vikas Chaudhary made changes -
          Assignee Vikas Chaudhary [ vikas.chaudhary ] Ankit Prabhu [ ankit.prabhu ]
          jeelan.poola Jeelan Poola made changes -
          Labels performance hard_bug performance
          jeelan.poola Jeelan Poola added a comment -

          Vikas Chaudhary Could you please also run with a constant 30secs for timer on both good & suspect builds?

          jeelan.poola Jeelan Poola added a comment - Vikas Chaudhary Could you please also run with a constant 30secs for timer on both good & suspect builds?
          vikas.chaudhary Vikas Chaudhary added a comment - - edited Jeelan Poola Please find the runs http://perf.jenkins.couchbase.com/job/themis/15819/console  - 64650.0 http://perf.jenkins.couchbase.com/job/themis/15818/console  - 71446.0
          jeelan.poola Jeelan Poola made changes -
          Due Date 01/Mar/22 08/Mar/22

          Discussed during scrum today, moving it to 7.1.1

          chanabasappa.ghali Chanabasappa Ghali added a comment - Discussed during scrum today, moving it to 7.1.1
          chanabasappa.ghali Chanabasappa Ghali made changes -
          Fix Version/s Neo [ 17615 ]
          Fix Version/s 7.1.1 [ 18320 ]
          jeelan.poola Jeelan Poola made changes -
          Fix Version/s 7.1.1 [ 18320 ]
          Fix Version/s Morpheus [ 17651 ]
          jeelan.poola Jeelan Poola made changes -
          Fix Version/s 7.1.1 [ 18320 ]
          jeelan.poola Jeelan Poola made changes -
          Link This issue is cloned by MB-52380 [ MB-52380 ]
          jeelan.poola Jeelan Poola made changes -
          Fix Version/s 7.1.1 [ 18320 ]

          People

            ankit.prabhu Ankit Prabhu
            vikas.chaudhary Vikas Chaudhary
            Votes:
            0 Vote for this issue
            Watchers:
            8 Start watching this issue

            Dates

              Created:
              Updated:

              Gerrit Reviews

                There are no open Gerrit changes

                PagerDuty