Uploaded image for project: 'Couchbase Python Client Library'
  1. Couchbase Python Client Library
  2. PYCBC-1037

SDK3(patch set) kv and query latency ~30% higher than SDK2 with external libcouchbase installed

    XMLWordPrintable

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • 3.0.1
    • None
    • None
    • 1
    • SDK48: FTS Score/Incl, Docs., SDK51: FLEpt1, Txns Test Compl, SDK2: FLEpt2, TxnsTest, SDK4: 3.1GA,Ruby3.1,TxnTst,FLE

    Description

      Comparing SDK3 to SDK2 kv get/set latency, we see that across all latency percentiles, SDK3 is ~20-30% higher than SDK2. Here are the two tests for comparison:

      SDK 2.5.0 - 24 workers:

      http://perf.jenkins.couchbase.com/job/ares/17051/

      SDK 3 - 24 workers:

      http://perf.jenkins.couchbase.com/job/ares/17049/ 

      In both tests, kv ops are against default scope/collection.

      SDK3 is this patch: git+http://review.couchbase.org/couchbase-python-client@refs/changes/91/135991/3

      Here is a comparison graph:

      http://cbmonitor.sc.couchbase.com/reports/html/?snapshot=ares_700-3090_access_c015&label=sdk2(24workers)&snapshot=ares_700-3090_access_ae94&label=sdk3(24workers)

      The only difference in these tests, other than sdk version, cur_connections to bucket and ESTABILSHED connections to master node 172.23.133.13. Perhaps these "extra" connections are causing the slow down as the "extra" connections is around 30% more than sdk2 which corresponds to the increase in latency

      Workaround

      Remove any externally installed libcouchbase releases from the system. When run with the built-in libcouchbase, the higher latency is not observed.

      Attachments

        1. profile003.gif
          profile003.gif
          185 kB
        2. sdk2.kv.reads.repeated.perf.data
          31.04 MB
        3. sdk2.perf.data
          34.09 MB
        4. sdk3.kv.reads.repeated.3.perf.data
          36.92 MB
        5. sdk3.perf.data
          32.77 MB
        6. sdk3withlcb293.data
          425 kB
        7. sdk3withlcb306.data
          150 kB
        No reviews matched the request. Check your Options in the drop-down menu of this sections header.

        Activity

          korrigan.clark Korrigan Clark (Inactive) added a comment - - edited

          Raju Suravarjjala David Kelly I ran kv and n1ql tests with the sdk 3.0.8 and no lcb preinstalled. Here are the results with green being old runs and orange being new runs:

          90th percentile query latency (ms), Q1, Key-Value Lookup, 10K queries/sec, default collection

          Old: 1.2

          New: 1.2

          http://cbmonitor.sc.couchbase.com/reports/html/?snapshot=iris_700-4226_access_cd44&snapshot=iris_700-4226_access_0cab

          90th percentile query latency (ms), Q1, Key-Value Lookup, 10K queries/sec, s=1 c=1

          Old: 1.2

          New: 1.2

          http://cbmonitor.sc.couchbase.com/reports/html/?snapshot=iris_700-4226_access_534e&snapshot=iris_700-4226_access_38dd

          90th percentile query latency (ms), Q1, Key-Value Lookup, 10K queries/sec, s=1 c=1000

          Old: 1.2

          New: 1.2

          http://cbmonitor.sc.couchbase.com/reports/html/?snapshot=iris_700-4226_access_cf7e&snapshot=iris_700-4226_access_90e9

          Avg. Query Throughput (queries/sec), Q1, Key-Value Lookup, s=1 c=1

          Old: 127,774

          New: 129,814

          http://cbmonitor.sc.couchbase.com/reports/html/?snapshot=iris_700-4226_access_a7b4&snapshot=iris_700-4226_access_a29f

          Avg. Query Throughput (queries/sec), Q1, Key-Value Lookup, default collection

          Old: 127,943

          New: 130,978

          http://cbmonitor.sc.couchbase.com/reports/html/?snapshot=iris_700-4226_access_8313&snapshot=iris_700-4226_access_16f4

          99.9th percentile GET/SET latency (ms), 4 nodes, 1 bucket x 20M x 1KB, 10K ops/sec, default scope/collection

          Old: Get - 0.85, Set - 0.98

          New: Get - 1.0, Set - 1.12

          http://cbmonitor.sc.couchbase.com/reports/html/?snapshot=ares_700-4226_access_f0db&snapshot=ares_700-4226_access_e4da

          99.9th percentile GET/SET latency (ms), 4 nodes, 1 bucket x 20M x 1KB, 10K ops/sec, s=1 c=1

          Old: Get - 0.96, Set - 1.18

          New: Get - 0.91, Set - 1.12

          http://cbmonitor.sc.couchbase.com/reports/html/?snapshot=ares_700-4226_access_2b50&snapshot=ares_700-4226_access_d04e

           

          There are some things to note:

          1 - n1ql latency remains about the same as before. Default, 1 collection and 1000 collections all have same latency. The main difference in the n1ql test is the number of connection in the timewait state. This is significantly less with the new runs. N1ql also sees a mild bump in throughput. The ops and cpu utlization are also smoother and more consistent throughout the tests.

          2 - KV default collection latency seems to have increase, both get and set. KV 1 collection seems to have decreased latency, both get and set. We do not see any difference in the number of connection in timewait state in the kv tests.

           

          Comparing SDK2 to SDK3 default collection:

          There still is a difference (30%) between sdk2 and sdk3 default collection for latency tests. However the n1ql throughput numbers for sdk2 and sdk3 are now roughly the same. This could possible be the effect of the new libevent IO in server.

          korrigan.clark Korrigan Clark (Inactive) added a comment - - edited Raju Suravarjjala David Kelly  I ran kv and n1ql tests with the sdk 3.0.8 and no lcb preinstalled. Here are the results with green being old runs and orange being new runs: 90th percentile query latency (ms), Q1, Key-Value Lookup, 10K queries/sec, default collection Old: 1.2 New: 1.2 http://cbmonitor.sc.couchbase.com/reports/html/?snapshot=iris_700-4226_access_cd44&snapshot=iris_700-4226_access_0cab 90th percentile query latency (ms), Q1, Key-Value Lookup, 10K queries/sec, s=1 c=1 Old: 1.2 New: 1.2 http://cbmonitor.sc.couchbase.com/reports/html/?snapshot=iris_700-4226_access_534e&snapshot=iris_700-4226_access_38dd 90th percentile query latency (ms), Q1, Key-Value Lookup, 10K queries/sec, s=1 c=1000 Old: 1.2 New: 1.2 http://cbmonitor.sc.couchbase.com/reports/html/?snapshot=iris_700-4226_access_cf7e&snapshot=iris_700-4226_access_90e9 Avg. Query Throughput (queries/sec), Q1, Key-Value Lookup, s=1 c=1 Old: 127,774 New: 129,814 http://cbmonitor.sc.couchbase.com/reports/html/?snapshot=iris_700-4226_access_a7b4&snapshot=iris_700-4226_access_a29f Avg. Query Throughput (queries/sec), Q1, Key-Value Lookup, default collection Old: 127,943 New: 130,978 http://cbmonitor.sc.couchbase.com/reports/html/?snapshot=iris_700-4226_access_8313&snapshot=iris_700-4226_access_16f4 99.9th percentile GET/SET latency (ms), 4 nodes, 1 bucket x 20M x 1KB, 10K ops/sec, default scope/collection Old: Get - 0.85, Set - 0.98 New: Get - 1.0, Set - 1.12 http://cbmonitor.sc.couchbase.com/reports/html/?snapshot=ares_700-4226_access_f0db&snapshot=ares_700-4226_access_e4da 99.9th percentile GET/SET latency (ms), 4 nodes, 1 bucket x 20M x 1KB, 10K ops/sec, s=1 c=1 Old: Get - 0.96, Set - 1.18 New: Get - 0.91, Set - 1.12 http://cbmonitor.sc.couchbase.com/reports/html/?snapshot=ares_700-4226_access_2b50&snapshot=ares_700-4226_access_d04e   There are some things to note: 1 - n1ql latency remains about the same as before. Default, 1 collection and 1000 collections all have same latency. The main difference in the n1ql test is the number of connection in the timewait state. This is significantly less with the new runs. N1ql also sees a mild bump in throughput. The ops and cpu utlization are also smoother and more consistent throughout the tests. 2 - KV default collection latency seems to have increase, both get and set. KV 1 collection seems to have decreased latency, both get and set. We do not see any difference in the number of connection in timewait state in the kv tests.   Comparing SDK2 to SDK3 default collection: There still is a difference (30%) between sdk2 and sdk3 default collection for latency tests. However the n1ql throughput numbers for sdk2 and sdk3 are now roughly the same. This could possible be the effect of the new libevent IO in server.

          Thanks Korrigan Clark Did you publish all the numbers into showfast? Any other bugs/issues you are tracking?

          raju Raju Suravarjjala added a comment - Thanks Korrigan Clark Did you publish all the numbers into showfast? Any other bugs/issues you are tracking?

          Raju Suravarjjala yes these results should be reflected on showfast now for latest build and going forward the n1ql and kv latency tests will all be clean with 3.0.8 sdk version. I am in the processes of doing some retuning for these tests as the performance has changed slightly so there may be some quirks for a test or two. I am in the processes analyzing the rest of the tests and will open a bug if I see any issues. The latency increase is still present but there appears to be increased throughput as well. So, this issue may come down to tradeoffs.

          korrigan.clark Korrigan Clark (Inactive) added a comment - Raju Suravarjjala  yes these results should be reflected on showfast now for latest build and going forward the n1ql and kv latency tests will all be clean with 3.0.8 sdk version. I am in the processes of doing some retuning for these tests as the performance has changed slightly so there may be some quirks for a test or two. I am in the processes analyzing the rest of the tests and will open a bug if I see any issues. The latency increase is still present but there appears to be increased throughput as well. So, this issue may come down to tradeoffs.

          Nice work Korrigan Clark and David Kelly.

          One thing I would note by looking at the results you linked to. Through 99%ile, the results are exactly the same. For gets/sets, at least as I read it, above 99%ile it's not always consistently one test that is lower latency between sets and gets. It's just a guess, but multiple runs might show things flipping in these upper %iles. Looking at the scatterplot seems to confirm that. If you were to update that chart to plot only the points beyond 1σ, we could probably see if it's noise or an actual difference. Then again, we'd expect higher latency as throughput increases and that we don't have that seems to mean we've made things better.

          ingenthr Matt Ingenthron added a comment - Nice work Korrigan Clark and David Kelly . One thing I would note by looking at the results you linked to. Through 99%ile, the results are exactly the same. For gets/sets, at least as I read it, above 99%ile it's not always consistently one test that is lower latency between sets and gets. It's just a guess, but multiple runs might show things flipping in these upper %iles. Looking at the scatterplot seems to confirm that. If you were to update that chart to plot only the points beyond 1σ, we could probably see if it's noise or an actual difference. Then again, we'd expect higher latency as throughput increases and that we don't have that seems to mean we've made things better.

          Matt Ingenthron David Kelly

          From our side, it looks like this bug is resolved. If you guys have other thing in mind, let me know and we can keep it open, otherwise I think we should close. Will leave it up to David.

          korrigan.clark Korrigan Clark (Inactive) added a comment - Matt Ingenthron David Kelly From our side, it looks like this bug is resolved. If you guys have other thing in mind, let me know and we can keep it open, otherwise I think we should close. Will leave it up to David.

          People

            david.kelly David Kelly
            korrigan.clark Korrigan Clark (Inactive)
            Votes:
            0 Vote for this issue
            Watchers:
            9 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Gerrit Reviews

                There are no open Gerrit changes

                PagerDuty