Uploaded image for project: 'Java Couchbase JVM Core'
  1. Java Couchbase JVM Core
  2. JVMCBC-674

need feedback on durability performance

    XMLWordPrintable

Details

    Description

      Hey Michael,

      Can you take a look at the performance matrix I compiled for durability. Here is the sheet: https://docs.google.com/spreadsheets/d/1B8v4OZneOeGxJwUj226zA3YDr0Y0gjRSVLwy0IAP9qw/edit?usp=sharing . SDK2 and SDK3 columns are for old durability params (replicate to and persist to) and SDK3 New is the new durability levels. There are two issues that I am confused by and need some input to make sure I did the testing correctly. First: For SDK3 New, all durability levels except durabilityLevel=None have the same performance. To me, it does not make sense why majority and persistMajority would perform the same. Also, the performance impact is severe, dropping from 387k to 1k going from None to majority, >99% drop. Second: SDK3 with replicateTo=1 persistTo=0 performs significantly slower than replicateTo=1 persistTo=1 and replicateTo=1 persistTo=2, which implies that adding persist to increases performance and this doesn't really make sense. 

      Here is my YCSB code I am using for the tests, I create a branch called couchbase3-new-durability based on couchbase3 branch: https://github.com/couchbaselabs/YCSB/blob/couchbase3-new-durability/couchbase3/src/main/java/com/yahoo/ycsb/db/couchbase3/Couchbase3Client.java

       

      Here is the set of test files I am using: https://github.com/couchbase/perfrunner/tree/master/tests/durability

       

      Attachments

        Issue Links

          No reviews matched the request. Check your Options in the drop-down menu of this sections header.

          Activity

            korrigan.clark Korrigan Clark created issue -

            Hi,

            some observations:

            • It's great to see that actually SDK 3 using the old observe method is outperforming SDK 2 all over the place. This is even higher for the non-durability case.
            • I think you need to rerun SDK 3 DL1, since it does not make sense that this one is slower than all the others, given that even replicate WITH persistence is faster. Likely an environmental issue? Keep in mind the SDK does not do anything differently between replicateto and persistto polling
            • SDK 3 "new" durability: the slow performance is very likely not because of the SDK. the "new" approach just sends different headers, so all of the slowdown is likely coming from the server side. Now I have a theory why it is SO much slower: head of line blocking. With the old polling we can use the sockets continuously since polling does not interfer with regular ops. Because we do not have support for "out of order" operations, AND the client is configured to use only one socket it is pretty much a serialized system.

            So one thing I'd ask you to try is kvEndpoints 2, 4, 8, and 16 and see how the numbers change. If they go up significantly (or linearly with the number of sockets) this is very likely the head of line blocking issue since we do not have async ops on the kv layer yet.

            daschl Michael Nitschinger added a comment - Hi, some observations: It's great to see that actually SDK 3 using the old observe method is outperforming SDK 2 all over the place. This is even higher for the non-durability case. I think you need to rerun SDK 3 DL1, since it does not make sense that this one is slower than all the others, given that even replicate WITH persistence is faster. Likely an environmental issue? Keep in mind the SDK does not do anything differently between replicateto and persistto polling SDK 3 "new" durability: the slow performance is very likely not because of the SDK. the "new" approach just sends different headers, so all of the slowdown is likely coming from the server side. Now I have a theory why it is SO much slower: head of line blocking. With the old polling we can use the sockets continuously since polling does not interfer with regular ops. Because we do not have support for "out of order" operations, AND the client is configured to use only one socket it is pretty much a serialized system. So one thing I'd ask you to try is kvEndpoints 2, 4, 8, and 16 and see how the numbers change. If they go up significantly (or linearly with the number of sockets) this is very likely the head of line blocking issue since we do not have async ops on the kv layer yet.
            daschl Michael Nitschinger made changes -
            Field Original Value New Value
            Assignee Michael Nitschinger [ daschl ] Korrigan Clark [ korrigan.clark ]
            korrigan.clark Korrigan Clark made changes -
            Link This issue relates to MB-34261 [ MB-34261 ]
            korrigan.clark Korrigan Clark added a comment - Michael Nitschinger tried your suggestion, but looks like it hasnt been implemented yet in sdk3:  https://github.com/couchbase/couchbase-jvm-clients/blob/master/core-io/src/main/java/com/couchbase/client/core/service/strategy/PartitionSelectionStrategy.java#L36

            Korrigan Clark I've implemented the change and it is currently under review. Once merged we can work out a strategy on how you can use the new jars.

            daschl Michael Nitschinger added a comment - Korrigan Clark I've implemented the change and it is currently under review. Once merged we can work out a strategy on how you can use the new jars.

            Michael Nitschinger Sounds good, let me know when its ready. Thanks.

            korrigan.clark Korrigan Clark added a comment - Michael Nitschinger Sounds good, let me know when its ready. Thanks.
            ritam.sharma Ritam Sharma made changes -
            Labels 6.5mustpass

            Korrigan Clark uploaded 3.0.0-alpha.4, which should unblock the kv endpoints testing!

            daschl Michael Nitschinger added a comment - Korrigan Clark uploaded 3.0.0-alpha.4, which should unblock the kv endpoints testing!
            daschl Michael Nitschinger made changes -
            Status New [ 10003 ] Open [ 1 ]
            daschl Michael Nitschinger made changes -
            Fix Version/s 2.0.0-alpha.5 [ 16203 ]

            Assigned to the next alpha just to properly track it.

            daschl Michael Nitschinger added a comment - Assigned to the next alpha just to properly track it.

            Michael Nitschinger thanks, will run some tests today. Ill keep you updated.

            korrigan.clark Korrigan Clark added a comment - Michael Nitschinger thanks, will run some tests today. Ill keep you updated.

            Korrigan Clark can you explain the numbers a little more please? I don't understand how d=0 (no durability at all) only has 199 ops/s here? In the google chart we are talking like 380k ops/s for no durability.. and the others are also in the 1k range and not 100 range? (I would've expected the kvendpoints=1 to be in the same ballpark as the numbers from the sheet?)

            daschl Michael Nitschinger added a comment - Korrigan Clark can you explain the numbers a little more please? I don't understand how d=0 (no durability at all) only has 199 ops/s here? In the google chart we are talking like 380k ops/s for no durability.. and the others are also in the 1k range and not 100 range? (I would've expected the kvendpoints=1 to be in the same ballpark as the numbers from the sheet?)

            Michael Nitschinger that is because I ran a different test...  However, let me run a different test as I think I might have messed up.

            korrigan.clark Korrigan Clark added a comment - Michael Nitschinger that is because I ran a different test...  However, let me run a different test as I think I might have messed up.

            Korrigan Clark thanks! Not to be overly scientific but let's only change one variable at a time to get a good handle on the deltas.

            daschl Michael Nitschinger added a comment - Korrigan Clark thanks! Not to be overly scientific but let's only change one variable at a time to get a good handle on the deltas.
            korrigan.clark Korrigan Clark made changes -
            Comment [ [~daschl], I ran some tests... Looks like kvendpoints doesnt have the effect we thought it might. The results show that increasing kv endpoints does not increase throughput by a similar amount:

            kvendpoints=1
            d=0 [OVERALL], Throughput(ops/sec), 107.38831615120274
            d=1 [OVERALL], Throughput(ops/sec), 87.04659604286175
            d=2 [OVERALL], Throughput(ops/sec), 91.9303535641398
            d=3 [OVERALL], Throughput(ops/sec), 96.49901571003976

            kvendpoints=2
            d=0 [OVERALL], Throughput(ops/sec), 118.2941976696043
            d=1 [OVERALL], Throughput(ops/sec), 100.47423840527289
            d=2 [OVERALL], Throughput(ops/sec), 91.69096477233134
            d=3 [OVERALL], Throughput(ops/sec), 87.89199831247363

            kvendpoints=16
            d=0 [OVERALL], Throughput(ops/sec), 119.98752129778504
            d=1 [OVERALL], Throughput(ops/sec), 96.25380202518
            d=2 [OVERALL], Throughput(ops/sec), 81.82971236856102
            d=3 [OVERALL], Throughput(ops/sec), 95.53835865099838 ]

            Michael Nitschinger ok i delete the results from yesterday and updated the table with preliminary results for kvendpoints=2... at this level it looks like performance is unchanged. I have queued up the full matrix of tests for kvendpoints=2,4,16. Will update the numbers tomorrow morning. 

            korrigan.clark Korrigan Clark added a comment - Michael Nitschinger ok i delete the results from yesterday and updated the table with preliminary results for kvendpoints=2... at this level it looks like performance is unchanged. I have queued up the full matrix of tests for kvendpoints=2,4,16. Will update the numbers tomorrow morning. 

            I've run local experiments and ReplicateTo.ONE, PersistTo.NONE is double as fast as ReplicateTo.ONE, PersistTo.ONE in my vagrant setup, so I could not replicate your YCSB finding there.

            • With alpha.4, are you running against a cluster with developer mode enabled (if so, please disable)?

            I think to further see what's going on I think you should try to replicate in a local setup against the cluster.. for example run code like so:

                    Cluster cluster = Cluster.connect(ClusterEnvironment.builder(
                      "10.143.193.101", "Administrator", "password")
                      .serviceConfig(ServiceConfig.keyValueServiceConfig(KeyValueServiceConfig.endpoints(4)))
                      .build());
                    Bucket bucket = cluster.bucket("default");
                    Collection collection = bucket.defaultCollection();
             
                    while (true) {
                        for (int i = 0; i < Integer.MAX_VALUE; i++) {
                           collection.insert("key-"+i, "foobar", InsertOptions
                              .insertOptions()
                              .timeout(Duration.ofSeconds(10))
                            .durability(PersistTo.NONE, ReplicateTo.ONE)
                           //  .durabilityLevel(DurabilityLevel.PERSIST_TO_MAJORITY)
                           );
                        }
                    }
            

            and see if the behavior you see is the same.

            Also, I would strongly recommend testing the same with golang or libcouchbase-based variants to double check it's actually something on the client and not the server.

            daschl Michael Nitschinger added a comment - I've run local experiments and ReplicateTo.ONE, PersistTo.NONE is double as fast as ReplicateTo.ONE, PersistTo.ONE in my vagrant setup, so I could not replicate your YCSB finding there. With alpha.4, are you running against a cluster with developer mode enabled (if so, please disable)? I think to further see what's going on I think you should try to replicate in a local setup against the cluster.. for example run code like so: Cluster cluster = Cluster.connect(ClusterEnvironment.builder( "10.143.193.101", "Administrator", "password") .serviceConfig(ServiceConfig.keyValueServiceConfig(KeyValueServiceConfig.endpoints(4))) .build()); Bucket bucket = cluster.bucket("default"); Collection collection = bucket.defaultCollection();   while (true) { for (int i = 0; i < Integer.MAX_VALUE; i++) { collection.insert("key-"+i, "foobar", InsertOptions .insertOptions() .timeout(Duration.ofSeconds(10)) .durability(PersistTo.NONE, ReplicateTo.ONE) // .durabilityLevel(DurabilityLevel.PERSIST_TO_MAJORITY) ); } } and see if the behavior you see is the same. Also, I would strongly recommend testing the same with golang or libcouchbase-based variants to double check it's actually something on the client and not the server.

            Michael Nitschinger, I can reproduce your findings only when I use 1 YCSB client with a single thread. However, all of the perf tests use 4 YCSB clients with 25 threads each.

            korrigan.clark Korrigan Clark added a comment - Michael Nitschinger , I can reproduce your findings only when I use 1 YCSB client with a single thread. However, all of the perf tests use 4 YCSB clients with 25 threads each.

            Korrigan Clark can you try with 1 ycsb client and 25 threads? I'm curious at which point it switches.. either it's the number of ycsb clients or the number of threads?

            daschl Michael Nitschinger added a comment - Korrigan Clark can you try with 1 ycsb client and 25 threads? I'm curious at which point it switches.. either it's the number of ycsb clients or the number of threads?

            Michael Nitschinger we just talked but for bookkeeping, here is the psreadsheet with the data relating to number of thread: https://docs.google.com/spreadsheets/d/1B8v4OZneOeGxJwUj226zA3YDr0Y0gjRSVLwy0IAP9qw/edit?usp=sharing 

            If you go to the threads tab at the bottom you should see the different runs.

            korrigan.clark Korrigan Clark added a comment - Michael Nitschinger we just talked but for bookkeeping, here is the psreadsheet with the data relating to number of thread: https://docs.google.com/spreadsheets/d/1B8v4OZneOeGxJwUj226zA3YDr0Y0gjRSVLwy0IAP9qw/edit?usp=sharing   If you go to the threads tab at the bottom you should see the different runs.
            wayne Wayne Siu made changes -
            Link This issue relates to MB-34631 [ MB-34631 ]
            dfinlay Dave Finlay made changes -
            Labels 6.5mustpass 6.5mustpass durability

            Korrigan Clark can we close this out?

            daschl Michael Nitschinger added a comment - Korrigan Clark can we close this out?
            daschl Michael Nitschinger made changes -
            Fix Version/s 2.0.0-alpha.6 [ 16233 ]
            Fix Version/s 2.0.0-alpha.5 [ 16203 ]
            korrigan.clark Korrigan Clark made changes -
            Resolution Fixed [ 1 ]
            Status Open [ 1 ] Resolved [ 5 ]
            korrigan.clark Korrigan Clark made changes -
            Actual End 2019-07-08 11:24 (issue has been resolved)
            korrigan.clark Korrigan Clark made changes -
            Status Resolved [ 5 ] Closed [ 6 ]

            People

              korrigan.clark Korrigan Clark
              korrigan.clark Korrigan Clark
              Votes:
              0 Vote for this issue
              Watchers:
              7 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Gerrit Reviews

                  There are no open Gerrit changes

                  PagerDuty