Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-49612

Performance: Timer undeployment with TLS vs Non-TLS shows ~25% difference

    XMLWordPrintable

Details

    • Bug
    • Status: Closed
    • Major
    • Resolution: Not a Bug
    • Neo
    • None
    • couchbase-bucket

    Description

      This Jira is to investigate TLS vs Non-TLS undeployment time. Non-TLS time is decreased over the builds but TLS undeployment is not improved relatively

        7.1.0-1650 7.1.0-1695
      TLS 21.6 21.6
      Non-TLS 16.4 16.9

      Cbmonitor TLS/Non-TLS : http://cbmonitor.sc.couchbase.com/reports/html/?snapshot=themis_710-1695_load_and_wait_for_timers_6f3f&snapshot=themis_710-1695_load_and_wait_for_timers_36fc 

      Attachments

        No reviews matched the request. Check your Options in the drop-down menu of this sections header.

        Activity

          To analyse the difference in peformance lets take the latest timer undeploy performance test at : http://showfast.sc.couchbase.com/#/timeline/Linux/eventing/lat/Timer

          where TLS undeploy time = around 20 minutes 30 seconds
          and non-TLS undeploy time = around 17 minutes
          hence difference is around 200 seconds

          cleanup routine during undeploy sends delete KV operations via gocb to KV nodes. Given that this test was run with 50M timers, there are 50 * 2 = 100M timer documents to delete from metadata collection and with 4 KV nodes, each KV node will process around 25M deletions.

          Uploaded logs collected from non-tls undeploy and tls undeploy runs to supportal.
          TLS run : https://supportal.couchbase.com/snapshot/2d2e90f0a58abdee019030eb1f8011c9%3A%3A0
          Non-TLS run: https://supportal.couchbase.com/snapshot/8767bb72a03853f5f2fecf55e2b0bec3%3A%3A0

          Checking the stat: kv_cmd_duration_seconds_sum~DELETE for bucket : "eventing" which represents the cumulative time it took for one KV node to delete 25M documents. We see that:

          • It takes 480 seconds for KV to process 25M DELETE requests when N2N is strict

          • Takes around 300 seconds for KV to process 25M DELETE requests when N2N disabled:

          Hence, we can see that 480 - 300 = 180 seconds or 3 minutes is the difference coming for DELETE times from KV for tls vs non-tls use case .

          This is also evident from the mctimings published from KV.

          Following are the 90% delete timings distribution from node 96.16 for TLS use case (if you sum up the commulative timings you'll get close to 450 seconds):

           [  0.00 -   5.00]us (0.0000%)          2|
           [  5.00 -  12.00]us (10.0000%)   4044381| ##########################################
           [ 12.00 -  13.00]us (20.0000%)   1553437| ################
           [ 13.00 -  15.00]us (30.0000%)   4204567| ############################################
           [ 15.00 -  16.00]us (40.0000%)   2455198| #########################
           [ 16.00 -  17.00]us (50.0000%)   2350165| ########################
           [ 17.00 -  17.00]us (55.0000%)         0|
           [ 17.00 -  18.00]us (60.0000%)   2096812| #####################
           [ 18.00 -  18.00]us (65.0000%)         0|
           [ 18.00 -  19.00]us (70.0000%)   1811568| ##################
           [ 19.00 -  20.00]us (75.0000%)   1488855| ###############
           [ 20.00 -  20.00]us (77.5000%)         0|
           [ 20.00 -  21.00]us (80.0000%)   1208497| ############
           [ 21.00 -  21.00]us (82.5000%)         0|
           [ 21.00 -  22.00]us (85.0000%)    904237| #########
           [ 22.00 -  22.00]us (87.5000%)         0|
           [ 22.00 -  23.00]us (88.7500%)    745775| #######
           [ 23.00 -  23.00]us (90.0000%)         0|
          

          And following are the timings from node 96.16 for TLS use case. We notice here that the timings are shifted towards lesser times for this case:

           [  0.00 -   2.00]us (0.0000%)          6|
           [  2.00 -   6.00]us (10.0000%)   8032923| ############################################
           [  6.00 -   6.00]us (20.0000%)         0|
           [  6.00 -   6.00]us (30.0000%)         0|
           [  6.00 -   7.00]us (40.0000%)   3076203| ################
           [  7.00 -   9.00]us (50.0000%)   2245601| ############
           [  9.00 -  10.00]us (55.0000%)   1363119| #######
           [ 10.00 -  11.00]us (60.0000%)   2547767| #############
           [ 11.00 -  11.00]us (65.0000%)         0|
           [ 11.00 -  12.00]us (70.0000%)   2734889| ##############
           [ 12.00 -  12.00]us (75.0000%)         0|
           [ 12.00 -  12.00]us (77.5000%)         0|
           [ 12.00 -  13.00]us (80.0000%)   1059013| #####
           [ 13.00 -  13.00]us (82.5000%)         0|
           [ 13.00 -  14.00]us (85.0000%)    618964| ###
           [ 14.00 -  15.00]us (87.5000%)    448439| ##
           [ 15.00 -  16.00]us (88.7500%)    418511| ##
           [ 16.00 -  16.00]us (90.0000%)         0|
          

          Additionally, comparing the 100% write throughput for TLS vs non-TLS use case for KV (test case: 4 nodes, 0/100 R/W, 512B JSON items, batch size = 1, Durability None) there is a 6X difference in throughput. non-TLS is at 630,000 ops/sec whereas TLS doesn't go beyond 110,000

          Summary is that eventing undeploy performance majorly depends on the KV delete performance.

          Assigning to KV to verify whether the above analysis regarding comparison of delete timings for TLS vs non-TLS is correct? If yes, is this difference in DELETE performance expected.

          abhishek.jindal Abhishek Jindal added a comment - To analyse the difference in peformance lets take the latest timer undeploy performance test at : http://showfast.sc.couchbase.com/#/timeline/Linux/eventing/lat/Timer where TLS undeploy time = around 20 minutes 30 seconds and non-TLS undeploy time = around 17 minutes hence difference is around 200 seconds cleanup routine during undeploy sends delete KV operations via gocb to KV nodes. Given that this test was run with 50M timers, there are 50 * 2 = 100M timer documents to delete from metadata collection and with 4 KV nodes, each KV node will process around 25M deletions. Uploaded logs collected from non-tls undeploy and tls undeploy runs to supportal. TLS run : https://supportal.couchbase.com/snapshot/2d2e90f0a58abdee019030eb1f8011c9%3A%3A0 Non-TLS run: https://supportal.couchbase.com/snapshot/8767bb72a03853f5f2fecf55e2b0bec3%3A%3A0 Checking the stat: kv_cmd_duration_seconds_sum~DELETE for bucket : "eventing" which represents the cumulative time it took for one KV node to delete 25M documents. We see that: It takes 480 seconds for KV to process 25M DELETE requests when N2N is strict Takes around 300 seconds for KV to process 25M DELETE requests when N2N disabled: — Hence, we can see that 480 - 300 = 180 seconds or 3 minutes is the difference coming for DELETE times from KV for tls vs non-tls use case . This is also evident from the mctimings published from KV. Following are the 90% delete timings distribution from node 96.16 for TLS use case (if you sum up the commulative timings you'll get close to 450 seconds): [ 0.00 - 5.00]us (0.0000%) 2| [ 5.00 - 12.00]us (10.0000%) 4044381| ########################################## [ 12.00 - 13.00]us (20.0000%) 1553437| ################ [ 13.00 - 15.00]us (30.0000%) 4204567| ############################################ [ 15.00 - 16.00]us (40.0000%) 2455198| ######################### [ 16.00 - 17.00]us (50.0000%) 2350165| ######################## [ 17.00 - 17.00]us (55.0000%) 0| [ 17.00 - 18.00]us (60.0000%) 2096812| ##################### [ 18.00 - 18.00]us (65.0000%) 0| [ 18.00 - 19.00]us (70.0000%) 1811568| ################## [ 19.00 - 20.00]us (75.0000%) 1488855| ############### [ 20.00 - 20.00]us (77.5000%) 0| [ 20.00 - 21.00]us (80.0000%) 1208497| ############ [ 21.00 - 21.00]us (82.5000%) 0| [ 21.00 - 22.00]us (85.0000%) 904237| ######### [ 22.00 - 22.00]us (87.5000%) 0| [ 22.00 - 23.00]us (88.7500%) 745775| ####### [ 23.00 - 23.00]us (90.0000%) 0| And following are the timings from node 96.16 for TLS use case. We notice here that the timings are shifted towards lesser times for this case: [ 0.00 - 2.00]us (0.0000%) 6| [ 2.00 - 6.00]us (10.0000%) 8032923| ############################################ [ 6.00 - 6.00]us (20.0000%) 0| [ 6.00 - 6.00]us (30.0000%) 0| [ 6.00 - 7.00]us (40.0000%) 3076203| ################ [ 7.00 - 9.00]us (50.0000%) 2245601| ############ [ 9.00 - 10.00]us (55.0000%) 1363119| ####### [ 10.00 - 11.00]us (60.0000%) 2547767| ############# [ 11.00 - 11.00]us (65.0000%) 0| [ 11.00 - 12.00]us (70.0000%) 2734889| ############## [ 12.00 - 12.00]us (75.0000%) 0| [ 12.00 - 12.00]us (77.5000%) 0| [ 12.00 - 13.00]us (80.0000%) 1059013| ##### [ 13.00 - 13.00]us (82.5000%) 0| [ 13.00 - 14.00]us (85.0000%) 618964| ### [ 14.00 - 15.00]us (87.5000%) 448439| ## [ 15.00 - 16.00]us (88.7500%) 418511| ## [ 16.00 - 16.00]us (90.0000%) 0| — Additionally, comparing the 100% write throughput for TLS vs non-TLS use case for KV (test case: 4 nodes, 0/100 R/W, 512B JSON items, batch size = 1, Durability None) there is a 6X difference in throughput. non-TLS is at 630,000 ops/sec whereas TLS doesn't go beyond 110,000 Summary is that eventing undeploy performance majorly depends on the KV delete performance. Assigning to KV to verify whether the above analysis regarding comparison of delete timings for TLS vs non-TLS is correct? If yes, is this difference in DELETE performance expected.
          drigby Dave Rigby added a comment -

          Thanks Abhishek Jindal for the analysis.

          It is indeed expected that TLS is slower than non-TLS, given the extra decryption / encryption work needed. As such, if the phase of the workload is bottlenecked in processing deletes, they will be most costly (and hence take longer with the same resource) on TLS than non-TLS.

          One thing I would highlight however, is that tls1.2 is generally faster than tls1 (not sure if 1.0 or 1.1) for KV-Engine - see the graphs at: http://showfast.sc.couchbase.com/#/timeline/Linux/kv/max_ops_ssl/all I would suggest checking which cypher version eventing is using here.

          drigby Dave Rigby added a comment - Thanks Abhishek Jindal for the analysis. It is indeed expected that TLS is slower than non-TLS, given the extra decryption / encryption work needed. As such, if the phase of the workload is bottlenecked in processing deletes, they will be most costly (and hence take longer with the same resource) on TLS than non-TLS. One thing I would highlight however, is that tls1.2 is generally faster than tls1 (not sure if 1.0 or 1.1) for KV-Engine - see the graphs at: http://showfast.sc.couchbase.com/#/timeline/Linux/kv/max_ops_ssl/all I would suggest checking which cypher version eventing is using here.
          drigby Dave Rigby added a comment -

          Resolving as "not a bug" - the difference in performance for a TLS vs non-TLS connections is expected.

          drigby Dave Rigby added a comment - Resolving as "not a bug" - the difference in performance for a TLS vs non-TLS connections is expected.

          Abhishek Jindal can we confirm the cypher version? if its < 1.2 can we move to 1.2

          vikas.chaudhary Vikas Chaudhary added a comment - Abhishek Jindal can we confirm the cypher version? if its < 1.2 can we move to 1.2

          Vikas Chaudhary ns_server / cbauth provides all golang services tls 1.2 as mintlsversion. Hence this is used in eventing for all communications - gocb, lcb, http server, go-couchbase (dcp client).

          abhishek.jindal Abhishek Jindal added a comment - Vikas Chaudhary ns_server / cbauth provides all golang services tls 1.2 as mintlsversion. Hence this is used in eventing for all communications - gocb, lcb, http server, go-couchbase (dcp client).

          closing as expected

          vikas.chaudhary Vikas Chaudhary added a comment - closing as expected

          People

            vikas.chaudhary Vikas Chaudhary
            vikas.chaudhary Vikas Chaudhary
            Votes:
            0 Vote for this issue
            Watchers:
            5 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Gerrit Reviews

                There are no open Gerrit changes

                PagerDuty