Uploaded image for project: 'Distributed Transactions Java'
  1. Distributed Transactions Java
  2. TXNJ-99

Lesser GET & SET ops with high CPU usage (at server side) while running with transactions as compared to regular KV loadtest

    XMLWordPrintable

Details

    • Bug
    • Status: Closed
    • Major
    • Resolution: Not a Bug
    • 1.0.0-beta.1
    • None
    • None

    Description

      Observing lesser number of GET & SET operations while running transactions as compared to regular workloadA load test .

      Also high CPU utilization 

      Here is a comparison of two load tests with durability set to None .

       

      stats Transaction Test KV Load Test
      OPS ~11000 ops/sec (7748 trans per sec) ~328880 ops /sec
      cmd_get ~37000 ~164440
      cmd_set ~70000 ~164440
      Throughput  7748 trans/sec ~328880  ops/sec
      server side cpu utilization (%)  ~ 90 % ~90 %
      workload 1 Transaction = 4 READ + 3 UPDATE  1 OPS = 1 READ or 1 UPDATE
      workload Distribution  100% transactions 50:50 READ:UPDATE

       

      Cluster Config :  4 Nodes, 2 Replicas , 12 vCPU, 64 GB RAM

      Test Config : 10M Items , 1KB docSize

      Client Info : YCSB , 1.0.0-beta.1 3.0.0-alpha.6 , Uniform requestdistribution, 480 concurrent workers

      WORKLOADTA : Number of ops in Single Transaction 4 , 4 READS, 3 UPDATE, Durability 0

       

      *Table updated with most recent numbers & test config . 

       

      Attachments

        No reviews matched the request. Check your Options in the drop-down menu of this sections header.

        Activity

          sharath.sulochana Sharath Sulochana (Inactive) created issue -
          sharath.sulochana Sharath Sulochana (Inactive) made changes -
          Field Original Value New Value
          Description Observing lesser number of GET & SET operations while running transactions as compared to regular workloadA load test 

          Here is a comparison of two load tests with durability set to None .

           
          |*stats*|*Transaction Test* |*KV Load Test*|
          |OPS|80000|175000|
          |cmd_get|30000|87000|
          |cmd_set|50000|87000|
          |Throughput |5680 trans/sec|175k ops/sec|
          |cpu utilization (%) |80|35|
          Observing lesser number of GET & SET operations while running transactions as compared to regular workloadA load test .

          Also high CPU utilization 

          Here is a comparison of two load tests with durability set to None .

           
          |*stats*|*Transaction Test*|*KV Load Test*|
          |OPS|80000|175000|
          |cmd_get|30000|87000|
          |cmd_set|50000|87000|
          |Throughput |5680 trans/sec|175k ops/sec|
          |cpu utilization (%) |80|35|
          |workload|1 Transaction = 3 READ + 1 UPDATE| 1 OPS = 1 READ or 1 UPDATE|
          |workload Distribution |100% transactions|50:50 READ:UPDATE|

           

           
          sharath.sulochana Sharath Sulochana (Inactive) made changes -
          Description Observing lesser number of GET & SET operations while running transactions as compared to regular workloadA load test .

          Also high CPU utilization 

          Here is a comparison of two load tests with durability set to None .

           
          |*stats*|*Transaction Test*|*KV Load Test*|
          |OPS|80000|175000|
          |cmd_get|30000|87000|
          |cmd_set|50000|87000|
          |Throughput |5680 trans/sec|175k ops/sec|
          |cpu utilization (%) |80|35|
          |workload|1 Transaction = 3 READ + 1 UPDATE| 1 OPS = 1 READ or 1 UPDATE|
          |workload Distribution |100% transactions|50:50 READ:UPDATE|

           

           
          Observing lesser number of GET & SET operations while running transactions as compared to regular workloadA load test .

          Also high CPU utilization 

          Here is a comparison of two load tests with durability set to None .

           
          |*stats*|*Transaction Test*|*KV Load Test*|
          |OPS|80000|175000|
          |cmd_get|~30000|~87000|
          |cmd_set|~50000|~87000|
          |Throughput |5680 trans/sec|175k ops/sec|
          |cpu utilization (%) |80|35|
          |workload|1 Transaction = 3 READ + 1 UPDATE| 1 OPS = 1 READ or 1 UPDATE|
          |workload Distribution |100% transactions|50:50 READ:UPDATE|

           

           
          sharath.sulochana Sharath Sulochana (Inactive) made changes -
          Description Observing lesser number of GET & SET operations while running transactions as compared to regular workloadA load test .

          Also high CPU utilization 

          Here is a comparison of two load tests with durability set to None .

           
          |*stats*|*Transaction Test*|*KV Load Test*|
          |OPS|80000|175000|
          |cmd_get|~30000|~87000|
          |cmd_set|~50000|~87000|
          |Throughput |5680 trans/sec|175k ops/sec|
          |cpu utilization (%) |80|35|
          |workload|1 Transaction = 3 READ + 1 UPDATE| 1 OPS = 1 READ or 1 UPDATE|
          |workload Distribution |100% transactions|50:50 READ:UPDATE|

           

           
          Observing lesser number of GET & SET operations while running transactions as compared to regular workloadA load test .

          Also high CPU utilization 

          Here is a comparison of two load tests with durability set to None .

           
          |*stats*|*Transaction Test*|*KV Load Test*|
          |OPS|~80000|~200000|
          |cmd_get|~30000|~100000|
          |cmd_set|~50000|~100000|
          |Throughput |5680 trans/sec|175k ops/sec|
          |cpu utilization (%) |80|35|
          |workload|1 Transaction = 3 READ + 1 UPDATE| 1 OPS = 1 READ or 1 UPDATE|
          |workload Distribution |100% transactions|50:50 READ:UPDATE|

           

           
          sharath.sulochana Sharath Sulochana (Inactive) made changes -
          Description Observing lesser number of GET & SET operations while running transactions as compared to regular workloadA load test .

          Also high CPU utilization 

          Here is a comparison of two load tests with durability set to None .

           
          |*stats*|*Transaction Test*|*KV Load Test*|
          |OPS|~80000|~200000|
          |cmd_get|~30000|~100000|
          |cmd_set|~50000|~100000|
          |Throughput |5680 trans/sec|175k ops/sec|
          |cpu utilization (%) |80|35|
          |workload|1 Transaction = 3 READ + 1 UPDATE| 1 OPS = 1 READ or 1 UPDATE|
          |workload Distribution |100% transactions|50:50 READ:UPDATE|

           

           
          Observing lesser number of GET & SET operations while running transactions as compared to regular workloadA load test .

          Also high CPU utilization 

          Here is a comparison of two load tests with durability set to None .

           
          |*stats*|*Transaction Test*|*KV Load Test*|
          |OPS|~80000|~175000|
          |cmd_get|~30000|~87000|
          |cmd_set|~50000|~187000|
          |Throughput |5680 trans/sec|175k ops/sec|
          |cpu utilization (%) |80|35|
          |workload|1 Transaction = 3 READ + 1 UPDATE| 1 OPS = 1 READ or 1 UPDATE|
          |workload Distribution |100% transactions|50:50 READ:UPDATE|

           

           

          I think to a large degree this is expected, as many of the underlying operations implementing transactions are not cmd_get/cmd_set, but rather using the subdocument API with xattrs.

          Sharath Sulochana: do you have the full set of (non zero) KV statistics?

          ingenthr Matt Ingenthron added a comment - I think to a large degree this is expected, as many of the underlying operations implementing transactions are not cmd_get/cmd_set, but rather using the subdocument API with xattrs. Sharath Sulochana : do you have the full set of (non zero) KV statistics?
          sharath.sulochana Sharath Sulochana (Inactive) made changes -
          Description Observing lesser number of GET & SET operations while running transactions as compared to regular workloadA load test .

          Also high CPU utilization 

          Here is a comparison of two load tests with durability set to None .

           
          |*stats*|*Transaction Test*|*KV Load Test*|
          |OPS|~80000|~175000|
          |cmd_get|~30000|~87000|
          |cmd_set|~50000|~187000|
          |Throughput |5680 trans/sec|175k ops/sec|
          |cpu utilization (%) |80|35|
          |workload|1 Transaction = 3 READ + 1 UPDATE| 1 OPS = 1 READ or 1 UPDATE|
          |workload Distribution |100% transactions|50:50 READ:UPDATE|

           

           
          Observing lesser number of GET & SET operations while running transactions as compared to regular workloadA load test .

          Also high CPU utilization 

          Here is a comparison of two load tests with durability set to None .

           
          |*stats*|*Transaction Test*|*KV Load Test*|
          |OPS|~80000|~400000|
          |cmd_get|~30000|~200000|
          |cmd_set|~50000|~200000|
          |Throughput |5680 trans/sec|400k ops/sec|
          |cpu utilization (%) |80|38|
          |workload|1 Transaction = 3 READ + 1 UPDATE| 1 OPS = 1 READ or 1 UPDATE|
          |workload Distribution |100% transactions|50:50 READ:UPDATE|

           

           
          sharath.sulochana Sharath Sulochana (Inactive) made changes -
          Affects Version/s 1.0.0-alpha.5 [ 16172 ]
          sharath.sulochana Sharath Sulochana (Inactive) made changes -
          Fix Version/s 1.0.0-alpha.5 [ 16172 ]
          sharath.sulochana Sharath Sulochana (Inactive) made changes -
          Labels mad-hatter performance

          Matt Ingenthron -  Thanks for looking into the ticket . I had this assigned to me as I was still planning to add some more details . 

          here is a link to one of the kv test as you requested - 

          http://perf.jenkins.couchbase.com/job/hebe-dura-txn/227/console

          http://cbmonitor.sc.couchbase.com/reports/html/?snapshot=hebe_650-3883_access_a4b1

          I will share additional details for transaction test with same config . 

          sharath.sulochana Sharath Sulochana (Inactive) added a comment - Matt Ingenthron -  Thanks for looking into the ticket . I had this assigned to me as I was still planning to add some more details .  here is a link to one of the kv test as you requested -  http://perf.jenkins.couchbase.com/job/hebe-dura-txn/227/console http://cbmonitor.sc.couchbase.com/reports/html/?snapshot=hebe_650-3883_access_a4b1 I will share additional details for transaction test with same config . 
          graham.pople Graham Pople made changes -
          Status New [ 10003 ] Open [ 1 ]
          graham.pople Graham Pople added a comment - - edited

          100% of txn operations are done with Sub-Document, so I suspect those cmd_gets and cmd_sets do include the Sub-Document ops.

          Thanks for these early figures Sharath.

          They do look on the low side, both in terms of the reduced overall number of operations, and the transactions/sec which is lower than I'd expect with those ops/secs figures - but it's not completely unexpected.  This initial release of transactions is focussed on stability and as yet hasn't been subject to excessive profiling.  There's doubtless some low hanging fruit, and I look forward to having time after the beta release to investigate and improve this area.  I'll be surprised if we don't see some big improvements to these numbers in the coming weeks.

          In particular transactions is mostly just doing KV ops, so the high CPU usage is very interesting.  I suspect the logging needs some iteration.

          graham.pople Graham Pople added a comment - - edited 100% of txn operations are done with Sub-Document, so I suspect those cmd_gets and cmd_sets do include the Sub-Document ops. Thanks for these early figures Sharath. They do look on the low side, both in terms of the reduced overall number of operations, and the transactions/sec which is lower than I'd expect with those ops/secs figures - but it's not completely unexpected.  This initial release of transactions is focussed on stability and as yet hasn't been subject to excessive profiling.  There's doubtless some low hanging fruit, and I look forward to having time after the beta release to investigate and improve this area.  I'll be surprised if we don't see some big improvements to these numbers in the coming weeks. In particular transactions is mostly just doing KV ops, so the high CPU usage is very interesting.  I suspect the logging needs some iteration.
          graham.pople Graham Pople made changes -
          Fix Version/s 1.0.0-beta.1 [ 16171 ]
          Fix Version/s 1.0.0-alpha.5 [ 16172 ]
          Affects Version/s 1.0.0-beta.1 [ 16171 ]
          Affects Version/s 1.0.0-alpha.5 [ 16172 ]
          graham.pople Graham Pople made changes -
          Fix Version/s future [ 16170 ]
          Fix Version/s 1.0.0-beta.1 [ 16171 ]

          Note from the performance meeting on August 7, the CPU usage concern is actually at the server side. Sharath Sulochana is going to get more information into the ticket, as it was raised that CPU was a larger concern, but that didn't come through in the initial description.

          It may be worth microbenchmarking the subdoc operations being used at some point.

          ingenthr Matt Ingenthron added a comment - Note from the performance meeting on August 7, the CPU usage concern is actually at the server side. Sharath Sulochana is going to get more information into the ticket, as it was raised that CPU was a larger concern, but that didn't come through in the initial description. It may be worth microbenchmarking the subdoc operations being used at some point.
          wayne Wayne Siu made changes -
          Description Observing lesser number of GET & SET operations while running transactions as compared to regular workloadA load test .

          Also high CPU utilization 

          Here is a comparison of two load tests with durability set to None .

           
          |*stats*|*Transaction Test*|*KV Load Test*|
          |OPS|~80000|~400000|
          |cmd_get|~30000|~200000|
          |cmd_set|~50000|~200000|
          |Throughput |5680 trans/sec|400k ops/sec|
          |cpu utilization (%) |80|38|
          |workload|1 Transaction = 3 READ + 1 UPDATE| 1 OPS = 1 READ or 1 UPDATE|
          |workload Distribution |100% transactions|50:50 READ:UPDATE|

           

           
          Observing lesser number of GET & SET operations while running transactions as compared to regular workloadA load test .

          Also high CPU utilization 

          Here is a comparison of two load tests with durability set to None .

           
          |*stats*|*Transaction Test*|*KV Load Test*|
          |OPS|~80000|~400000|
          |cmd_get|~30000|~200000|
          |cmd_set|~50000|~200000|
          |Throughput |5680 trans/sec|400k ops/sec|
          |server side cpu utilization (%) |80|38|
          |workload|1 Transaction = 3 READ + 1 UPDATE| 1 OPS = 1 READ or 1 UPDATE|
          |workload Distribution |100% transactions|50:50 READ:UPDATE|

           

           
          wayne Wayne Siu made changes -
          Summary Lesser GET & SET ops with high CPU usage while running with transactions as compared to regular KV loadtest Lesser GET & SET ops with high CPU usage (at server side) while running with transactions as compared to regular KV loadtest

          Sharath Sulochana: The CPU utilization in the description, is that Irix style accounting (where 16 cores is 1600%) or Solaris style accounting (where use of 4 of 16 cores would be 25%)? This came up as a point of confusion in another perf test not too long ago.

          ingenthr Matt Ingenthron added a comment - Sharath Sulochana : The CPU utilization in the description, is that Irix style accounting (where 16 cores is 1600%) or Solaris style accounting (where use of 4 of 16 cores would be 25%)? This came up as a point of confusion in another perf test not too long ago.

          Matt Ingenthron

          CPU captured in perfrunner are  based on ns_server stats api via _+

          {host}

          :{port}/pools/default+_  . 

          I believe it uses top command internally (where 16 cores is 1600% and then averages it )  .  Need to confirm it with ns_server team .

          sharath.sulochana Sharath Sulochana (Inactive) added a comment - - edited Matt Ingenthron CPU captured in perfrunner are  based on ns_server stats api  via _+ {host} :{port}/pools/default+_  .  I believe it uses top command internally (where 16 cores is 1600% and then averages it )  .  Need to confirm it with ns_server team .

          Thanks. If that's the case, then going from 38% to 80% doesn't seem unreasonable as subdocument ops are a little more expensive and there are 4 times as many.

          It looks like to know where to take this next, it'll be great when you can pass along the set of stats for KV and how CPU is being accounted for Sharath Sulochana.

          ingenthr Matt Ingenthron added a comment - Thanks. If that's the case, then going from 38% to 80% doesn't seem unreasonable as subdocument ops are a little more expensive and there are 4 times as many. It looks like to know where to take this next, it'll be great when you can pass along the set of stats for KV and how CPU is being accounted for Sharath Sulochana .

          Along with the description of the SUT as well please Sharath Sulochana. How many CPUs, etc.

          ingenthr Matt Ingenthron added a comment - Along with the description of the SUT as well please Sharath Sulochana . How many CPUs, etc.

          Matt Ingenthron : I dig little deeper on these tests . Looks like I had compared two wrong tests . Basically we have three set of tests current YCSB KV, new durability & transactions tests .  I had inherited configs from KV tests (with lower cores ) . I will be closing this as not a bug . 

          sharath.sulochana Sharath Sulochana (Inactive) added a comment - Matt Ingenthron  : I dig little deeper on these tests . Looks like I had compared two wrong tests . Basically we have three set of tests current YCSB KV, new durability & transactions tests .  I had inherited configs from KV tests (with lower cores ) . I will be closing this as not a bug . 
          sharath.sulochana Sharath Sulochana (Inactive) made changes -
          Resolution Not a Bug [ 10200 ]
          Status Open [ 1 ] Closed [ 6 ]
          sharath.sulochana Sharath Sulochana (Inactive) made changes -
          Actual End 2019-08-08 14:55 (issue has been closed)

          Reopening it after reassessing it based on new tests results .

          sharath.sulochana Sharath Sulochana (Inactive) added a comment - Reopening it after reassessing it based on new tests results .
          sharath.sulochana Sharath Sulochana (Inactive) made changes -
          Resolution Not a Bug [ 10200 ]
          Status Closed [ 6 ] Reopened [ 4 ]
          sharath.sulochana Sharath Sulochana (Inactive) made changes -
          Description Observing lesser number of GET & SET operations while running transactions as compared to regular workloadA load test .

          Also high CPU utilization 

          Here is a comparison of two load tests with durability set to None .

           
          |*stats*|*Transaction Test*|*KV Load Test*|
          |OPS|~80000|~400000|
          |cmd_get|~30000|~200000|
          |cmd_set|~50000|~200000|
          |Throughput |5680 trans/sec|400k ops/sec|
          |server side cpu utilization (%) |80|38|
          |workload|1 Transaction = 3 READ + 1 UPDATE| 1 OPS = 1 READ or 1 UPDATE|
          |workload Distribution |100% transactions|50:50 READ:UPDATE|

           

           
          Observing lesser number of GET & SET operations while running transactions as compared to regular workloadA load test .

          Also high CPU utilization 

          Here is a comparison of two load tests with durability set to None .

           
          |*stats*|*Transaction Test*|*KV Load Test*|
          |OPS|~11000 ops/sec (7748 trans per sec)|~328880 ops /sec|
          |cmd_get|~37000|~164440|
          |cmd_set|~50000|~164440|
          |Throughput |7748 trans/sec|~328880  ops/sec|
          |server side cpu utilization (%) |~ 90 %|~90 %|
          |workload|1 Transaction = 4 READ + 3 UPDATE| 1 OPS = 1 READ or 1 UPDATE|
          |workload Distribution |100% transactions|50:50 READ:UPDATE|

           

          *Cluster Config :*  _4 Nodes, 2 Replicas , 12 vCPU, 64 GB RAM_

          *Test Config :* _10M_ Items , _1KB_ docSize

          *Client Info :* _YCSB_ , _1.0.0-beta.1_ _3.0.0-alpha.6_ , _Uniform_ requestdistribution, _480_ concurrent workers

          *WORKLOADTA* : Number of ops in Single Transaction _4_ , _4 READS, 3 UPDATE, Durability 0_

           

          *Table updated with most recent numbers & test config . 

           
          graham.pople Graham Pople added a comment - - edited

          The numbers do point towards some bottlenecks, a few theories on where they are:

          1. ATR contention.  There are 1024 ATRs and each transaction requires 4 writes to one of them.  8k txns/sec are going to generate 32k ops/sec on just 1024 docs, and while there should be no contention on the content itself (as they're writing to different paths in the doc), under-the-hood there's a great deal of document contention.   With Durability=None the contention is resolved quickly server-side in a CAS loop (fetch the full doc, apply the subdoc mutations, try to write the doc, retry on CAS fail), otherwise it is resolved by DurabilitySyncWriteInProgress errors being sent back to client and the client retrying.

          I'm very curious what will happen if we change the number of ATRs, and I'm adding TXNJ-112 to let YCSB configure it. There will be a linear increase in required for the backround cleanup, which polls each ATR every minute, but it's pretty minimal (17 reads/sec currently, so can easily increase by 10x).  Once that's in Sharath will run tests with various numbers of ATRs (suggest 1024*1,5,10 & 20) and we'll see what drops out.

          2. High cluster CPU (90%+).  Though I'm making small changes to the ATRs using subdoc, with durability the entire ATR doc is sent in the DCP Prepare.  I wonder if these docs are getting pretty big, and the server is spending a bunch of time a) reading the full doc to apply the subdoc change, (which is always going to need to happen) and b) parsing large DCP Prepares on the replicas (when theoretically instead the Prepare could contain just the subdoc mutation rather than the full doc - though I briefly chatted with KV about this and get the impression it's non-trivial).  This perhaps accounts for the high CPU seen on the cluster.

          I'm not sure changing the number of ATRs will have any impact here - it will be processing more, smaller docs, but overall the same amount of data.

          Under TXNJ-110 I'll add summary diagnostic events on the size of ATRs, which hopefully Sharath can also integrate into YCSB.  This may give us something to go on, though this sub-issue is probably better investigated by KV team.

          3. Client-side CPU & GC churn.  Sharath is going to add client-side CPU monitoring.  This is a new library, it hasn't gone through profiling yet, it logs heavily, and it's entirely possible there's some low-hanging fruit to address.

          Bonus. Poor YCSB distribution.  I've seen evidence, both in my transactions logging and in pcaps, that though YCSB is spinning up many clients, only a handful (possibly just 1 or 2) in each worker are doing any real work, which will likely impact throughput.  This is somewhat contended, as Sharath has investigated and believes YCSB is distributing the workload just fine.  Nonetheless, I'd like to find time to spin up YCSB locally and investigate this further.

          graham.pople Graham Pople added a comment - - edited The numbers do point towards some bottlenecks, a few theories on where they are: 1. ATR contention .  There are 1024 ATRs and each transaction requires 4 writes to one of them.  8k txns/sec are going to generate 32k ops/sec on just 1024 docs, and while there should be no contention on the content itself (as they're writing to different paths in the doc), under-the-hood there's a great deal of document contention.   With Durability=None the contention is resolved quickly server-side in a CAS loop (fetch the full doc, apply the subdoc mutations, try to write the doc, retry on CAS fail), otherwise it is resolved by DurabilitySyncWriteInProgress errors being sent back to client and the client retrying. I'm very curious what will happen if we change the number of ATRs, and I'm adding TXNJ-112 to let YCSB configure it. There will be a linear increase in required for the backround cleanup, which polls each ATR every minute, but it's pretty minimal (17 reads/sec currently, so can easily increase by 10x).  Once that's in Sharath will run tests with various numbers of ATRs (suggest 1024*1,5,10 & 20) and we'll see what drops out. 2. High cluster CPU (90%+) .  Though I'm making small changes to the ATRs using subdoc, with durability the entire ATR doc is sent in the DCP Prepare.  I wonder if these docs are getting pretty big, and the server is spending a bunch of time a) reading the full doc to apply the subdoc change, (which is always going to need to happen) and b) parsing large DCP Prepares on the replicas (when theoretically instead the Prepare could contain just the subdoc mutation rather than the full doc - though I briefly chatted with KV about this and get the impression it's non-trivial).  This perhaps accounts for the high CPU seen on the cluster. I'm not sure changing the number of ATRs will have any impact here - it will be processing more, smaller docs, but overall the same amount of data. Under TXNJ-110 I'll add summary diagnostic events on the size of ATRs, which hopefully Sharath can also integrate into YCSB.  This may give us something to go on, though this sub-issue is probably better investigated by KV team. 3. Client-side CPU & GC churn .  Sharath is going to add client-side CPU monitoring.  This is a new library, it hasn't gone through profiling yet, it logs heavily, and it's entirely possible there's some low-hanging fruit to address. Bonus. Poor YCSB distribution .  I've seen evidence, both in my transactions logging and in pcaps, that though YCSB is spinning up many clients, only a handful (possibly just 1 or 2) in each worker are doing any real work, which will likely impact throughput.  This is somewhat contended, as Sharath has investigated and believes YCSB is distributing the workload just fine.  Nonetheless, I'd like to find time to spin up YCSB locally and investigate this further.

          4. Unexpected reads.  While debugging with Sharath we also noted that there should be around 32k reads/sec (~8k txns/sec, each doing 4 reads), but we're seeing ~37k.  That missing 5k needs to be accounted for, possibly using some advanced wireshark-fu.

          This shouldn't have much performance impact though, as in this test the cluster was capable of ~320k ops/sec.  I'm just curious about it.

          graham.pople Graham Pople added a comment - 4. Unexpected reads .  While debugging with Sharath we also noted that there should be around 32k reads/sec (~8k txns/sec, each doing 4 reads), but we're seeing ~37k.  That missing 5k needs to be accounted for, possibly using some advanced wireshark-fu. This shouldn't have much performance impact though, as in this test the cluster was capable of ~320k ops/sec.  I'm just curious about it.
          sharath.sulochana Sharath Sulochana (Inactive) made changes -
          Description Observing lesser number of GET & SET operations while running transactions as compared to regular workloadA load test .

          Also high CPU utilization 

          Here is a comparison of two load tests with durability set to None .

           
          |*stats*|*Transaction Test*|*KV Load Test*|
          |OPS|~11000 ops/sec (7748 trans per sec)|~328880 ops /sec|
          |cmd_get|~37000|~164440|
          |cmd_set|~50000|~164440|
          |Throughput |7748 trans/sec|~328880  ops/sec|
          |server side cpu utilization (%) |~ 90 %|~90 %|
          |workload|1 Transaction = 4 READ + 3 UPDATE| 1 OPS = 1 READ or 1 UPDATE|
          |workload Distribution |100% transactions|50:50 READ:UPDATE|

           

          *Cluster Config :*  _4 Nodes, 2 Replicas , 12 vCPU, 64 GB RAM_

          *Test Config :* _10M_ Items , _1KB_ docSize

          *Client Info :* _YCSB_ , _1.0.0-beta.1_ _3.0.0-alpha.6_ , _Uniform_ requestdistribution, _480_ concurrent workers

          *WORKLOADTA* : Number of ops in Single Transaction _4_ , _4 READS, 3 UPDATE, Durability 0_

           

          *Table updated with most recent numbers & test config . 

           
          Observing lesser number of GET & SET operations while running transactions as compared to regular workloadA load test .

          Also high CPU utilization 

          Here is a comparison of two load tests with durability set to None .

           
          |*stats*|*Transaction Test*|*KV Load Test*|
          |OPS|~11000 ops/sec (7748 trans per sec)|~328880 ops /sec|
          |cmd_get|~37000|~164440|
          |cmd_set|~70000|~164440|
          |Throughput |7748 trans/sec|~328880  ops/sec|
          |server side cpu utilization (%) |~ 90 %|~90 %|
          |workload|1 Transaction = 4 READ + 3 UPDATE| 1 OPS = 1 READ or 1 UPDATE|
          |workload Distribution |100% transactions|50:50 READ:UPDATE|

           

          *Cluster Config :*  _4 Nodes, 2 Replicas , 12 vCPU, 64 GB RAM_

          *Test Config :* _10M_ Items , _1KB_ docSize

          *Client Info :* _YCSB_ , _1.0.0-beta.1_ _3.0.0-alpha.6_ , _Uniform_ requestdistribution, _480_ concurrent workers

          *WORKLOADTA* : Number of ops in Single Transaction _4_ , _4 READS, 3 UPDATE, Durability 0_

           

          *Table updated with most recent numbers & test config . 

           
          sharath.sulochana Sharath Sulochana (Inactive) made changes -
          Assignee Sharath Sulochana [ sharath.sulochana ] Graham Pople [ graham.pople ]
          graham.pople Graham Pople made changes -
          Attachment image-2019-08-23-14-05-22-430.png [ 72655 ]
          graham.pople Graham Pople added a comment - - edited

          Focussing on 1. ATR Contention here.

          Under TXNJ-112 I made the number of ATRs a configuration option and have been experimenting with it today, with promising results from YCSB:

           

          So we see a big initial leap in performance going from the default 1024 ATRs up to 2048 (10% improvement) then 4096 (22%), and then a steady improvement all the way up to the current configurable max of 20,480 ATRs (40% improvement). 

          Performance setup: 3 AWS m4.large nodes; 100 YCSB threads; Thinkpad Carbon X1 laptop; 200k records; durability=1; transactions beta.2-SNAPSHOT; run for 5 minutes each
          Other results: server CPU pegged at ~95% even from initial 1024 test; client CPU varied 45-65%; ops ~800*3 at 1024 ATRs, ~950-1350*3 (quite peaky) at 20480

          graham.pople Graham Pople added a comment - - edited Focussing on 1. ATR Contention here. Under TXNJ-112 I made the number of ATRs a configuration option and have been experimenting with it today, with promising results from YCSB:   So we see a big initial leap in performance going from the default 1024 ATRs up to 2048 (10% improvement) then 4096 (22%), and then a steady improvement all the way up to the current configurable max of 20,480 ATRs (40% improvement).  Performance setup: 3 AWS m4.large nodes; 100 YCSB threads; Thinkpad Carbon X1 laptop; 200k records; durability=1; transactions beta.2-SNAPSHOT; run for 5 minutes each Other results: server CPU pegged at ~95% even from initial 1024 test; client CPU varied 45-65%; ops ~800*3 at 1024 ATRs, ~950-1350*3 (quite peaky) at 20480
          graham.pople Graham Pople added a comment -

          MB-35649 contains an important performance fix for sync rep that is going out in 4133, and will likely have an impact on txns performance too.

          graham.pople Graham Pople added a comment - MB-35649 contains an important performance fix for sync rep that is going out in 4133, and will likely have an impact on txns performance too.
          graham.pople Graham Pople made changes -
          Comment [ [~sharath.sulochana] it looks like there may be just one KV endpoint shared between all YCSB threads, which will have a severe impact on throughput especially with durability.  Can you check my working here?
           * Each YCSB worker node runs a single ycsb process.
           * Each process spins up X threads (configurable)
           * Each thread creates a single Couchbase3Client. 
           * Couchbase3Client has some singleton logic so there will only be a single ClusterEnvironment, Cluster, Collection and Transactions object created for the whole ycsb process.  E.g. they're shared between all Couchbase3Clients/threads.
           * The ycsb process takes a couchbase.kvEndpoints param that defaults to 1.  I've looked through some of the perf cluster jobs and as far as I can tell none of them are setting this param.
           * So, we end up with just one KV endpoint for the ycsb process.

          Is that all correct? ]
          graham.pople Graham Pople added a comment -

          Sharath Sulochana, I'm drilling into the YCSB code and the couchbase3-transactions branch is missing some crucial changes that Michael Nitschinger made in couchbase3-new-durability branch.  Specifically this will share one Java client ClusterEnvironment between all threads (Couchbase3Clients really) in the ycsb process, leading to much more efficient use of resources.

          Can you please merge these changes in?  Note that it will be imperative to set the kvEndpoints parameter after.  I'd suggest setting it to double the number of threads.  That will let each Couchbase3Client have its own kvEndpoint effectively, plus some wiggle room.

          graham.pople Graham Pople added a comment - Sharath Sulochana , I'm drilling into the YCSB code and the couchbase3-transactions branch is missing some crucial changes that Michael Nitschinger made in couchbase3-new-durability branch.  Specifically this will share one Java client ClusterEnvironment between all threads (Couchbase3Clients really) in the ycsb process, leading to much more efficient use of resources. Can you please merge these changes in?  Note that it will be imperative to set the kvEndpoints parameter after.  I'd suggest setting it to double the number of threads.  That will let each Couchbase3Client have its own kvEndpoint effectively, plus some wiggle room.

          also Korrigan Clark is able to provide input on the params he is using for the durability testing Sharath Sulochana so it makes sense for both of you to sync up on this.

          daschl Michael Nitschinger added a comment - also Korrigan Clark is able to provide input on the params he is using for the durability testing Sharath Sulochana so it makes sense for both of you to sync up on this.

          Graham Pople - Here is the numbers form latest tests with numATR's set to 20480 (build - 6.5.0-4081) .  There is some significant increase in throughput when durability level is NONE (almost 100%+) . But other durability levels have improvement at ~10% .

           

          Note - Thanks for kvEndpoints suggestion . I will go ahead and implement the changes  .... 

           

          Test
          (YCSB workloadTA)
          Transaction 
          (6.5.0-3939) 
          SDK3 (Beta)
          Transaction ** 
          (6.5.0-4081) ** 
          SDK3 (Beta.2-SNAPSHOT)**
          KV Throughput
          (6.5.0-3939) 
          SDK3 alpha 6 
          Test #1
          Durability Level : NONE
          7,788 trans/sec
             (ops ~110 k)
           16859 trans/sec
            (ops ~235k)
           328,880  ops/sec
          Test #2
          Durability Level : MAJORITY 
          1,284 trans/sec 1,394 trans/sec 21,315 ops/sec 
          Test #3
          Durability Level : MAJORITY_AND_PERSIST_ON_MASTER
          600 trans/sec  648 trans/sec 9,438 ops/sec
          Test #4
          Durability Level : PERSIST_TO_MAJORITY
          565 trans/sec 611 trans/sec 9,031 ops/sec

           
           

          sharath.sulochana Sharath Sulochana (Inactive) added a comment - - edited Graham Pople - Here is the numbers form latest tests with numATR's set to 20480 (build - 6.5.0-4081) .  There is some significant increase in throughput when durability level is NONE (almost 100%+) . But other durability levels have improvement at ~10% .   Note - Thanks for kvEndpoints suggestion . I will go ahead and implement the changes  ....    Test (YCSB workloadTA) Transaction   (6.5.0-3939)   SDK3 (Beta) Transaction  **  (6.5.0-4081)  **  SDK3 (Beta.2-SNAPSHOT) ** KV Throughput (6.5.0-3939)   SDK3 alpha 6   Test #1 Durability Level : NONE 7,788 trans/sec    (ops ~110 k)  16859 trans/sec   (ops ~235k)  328,880  ops/sec Test #2 Durability Level : MAJORITY   1,284 trans/sec 1,394 trans/sec 21,315 ops/sec  Test #3 Durability Level : MAJORITY_AND_PERSIST_ON_MASTER 600 trans/sec  648 trans/sec 9,438 ops/sec Test #4 Durability Level : PERSIST_TO_MAJORITY 565 trans/sec 611 trans/sec 9,031 ops/sec    
          graham.pople Graham Pople added a comment -

          Thanks Sharath Sulochana for the latest testing.  Don't suppose you have the figures saved from testing with other values of numAtrs?  I want to increase it but will need to make a judgement call balancing the best performance improvement vs "woah this seems like a lot of metadata documents to create in the customer's cluster".   Also it will guide me on whether it's worth increasing the max limit even beyond the current 20k-ish.

          Good to see the big performance jump, at least with durability disabled.  For Majority+ I suspect we've hit some other bottleneck that is preventing the numATRs change from shining, and my guess would be the high cluster-side CPU.  Daniel Owen can we possibly request a bit of time from your team on profiling the CPU usage (presuming it's from kvengine, Sharath Sulochana can you confirm)?  Transactions does some unusual things (lots of churn in big xattrs, for example), so maybe there's some low-hanging performance fruit in KV here?  Or perhaps we'll need to take another approach for txns entirely - either way, would be very helpful to get some expert help at this point.

          graham.pople Graham Pople added a comment - Thanks Sharath Sulochana for the latest testing.  Don't suppose you have the figures saved from testing with other values of numAtrs?  I want to increase it but will need to make a judgement call balancing the best performance improvement vs "woah this seems like a lot of metadata documents to create in the customer's cluster".   Also it will guide me on whether it's worth increasing the max limit even beyond the current 20k-ish. Good to see the big performance jump, at least with durability disabled.  For Majority+ I suspect we've hit some other bottleneck that is preventing the numATRs change from shining, and my guess would be the high cluster-side CPU.  Daniel Owen can we possibly request a bit of time from your team on profiling the CPU usage (presuming it's from kvengine, Sharath Sulochana can you confirm)?  Transactions does some unusual things (lots of churn in big xattrs, for example), so maybe there's some low-hanging performance fruit in KV here?  Or perhaps we'll need to take another approach for txns entirely - either way, would be very helpful to get some expert help at this point.
          owend Daniel Owen added a comment -

          Hi Graham Pople. Yeh that fine. Probably easiest is when the performance team does a run, for them to use linux perf to capture a profile for us to analyse.

          owend Daniel Owen added a comment - Hi Graham Pople . Yeh that fine. Probably easiest is when the performance team does a run, for them to use linux perf to capture a profile for us to analyse.
          graham.pople Graham Pople added a comment - - edited

          Though the transaction throughput figures look low, it's important to remember how much each transaction is doing: not only updating multiple documents, but also updating the ATRs so the txn can be cleaned up if anything goes wrong.  I'm crunching the numbers, and for Majority+, the transactions throughput is actually very close to the current theoretical maximum *.

          Each transaction tested is doing 14 operations total (4 reads, 4 writes to the ATR, and 3 updates which involve a stage and a commit each, so 6 writes there).  So e.g. at Majority we have 1,394 txns/sec, which is 1,394 * 14 = 19,516 operations per sec - within spitting distance of the regular KV throughput you've achieved of 21,315 ops/sec. 

          Here's the full table:

           

          Test
          (YCSB workloadTA)
          Transaction
          (6.5.0-4081) 
          SDK3 (Beta.2-SNAPSHOT)
          numAtrs=20480
          Ops/sec
          (14 * trans/sec)
          KV Throughput
          (6.5.0-3939) 
          SDK3 alpha 6 
          Txns throughput
          as % of
          maximum possible
          Test #1
          Durability Level : NONE
           16859 trans/sec 236,026 ops/sec  328,880  ops/sec 71.8 %
          Test #2
          Durability Level : MAJORITY 
          1,394 trans/sec 19,516 ops/sec 21,315 ops/sec  91.6 %
          Test #3
          Durability Level : MAJORITY_AND_PERSIST_ON_MASTER
           648 trans/sec 9,072 ops/sec 9,438 ops/sec 96.1%
          Test #4
          Durability Level : PERSIST_TO_MAJORITY
          611 trans/sec 8,554 ops/sec 9,031 ops/sec 94.7%

           
          So, we're actually doing pretty well.

          (* Note the current theoretical maximum can be improved slightly, as we can probably get rid of 1 of one of the ATR writes.)

           

          graham.pople Graham Pople added a comment - - edited Though the transaction throughput figures look low, it's important to remember how much each transaction is doing: not only updating multiple documents, but also updating the ATRs so the txn can be cleaned up if anything goes wrong.  I'm crunching the numbers, and for Majority+, the transactions throughput is actually very close to the current theoretical maximum *. Each transaction tested is doing 14 operations total (4 reads, 4 writes to the ATR, and 3 updates which involve a stage and a commit each, so 6 writes there).  So e.g. at Majority we have 1,394 txns/sec, which is 1,394 * 14 = 19,516 operations per sec - within spitting distance of the regular KV throughput you've achieved of 21,315 ops/sec.  Here's the full table:   Test (YCSB workloadTA) Transaction (6.5.0-4081)   SDK3 (Beta.2-SNAPSHOT) numAtrs= 20480 Ops/sec (14 * trans/sec) KV Throughput (6.5.0-3939)   SDK3 alpha 6   Txns throughput as % of maximum possible Test #1 Durability Level : NONE  16859 trans/sec 236,026 ops/sec  328,880  ops/sec 71.8 % Test #2 Durability Level : MAJORITY   1,394 trans/sec 19,516 ops/sec 21,315 ops/sec  91.6 % Test #3 Durability Level : MAJORITY_AND_PERSIST_ON_MASTER  648 trans/sec 9,072 ops/sec 9,438 ops/sec 96.1% Test #4 Durability Level : PERSIST_TO_MAJORITY 611 trans/sec 8,554 ops/sec 9,031 ops/sec 94.7%   So, we're actually doing pretty well. (* Note the current theoretical maximum can be improved slightly, as we can probably get rid of 1 of one of the ATR writes.)  

          Graham Pople question for you:

          Today do you create the 1024 ATRs upfront or the first time a transaction hits its first document on a given vbucket?

          Because when I run small experimental tests I do not see 1024 ATRs right away (which is a relief actually!).

          I agree with your concern that if we go with 20 *1024 ATRs, they will flood the users bucket. Until we have a way of filtering out ATRs into a special System Collection I would not create tons of them.

          Also, given that with Durability level = majority, the bottleneck is elsewhere rather on ATR contention, I am less convinced this change is necessary.

          shivani.gupta Shivani Gupta added a comment - Graham Pople  question for you: Today do you create the 1024 ATRs upfront or the first time a transaction hits its first document on a given vbucket? Because when I run small experimental tests I do not see 1024 ATRs right away (which is a relief actually!). I agree with your concern that if we go with 20 *1024 ATRs, they will flood the users bucket. Until we have a way of filtering out ATRs into a special System Collection I would not create tons of them. Also, given that with Durability level = majority, the bottleneck is elsewhere rather on ATR contention, I am less convinced this change is necessary.
          graham.pople Graham Pople added a comment - - edited

          Shivani Gupta it's the latter - the ATRs are created on-demand.

          Admittedly, the numbers above do indicate that the number of ATRs isn't much of a factor when Durability is enabled.  Though countering that is issue MB-35359, where transactions were actually expiring due to congestion on the ATRs.  It's been worked around, as newer Java clients now automatically retry on that error for up to 2.5 seconds, so the issue is closed - but it remains indicative that there's heavy congestion going on behind the scenes.

          graham.pople Graham Pople added a comment - - edited Shivani Gupta it's the latter - the ATRs are created on-demand. Admittedly, the numbers above do indicate that the number of ATRs isn't much of a factor when Durability is enabled.  Though countering that is issue MB-35359 , where transactions were actually expiring due to congestion on the ATRs.  It's been worked around, as newer Java clients now automatically retry on that error for up to 2.5 seconds, so the issue is closed - but it remains indicative that there's heavy congestion going on behind the scenes.

          Hi Sharath Sulochana , do you think we still need this ticket open?

          graham.pople Graham Pople added a comment - Hi Sharath Sulochana , do you think we still need this ticket open?

          Graham Pople - I guess we can close this ticket .

           

          sharath.sulochana Sharath Sulochana (Inactive) added a comment - Graham Pople  - I guess we can close this ticket .  
          graham.pople Graham Pople added a comment -

          Ok, closing it out.  It turned into quite a sprawling ticket with various ideas and testing of performance improvements, but the core takeaway is that transactions performance is reasonably close to the maximum currently possible, considering durability performance and the current protocol.  There is at least two ways the protocol can be improved in the future, performance-wise, which have been logged separately.

          graham.pople Graham Pople added a comment - Ok, closing it out.  It turned into quite a sprawling ticket with various ideas and testing of performance improvements, but the core takeaway is that transactions performance is reasonably close to the maximum currently possible, considering durability performance and the current protocol.  There is at least two ways the protocol can be improved in the future, performance-wise, which have been logged separately.
          graham.pople Graham Pople made changes -
          Fix Version/s .future [ 16170 ]
          Resolution Not a Bug [ 10200 ]
          Status Reopened [ 4 ] Closed [ 6 ]

          People

            graham.pople Graham Pople
            sharath.sulochana Sharath Sulochana (Inactive)
            Votes:
            0 Vote for this issue
            Watchers:
            9 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Gerrit Reviews

                There are no open Gerrit changes

                PagerDuty