The numbers do point towards some bottlenecks, a few theories on where they are:
1. ATR contention. There are 1024 ATRs and each transaction requires 4 writes to one of them. 8k txns/sec are going to generate 32k ops/sec on just 1024 docs, and while there should be no contention on the content itself (as they're writing to different paths in the doc), under-the-hood there's a great deal of document contention. With Durability=None the contention is resolved quickly server-side in a CAS loop (fetch the full doc, apply the subdoc mutations, try to write the doc, retry on CAS fail), otherwise it is resolved by DurabilitySyncWriteInProgress errors being sent back to client and the client retrying.
I'm very curious what will happen if we change the number of ATRs, and I'm adding
TXNJ-112 to let YCSB configure it. There will be a linear increase in required for the backround cleanup, which polls each ATR every minute, but it's pretty minimal (17 reads/sec currently, so can easily increase by 10x). Once that's in Sharath will run tests with various numbers of ATRs (suggest 1024*1,5,10 & 20) and we'll see what drops out.
2. High cluster CPU (90%+). Though I'm making small changes to the ATRs using subdoc, with durability the entire ATR doc is sent in the DCP Prepare. I wonder if these docs are getting pretty big, and the server is spending a bunch of time a) reading the full doc to apply the subdoc change, (which is always going to need to happen) and b) parsing large DCP Prepares on the replicas (when theoretically instead the Prepare could contain just the subdoc mutation rather than the full doc - though I briefly chatted with KV about this and get the impression it's non-trivial). This perhaps accounts for the high CPU seen on the cluster.
I'm not sure changing the number of ATRs will have any impact here - it will be processing more, smaller docs, but overall the same amount of data.
TXNJ-110 I'll add summary diagnostic events on the size of ATRs, which hopefully Sharath can also integrate into YCSB. This may give us something to go on, though this sub-issue is probably better investigated by KV team.
3. Client-side CPU & GC churn. Sharath is going to add client-side CPU monitoring. This is a new library, it hasn't gone through profiling yet, it logs heavily, and it's entirely possible there's some low-hanging fruit to address.
Bonus. Poor YCSB distribution. I've seen evidence, both in my transactions logging and in pcaps, that though YCSB is spinning up many clients, only a handful (possibly just 1 or 2) in each worker are doing any real work, which will likely impact throughput. This is somewhat contended, as Sharath has investigated and believes YCSB is distributing the workload just fine. Nonetheless, I'd like to find time to spin up YCSB locally and investigate this further.