Uploaded image for project: 'Distributed Transactions Java'
  1. Distributed Transactions Java
  2. TXNJ-61

YCSB: DocumentAlreadyInTransaction exceptions while running concurrent transactions with Durability Level None

    XMLWordPrintable

Details

    • Bug
    • Status: Closed
    • Major
    • Resolution: Not a Bug
    • 1.0.0-alpha.4
    • None
    • None

    Description

      Getting "com.couchbase.transactions.error.attempts.DocumentAlreadyInTransaction " exceptions while running concurrent transactions with durability level NONE 

      Detailed log attached - worker_172.23.97.251.log_dur_level_none.zip

      Error Snippet :

      Transaction logger:24/Thread-3/03f38a13-e728-4989-8dee-aa2115a16f7c/345d0f15-7aa2-4b88-9c80-ef41aa57737f caught exception 'com.couchbase.transactions.error.attempts.DocumentAlreadyInTransaction: Document usertable:user4441073806199749893 is already in a transaction' in asyncInternal, rethrowing to rollback

      Transaction logger:24/Thread-3/03f38a13-e728-4989-8dee-aa2115a16f7c/345d0f15-7aa2-4b88-9c80-ef41aa57737f com.couchbase.transactions.AttemptContextReactive.checkAndHandleBlockingTxn(AttemptContextReactive.java:849)

      com.couchbase.transactions.AttemptContextReactive.lambda$null$17(AttemptContextReactive.java:401)

      reactor.core.publisher.MonoDefer.subscribe(MonoDefer.java:44)

      reactor.core.publisher.MonoPeek.subscribe(MonoPeek.java:71)

      reactor.core.publisher.MonoDoFinally.subscribe(MonoDoFinally.java:47)

      reactor.core.publisher.MonoDefer.subscribe(MonoDefer.java:52)

      reactor.core.publisher.Mono.block(Mono.java:1473)

      com.couchbase.transactions.AttemptContext.replace(AttemptContext.java:104)

      com.yahoo.ycsb.db.couchbase3.Couchbase3Client.lambda$transactionContext$0(Couchbase3Client.java:194)

      com.couchbase.transactions.TransactionsReactive.lambda$null$17(TransactionsReactive.java:356)

      reactor.core.publisher.MonoRunnable.subscribe(MonoRunnable.java:40)

      reactor.core.publisher.MonoOnErrorResume.subscribe(MonoOnErrorResume.java:44)

      reactor.core.publisher.MonoPeek.subscribe(MonoPeek.java:71)

      reactor.core.publisher.MonoPeek.subscribe(MonoPeek.java:71)

      reactor.core.publisher.Mono.subscribe(Mono.java:3589)

      reactor.core.publisher.MonoIgnoreThen$ThenIgnoreMain.drain(MonoIgnoreThen.java:172)

      reactor.core.publisher.MonoIgnoreThen.subscribe(MonoIgnoreThen.java:56)

      reactor.core.publisher.MonoDefer.subscribe(MonoDefer.java:52)

      reactor.core.publisher.MonoIgnoreThen$ThenIgnoreMain.drain(MonoIgnoreThen.java:153)

      reactor.core.publisher.MonoIgnoreThen$ThenIgnoreMain.ignoreDone(MonoIgnoreThen.java:190)

      reactor.core.publisher.MonoIgnoreThen$ThenIgnoreInner.onComplete(MonoIgnoreThen.java:240)

      reactor.core.publisher.Operators$MonoSubscriber.complete(Operators.java:1478)

      reactor.core.publisher.MonoFlatMap$FlatMapMain.onNext(MonoFlatMap.java:144)

      reactor.core.publisher.FluxPeek$PeekSubscriber.onNext(FluxPeek.java:192)

      reactor.core.publisher.FluxSubscribeOnValue$ScheduledScalar.run(FluxSubscribeOnValue.java:178)

      reactor.core.scheduler.ElasticScheduler$DirectScheduleTask.run(ElasticScheduler.java:292)

      reactor.core.scheduler.SchedulerTask.call(SchedulerTask.java:50)

      reactor.core.scheduler.SchedulerTask.call(SchedulerTask.java:27)

      java.util.concurrent.FutureTask.run(FutureTask.java:266)

      java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)

      java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)

      java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)

      java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)

      java.lang.Thread.run(Thread.java:748)

      Attachments

        Issue Links

          No reviews matched the request. Check your Options in the drop-down menu of this sections header.

          Activity

            sharath.sulochana Sharath Sulochana (Inactive) created issue -
            sharath.sulochana Sharath Sulochana (Inactive) made changes -
            Field Original Value New Value
            Link This issue finishs with TXNJ-55 [ TXNJ-55 ]
            graham.pople Graham Pople added a comment -

            Looking at this now.  DocumentAlreadyInTransaction errors shouldn't make it back to the app, the txn should automatically retry.

            graham.pople Graham Pople added a comment - Looking at this now.  DocumentAlreadyInTransaction errors shouldn't make it back to the app, the txn should automatically retry.
            graham.pople Graham Pople added a comment -

            Looks like the 'lost txns' cleanup process is malfunctioning.  There's a couple of existing tickets indicating something's going wrong with that process, e.g. TXNJ-59.

            In detail:

            • The txn gets document usertable:user4441073806199749893
            • It finds it's already in a txn (e.g. it has metadata set)
            • It retries several times over a 1 second period
            • Then gets suspicious of that blocking txn, which is starting to look like a lost txn, and fetches its ATR entry
            • The logic here is: if the ATR entry is removed, then the 'lost txns' cleanup process has found it, and it's safe to continue overwriting the doc.  Else, it leaves it alone
            • But, it finds the ATR entry, and stays blocked

            It would be very helpful if you could follow the guidance on TXNJ-59 on obtaining cleanup logging.  That'll give me some hints why the cleanup process is failing here.

            graham.pople Graham Pople added a comment - Looks like the 'lost txns' cleanup process is malfunctioning.  There's a couple of existing tickets indicating something's going wrong with that process, e.g. TXNJ-59 . In detail: The txn gets document usertable:user4441073806199749893 It finds it's already in a txn (e.g. it has metadata set) It retries several times over a 1 second period Then gets suspicious of that blocking txn, which is starting to look like a lost txn, and fetches its ATR entry The logic here is: if the ATR entry is removed, then the 'lost txns' cleanup process has found it, and it's safe to continue overwriting the doc.  Else, it leaves it alone But, it finds the ATR entry, and stays blocked It would be very helpful if you could follow the guidance on TXNJ-59 on obtaining cleanup logging.  That'll give me some hints why the cleanup process is failing here.
            graham.pople Graham Pople made changes -
            Status New [ 10003 ] Open [ 1 ]
            graham.pople Graham Pople added a comment -

            Under TXNJ-62 I've added a ton of tests for cleanup, which I hoped was going to find some bug that would account for the cleanup process failing here - but no joy, they all pass.

            So, I'll need logging to be able to progress this one.  Could you please follow the guidamce on TXNJ-59?

            graham.pople Graham Pople added a comment - Under TXNJ-62 I've added a ton of tests for cleanup, which I hoped was going to find some bug that would account for the cleanup process failing here - but no joy, they all pass. So, I'll need logging to be able to progress this one.  Could you please follow the guidamce on TXNJ-59 ?
            graham.pople Graham Pople made changes -
            Assignee Graham Pople [ graham.pople ] Sharath Sulochana [ sharath.sulochana ]
            graham.pople Graham Pople added a comment -

            I've:

            a) made cleanup a lot more robust in the face of transient errors.  This will very likely fix the issue here.

            b) added an events mechanism that notifies you about successful and failed cleanup attempts, along with all log related to it.  So if your issue isn't solved, hopefully this will give some clues as to why not.

            c) starting publishing transactions snapshot jars to a central repo so you can get these changes right now

            Please see TXNJ-59 for details.

            graham.pople Graham Pople added a comment - I've: a) made cleanup a lot more robust in the face of transient errors.  This will very likely fix the issue here. b) added an events mechanism that notifies you about successful and failed cleanup attempts, along with all log related to it.  So if your issue isn't solved, hopefully this will give some clues as to why not. c) starting publishing transactions snapshot jars to a central repo so you can get these changes right now Please see TXNJ-59 for details.

            Graham Pople -

            are these changes available in couchbase-transactions-1.0.0-alpha.4.jar  ? 

            sharath.sulochana Sharath Sulochana (Inactive) added a comment - - edited Graham Pople - are these changes available in  couchbase-transactions-1.0.0-alpha.4.jar   ? 
            graham.pople Graham Pople added a comment -

            (For the record, am working with Sharath online to get them up and running with txns alpha.5-SNAPSHOT in YCSB)

            graham.pople Graham Pople added a comment - (For the record, am working with Sharath online to get them up and running with txns alpha.5-SNAPSHOT in YCSB)
            sharath.sulochana Sharath Sulochana (Inactive) made changes -
            Attachment ycsb_alpha5_snapshot_error.log [ 70043 ]

            Graham Pople - 

            I have changed the ycsb pom.xml file to latest transaction-snapshot (1.0.0-alpha.5-SNAPSHOT) and client to (3.0.0-alpha.4).

            I still see  "DocumentAlreadyInTransaction" errors . Please find attached log for more details .

            ycsb_alpha5_snapshot_error.log

             

            sharath.sulochana Sharath Sulochana (Inactive) added a comment - Graham Pople  -  I have changed the ycsb pom.xml file to latest transaction-snapshot (1.0.0-alpha.5-SNAPSHOT) and client to (3.0.0-alpha.4). I still see  "DocumentAlreadyInTransaction" errors . Please find attached log for more details . ycsb_alpha5_snapshot_error.log  
            sharath.sulochana Sharath Sulochana (Inactive) made changes -
            Assignee Sharath Sulochana [ sharath.sulochana ] Graham Pople [ graham.pople ]
            ritam.sharma Ritam Sharma added a comment -

            Sharath Sulochana - Can you try with Latest snapshot, the issue seems to be fixed with functional testing.

            ritam.sharma Ritam Sharma added a comment - Sharath Sulochana - Can you try with Latest snapshot, the issue seems to be fixed with functional testing.
            graham.pople Graham Pople added a comment -

            I checked the trace and looks like the issue has not changed e.g. the lost txns cleanup process has not cleaned up the blocking expired txn. Under TXNJ-59 I've been doing a lot of work in this area, improving the robustness of the process, adding an event bus logging mechanism to give the app more insight, and improving the cleanup logging. Could you please follow the info on that ticket to 1. enable cleanup logging and 2. add event bus logging? That should give me some clues.able

            graham.pople Graham Pople added a comment - I checked the trace and looks like the issue has not changed e.g. the lost txns cleanup process has not cleaned up the blocking expired txn. Under TXNJ-59 I've been doing a lot of work in this area, improving the robustness of the process, adding an event bus logging mechanism to give the app more insight, and improving the cleanup logging. Could you please follow the info on that ticket to 1. enable cleanup logging and 2. add event bus logging? That should give me some clues.able
            graham.pople Graham Pople added a comment -

            Also this issue looks really similar to TXNJ-59 and is very likely to be fixed when I address that.

            graham.pople Graham Pople added a comment - Also this issue looks really similar to TXNJ-59 and is very likely to be fixed when I address that.
            graham.pople Graham Pople added a comment - - edited

            I've done further work under TXNJ-47 today to improve the robustness of cleaning up lost txns.  There are now no known issues with cleanup and I've done a fair bit of work in this area, so it would be great if you could retest to see if this is resolved.  All fixes are in transactions 1.0.0-alpha.5-SNAPSHOT, so if you're using that already you just need to 'gradle --refresh-dependencies' to pick up the latest and greatest. –

            graham.pople Graham Pople added a comment - - edited I've done further work under TXNJ-47 today to improve the robustness of cleaning up lost txns.  There are now no known issues with cleanup and I've done a fair bit of work in this area, so it would be great if you could retest to see if this is resolved.  All fixes are in transactions 1.0.0-alpha.5-SNAPSHOT, so if you're using that already you just need to 'gradle --refresh-dependencies' to pick up the latest and greatest. –
            graham.pople Graham Pople made changes -
            Assignee Graham Pople [ graham.pople ] Sharath Sulochana [ sharath.sulochana ]

            Graham Pople - YCSB uses maven . I believe it should pick up the latest ? 

            sharath.sulochana Sharath Sulochana (Inactive) added a comment - Graham Pople - YCSB uses maven . I believe it should pick up the latest ? 

            Sharath Sulochana: Maven doesn't have a concept of latest directly; it uses specific versions in canonical usage. It has the concept of a -SNAPSHOT in many implementations, but there is some caching and you often need to prompt the tool using the maven repository (gradle in this case) to refresh to the latest snapshot during development. See https://stackoverflow.com/questions/5901378/what-exactly-is-a-maven-snapshot-and-why-do-we-need-it

            ingenthr Matt Ingenthron added a comment - Sharath Sulochana : Maven doesn't have a concept of latest directly; it uses specific versions in canonical usage. It has the concept of a -SNAPSHOT in many implementations, but there is some caching and you often need to prompt the tool using the maven repository (gradle in this case) to refresh to the latest snapshot during development. See https://stackoverflow.com/questions/5901378/what-exactly-is-a-maven-snapshot-and-why-do-we-need-it
            graham.pople Graham Pople added a comment -

            Just to add to what Matt says, "mvn clean install -U" is your pal for forcing maven to grab the latest snapshot.

            graham.pople Graham Pople added a comment - Just to add to what Matt says, " mvn clean install -U" is your pal for forcing maven to grab the latest snapshot.

            Sure. I will take a look at it today . 

            sharath.sulochana Sharath Sulochana (Inactive) added a comment - - edited Sure. I will take a look at it today . 
            sharath.sulochana Sharath Sulochana (Inactive) made changes -
            Attachment worker_172.23.97.253.log.zip [ 70937 ]

            Graham Pople

            I was able to run it with latest transaction snapshot (1.0.0-alpha.5-SNAPSHOT) & java-client (3.0.0-alpha.5) .

            I see below log messages (which eventually ends up DocumentAlreadyInTransaction exception ) .

            I am not sure if these are expected messages .  Can you take a look . worker_172.23.97.253.log.zip

             

            Log Snippet : 

            Transaction logger:14:55:01.867 32/Thread-11/e1adb/fc09d getting doc usertable:user4441073806199749893

            Transaction logger:14:55:01.868 32/Thread-11/e1adb/fc09d completed get of TransactionJsonDocument{id=usertable:user4441073806199749893,cas=1563314101869740032,status=NORMAL,bucket=bucket-1,coll=_default,links=TransactionLinks{atr=none,atrBkt=none,atrColl=none,ver=none}} in 0ms

            Transaction logger:14:55:01.868 32/Thread-11/e1adb/fc09d replace doc TransactionJsonDocument{id=usertable:user4441073806199749893,cas=1563314101869740032,status=NORMAL,bucket=bucket-1,coll=_default,links=TransactionLinks{atr=none,atrBkt=none,atrColl=none,ver=none}}

            Transaction logger:14:55:01.868 32/Thread-11/e1adb/fc09d First mutated doc in txn is 'usertable:user4441073806199749893' on vbucket 63, so using atr atr-63-#9a

            Transaction logger:14:55:01.868 32/Thread-11/e1adb/fc09d about to replace doc usertable:user4441073806199749893 with cas 1563314101869740032

            Transaction logger:14:55:01.869 32/Thread-11/e1adb/fc09d caught exception 'com.couchbase.client.core.error.CASMismatchException: Document with key [usertable:user4441073806199749893] has been concurrently modified' in runBlocking, rethrowing

            Transaction logger:14:55:01.869 32/Thread-11/e1adb/<> hit error TransactionWrappedExceptionNeedsRetry(com.couchbase.client.core.error.CASMismatchException: Document with key [usertable:user4441073806199749893] has been concurrently modified)

            Transaction logger:14:55:01.875 32/Thread-11/e1adb/<> retrying on error 'TransactionWrappedExceptionNeedsRetry(com.couchbase.client.core.error.CASMismatchException: Document with key [usertable:user4441073806199749893] has been concurrently modified)'

             

            sharath.sulochana Sharath Sulochana (Inactive) added a comment - Graham Pople -  I was able to run it with latest transaction snapshot (1.0.0-alpha.5-SNAPSHOT) & java-client (3.0.0-alpha.5) . I see below log messages (which eventually ends up DocumentAlreadyInTransaction exception ) . I am not sure if these are expected messages .  Can you take a look .  worker_172.23.97.253.log.zip   Log Snippet :   Transaction logger:14:55:01.867 32/Thread-11/e1adb/fc09d getting doc usertable:user4441073806199749893 Transaction logger:14:55:01.868 32/Thread-11/e1adb/fc09d completed get of TransactionJsonDocument{id=usertable:user4441073806199749893,cas=1563314101869740032,status=NORMAL,bucket=bucket-1,coll=_default,links=TransactionLinks{atr=none,atrBkt=none,atrColl=none,ver=none}} in 0ms Transaction logger:14:55:01.868 32/Thread-11/e1adb/fc09d replace doc TransactionJsonDocument{id=usertable:user4441073806199749893,cas=1563314101869740032,status=NORMAL,bucket=bucket-1,coll=_default,links=TransactionLinks{atr=none,atrBkt=none,atrColl=none,ver=none}} Transaction logger:14:55:01.868 32/Thread-11/e1adb/fc09d First mutated doc in txn is 'usertable:user4441073806199749893' on vbucket 63, so using atr atr-63-#9a Transaction logger:14:55:01.868 32/Thread-11/e1adb/fc09d about to replace doc usertable:user4441073806199749893 with cas 1563314101869740032 Transaction logger:14:55:01.869 32/Thread-11/e1adb/fc09d caught exception 'com.couchbase.client.core.error.CASMismatchException: Document with key [usertable:user4441073806199749893] has been concurrently modified' in runBlocking, rethrowing Transaction logger:14:55:01.869 32/Thread-11/e1adb/<> hit error TransactionWrappedExceptionNeedsRetry(com.couchbase.client.core.error.CASMismatchException: Document with key [usertable:user4441073806199749893] has been concurrently modified) Transaction logger:14:55:01.875 32/Thread-11/e1adb/<> retrying on error 'TransactionWrappedExceptionNeedsRetry( com.couchbase.client.core.error.CASMismatchException : Document with key [usertable:user4441073806199749893] has been concurrently modified)'  
            graham.pople Graham Pople added a comment -

            Sharath Sulochana you must have something else modifying usertable:user4441073806199749893 concurrently to get a CAS mismatch.  If it's another transaction it's fine, the two transactions will detect the write-write conflict (you'll see it as either DocumentAlreadyInTransaction or CASMismatchException in the logs), and one will retry.  Both transactions should ultimately succeed.

            If you're doing a regular write (e.g. a KV replace or a N1QL DML update) to that doc while also transactionally updating, that is not permitted and must be prevented by the application.

            It's worth implementing the code in TXNJ-83.  This will log any events from transactions, and one of them will alert you if you're doing regular writes together with transactional ones.

            graham.pople Graham Pople added a comment - Sharath Sulochana you must have something else modifying usertable:user4441073806199749893 concurrently to get a CAS mismatch.  If it's another transaction it's fine, the two transactions will detect the write-write conflict (you'll see it as either DocumentAlreadyInTransaction or CASMismatchException in the logs), and one will retry.  Both transactions should ultimately succeed. If you're doing a regular write (e.g. a KV replace or a N1QL DML update) to that doc while also transactionally updating, that is not permitted and must be prevented by the application. It's worth implementing the code in TXNJ-83 .  This will log any events from transactions, and one of them will alert you if you're doing regular writes together with transactional ones.
            graham.pople Graham Pople made changes -
            Assignee Sharath Sulochana [ sharath.sulochana ] Graham Pople [ graham.pople ]

            Sharath requested I take another look at this one, and on going through the full logs rather than the log snippet, I found at least one txn that was blocked by a document already being involved in another txn.  My suspicion now is that the cleanup process is not finding or cleaning up these blocking txns, but I need more information to diagnose it fully.

            Sharath will send me over instructions on how I can run this YCSB workload locally, hopefully I'll be able to replicate the problem and investigate further.  If not I may need additional cleanup logging.

            graham.pople Graham Pople added a comment - Sharath requested I take another look at this one, and on going through the full logs rather than the log snippet, I found at least one txn that was blocked by a document already being involved in another txn.  My suspicion now is that the cleanup process is not finding or cleaning up these blocking txns, but I need more information to diagnose it fully. Sharath will send me over instructions on how I can run this YCSB workload locally, hopefully I'll be able to replicate the problem and investigate further.  If not I may need additional cleanup logging.

            Graham Pople -

            I continued running some more runs with different concurrent client threads/worker counts. While above issue still exists in case of higher concurrency (>60 concurrent client requests) . I have some performance numbers for runs which ran without failures or partial failures with lesser concurrency .

             

            Test1 :  workload_Instance 1, workers 20 , transaction distribution - 4 docs , 75% Read, 25% Update

            Failures - 0

            Throughput : 608 transactions/sec

            ops : ~10K-12K ops/sec

            CPU utilization: ~55 % 

            Stats : http://cbmonitor.sc.couchbase.com/reports/html/?snapshot=hebe_650-3784_access_4dde

            JobID : http://perf.jenkins.couchbase.com/job/hebe-txn/6/

             

            Test2 :  workload_Instance 1, workers 30 , transaction distribution - 4 docs , 75% Read, 25% Update

            Failures - 35

            Throughput : 521 transactions/sec

            ops : ~10K - 12K ops/sec

            CPU utilization: ~55% 

            Stats: http://cbmonitor.sc.couchbase.com/reports/html/?snapshot=hebe_650-3784_access_b03d

            JobID : http://perf.jenkins.couchbase.com/job/hebe-txn/9

             

            Test2 :  workload_Instance 2, workers 20 , transaction distribution - 4 docs , 75% Read, 25% Update

            Failures - 38

            Throughput : 472 transactions/sec

            ops : ~10K-12k ops/sec

            CPU utilization: ~55% 

            Stats: http://cbmonitor.sc.couchbase.com/reports/html/?snapshot=hebe_650-3784_access_6b6f

            JobID : http://perf.jenkins.couchbase.com/job/hebe-txn/10

             

            CPU comparison graph for above three tests :

            http://cbmonitor.sc.couchbase.com/reports/html/?snapshot=hebe_650-3784_access_b03d&label=test2&snapshot=hebe_650-3784_access_4dde&label=test1&snapshot=hebe_650-3784_access_6b6f&label=test3

            • this info is not very crucial for triaging the issue mentioned earlier  . This is to get an idea on performance of transactions under smaller concurrent load .
            • But analyzing worker logs in test2 & test3 would be useful to eventually get to the blocking issue (suggestion).
            sharath.sulochana Sharath Sulochana (Inactive) added a comment - - edited Graham Pople - I continued running some more runs with different concurrent client threads/worker counts. While above issue still exists in case of higher concurrency (>60 concurrent client requests) . I have some performance numbers for runs which ran without failures or partial failures with lesser concurrency .   Test1 :   workload_Instance 1, workers 20 , transaction distribution - 4 docs , 75% Read, 25% Update Failures - 0 Throughput : 608 transactions/sec ops : ~10K-12K ops/sec CPU utilization: ~55 %  Stats :  http://cbmonitor.sc.couchbase.com/reports/html/?snapshot=hebe_650-3784_access_4dde JobID :  http://perf.jenkins.couchbase.com/job/hebe-txn/6/   Test2 :   workload_Instance 1, workers 30 , transaction distribution - 4 docs , 75% Read, 25% Update Failures - 35 Throughput : 521 transactions/sec ops : ~10K - 12K ops/sec CPU utilization: ~55%  Stats:  http://cbmonitor.sc.couchbase.com/reports/html/?snapshot=hebe_650-3784_access_b03d JobID :  http://perf.jenkins.couchbase.com/job/hebe-txn/9   Test2 :   workload_Instance 2, workers 20 , transaction distribution - 4 docs , 75% Read, 25% Update Failures - 38 Throughput : 472 transactions/sec ops : ~10K-12k ops/sec CPU utilization: ~55%  Stats:  http://cbmonitor.sc.couchbase.com/reports/html/?snapshot=hebe_650-3784_access_6b6f JobID :  http://perf.jenkins.couchbase.com/job/hebe-txn/10   CPU comparison graph for above three tests : http://cbmonitor.sc.couchbase.com/reports/html/?snapshot=hebe_650-3784_access_b03d&label=test2&snapshot=hebe_650-3784_access_4dde&label=test1&snapshot=hebe_650-3784_access_6b6f&label=test3 this info is not very crucial for triaging the issue mentioned earlier  . This is to get an idea on performance of transactions under smaller concurrent load . But analyzing worker logs in test2 & test3 would be useful to eventually get to the blocking issue (suggestion).
            graham.pople Graham Pople added a comment -

            Sharath Sulochana thanks for sharing those performance figures.  Some additional context for others is that there's 4 documents in each txn (3 reads, 1 read+replace), and durability is disabled.  The initial focus for transactions is a stable first release, so there's actually quite a few performance improvements that can be made later to punch those figures up.

            On the issue, Sharath has helped me get YCSB running locally, and I can reproduce the issue.  I'll aim to have at least some insight and hopefully a fix either today or tomorrow.

            graham.pople Graham Pople added a comment - Sharath Sulochana thanks for sharing those performance figures.  Some additional context for others is that there's 4 documents in each txn (3 reads, 1 read+replace), and durability is disabled.  The initial focus for transactions is a stable first release, so there's actually quite a few performance improvements that can be made later to punch those figures up. On the issue, Sharath has helped me get YCSB running locally, and I can reproduce the issue.  I'll aim to have at least some insight and hopefully a fix either today or tomorrow.
            graham.pople Graham Pople added a comment -

            Have found the bug, there seems to be a code-path that causes the lost transactions thread to bailout prematurely:

            Jul 18, 2019 5:09:47 PM com.couchbase.client.core.cnc.LoggingEventConsumer$JdkLogger warn
            WARNING: [com.couchbase.transactions.cleanup][TransactionLogEvent] default/_default/clientId=63ed5 Got error 'com.couchbase.client.core.error.TemporaryFailureException: TEMPORARY_FAILURE' while getting ATR default/_default/atr-606-#d/
            Jul 18, 2019 5:09:47 PM com.couchbase.client.core.cnc.LoggingEventConsumer$JdkLogger info
            INFO: [com.couchbase.transactions.cleanup][TransactionLogEvent] Client 63ed5 default/_default lost transactions thread has ended
            

            graham.pople Graham Pople added a comment - Have found the bug, there seems to be a code-path that causes the lost transactions thread to bailout prematurely: Jul 18, 2019 5:09:47 PM com.couchbase.client.core.cnc.LoggingEventConsumer$JdkLogger warn WARNING: [com.couchbase.transactions.cleanup][TransactionLogEvent] default/_default/clientId=63ed5 Got error 'com.couchbase.client.core.error.TemporaryFailureException: TEMPORARY_FAILURE' while getting ATR default/_default/atr-606-#d/ Jul 18, 2019 5:09:47 PM com.couchbase.client.core.cnc.LoggingEventConsumer$JdkLogger info INFO: [com.couchbase.transactions.cleanup][TransactionLogEvent] Client 63ed5 default/_default lost transactions thread has ended
            graham.pople Graham Pople added a comment -

            Seeing several things while investigating this issue:

            1. Sharath Sulochana you're running one Transactions object per thread.  Each Transactions object is doing an expensive background cleanup polling operation so this isn't good, please refactor the app to use a single Transactions object.  (Side note, hopefully this will also improve performance).
            2. Starting up multiple Transactions objects at once like this unveiled a race where the lost transactions thread can bail out early.  I've logged TXNJ-90 for this, but it's relatively minor as I think it's much less likely to happen in production (would need apps to start at precisely the same time).
            3. I think my assessment yesterday is incorrect.  I tested and the lost transactions thread should and does bounce back on that error at that place.  I think the "lost transactions thread has ended" message is just coincidentally at the same time, and is actually caused by (2) above.

            I'm still trying to figure out the actual problem(s) here.  There's some evidence of TXN-88 - that txns are being cleaned up by the lost txns rather than regular cleanup process, which shouldn't be happening.  There also seems to be some other weirdnesses going on.  Investigation continues.

            graham.pople Graham Pople added a comment - Seeing several things while investigating this issue: Sharath Sulochana you're running one Transactions object per thread.  Each Transactions object is doing an expensive background cleanup polling operation so this isn't good, please refactor the app to use a single Transactions object.  (Side note, hopefully this will also improve performance). Starting up multiple Transactions objects at once like this unveiled a race where the lost transactions thread can bail out early.  I've logged TXNJ-90 for this, but it's relatively minor as I think it's much less likely to happen in production (would need apps to start at precisely the same time). I think my assessment yesterday is incorrect.  I tested and the lost transactions thread should and does bounce back on that error at that place.  I think the " lost transactions thread has ended " message is just coincidentally at the same time, and is actually caused by (2) above. I'm still trying to figure out the actual problem(s) here.  There's some evidence of TXN-88 - that txns are being cleaned up by the lost txns rather than regular cleanup process, which shouldn't be happening.  There also seems to be some other weirdnesses going on.  Investigation continues.
            graham.pople Graham Pople made changes -
            Summary DocumentAlreadyInTransaction exceptions while running concurrent transactions with Durability Level None YCSB: DocumentAlreadyInTransaction exceptions while running concurrent transactions with Durability Level None

            Graham Pople Thanks for summary . You are right , we are using zipfian requestdistribution for all (most) performance tests .  Main reason for using zipfian/random requestdistribution in transactions is to test the edge case scenario where single doc would be part of multiple transactions in a concurrent system . I think this is still a highly realistic real world use-case . I am not sure if we are going to to handle this behaviour in this release or in future (I will leave PM team to decide on this ).  

            For now we can run it with uniform  requestdistribution . This __ will most probably eliminate blocking issue's from perf tests . We may need to open a separate bug for next release . 

            sharath.sulochana Sharath Sulochana (Inactive) added a comment - Graham Pople  Thanks for summary . You are right , we are using  zipfian   requestdistribution for all (most) performance tests .   Main reason for using  zipfian/ random   requestdistribution in transactions is to test the edge case scenario where single doc would be part of multiple transactions in a concurrent system . I think this is still a highly realistic real world use-case . I am not sure if we are going to to handle this behaviour in this release or in future (I will leave PM team to decide on this ).   For now we can run it with uniform   requestdistribution . This __ will most probably eliminate blocking issue's from perf tests . We may need to open a separate bug for next release . 

            Sharath Sulochana as you mention using zipfian distribution for transactions represents an edge case where the same document is modified concurrently in a large number of transactions. This is not a typical real-world scenario. Some transactions will timeout and fail - that is to be expected.

            For performance testing, I would recommend using uniform distribution as that is more representative of the real world.

            Graham Pople I agree that you should not embark on non-trivial changes to address an unrealistic performance test - not now and maybe not for the foreseeable future either because there are always trade-offs to be made.

            shivani.gupta Shivani Gupta added a comment - Sharath Sulochana as you mention using zipfian distribution for transactions represents an edge case where the same document is modified concurrently in a large number of transactions. This is not a typical real-world scenario. Some transactions will timeout and fail - that is to be expected. For performance testing, I would recommend using uniform distribution as that is more representative of the real world. Graham Pople I agree that you should not embark on non-trivial changes to address an unrealistic performance test - not now and maybe not for the foreseeable future either because there are always trade-offs to be made.

            Chiming in here, we certainly discussed (cc Shivani Gupta, Dave Finlay, John Liang, Ravi Mayuram) during design approach that the target here is one in which there is not a lot of contention on the same document involved in a transaction. Even in traditional RDBMS (or for that matter OO software design), breaking up this kind of contention is normal if you want to scale.

            I do know that YCSB uses zipfian distribution and there are some tunables there. I also know from other testing that Michael Nitschinger and I did back in the early days of YCSB at Couchbase, that it's problematic[1]. Michael had a good writeup based on profiling about zipfian.

            In any case, I think since the design target is one in which we don't have a lot of overlapping documents, if the OOTB YCSB is showing, as I seem to see above, 31,000 ops on a single document in the set, we need to relax zipfian or just use an even distribution. That's the design target with the feature.

            1. Slide 18 in this GDrive doc

            ingenthr Matt Ingenthron added a comment - Chiming in here, we certainly discussed (cc Shivani Gupta , Dave Finlay , John Liang , Ravi Mayuram ) during design approach that the target here is one in which there is not a lot of contention on the same document involved in a transaction. Even in traditional RDBMS (or for that matter OO software design), breaking up this kind of contention is normal if you want to scale. I do know that YCSB uses zipfian distribution and there are some tunables there. I also know from other testing that Michael Nitschinger and I did back in the early days of YCSB at Couchbase, that it's problematic [1] . Michael had a good writeup based on profiling about zipfian. In any case, I think since the design target is one in which we don't have a lot of overlapping documents, if the OOTB YCSB is showing, as I seem to see above, 31,000 ops on a single document in the set, we need to relax zipfian or just use an even distribution. That's the design target with the feature. 1. Slide 18 in this GDrive doc

            Shivani Gupta Matt Ingenthron - Thanks for the clarification . 

            sharath.sulochana Sharath Sulochana (Inactive) added a comment - Shivani Gupta Matt Ingenthron - Thanks for the clarification . 

            Graham Pople 

            Recent runs with Uniform requestdistribution looks pretty clean without any errors in performance cluster . We can actually close this ticket unless if you want to keep it open as reference to TXNJ-90  or any other investigation  . 

            Throughput achieved ~5375 transactions/sec with 90% cpu utilization while durability level is set to none  . I will run some additional tests before sharing the perf numbers  . 

            Test :  workload_Instance 6, workers 40 , transaction distribution - 4 docs , 75% Read, 25% Update , Durabilitylevel None , TestDur 1200 secs

            Failures - 0

            Throughput : 5375 transactions/sec

            CB CPU utilization: ~90 % 

            http://perf.jenkins.couchbase.com/job/hebe-txn/16/

            sharath.sulochana Sharath Sulochana (Inactive) added a comment - Graham Pople   Recent runs with Uniform   requestdistribution looks pretty clean without any errors in performance cluster . We can actually close this ticket unless if you want to keep it open as reference to  TXNJ-90   or any other investigation  .  Throughput achieved ~5375 transactions/sec with 90% cpu utilization while durability level is set to none  . I will run some additional tests before sharing the perf numbers  .  Test :   workload_Instance 6, workers 40 , transaction distribution - 4 docs , 75% Read, 25% Update , Durabilitylevel None , TestDur 1200 secs Failures - 0 Throughput : 5375 transactions/sec CB CPU utilization: ~90 %  http://perf.jenkins.couchbase.com/job/hebe-txn/16/
            graham.pople Graham Pople added a comment -

            Sharath Sulochana that's excellent news, thanks for sharing.  So we've gone from ~500 transactions per sec to >5000?  Fantastic.  I'll look forward to the full figures including with durability enabled.

            One thing I'm very curious about is, what is the current amount of contention in your tests - e.g. how many transactions are retrying due to write-write conflicts, and how many retries are they doing.  You could check this by looking at how many attempts there are in the 'attempts' field in the returned 'TransactionResult'.  Perhaps % of contention could be expressed as (number of retry attempts / number of total attempts).

            I'll close this ticket out since we've addressed the issue here.  Is there a more generic 'transactions performance testing' ticket we can use for further discussion?

            graham.pople Graham Pople added a comment - Sharath Sulochana that's excellent news, thanks for sharing.  So we've gone from ~500 transactions per sec to >5000?  Fantastic.  I'll look forward to the full figures including with durability enabled. One thing I'm very curious about is, what is the current amount of contention in your tests - e.g. how many transactions are retrying due to write-write conflicts, and how many retries are they doing.  You could check this by looking at how many attempts there are in the 'attempts' field in the returned 'TransactionResult'.  Perhaps % of contention could be expressed as (number of retry attempts / number of total attempts). I'll close this ticket out since we've addressed the issue here.  Is there a more generic 'transactions performance testing' ticket we can use for further discussion?
            graham.pople Graham Pople made changes -
            Resolution Not a Bug [ 10200 ]
            Status Open [ 1 ] Resolved [ 5 ]
            graham.pople Graham Pople made changes -
            Actual End 2019-07-22 03:34 (issue has been resolved)

            Closing this issue 

            sharath.sulochana Sharath Sulochana (Inactive) added a comment - Closing this issue 
            sharath.sulochana Sharath Sulochana (Inactive) made changes -
            Status Resolved [ 5 ] Closed [ 6 ]

            People

              graham.pople Graham Pople
              sharath.sulochana Sharath Sulochana (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              8 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Gerrit Reviews

                  There are no open Gerrit changes

                  PagerDuty