Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-35359

Temporary failure Exception is seen for a steady state cluster for ephemeral SyncWrites

    XMLWordPrintable

Details

    • Bug
    • Status: Closed
    • Critical
    • Resolution: Fixed
    • 6.5.0
    • 6.5.0
    • couchbase-bucket
    • None

    Description

      1. create 2 ephemeral bucket
      2. create a transaction with durability as MAJORITY 
      3. load 1000 documents through transaction 
      4. do a non-transactional update on few documents 
      5. delete few documents through transaction 

      We see that the last delete fails with TemporaryfailureException and transaction retries continously till transaction expiry

      Attachments

        1. temp_fail.pcapng
          47.97 MB
        2. test.log
          8.83 MB
        3. tmpfailureexception.pcapng
          7.98 MB
        No reviews matched the request. Check your Options in the drop-down menu of this sections header.

        Activity

          drigby Dave Rigby added a comment -

          Dave Rigby one thing I'm confused by - if I understand you correctly, SyncWriteInProgress errors on subdoc will be automatically retried by KV (up to 100 times, then TempFail). E.g. I shouldn't expect to see SyncWriteInProgress errors reach returned from server on subdoc writes - but, I do.

          In the most recent pcap (tmpfailureexception.pcapng) there's zero SyncWriteInProgress status codes.

          drigby Dave Rigby added a comment - Dave Rigby one thing I'm confused by - if I understand you correctly, SyncWriteInProgress errors on subdoc will be automatically retried by KV (up to 100 times, then TempFail). E.g. I shouldn't expect to see SyncWriteInProgress errors reach returned from server on subdoc writes - but, I do. In the most recent pcap (tmpfailureexception.pcapng) there's zero SyncWriteInProgress status codes.
          graham.pople Graham Pople added a comment -

          > In the most recent pcap (tmpfailureexception.pcapng) there's zero SyncWriteInProgress status codes.

          (Just for other's reference, in an offline chat we established that SyncWriteInProgress has been seen in other transaction tickets, and it happens when a CAS has been specified for the subdoc write. Transactions does a mix of CAS & non-CAS subdoc writes.)

           

           

          graham.pople Graham Pople added a comment - > In the most recent pcap (tmpfailureexception.pcapng) there's zero SyncWriteInProgress status codes. (Just for other's reference, in an offline chat we established that SyncWriteInProgress has been seen in other transaction tickets, and it happens when a CAS has been specified for the subdoc write. Transactions does a mix of CAS & non-CAS subdoc writes.)    
          drigby Dave Rigby added a comment -

          Based on discussions with Graham Pople I think the ETmpFail exceptions should be auto-retried on newer TX / Java SDK versions.

          Note also that there's some sizing considerations here - the number of concurrent transactions will be bounded by the number of ATRs in use - at some point clients won't be able to write to an ATR as they it will be constantly in "SyncWriteInProgress".

          Anitha Kuberan Can you please re-run with beta.2 or newer and see if this issue still exists?

          drigby Dave Rigby added a comment - Based on discussions with Graham Pople I think the ETmpFail exceptions should be auto-retried on newer TX / Java SDK versions. Note also that there's some sizing considerations here - the number of concurrent transactions will be bounded by the number of ATRs in use - at some point clients won't be able to write to an ATR as they it will be constantly in "SyncWriteInProgress". Anitha Kuberan Can you please re-run with beta.2 or newer and see if this issue still exists?
          graham.pople Graham Pople added a comment -

          Indeed, the TempFail will be retried immediately in beta.2+ (as opposed to restarting the txn), which will probably aid things.  I'm looking at changing the default number of ATRs in TXNJ-122.

          graham.pople Graham Pople added a comment - Indeed, the TempFail will be retried immediately in beta.2+ (as opposed to restarting the txn), which will probably aid things.  I'm looking at changing the default number of ATRs in TXNJ-122 .

          This issue is not getting reproduced anymore in transaction version beta.3. closing this issue. I will reopen it, incase we hit the issue again

          anitha.kuberan Anitha Kuberan added a comment - This issue is not getting reproduced anymore in transaction version beta.3. closing this issue. I will reopen it, incase we hit the issue again

          People

            anitha.kuberan Anitha Kuberan
            anitha.kuberan Anitha Kuberan
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Gerrit Reviews

                There are no open Gerrit changes

                PagerDuty