Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-40126

Subdoc CreateAsDeleted without any value path can result in non-empty value

    XMLWordPrintable

Details

    • Untriaged
    • Centos 64-bit
    • 1
    • No
    • KV Sprint 2020-June

    Description

      Build: 6.6.0-7834

      Test case:

      ./testrunner -i node.ini threads_to_use=8,rerun=False,skip_cleanup=True,get-cbcollect-info=False,skip_collections_cleanup=True -t Atomicity.doc_isolation.IsolationDocTest.test_staged_doc_read,nodes_init=2,num_items=1,replicas=1,transaction_timeout=30,transaction_commit=True,doc_op=create,operation=beforeAtrPending,GROUP=P0

      Cbcollect logs:
      https://cb-jira.s3.us-east-2.amazonaws.com/logs/transaction_crash/collectinfo-2020-06-24T195940-ns_1%4010.112.191.105.zip
      https://cb-jira.s3.us-east-2.amazonaws.com/logs/transaction_crash/collectinfo-2020-06-24T195940-ns_1%4010.112.191.106.zip
      Scenario:

      • Two node cluster, Couchbase bucket with replica=1
      • Start transaction with expiry of 30 seconds. Forcefully fail the transaction during "beforeAtrPending" and retry

      Expected behavior:

      Failed transaction should succeed without any issues

      Service 'memcached' exited with status 134. Restarting. Messages:
      2020-06-24T12:53:01.819552-07:00 CRITICAL /opt/couchbase/bin/../lib/libstdc++.so.6() [0x7f0c401a8000+0x8f213]
      2020-06-24T12:53:01.819559-07:00 CRITICAL /opt/couchbase/bin/../lib/../lib/ep.so() [0x7f0c3aca6000+0xcccd0]
      2020-06-24T12:53:01.819564-07:00 CRITICAL /opt/couchbase/bin/../lib/../lib/ep.so() [0x7f0c3aca6000+0xc811a]
      2020-06-24T12:53:01.819570-07:00 CRITICAL /opt/couchbase/bin/../lib/../lib/ep.so() [0x7f0c3aca6000+0xca523]
      2020-06-24T12:53:01.819576-07:00 CRITICAL /opt/couchbase/bin/../lib/../lib/ep.so() [0x7f0c3aca6000+0x18f490]
      2020-06-24T12:53:01.819581-07:00 CRITICAL /opt/couchbase/bin/../lib/../lib/ep.so() [0x7f0c3aca6000+0xcfa4d]
      2020-06-24T12:53:01.819586-07:00 CRITICAL /opt/couchbase/bin/../lib/../lib/ep.so() [0x7f0c3aca6000+0x12b764]
      2020-06-24T12:53:01.819590-07:00 CRITICAL /opt/couchbase/bin/../lib/libplatform_so.so.0.1.0() [0x7f0c42257000+0x8f17]
      2020-06-24T12:53:01.819594-07:00 CRITICAL /lib64/libpthread.so.0() [0x7f0c3fa73000+0x7df3]
      2020-06-24T12:53:01.819621-07:00 CRITICAL /lib64/libc.so.6(clone+0x6d) [0x7f0c3f6b2000+0xf61ad]

      Attaching test.log and  trans_expired_with_d_ambiguous.pcapng for further debugging.

      Attachments

        1. _7845_client-op.png
          _7845_client-op.png
          390 kB
        2. _7845_invalid-prepare.png
          _7845_invalid-prepare.png
          381 kB
        3. _dcp-prepare-png.png
          _dcp-prepare-png.png
          346 kB
        4. _no non-subdoc fronend op.png
          _no non-subdoc fronend op.png
          33 kB
        5. _subdoc ops only against 105.png
          _subdoc ops only against 105.png
          212 kB
        6. 6.6.0-7845_client_run_1.pcapng
          1.74 MB
        7. 6.6.0-7845_node_run_1.pcap
          4.17 MB
        8. test.log
          33 kB
        9. test.pcap
          20.11 MB
        10. trans_expired_with_d_ambiguous_new.pcapng
          2.06 MB
        11. trans_expired_with_d_ambiguous.pcapng
          2.09 MB

        Issue Links

          For Gerrit Dashboard: MB-40126
          # Subject Branch Project Status CR V

          Activity

            ashwin.govindarajulu Ashwin Govindarajulu created issue -
            owend Daniel Owen added a comment -

            On node 106 see:

            2020-06-24T12:20:41.950253-07:00 ERROR 52: (default) DCP (Producer) eq_dcpq:replication:ns_1@10.112.191.106->ns_1@10.112.191.105:default - DcpProducer::handleResponse disconnecting, received unexpected response:{"bodylen":0,"cas":0,"datatype":"raw","extlen":0,"keylen":0,"magic":"ClientResponse","opaque":462,"opcode":"DCP_PREPARE","status":"Invalid arguments"}
            

            then later on see:

            2020-06-24T12:20:46.478481-07:00 ERROR (default) VBucket::abort (vb:971) - active failed as no HashTableitem found with key:<ud>cid:0x0:test_docs-0</ud>
            2020-06-24T12:20:46.478566-07:00 CRITICAL *** Fatal error encountered during exception handling ***
            2020-06-24T12:20:46.478596-07:00 CRITICAL Caught unhandled std::exception-derived exception. what(): ActiveDurabilityMonitor::abort vb:971 failed with status:1
            

            owend Daniel Owen added a comment - On node 106 see: 2020-06-24T12:20:41.950253-07:00 ERROR 52: (default) DCP (Producer) eq_dcpq:replication:ns_1@10.112.191.106->ns_1@10.112.191.105:default - DcpProducer::handleResponse disconnecting, received unexpected response:{"bodylen":0,"cas":0,"datatype":"raw","extlen":0,"keylen":0,"magic":"ClientResponse","opaque":462,"opcode":"DCP_PREPARE","status":"Invalid arguments"} then later on see: 2020-06-24T12:20:46.478481-07:00 ERROR (default) VBucket::abort (vb:971) - active failed as no HashTableitem found with key:<ud>cid:0x0:test_docs-0</ud> 2020-06-24T12:20:46.478566-07:00 CRITICAL *** Fatal error encountered during exception handling *** 2020-06-24T12:20:46.478596-07:00 CRITICAL Caught unhandled std::exception-derived exception. what(): ActiveDurabilityMonitor::abort vb:971 failed with status:1
            graham.pople Graham Pople added a comment -

            I don't think this has anything to do with transactions.  Looking at the transaction logs, I see a couple of TempFailExceptions which I think is Ashwin doing some fault injecting - e.g. there has not been any cluster interaction so far. 

            These TempFails cause the transaction to retry a couple of times and then on the third attempt it tries to stage an insert (which is a KV subdoc op using the new CreateAsDeleted flag), and gets back a DurabilityAmbiguousException error from the server.  Ashwin's preliminary investigation indicates this is due to a memcached node crash.

            graham.pople Graham Pople added a comment - I don't think this has anything to do with transactions.  Looking at the transaction logs, I see a couple of TempFailExceptions which I think is Ashwin doing some fault injecting - e.g. there has not been any cluster interaction so far.  These TempFails cause the transaction to retry a couple of times and then on the third attempt it tries to stage an insert (which is a KV subdoc op using the new CreateAsDeleted flag), and gets back a DurabilityAmbiguousException error from the server.  Ashwin's preliminary investigation indicates this is due to a memcached node crash.
            owend Daniel Owen added a comment -

            On 105 we see:

             WARNING 51: (default) DCP (Consumer) eq_dcpq:replication:ns_1@10.112.191.106->ns_1@10.112.191.105:default - DcpConsumer::prepare: (vb:971) Value cannot contain a body for SyncDelete
            

            Hey Paolo Cocchi Could you take a look? - seems to be replicated to the transparent transaction changes.

            owend Daniel Owen added a comment - On 105 we see: WARNING 51: (default) DCP (Consumer) eq_dcpq:replication:ns_1@10.112.191.106->ns_1@10.112.191.105:default - DcpConsumer::prepare: (vb:971) Value cannot contain a body for SyncDelete Hey Paolo Cocchi Could you take a look? - seems to be replicated to the transparent transaction changes.
            owend Daniel Owen made changes -
            Field Original Value New Value
            Assignee Daniel Owen [ owend ] Paolo Cocchi [ paolo.cocchi ]
            paolo.cocchi Paolo Cocchi added a comment -

            Hi Ashwin Govindarajulu,
            the 98715_trans_expired_with_d_ambiguous.pcapng dump doesn't seem to cover the test entire test window, as no message shows up by filtering by 'couchbase.opcode == 0x60' (ie, DcpPrepare).
            Could check the dump please? Could you provide a dump where the Prepare is visible please?
            Thank you, Paolo

            paolo.cocchi Paolo Cocchi added a comment - Hi Ashwin Govindarajulu , the 98715_trans_expired_with_d_ambiguous.pcapng dump doesn't seem to cover the test entire test window, as no message shows up by filtering by 'couchbase.opcode == 0x60' (ie, DcpPrepare). Could check the dump please? Could you provide a dump where the Prepare is visible please? Thank you, Paolo
            ashwin.govindarajulu Ashwin Govindarajulu made changes -
            ashwin.govindarajulu Ashwin Govindarajulu added a comment - - edited Hi Paolo Cocchi , Please check this file -> test.pcap and respective server logs: https://cb-jira.s3.us-east-2.amazonaws.com/logs/mc_crash/collectinfo-2020-06-25T110606-ns_1%4010.112.191.105.zip https://cb-jira.s3.us-east-2.amazonaws.com/logs/mc_crash/collectinfo-2020-06-25T110606-ns_1%4010.112.191.106.zip  
            ashwin.govindarajulu Ashwin Govindarajulu made changes -
            Attachment test.pcap [ 98846 ]
            paolo.cocchi Paolo Cocchi made changes -
            Attachment _dcp-prepare-png.png [ 98851 ]
            Attachment _no non-subdoc fronend op.png [ 98852 ]
            Attachment _subdoc ops only against 105.png [ 98853 ]
            ashwin.govindarajulu Ashwin Govindarajulu made changes -
            Attachment 6.6.0-7845_client_run_1.pcapng [ 98856 ]
            Attachment 6.6.0-7845_node_run_1.pcap [ 98857 ]
            ashwin.govindarajulu Ashwin Govindarajulu added a comment - Paolo Cocchi I've rerun the same scenario using 6.6.0-7845 and following are the logs: 6.6.0-7845_node_run_1.pcap 6.6.0-7845_client_run_1.pcapng Cbcollect-logs: https://cb-jira.s3.us-east-2.amazonaws.com/logs/mc_crash_run_1/collectinfo-2020-06-25T135816-ns_1%4010.112.191.105.zip https://cb-jira.s3.us-east-2.amazonaws.com/logs/mc_crash_run_1/collectinfo-2020-06-25T135816-ns_1%4010.112.191.106.zip
            paolo.cocchi Paolo Cocchi added a comment - - edited

            Analysis on log set:
            test.pcap
            https://cb-jira.s3.us-east-2.amazonaws.com/logs/mc_crash/collectinfo-2020-06-25T110606-ns_1%4010.112.191.105.zip
            https://cb-jira.s3.us-east-2.amazonaws.com/logs/mc_crash/collectinfo-2020-06-25T110606-ns_1%4010.112.191.106.zip

             

            memcached logs:

            106/Producer

            2020-06-25T03:34:50.601436-07:00 ERROR 52: (default) DCP (Producer) eq_dcpq:replication:ns_1@10.112.191.106->ns_1@10.112.191.105:default - DcpProducer::handleResponse disconnecting, received unexpected response:{"bodylen":0,"cas":0,"datatype":"raw","extlen":0,"keylen":0,"magic":"ClientResponse","opaque":462,"opcode":"DCP_PREPARE","status":"Invalid arguments"}
            2020-06-25T03:34:59.693855-07:00 ERROR (default) VBucket::abort (vb:971) - active failed as no HashTableitem found with key:<ud>cid:0x0:test_docs-0</ud>
            

            106/Consumer

            2020-06-25T03:34:50.599888-07:00 WARNING 51: (default) DCP (Consumer) eq_dcpq:replication:ns_1@10.112.191.106->ns_1@10.112.191.105:default - DcpConsumer::prepare: (vb:971) Value cannot contain a body for SyncDelete
            

             

            The latest pcap also contains the DCP flow between the 2 nodes of the cluster (105 and 106).

            I see that all the frontend operations are Subdoc ops:

            Also, the dump shows frontend ops only against node 105:

            We do see the invalid DcpPrepare:

             

            Comments on the latest image:

            • the invalid (with body) DcpPrepare is sent from 106 to 105. That means that we should see at least one SyncWrite against node 106, which is the SW that generates the invalid DcpPrepare
            • no SyncWrite against any node before the DcpPrepare

            That suggests that the pcap dump is still incomplete.

            Ashwin Govindarajulu could you please try to capture a complete pcap where we see all the traffic across the client (IP .1) and all nodes (IPs .105, .106)? That would help considerably for understanding the origin of the invalid DcpPrepare. New log set already uploaded.

             

            Focus on the DcpPrepare payload now:

            • deleted:1 - it is a SyncDelete
            • exptime>0 - it seems that a pending SyncDelete was issued with an expiry!=0 and that it has expired before completion
            • value: why that value fails validation at Consumer is still under investigation
            • durabiliy-timeout: 372 seconds

             

            The following sequence of events may explain the errors in memcached logs:

            1. A client issues a CreateAsDeleted/SyncWrite with DurabilityTimeout=372 and Expiry!=0 against node 106
            2. At this point I would expect a first DcpPrepare to be sent from 106 to 105, but I don't see that on dumps <-- needs to be checked, as seen above the pcap seems incomplete, that may be the reason
            3. Expiry triggers before the SyncWrite is completed (Committed or Aborted)
            4. Expiry marks the Prepare in the HT as deleted and another Prepare item is queued in the CheckpointManager
            5. Another DcpPrepare is sent with exptime!=0 and invalid payload - this is the DcpPrepare that we see in the pcap
            6. The Consumer validates the the payload and returns EINVALID - ERROR: DcpConsumer::prepare: (vb:971) Value cannot contain a body for SyncDelete
            7. The Producer receives EINVALID and disconnects - ERROR: DcpProducer::handleResponse disconnecting
            8. The flusher persists the SyncDelete and the StoredValue is removed from the HashTable <-- This should never happen. Not clear yet if we may end up with that, under investigation
            9. the DurabilityTimeout triggers -> Abort -> ERROR: VBucket::abort (vb:971) - active failed as no HashTable item..

             

            Notes
            It seems that we have 2 problems here:

            • Something is generating an invalid payload for a (sync) delete. Not clear yet what it is, if a client or KV internally (maybe via the expiry path?)
            • VBucket::abort doesn't find a pending Prepare in the HashTable, which should never happen

            Very likely the 2 issues are related but I still don't see the clear link.

            paolo.cocchi Paolo Cocchi added a comment - - edited Analysis on log set: test.pcap https://cb-jira.s3.us-east-2.amazonaws.com/logs/mc_crash/collectinfo-2020-06-25T110606-ns_1%4010.112.191.105.zip https://cb-jira.s3.us-east-2.amazonaws.com/logs/mc_crash/collectinfo-2020-06-25T110606-ns_1%4010.112.191.106.zip   memcached logs: 106/Producer 2020-06-25T03:34:50.601436-07:00 ERROR 52: (default) DCP (Producer) eq_dcpq:replication:ns_1@10.112.191.106->ns_1@10.112.191.105:default - DcpProducer::handleResponse disconnecting, received unexpected response:{"bodylen":0,"cas":0,"datatype":"raw","extlen":0,"keylen":0,"magic":"ClientResponse","opaque":462,"opcode":"DCP_PREPARE","status":"Invalid arguments"} 2020-06-25T03:34:59.693855-07:00 ERROR (default) VBucket::abort (vb:971) - active failed as no HashTableitem found with key:<ud>cid:0x0:test_docs-0</ud> 106/Consumer 2020-06-25T03:34:50.599888-07:00 WARNING 51: (default) DCP (Consumer) eq_dcpq:replication:ns_1@10.112.191.106->ns_1@10.112.191.105:default - DcpConsumer::prepare: (vb:971) Value cannot contain a body for SyncDelete   The latest pcap also contains the DCP flow between the 2 nodes of the cluster (105 and 106). I see that all the frontend operations are Subdoc ops: Also, the dump shows frontend ops only against node 105: We do see the invalid DcpPrepare:   Comments on the latest image : the invalid (with body) DcpPrepare is sent from 106 to 105. That means that we should see at least one SyncWrite against node 106, which is the SW that generates the invalid DcpPrepare no SyncWrite against any node before the DcpPrepare That suggests that the pcap dump is still incomplete. Ashwin Govindarajulu could you please try to capture a complete pcap where we see all the traffic across the client (IP .1) and all nodes (IPs .105, .106)? That would help considerably for understanding the origin of the invalid DcpPrepare.  New log set already uploaded.   Focus on the DcpPrepare payload now : deleted:1 - it is a SyncDelete exptime>0 - it seems that a pending SyncDelete was issued with an expiry!=0 and that it has expired before completion value: why that value fails validation at Consumer is still under investigation durabiliy-timeout: 372 seconds   The following sequence of events may explain the errors in memcached logs: A client issues a CreateAsDeleted/SyncWrite with DurabilityTimeout=372 and Expiry!=0 against node 106 At this point I would expect a first DcpPrepare to be sent from 106 to 105, but I don't see that on dumps <-- needs to be checked, as seen above the pcap seems incomplete, that may be the reason Expiry triggers before the SyncWrite is completed (Committed or Aborted) Expiry marks the Prepare in the HT as deleted and another Prepare item is queued in the CheckpointManager Another DcpPrepare is sent with exptime!=0 and invalid payload - this is the DcpPrepare that we see in the pcap The Consumer validates the the payload and returns EINVALID - ERROR: DcpConsumer::prepare: (vb:971) Value cannot contain a body for SyncDelete The Producer receives EINVALID and disconnects - ERROR: DcpProducer::handleResponse disconnecting The flusher persists the SyncDelete and the StoredValue is removed from the HashTable <-- This should never happen. Not clear yet if we may end up with that, under investigation the DurabilityTimeout triggers -> Abort -> ERROR: VBucket::abort (vb:971) - active failed as no HashTable item..   Notes It seems that we have 2 problems here: Something is generating an invalid payload for a (sync) delete. Not clear yet what it is, if a client or KV internally (maybe via the expiry path?) VBucket::abort doesn't find a pending Prepare in the HashTable, which should never happen Very likely the 2 issues are related but I still don't see the clear link.
            graham.pople Graham Pople added a comment - - edited

            > A client issues a CreateAsDeleted/SyncWrite with DurabilityTimeout=372 and Expiry!=0 against node 106

            That... shouldn't be happening.  Transactions does not currently support or set expiry anywhere.

            There's also a complaint from Wireshark in that screenshot that the extras are encoded incorrectly.  Is there a bug with the DCP prepare packet maybe?

            graham.pople Graham Pople added a comment - - edited > A client issues a CreateAsDeleted/SyncWrite with DurabilityTimeout=372 and Expiry!=0 against node 106 That... shouldn't be happening.  Transactions does not currently support or set expiry anywhere. There's also a complaint from Wireshark in that screenshot that the extras are encoded incorrectly.  Is there a bug with the DCP prepare packet maybe?
            drigby Dave Rigby added a comment -

            There's also a complaint from Wireshark in that screenshot that the extras are encoded incorrectly. Is there a bug with the DCP prepare packet maybe?

            ... or that Wireshark hasn't been updated for the recent DCP_PREPARE encoding changes

            drigby Dave Rigby added a comment - There's also a complaint from Wireshark in that screenshot that the extras are encoded incorrectly. Is there a bug with the DCP prepare packet maybe? ... or that Wireshark hasn't been updated for the recent DCP_PREPARE encoding changes
            paolo.cocchi Paolo Cocchi added a comment - - edited

            Hey Ashwin Govindarajulu, has this test ever succeeded? Maybe against an older build?
            These runs are all against some of the latest builds. Could you try to run the test against 6.6.0-7766 please?
            I'm currently checking the latest log set.
            Thanks

            paolo.cocchi Paolo Cocchi added a comment - - edited Hey Ashwin Govindarajulu , has this test ever succeeded? Maybe against an older build? These runs are all against some of the latest builds. Could you try to run the test against 6.6.0-7766 please? I'm currently checking the latest log set. Thanks
            graham.pople Graham Pople added a comment -

            > ... or that Wireshark hasn't been updated for the recent DCP_PREPARE encoding changes 
             
            Sure, but I mean either way, I wouldn't take Wireshark's claim of a non zero expiry field at face value.  I'm definitely not sending that from transactions (well unless there's an SDK encoding bug).

            graham.pople Graham Pople added a comment - > ... or that Wireshark hasn't been updated for the recent DCP_PREPARE encoding changes    Sure, but I mean either way, I wouldn't take Wireshark's claim of a non zero expiry field at face value.  I'm definitely not sending that from transactions (well unless there's an SDK encoding bug).
            paolo.cocchi Paolo Cocchi made changes -
            Attachment _7845_client-op.png [ 98868 ]
            Attachment _7845_invalid-prepare.png [ 98869 ]
            owend Daniel Owen made changes -
            Summary [Transaction DocIsolation]: Memcached crahsed with status 134 resulting in AmbiguousTimeoutException during SubdocMutateRequest [Transaction DocIsolation]: Memcached crashed with status 134 resulting in AmbiguousTimeoutException during SubdocMutateRequest

            We are hitting the same issue. 

            We have a live cluster where we are hitting the issue. Please let me know if you need access to it.

            praneeth.bokka Praneeth Bokka (Inactive) added a comment - We are hitting the same issue.  We have a live cluster where we are hitting the issue. Please let me know if you need access to it.
            ashwin.govindarajulu Ashwin Govindarajulu added a comment - - edited

            Hey Ashwin Govindarajulu, has this test ever succeeded? Maybe against an older build?
            These runs are all against some of the latest builds. Could you try to run the test against 6.6.0-7766 please?

            Hey Paolo Cocchi, Hitting the same issue with build 6.6.0-7766. And the same test works fine if we switch back to Java Transx 1.0.1-SNAPSHOT.jar (Where doc_isolation feature is not supported) on top the same server build 6.6.0-7845.

            ashwin.govindarajulu Ashwin Govindarajulu added a comment - - edited Hey Ashwin Govindarajulu , has this test ever succeeded? Maybe against an older build? These runs are all against some of the latest builds. Could you try to run the test against 6.6.0-7766 please? Hey Paolo Cocchi , Hitting the same issue with build 6.6.0-7766. And the same test works fine if we switch back to Java Transx 1.0.1-SNAPSHOT.jar (Where doc_isolation feature is not supported) on top the same server build 6.6.0-7845.
            drigby Dave Rigby added a comment -

            Ashwin Govindarajulu can you add the necessary steps to Run the test which hits this bug?

            Ideally with info on how to run against cluster_run

            drigby Dave Rigby added a comment - Ashwin Govindarajulu can you add the necessary steps to Run the test which hits this bug? Ideally with info on how to run against cluster_run
            drigby Dave Rigby added a comment -

            Ashwin Govindarajulu apologies, I missed it in the description above! Stand easy

            drigby Dave Rigby added a comment - Ashwin Govindarajulu apologies, I missed it in the description above! Stand easy

            Hey Dave Rigby , the test case is still as a patch set. Please get it from http://review.couchbase.org/c/TAF/+/129957

            ashwin.govindarajulu Ashwin Govindarajulu added a comment - Hey Dave Rigby , the test case is still as a patch set. Please get it from http://review.couchbase.org/c/TAF/+/129957
            drigby Dave Rigby made changes -
            Assignee Paolo Cocchi [ paolo.cocchi ] Dave Rigby [ drigby ]
            drigby Dave Rigby made changes -
            Sprint KV Sprint 2020-June [ 1106 ]
            till Till Westmann made changes -
            Link This issue blocks MB-38724 [ MB-38724 ]
            till Till Westmann made changes -
            Labels Transactions durability functional-test Transactions approved-for-6.6.0 durability functional-test
            ritam.sharma Ritam Sharma made changes -
            Priority Critical [ 2 ] Blocker [ 1 ]
            drigby Dave Rigby made changes -
            Status Open [ 1 ] In Progress [ 3 ]
            drigby Dave Rigby added a comment -

            Focus on the DcpPrepare payload now:

            • deleted:1 - it is a SyncDelete
            • exptime>0 - it seems that a pending SyncDelete was issued with an expiry!=0 and that it has expired before completion

            Note the exptime field in DcpPrepare serves double-duty - for SyncDeletes it is the deletetime (similar to DCP_DELETE).

            drigby Dave Rigby added a comment - Focus on the DcpPrepare payload now: deleted:1 - it is a SyncDelete exptime>0 - it seems that a pending SyncDelete was issued with an expiry!=0 and that it has expired before completion Note the exptime field in DcpPrepare serves double-duty - for SyncDeletes it is the deletetime (similar to DCP_DELETE).
            drigby Dave Rigby made changes -
            Comment [ Issue is caused by a bug in the {{subdoc_fetch}} code when allowing execution to continue even if no existing doc is found (say when using CreateAsDeleted), _if_ the check to see if a tombstone is present needs to go to disk (i.e. the Bloom filter failed to avoid the bgFetch). This explains why this only occurs intermittently.

            In that situation, the subdoc state machine sequence for the multi-mutation from test.pcap looks like:

            # subdoc_fetch -> bucket_get(AliveOrDeleted) -> EventuallyPersistentEngine::get() -> not found in HT and bloom filter cannot determine -> EWOULDBLOCK, schedule bgFetch.
            # subdoc_fetch returns back to runloop with EWOULDBLOCK, waiting for notify_io_complete
            # bgFetch completes, returns KEY_ENOENT (no tombstone on disk) -> notify_IO_complete(fd, KEY_ENOENT).
            # subdoc_fetch called again, checks AsyncIO status code KEY_ENOENT. CreateAsDeleted specified, ok, setup empty document and return true (to continue execution).
            #* *However, the status code {{ret}}) is _not_ modified, is still KEY_ENOENT*.
            # subdoc_execute called, performs requested operations on newly-formed empty doc.
            # subdoc_update called, checks {{ret}}. Upon finding non-successful result returns early with that same status code.

            Fix is to ensure that after subdoc_fetch() succeeds (i.e. is happy with no tombstone being present), then {{ret}} is set to SUCCESS.

            There's also a tangental issue to understand why this was 100% reproducible on Windows - sounds like some bloomfilter bug.
            ]
            owend Daniel Owen made changes -
            Link This issue relates to MB-37374 [ MB-37374 ]
            drigby Dave Rigby added a comment - - edited

            Updated Wireshark to correctly decode XATTRs within a DCP_PREPARE opcode. That gives the following decode of the DCP_PREPARE which was rejected by the DCP Consumer :

            Couchbase Protocol, DCP Prepare Request, Opcode: 0x60, VBucket: 0x3cb
                Magic: Request (0x80)
                Opcode: DCP Prepare (0x60)
                Key Length: 16
                Extras Length: 31
                Data Type: 0x05, JSON, XATTR
                VBucket: 971 (0x03cb)
                [Value Length: 161]
                Total Body Length: 208
                Opaque: 0x000001ce
                CAS: 0x161bc2fe46ec0000
                Extras
                Key: _txn:atr-971-#4b
                XATTR Length: 155
                XATTRs
                    XATTR Pair Length: 151
                        Key: attempts
                        Value: {"82534e38-fb98-475f-a831-a6b9f62d52f1":{"tid":"3135008a-ff17-43fc-a9a8-70ac2363983a","st":"PENDING","tst":"0x0000ec46fec21b16","exp":30000}}
                Value: {}
            

            I suspect the problem here is the "empty but not empty" value - i.e. the value "{}" represents an empty JSON document, but the value from a mcbp pov is not empty.

            drigby Dave Rigby added a comment - - edited Updated Wireshark to correctly decode XATTRs within a DCP_PREPARE opcode. That gives the following decode of the DCP_PREPARE which was rejected by the DCP Consumer : Couchbase Protocol, DCP Prepare Request, Opcode: 0x60, VBucket: 0x3cb Magic: Request (0x80) Opcode: DCP Prepare (0x60) Key Length: 16 Extras Length: 31 Data Type: 0x05, JSON, XATTR VBucket: 971 (0x03cb) [Value Length: 161] Total Body Length: 208 Opaque: 0x000001ce CAS: 0x161bc2fe46ec0000 Extras Key: _txn:atr-971-#4b XATTR Length: 155 XATTRs XATTR Pair Length: 151 Key: attempts Value: {"82534e38-fb98-475f-a831-a6b9f62d52f1":{"tid":"3135008a-ff17-43fc-a9a8-70ac2363983a","st":"PENDING","tst":"0x0000ec46fec21b16","exp":30000}} Value: {} I suspect the problem here is the "empty but not empty" value - i.e. the value "{}" represents an empty JSON document, but the value from a mcbp pov is not empty.
            drigby Dave Rigby made changes -
            Summary [Transaction DocIsolation]: Memcached crashed with status 134 resulting in AmbiguousTimeoutException during SubdocMutateRequest [Transaction DocIsolation]: DcpConsumer::prepare: (vb:971) Value cannot contain a body for SyncDelete
            drigby Dave Rigby made changes -
            Link This issue is duplicated by MB-40167 [ MB-40167 ]
            drigby Dave Rigby made changes -
            Assignee Dave Rigby [ drigby ] Ashwin Govindarajulu [ ashwin.govindarajulu ]
            Resolution Fixed [ 1 ]
            Status In Progress [ 3 ] Resolved [ 5 ]
            drigby Dave Rigby made changes -
            Summary [Transaction DocIsolation]: DcpConsumer::prepare: (vb:971) Value cannot contain a body for SyncDelete Subdoc CreateAsDeleted without any value path can result in non-empty value

            Build couchbase-server-6.6.0-7856 contains kv_engine commit c55541f with commit message:
            MB-40126: subdoc CreateAsDeleted: User value shouldn't be '{}'

            build-team Couchbase Build Team added a comment - Build couchbase-server-6.6.0-7856 contains kv_engine commit c55541f with commit message: MB-40126 : subdoc CreateAsDeleted: User value shouldn't be '{}'

            We used to hit this error before on 7846 but we are not hitting this anymore on 7856 using our FIT performer for transaction testing.

            praneeth.bokka Praneeth Bokka (Inactive) added a comment - We used to hit this error before on 7846 but we are not hitting this anymore on 7856 using our FIT performer for transaction testing.
            drigby Dave Rigby made changes -
            Link This issue causes MB-40262 [ MB-40262 ]

            We are not hitting this issue anymore for TAF tests as well:

            09:11:10.197 1/MainThread/7f825/f7f08 overall commit completed

            09:11:10.197 1/MainThread/7f825/f7f08 added attempt TransactionAttempt{id=f7f08,state=COMPLETED,atrColl=com.couchbase.client.java.ReactiveCollection@47fb595c/_txn:atr-971-#4b} after success

            09:11:10.197 1/MainThread/7f825/f7f08 Adding cleanup request for _default/_txn:atr-971-#4b to run in 30000 msecs

            09:11:10.197 1/MainThread/7f825 finished txn in 134ms

            2020-07-02 09:11:10,436 | test  | INFO    | MainThread | [basetestcase:check_coredump_exist:543] Initializing core dump check on all the nodes

            2020-07-02 09:11:29,283 | test  | INFO    | MainThread | [rest_client:monitorRebalance:1423] Rebalance done. Taken 10.1079998016 seconds to complete

            2020-07-02 09:11:29,285 | test  | INFO    | MainThread | [rest_client:monitorRebalance:1425] Sleep for 5 seconds after rebalance

            ok

             

            ----------------------------------------------------------------------

            Ran 1 test in 197.275s

             

            OK

            During the test, Remote Connections: 27, Disconnections: 27

            SDK Connections: 2, Disconnections: 2

            summary so far suite Atomicity.doc_isolation.IsolationDocTest , pass 1 , fail 0

            testrunner logs, diags and results are available under /Users/praneethbokka/Frameworks/services/TAF/logs/testrunner-20-Jul-02_09-08-54/test_1

             

            Deprecated Gradle features were used in this build, making it incompatible with Gradle 7.0.

            Use '--warning-mode all' to show the individual deprecation warnings.

            See https://docs.gradle.org/6.0/userguide/command_line_interface.html#sec:command_line_warnings

             

             

            However we are running into other issue:

            https://issues.couchbase.com/browse/MB-40262

            praneeth.bokka Praneeth Bokka (Inactive) added a comment - We are not hitting this issue anymore for TAF tests as well: 09:11:10.197 1/MainThread/7f825/f7f08 overall commit completed 09:11:10.197 1/MainThread/7f825/f7f08 added attempt TransactionAttempt{id=f7f08,state=COMPLETED,atrColl=com.couchbase.client.java.ReactiveCollection@47fb595c/_txn:atr-971-#4b} after success 09:11:10.197 1/MainThread/7f825/f7f08 Adding cleanup request for _default/_txn:atr-971-#4b to run in 30000 msecs 09:11:10.197 1/MainThread/7f825 finished txn in 134ms 2020-07-02 09:11:10,436 | test  | INFO    | MainThread | [basetestcase:check_coredump_exist:543] Initializing core dump check on all the nodes 2020-07-02 09:11:29,283 | test  | INFO    | MainThread | [rest_client:monitorRebalance:1423] Rebalance done. Taken 10.1079998016 seconds to complete 2020-07-02 09:11:29,285 | test  | INFO    | MainThread | [rest_client:monitorRebalance:1425] Sleep for 5 seconds after rebalance ok   ---------------------------------------------------------------------- Ran 1 test in 197.275s   OK During the test, Remote Connections: 27, Disconnections: 27 SDK Connections: 2, Disconnections: 2 summary so far suite Atomicity.doc_isolation.IsolationDocTest , pass 1 , fail 0 testrunner logs, diags and results are available under /Users/praneethbokka/Frameworks/services/TAF/logs/testrunner-20-Jul-02_09-08-54/test_1   Deprecated Gradle features were used in this build, making it incompatible with Gradle 7.0. Use '--warning-mode all' to show the individual deprecation warnings. See https://docs.gradle.org/6.0/userguide/command_line_interface.html#sec:command_line_warnings     However we are running into other issue: https://issues.couchbase.com/browse/MB-40262

            Build couchbase-server-7.0.0-2526 contains kv_engine commit c55541f with commit message:
            MB-40126: subdoc CreateAsDeleted: User value shouldn't be '{}'

            build-team Couchbase Build Team added a comment - Build couchbase-server-7.0.0-2526 contains kv_engine commit c55541f with commit message: MB-40126 : subdoc CreateAsDeleted: User value shouldn't be '{}'
            arunkumar Arunkumar Senthilnathan (Inactive) made changes -
            Status Resolved [ 5 ] Closed [ 6 ]
            arunkumar Arunkumar Senthilnathan (Inactive) made changes -
            Labels Transactions approved-for-6.6.0 durability functional-test Transactions approved-for-6.6.0 durability functional-test txn_testing
            owend Daniel Owen made changes -
            Link This issue relates to MB-40366 [ MB-40366 ]
            drigby Dave Rigby made changes -
            Link This issue relates to MB-40366 [ MB-40366 ]

            People

              ashwin.govindarajulu Ashwin Govindarajulu
              ashwin.govindarajulu Ashwin Govindarajulu
              Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Gerrit Reviews

                  There are no open Gerrit changes

                  PagerDuty