Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-40526

[Transaction] Seeing keys missing in doc_loading during swap_rebalance

    XMLWordPrintable

    Details

    • Type: Bug
    • Status: Closed
    • Priority: Critical
    • Resolution: User Error
    • Affects Version/s: 6.6.0
    • Fix Version/s: 6.6.0
    • Component/s: test-execution
    • Environment:
      6.6.0-7883 Enterprise Edition
      couchbase-transactions-1.1.0-SNAPSHOT
      java-client-3.0.6-SNAPSHOT

      Description

      Build: 6.6.0-7883

      Scenario:

      1. 4 node cluster, 2 Couchbase bucket (replica=2)

        +----------------+----------------------+-----------------+------------+------------+----------------+-------------------+-----------------------+
        | Node           | Services             | CPU_utilization | Mem_total  | Mem_free   | Swap_mem_used  | Active / Replica  | Version               |
        +----------------+----------------------+-----------------+------------+------------+----------------+-------------------+-----------------------+
        | 172.23.105.205 | kv                   | 54.025974026    | 4201840640 | 3443564544 | 0 / 3758092288 | 5167 / 10324      | 6.6.0-7883-enterprise |
        | 172.23.105.155 | fts, index, kv, n1ql | 69.6335078534   | 4201840640 | 3388350464 | 0 / 3758092288 | 4938 / 9838       | 6.6.0-7883-enterprise |
        | 172.23.105.206 | kv                   | 57.1428571429   | 4201840640 | 3453943808 | 0 / 3758092288 | 4806 / 9584       | 6.6.0-7883-enterprise |
        | 172.23.105.159 | kv                   | 56.9553805774   | 4201840640 | 3429101568 | 0 / 3758092288 | 4801 / 9680       | 6.6.0-7883-enterprise |
        +----------------+----------------------+-----------------+------------+------------+----------------+-------------------+-----------------------+

        +----------+---------+----------+------------+-----+-------+------------+----------+-----------+
        | Bucket   | Type    | Replicas | Durability | TTL | Items | RAM Quota  | RAM Used | Disk Used |
        +----------+---------+----------+------------+-----+-------+------------+----------+-----------+
        | bucket-0 | membase | 2        | none       | 0   | 9483  | 2986344448 | 93145992 | 439373036 |
        | bucket-1 | membase | 2        | none       | 0   | 10229 | 2986344448 | 97605584 | 540662594 |
        +----------+---------+----------+------------+-----+-------+------------+----------+-----------+
        

      1. Load few docs into the buckets using transaction
      2. Perform swap_rebalance for 2 nodes (172.23.105.155, 172.23.105.206)
      3. Start parallel doc_loading using transaction during rebalance phase and validate

      Observation:

      At the end of the rebalance during data validation seeing keys missing from the bucket.

      Pcap files https://cb-jira.s3.us-east-2.amazonaws.com/logs/key_missing/pcaps.zip

      Test case to run:

      guides/gradlew --refresh-dependencies testrunner -P jython=/opt/jython/bin/jython -P 'args=-i nodes.ini -t rebalance_new.swaprebalancetests.SwapRebalanceBasicTests.do_test,nodes_init=4,replicas=2,standard_buckets=2,num-swap=2,swap-orchestrator=True,num_items=10000,doc_size=512,transaction_timeout=150,durability=PERSIST_TO_MAJORITY,atomicity=True -p rerun=False,get-cbcollect-info=False -m rest'

        Attachments

        1. key_missing.test
          144 kB
        2. KeyMissing Wireshark.png
          KeyMissing Wireshark.png
          218 kB
        3. test.log
          43 kB
        No reviews matched the request. Check your Options in the drop-down menu of this sections header.

          Activity

          ashwin.govindarajulu Ashwin Govindarajulu created issue -
          ashwin.govindarajulu Ashwin Govindarajulu made changes -
          Field Original Value New Value
          Description *Build*: 6.6.0-6883

          *Scenario*:
           # 4 node cluster, 2 Couchbase bucket (replica=2)
          {noformat}
          +----------------+----------------------+-----------------+------------+------------+----------------+-------------------+-----------------------+
          | Node | Services | CPU_utilization | Mem_total | Mem_free | Swap_mem_used | Active / Replica | Version |
          +----------------+----------------------+-----------------+------------+------------+----------------+-------------------+-----------------------+
          | 172.23.105.205 | kv | 54.025974026 | 4201840640 | 3443564544 | 0 / 3758092288 | 5167 / 10324 | 6.6.0-7883-enterprise |
          | 172.23.105.155 | fts, index, kv, n1ql | 69.6335078534 | 4201840640 | 3388350464 | 0 / 3758092288 | 4938 / 9838 | 6.6.0-7883-enterprise |
          | 172.23.105.206 | kv | 57.1428571429 | 4201840640 | 3453943808 | 0 / 3758092288 | 4806 / 9584 | 6.6.0-7883-enterprise |
          | 172.23.105.159 | kv | 56.9553805774 | 4201840640 | 3429101568 | 0 / 3758092288 | 4801 / 9680 | 6.6.0-7883-enterprise |
          +----------------+----------------------+-----------------+------------+------------+----------------+-------------------+-----------------------+{noformat}
          {noformat}
          +----------+---------+----------+------------+-----+-------+------------+----------+-----------+
          | Bucket | Type | Replicas | Durability | TTL | Items | RAM Quota | RAM Used | Disk Used |
          +----------+---------+----------+------------+-----+-------+------------+----------+-----------+
          | bucket-0 | membase | 2 | none | 0 | 9483 | 2986344448 | 93145992 | 439373036 |
          | bucket-1 | membase | 2 | none | 0 | 10229 | 2986344448 | 97605584 | 540662594 |
          +----------+---------+----------+------------+-----+-------+------------+----------+-----------+
          {noformat}

           # Load few docs into the buckets using transaction
           # Perform swap_rebalance for 2 nodes (172.23.105.155, 172.23.105.206)
           # Start parallel doc_loading using transaction during rebalance phase and validate

          *Observation*:

          At the end of the rebalance during data validation seeing keys missing from the bucket.
          *Build*: 6.6.0-6883

          *Scenario*:
           # 4 node cluster, 2 Couchbase bucket (replica=2)
          {noformat}+----------------+----------------------+-----------------+------------+------------+----------------+-------------------+-----------------------+
          | Node | Services | CPU_utilization | Mem_total | Mem_free | Swap_mem_used | Active / Replica | Version |
          +----------------+----------------------+-----------------+------------+------------+----------------+-------------------+-----------------------+
          | 172.23.105.205 | kv | 54.025974026 | 4201840640 | 3443564544 | 0 / 3758092288 | 5167 / 10324 | 6.6.0-7883-enterprise |
          | 172.23.105.155 | fts, index, kv, n1ql | 69.6335078534 | 4201840640 | 3388350464 | 0 / 3758092288 | 4938 / 9838 | 6.6.0-7883-enterprise |
          | 172.23.105.206 | kv | 57.1428571429 | 4201840640 | 3453943808 | 0 / 3758092288 | 4806 / 9584 | 6.6.0-7883-enterprise |
          | 172.23.105.159 | kv | 56.9553805774 | 4201840640 | 3429101568 | 0 / 3758092288 | 4801 / 9680 | 6.6.0-7883-enterprise |
          +----------------+----------------------+-----------------+------------+------------+----------------+-------------------+-----------------------+{noformat}
          {noformat}+----------+---------+----------+------------+-----+-------+------------+----------+-----------+
          | Bucket | Type | Replicas | Durability | TTL | Items | RAM Quota | RAM Used | Disk Used |
          +----------+---------+----------+------------+-----+-------+------------+----------+-----------+
          | bucket-0 | membase | 2 | none | 0 | 9483 | 2986344448 | 93145992 | 439373036 |
          | bucket-1 | membase | 2 | none | 0 | 10229 | 2986344448 | 97605584 | 540662594 |
          +----------+---------+----------+------------+-----+-------+------------+----------+-----------+
          {noformat}

           # Load few docs into the buckets using transaction
           # Perform swap_rebalance for 2 nodes (172.23.105.155, 172.23.105.206)
           # Start parallel doc_loading using transaction during rebalance phase and validate

          *Observation*:

          At the end of the rebalance during data validation seeing keys missing from the bucket.

          Pcap files [https://cb-jira.s3.us-east-2.amazonaws.com/logs/key_missing/pcaps.zip]
          drigby Dave Rigby made changes -
          Assignee Dave Rigby [ drigby ] Daniel Owen [ owend ]
          owend Daniel Owen made changes -
          Assignee Daniel Owen [ owend ] Paolo Cocchi [ paolo.cocchi ]
          owend Daniel Owen made changes -
          Due Date 24/Jul/20
          owend Daniel Owen made changes -
          Due Date 24/Jul/20 22/Jul/20
          paolo.cocchi Paolo Cocchi made changes -
          Assignee Paolo Cocchi [ paolo.cocchi ] Ashwin Govindarajulu [ ashwin.govindarajulu ]
          ashwin.govindarajulu Ashwin Govindarajulu made changes -
          Environment 6.6.0-6883 Enterprise Edition
          couchbase-transactions-1.1.0-SNAPSHOT
          java-client-3.0.6-SNAPSHOT
          6.6.0-7883 Enterprise Edition
          couchbase-transactions-1.1.0-SNAPSHOT
          java-client-3.0.6-SNAPSHOT
          ashwin.govindarajulu Ashwin Govindarajulu made changes -
          Description *Build*: 6.6.0-6883

          *Scenario*:
           # 4 node cluster, 2 Couchbase bucket (replica=2)
          {noformat}+----------------+----------------------+-----------------+------------+------------+----------------+-------------------+-----------------------+
          | Node | Services | CPU_utilization | Mem_total | Mem_free | Swap_mem_used | Active / Replica | Version |
          +----------------+----------------------+-----------------+------------+------------+----------------+-------------------+-----------------------+
          | 172.23.105.205 | kv | 54.025974026 | 4201840640 | 3443564544 | 0 / 3758092288 | 5167 / 10324 | 6.6.0-7883-enterprise |
          | 172.23.105.155 | fts, index, kv, n1ql | 69.6335078534 | 4201840640 | 3388350464 | 0 / 3758092288 | 4938 / 9838 | 6.6.0-7883-enterprise |
          | 172.23.105.206 | kv | 57.1428571429 | 4201840640 | 3453943808 | 0 / 3758092288 | 4806 / 9584 | 6.6.0-7883-enterprise |
          | 172.23.105.159 | kv | 56.9553805774 | 4201840640 | 3429101568 | 0 / 3758092288 | 4801 / 9680 | 6.6.0-7883-enterprise |
          +----------------+----------------------+-----------------+------------+------------+----------------+-------------------+-----------------------+{noformat}
          {noformat}+----------+---------+----------+------------+-----+-------+------------+----------+-----------+
          | Bucket | Type | Replicas | Durability | TTL | Items | RAM Quota | RAM Used | Disk Used |
          +----------+---------+----------+------------+-----+-------+------------+----------+-----------+
          | bucket-0 | membase | 2 | none | 0 | 9483 | 2986344448 | 93145992 | 439373036 |
          | bucket-1 | membase | 2 | none | 0 | 10229 | 2986344448 | 97605584 | 540662594 |
          +----------+---------+----------+------------+-----+-------+------------+----------+-----------+
          {noformat}

           # Load few docs into the buckets using transaction
           # Perform swap_rebalance for 2 nodes (172.23.105.155, 172.23.105.206)
           # Start parallel doc_loading using transaction during rebalance phase and validate

          *Observation*:

          At the end of the rebalance during data validation seeing keys missing from the bucket.

          Pcap files [https://cb-jira.s3.us-east-2.amazonaws.com/logs/key_missing/pcaps.zip]
          *Build*: 6.6.0-7883

          *Scenario*:
           # 4 node cluster, 2 Couchbase bucket (replica=2)
          {noformat}+----------------+----------------------+-----------------+------------+------------+----------------+-------------------+-----------------------+
          | Node | Services | CPU_utilization | Mem_total | Mem_free | Swap_mem_used | Active / Replica | Version |
          +----------------+----------------------+-----------------+------------+------------+----------------+-------------------+-----------------------+
          | 172.23.105.205 | kv | 54.025974026 | 4201840640 | 3443564544 | 0 / 3758092288 | 5167 / 10324 | 6.6.0-7883-enterprise |
          | 172.23.105.155 | fts, index, kv, n1ql | 69.6335078534 | 4201840640 | 3388350464 | 0 / 3758092288 | 4938 / 9838 | 6.6.0-7883-enterprise |
          | 172.23.105.206 | kv | 57.1428571429 | 4201840640 | 3453943808 | 0 / 3758092288 | 4806 / 9584 | 6.6.0-7883-enterprise |
          | 172.23.105.159 | kv | 56.9553805774 | 4201840640 | 3429101568 | 0 / 3758092288 | 4801 / 9680 | 6.6.0-7883-enterprise |
          +----------------+----------------------+-----------------+------------+------------+----------------+-------------------+-----------------------+{noformat}
          {noformat}+----------+---------+----------+------------+-----+-------+------------+----------+-----------+
          | Bucket | Type | Replicas | Durability | TTL | Items | RAM Quota | RAM Used | Disk Used |
          +----------+---------+----------+------------+-----+-------+------------+----------+-----------+
          | bucket-0 | membase | 2 | none | 0 | 9483 | 2986344448 | 93145992 | 439373036 |
          | bucket-1 | membase | 2 | none | 0 | 10229 | 2986344448 | 97605584 | 540662594 |
          +----------+---------+----------+------------+-----+-------+------------+----------+-----------+
          {noformat}

           # Load few docs into the buckets using transaction
           # Perform swap_rebalance for 2 nodes (172.23.105.155, 172.23.105.206)
           # Start parallel doc_loading using transaction during rebalance phase and validate

          *Observation*:

          At the end of the rebalance during data validation seeing keys missing from the bucket.

          Pcap files [https://cb-jira.s3.us-east-2.amazonaws.com/logs/key_missing/pcaps.zip]
          ritam.sharma Ritam Sharma made changes -
          Assignee Ashwin Govindarajulu [ ashwin.govindarajulu ] Paolo Cocchi [ paolo.cocchi ]
          ashwin.govindarajulu Ashwin Govindarajulu made changes -
          Description *Build*: 6.6.0-7883

          *Scenario*:
           # 4 node cluster, 2 Couchbase bucket (replica=2)
          {noformat}+----------------+----------------------+-----------------+------------+------------+----------------+-------------------+-----------------------+
          | Node | Services | CPU_utilization | Mem_total | Mem_free | Swap_mem_used | Active / Replica | Version |
          +----------------+----------------------+-----------------+------------+------------+----------------+-------------------+-----------------------+
          | 172.23.105.205 | kv | 54.025974026 | 4201840640 | 3443564544 | 0 / 3758092288 | 5167 / 10324 | 6.6.0-7883-enterprise |
          | 172.23.105.155 | fts, index, kv, n1ql | 69.6335078534 | 4201840640 | 3388350464 | 0 / 3758092288 | 4938 / 9838 | 6.6.0-7883-enterprise |
          | 172.23.105.206 | kv | 57.1428571429 | 4201840640 | 3453943808 | 0 / 3758092288 | 4806 / 9584 | 6.6.0-7883-enterprise |
          | 172.23.105.159 | kv | 56.9553805774 | 4201840640 | 3429101568 | 0 / 3758092288 | 4801 / 9680 | 6.6.0-7883-enterprise |
          +----------------+----------------------+-----------------+------------+------------+----------------+-------------------+-----------------------+{noformat}
          {noformat}+----------+---------+----------+------------+-----+-------+------------+----------+-----------+
          | Bucket | Type | Replicas | Durability | TTL | Items | RAM Quota | RAM Used | Disk Used |
          +----------+---------+----------+------------+-----+-------+------------+----------+-----------+
          | bucket-0 | membase | 2 | none | 0 | 9483 | 2986344448 | 93145992 | 439373036 |
          | bucket-1 | membase | 2 | none | 0 | 10229 | 2986344448 | 97605584 | 540662594 |
          +----------+---------+----------+------------+-----+-------+------------+----------+-----------+
          {noformat}

           # Load few docs into the buckets using transaction
           # Perform swap_rebalance for 2 nodes (172.23.105.155, 172.23.105.206)
           # Start parallel doc_loading using transaction during rebalance phase and validate

          *Observation*:

          At the end of the rebalance during data validation seeing keys missing from the bucket.

          Pcap files [https://cb-jira.s3.us-east-2.amazonaws.com/logs/key_missing/pcaps.zip]
          *Build*: 6.6.0-7883

          *Scenario*:
           # 4 node cluster, 2 Couchbase bucket (replica=2)
          {noformat}+----------------+----------------------+-----------------+------------+------------+----------------+-------------------+-----------------------+
          | Node | Services | CPU_utilization | Mem_total | Mem_free | Swap_mem_used | Active / Replica | Version |
          +----------------+----------------------+-----------------+------------+------------+----------------+-------------------+-----------------------+
          | 172.23.105.205 | kv | 54.025974026 | 4201840640 | 3443564544 | 0 / 3758092288 | 5167 / 10324 | 6.6.0-7883-enterprise |
          | 172.23.105.155 | fts, index, kv, n1ql | 69.6335078534 | 4201840640 | 3388350464 | 0 / 3758092288 | 4938 / 9838 | 6.6.0-7883-enterprise |
          | 172.23.105.206 | kv | 57.1428571429 | 4201840640 | 3453943808 | 0 / 3758092288 | 4806 / 9584 | 6.6.0-7883-enterprise |
          | 172.23.105.159 | kv | 56.9553805774 | 4201840640 | 3429101568 | 0 / 3758092288 | 4801 / 9680 | 6.6.0-7883-enterprise |
          +----------------+----------------------+-----------------+------------+------------+----------------+-------------------+-----------------------+{noformat}
          {noformat}+----------+---------+----------+------------+-----+-------+------------+----------+-----------+
          | Bucket | Type | Replicas | Durability | TTL | Items | RAM Quota | RAM Used | Disk Used |
          +----------+---------+----------+------------+-----+-------+------------+----------+-----------+
          | bucket-0 | membase | 2 | none | 0 | 9483 | 2986344448 | 93145992 | 439373036 |
          | bucket-1 | membase | 2 | none | 0 | 10229 | 2986344448 | 97605584 | 540662594 |
          +----------+---------+----------+------------+-----+-------+------------+----------+-----------+
          {noformat}

           # Load few docs into the buckets using transaction
           # Perform swap_rebalance for 2 nodes (172.23.105.155, 172.23.105.206)
           # Start parallel doc_loading using transaction during rebalance phase and validate

          *Observation*:

          At the end of the rebalance during data validation seeing keys missing from the bucket.

          Pcap files [https://cb-jira.s3.us-east-2.amazonaws.com/logs/key_missing/pcaps.zip]

          *Test case to run:*
          {noformat}
          guides/gradlew --refresh-dependencies testrunner -P jython=/opt/jython/bin/jython -P 'args=-i nodes.ini -t rebalance_new.swaprebalancetests.SwapRebalanceBasicTests.do_test,nodes_init=4,replicas=2,standard_buckets=2,num-swap=2,swap-orchestrator=True,num_items=10000,doc_size=512,transaction_timeout=150,durability=PERSIST_TO_MAJORITY,atomicity=True -p rerun=False,get-cbcollect-info=False -m rest'{noformat}
          lynn.straus Lynn Straus made changes -
          Priority Major [ 3 ] Critical [ 2 ]
          Hide
          graham.pople Graham Pople added a comment - - edited

          Ashwin Govindarajulu do you have a full transactions log?  I'd like to do a quick check that's no bug in the transactions library causing this.
          (Unless the previous investigation has already confirmed that this is definitely a memcached issue.)

          Show
          graham.pople Graham Pople added a comment - - edited Ashwin Govindarajulu  do you have a full transactions log?  I'd like to do a quick check that's no bug in the transactions library causing this. (Unless the previous investigation has already confirmed that this is definitely a memcached issue.)
          Hide
          ashwin.govindarajulu Ashwin Govindarajulu added a comment -

          Graham Pople I'll update the logs with tranx log enabled now.

          Show
          ashwin.govindarajulu Ashwin Govindarajulu added a comment - Graham Pople I'll update the logs with tranx log enabled now.
          Hide
          paolo.cocchi Paolo Cocchi added a comment -

          Graham Pople Not confirmed any memcached issue yet, so it would great if you could exclude any SDK related issue. Thanks

          Show
          paolo.cocchi Paolo Cocchi added a comment - Graham Pople Not confirmed any memcached issue yet, so it would great if you could exclude any SDK related issue. Thanks
          paolo.cocchi Paolo Cocchi made changes -
          Assignee Paolo Cocchi [ paolo.cocchi ] Ashwin Govindarajulu [ ashwin.govindarajulu ]
          paolo.cocchi Paolo Cocchi made changes -
          Resolution User Error [ 10100 ]
          Status Open [ 1 ] Resolved [ 5 ]
          paolo.cocchi Paolo Cocchi made changes -
          Component/s couchbase-bucket [ 10173 ]
          Component/s test-execution [ 10231 ]
          ashwin.govindarajulu Ashwin Govindarajulu made changes -
          Attachment KeyMissing Wireshark.png [ 101804 ]
          ashwin.govindarajulu Ashwin Govindarajulu made changes -
          Attachment key_missing.test [ 101805 ]
          Hide
          ashwin.govindarajulu Ashwin Govindarajulu added a comment - - edited

          Able to reproduce the issue. Test log: key_missing.testand pcap (https://cb-jira.s3.us-east-2.amazonaws.com/logs/test/test.pcap) for the same.

          Looks like some timing issue during actual doc_creation and the time we are validating the results.

          Test reported "test_docs-00028945" was missing and following are the frames from test.pcap:

          Key: test_docs-00028945 (collection id: 0x00000074)
          vBucket: 882 (0x372)
           
          Frame 312805: Subdoc mutation request - 0xd1 (Opaque: 0x3b570200)
          Frame 312814: Response Success (But on deleted doc - 0x00cd)
           
          Frame 312988: Add request - 0x2 (Opaque: 0x5c570200)
          Frame 313008: Add response (D_Impossible - 0x00a1)
           
          Frame 313004: Add request - 0x2 (Opaque: 0x5c570200)
          Frame 313008: Add response (D_Impossible - 0x00a1)
           
          Frame 405843: Get request (opaque: 0x9ce50200) <— TAF verification read
          Frame 405872: Get response with Key not Found <— TAF Verification response (Where we report key missing)
           
          Frame 428494: Subdoc lookup request - 0xd0 (Opaque: 0xb00d0300)
          Frame 428495: Subdoc lookup success
           
          Frame 428498: Add request - 0x2 (Opaque: 0xb10d0300)
          Frame 428499: Add response SUCCESS

          As per the TAF, we are calling RunTransaction call from Jython_tasks/task.py#L3863 and this call is supposed to commit before returning to the caller as per SimpleTransaction.java#L489. Even the transaction has not reported any error, which could have printed on stdout in this case (SimpleTransaction.java#L496). We still see d_impossible in the tests but this is handled and retied well before the validation part within transaction_load() call (Jython_tasks/task.py#L3972)

          Now the point is Transaction context returned from ctx.commit() call and then we go ahead to validate the committed docs. But the pcap suggests the commit has happened only after the TAF verification.

          Run 2 - Update

          Run log: http://qa.sc.couchbase.com/job/test_suite_executor-TAF/45743/consoleText

          This looks like a test race condition. While adding extra debug prints in the Transaction application code, the key reported missing are not actually committed as per the Tranx-app.

          Will debug and fix it in TAF today.

          Show
          ashwin.govindarajulu Ashwin Govindarajulu added a comment - - edited Able to reproduce the issue. Test log: key_missing.test and pcap ( https://cb-jira.s3.us-east-2.amazonaws.com/logs/test/test.pcap ) for the same. Looks like some timing issue during actual doc_creation and the time we are validating the results. Test reported "test_docs-00028945" was missing and following are the frames from test.pcap: Key: test_docs-00028945 (collection id: 0x00000074) vBucket: 882 (0x372)   Frame 312805: Subdoc mutation request - 0xd1 (Opaque: 0x3b570200) Frame 312814: Response Success (But on deleted doc - 0x00cd)   Frame 312988: Add request - 0x2 (Opaque: 0x5c570200) Frame 313008: Add response (D_Impossible - 0x00a1)   Frame 313004: Add request - 0x2 (Opaque: 0x5c570200) Frame 313008: Add response (D_Impossible - 0x00a1)   Frame 405843: Get request (opaque: 0x9ce50200) <— TAF verification read Frame 405872: Get response with Key not Found <— TAF Verification response (Where we report key missing)   Frame 428494: Subdoc lookup request - 0xd0 (Opaque: 0xb00d0300) Frame 428495: Subdoc lookup success   Frame 428498: Add request - 0x2 (Opaque: 0xb10d0300) Frame 428499: Add response SUCCESS As per the TAF, we are calling RunTransaction call from Jython_tasks/task.py#L3863 and this call is supposed to commit before returning to the caller as per SimpleTransaction.java#L489 . Even the transaction has not reported any error, which could have printed on stdout in this case ( SimpleTransaction.java#L496 ). We still see d_impossible in the tests but this is handled and retied well before the validation part within transaction_load() call ( Jython_tasks/task.py#L3972) Now the point is Transaction context returned from ctx.commit() call and then we go ahead to validate the committed docs. But the pcap suggests the commit has happened only after the TAF verification. Run 2 - Update Run log: http://qa.sc.couchbase.com/job/test_suite_executor-TAF/45743/consoleText This looks like a test race condition. While adding extra debug prints in the Transaction application code, the key reported missing are not actually committed as per the Tranx-app. Will debug and fix it in TAF today.
          Hide
          graham.pople Graham Pople added a comment -

          TLDR for this long comment: (probably) found the issue; it's not a memcached bug, it's a TAF race, one caused by recent deliberate changes in transactions 1.0.1.  TAF change needed.

           

          Ashwin Govindarajulu thanks for presenting the frame like that, that's very helpful and I think I've seen the issue.  I think this is a TAF race - one between TAF validation, and transaction cleanup.  Here's what I suspect the packets are doing:

           

          Transaction stages an insert
          Frame 312805: Subdoc mutation request - 0xd1 (Opaque: 0x3b570200)
          Frame 312814: Response Success (But on deleted doc - 0x00cd)
           
          Transaction has reached the commit point.  It's written COMMIT and ll doc-ids for the attempt to the attempt's ATR entry.
          Here it is trying to commit the insert.  But it fails.
          Frame 312988: Add request - 0x2 (Opaque: 0x5c570200)
          Frame 313008: Add response (D_Impossible - 0x00a1)
           
          It has a second go at committing the insert, also fails.
          It now calls it a day.  It's going to return success to the application, because the COMMIT point was reached.  
          But TransactionResult.unstagingComplete() will be false to indicate that not all documents could be unstaged (committed) right now.
          Crucially, the async cleanup process is expected to finish the job.
          Frame 313004: Add request - 0x2 (Opaque: 0x5c570200)
          Frame 313008: Add response (D_Impossible - 0x00a1)
           
          You do the read here.  At this stage the insert is still a tombstone so you don't see it.
          Frame 405843: Get request (opaque: 0x9ce50200) <— TAF verification read
          Frame 405872: Get response with Key not Found <— TAF Verification response (Where we report key missing)
           
          This I think is the async transaction process, which has found the now-expired transaction and is cleaned it up by finishing the commit.
          Here it gets the document.
          Frame 428494: Subdoc lookup request - 0xd0 (Opaque: 0xb00d0300)
          Frame 428495: Subdoc lookup success
           
          And now it commits the document.  If TAF were to validate now, it should see the doc.
          Frame 428498: Add request - 0x2 (Opaque: 0xb10d0300)
          Frame 428499: Add response SUCCESS
          

          Does all that make sense?

          So this comes down to changes made in the error handling in 1.0.1.  Many errors during the commit phase will now trigger success to be returned, rather than that operation being retried indefinitely.

          This actually makes us more robust against rebalances, node failures, etc.: rather than those causing hundreds of transactions to simultaneously timeout, instead those transactions if they are mid-commit will simply return success in the face of durability impossibility errors and the like.  Unless the application actually needs all the docs to be committed right now (e.g. if they are doing a N1QL AT_PLUS query or - crucially - if you are TAF and want to immediately validate that all docs are in final expected committed state), then failures during commit aren't really failures as the commit can and will be completed a little later by the cleanup process.

          I thought we had discussed these changes before?  Maybe I'm mistaken, or maybe I hadn't stressed enough the likelihood that TAF was going to need to make some changes to adjust.

          I hope that explains everything.  Long story short, I suggest TAF uses TransactionResult.unstagingComplete() to determine if it is safe to do immediate document validation, or if it should wait until cleanup has finished off the commit first.  (I think I've already shown how to wait for cleanup events to determine if cleanup is complete.)

           

          Show
          graham.pople Graham Pople added a comment - TLDR for this long comment: (probably) found the issue; it's not a memcached bug, it's a TAF race, one caused by recent deliberate changes in transactions 1.0.1.  TAF change needed.   Ashwin Govindarajulu  thanks for presenting the frame like that, that's very helpful and I think I've seen the issue.  I think this is a TAF race - one between TAF validation, and transaction cleanup.  Here's what I suspect the packets are doing:   Transaction stages an insert Frame 312805: Subdoc mutation request - 0xd1 (Opaque: 0x3b570200) Frame 312814: Response Success (But on deleted doc - 0x00cd) Transaction has reached the commit point. It's written COMMIT and ll doc-ids for the attempt to the attempt's ATR entry. Here it is trying to commit the insert. But it fails. Frame 312988: Add request - 0x2 (Opaque: 0x5c570200) Frame 313008: Add response (D_Impossible - 0x00a1) It has a second go at committing the insert, also fails. It now calls it a day. It's going to return success to the application, because the COMMIT point was reached. But TransactionResult.unstagingComplete() will be false to indicate that not all documents could be unstaged (committed) right now. Crucially, the async cleanup process is expected to finish the job. Frame 313004: Add request - 0x2 (Opaque: 0x5c570200) Frame 313008: Add response (D_Impossible - 0x00a1) You do the read here. At this stage the insert is still a tombstone so you don't see it. Frame 405843: Get request (opaque: 0x9ce50200) <— TAF verification read Frame 405872: Get response with Key not Found <— TAF Verification response (Where we report key missing) This I think is the async transaction process, which has found the now-expired transaction and is cleaned it up by finishing the commit. Here it gets the document. Frame 428494: Subdoc lookup request - 0xd0 (Opaque: 0xb00d0300) Frame 428495: Subdoc lookup success And now it commits the document. If TAF were to validate now, it should see the doc. Frame 428498: Add request - 0x2 (Opaque: 0xb10d0300) Frame 428499: Add response SUCCESS Does all that make sense? So this comes down to changes made in the error handling in 1.0.1.  Many errors during the commit phase will now trigger success to be returned, rather than that operation being retried indefinitely. This actually makes us more robust against rebalances, node failures, etc.: rather than those causing hundreds of transactions to simultaneously timeout, instead those transactions if they are mid-commit will simply return success in the face of durability impossibility errors and the like.  Unless the application actually needs all the docs to be committed right now (e.g. if they are doing a N1QL AT_PLUS query or - crucially - if you are TAF and want to immediately validate that all docs are in final expected committed state), then failures during commit aren't really failures as the commit can and will be completed a little later by the cleanup process. I thought we had discussed these changes before?  Maybe I'm mistaken, or maybe I hadn't stressed enough the likelihood that TAF was going to need to make some changes to adjust. I hope that explains everything.  Long story short, I suggest TAF uses TransactionResult.unstagingComplete() to determine if it is safe to do immediate document validation, or if it should wait until cleanup has finished off the commit first.  (I think I've already shown how to wait for cleanup events to determine if cleanup is complete.)  
          Hide
          ashwin.govindarajulu Ashwin Govindarajulu added a comment -

          Graham Pople Thanks for the analysis. Updated the code as per your input.

          Patch ref: http://review.couchbase.org/c/TAF/+/132949

           

          Show
          ashwin.govindarajulu Ashwin Govindarajulu added a comment - Graham Pople Thanks for the analysis. Updated the code as per your input. Patch ref: http://review.couchbase.org/c/TAF/+/132949  
          ritam.sharma Ritam Sharma made changes -
          Status Resolved [ 5 ] Closed [ 6 ]

            People

            Assignee:
            ashwin.govindarajulu Ashwin Govindarajulu
            Reporter:
            ashwin.govindarajulu Ashwin Govindarajulu
            Votes:
            0 Vote for this issue
            Watchers:
            8 Start watching this issue

              Dates

              Due:
              Created:
              Updated:
              Resolved:

                Gerrit Reviews

                There are no open Gerrit changes

                  PagerDuty