Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-36634

[Documentation]: Successfully acknowledged sync-write is missing from the bucket when rebalance failure is simulated via memcached kill..

    XMLWordPrintable

Details

    Description

      1. Create a 2 node cluster:

      +----------------+----------+--------------+
      | Nodes          | Services | Status       |
      +----------------+----------+--------------+
      | 10.112.180.101 | [u'kv']  | Cluster node |
      | 10.112.180.102 | None     | <--- IN ---  |
      +----------------+----------+--------------+
      

      2. Create bucket:

      http://10.112.180.101:8091/pools/default/buckets with param: replicaIndex=1&maxTTL=0&flushEnabled=1&compressionMode=off&bucketType=membase&name=default&replicaNumber=1&ramQuotaMB=654&threadsNumber=3&evictionPolicy=valueOnly
      

      3. Loaded 100k(test_docs-0:test_docs-99999) docs with durability=majority
      4. Change bucket replica to 2, add 10.112.180.103, remove 10.112.180.102, hit rebalance. Load another 100k(test_docs-100000:test_docs-199999) in parallel
      5. Kill memcahced on 10.112.180.101 when rebalance reaches ~40%. Rebalance failed(Intentionally)
      Data loading is still in progress with expected exceptions.
      6. Restart rebalance. Wait for rebalance finish and it finished properly.
      7. Wait for data loading to finish and retry of all the catch exceptions succeeds.
      8. Validate the data

      Actual result:
      Data validation failed as few keys are missing from which there was success for sync-write
      Missing keys: ['test_docs-130287', 'test_docs-130289', 'test_docs-130282', 'test_docs-130294', 'test_docs-130291']

      Expected Result:
      All the data should be present as all the exceptions were watched and re-inserted.

      In the attached pcap, apply the filter as: couchbase.opaque == 0xe3080000 and see packet number 619311 which is an insert request for key: test_docs-130287. Packet number 619329 is the success response for it.

      But the key is missing from the bucket.

      Note: Pcap is quite big, please apply the filters. I tried to save the filtered packets through wireshark but some issue is coming while doing that so couldn't do it.

      QE Note:

      -t rebalance_new.swaprebalancetests.SwapRebalanceFailedTests.test_failed_swap_rebalance,nodes_init=2,replicas=1,standard_buckets=1,num-swap=1,new_replica=2,percentage_progress=40,GROUP=P0;durability,durability=MAJORITY,skip_cleanup=True -p infra_log_level=debug,log_level=debug -m rest
      

      Attachments

        Issue Links

          No reviews matched the request. Check your Options in the drop-down menu of this sections header.

          Activity

            dfinlay Dave Finlay added a comment -

            Ritesh Agarwal: basically yes. Our competitors offer this kind of trade off between performance and failure modes too. Obviously we do need to be clear with users in our docs on this trade off.

            CC: Shivani Gupta

            dfinlay Dave Finlay added a comment - Ritesh Agarwal : basically yes. Our competitors offer this kind of trade off between performance and failure modes too. Obviously we do need to be clear with users in our docs on this trade off. CC: Shivani Gupta
            ritesh.agarwal Ritesh Agarwal added a comment - - edited

            Dave Finlay/Dave Rigby: What is expected in case of ephemeral bucket? Given that auto-reprovisioning is enabled.

            Scenario:
            For a given key for which active has responded back success to the client and then immediately active is killed. Prepare on replica has to be processed but i am seeing all those keys are getting lost. For ephemeral as there is no rollback involved in this case there should not be any data loss for acked keys.

            Replica should commit all the Prepares it has acknowledged.

            CC: Ritam Sharma

            ritesh.agarwal Ritesh Agarwal added a comment - - edited Dave Finlay / Dave Rigby : What is expected in case of ephemeral bucket? Given that auto-reprovisioning is enabled. Scenario: For a given key for which active has responded back success to the client and then immediately active is killed. Prepare on replica has to be processed but i am seeing all those keys are getting lost. For ephemeral as there is no rollback involved in this case there should not be any data loss for acked keys. Replica should commit all the Prepares it has acknowledged. CC: Ritam Sharma
            drigby Dave Rigby added a comment -

            Ritesh Agarwal I believe that for an Ephemeral bucket, any SyncWrites which returned success to the client should not be lost if the active crashes. This is because auto-reprovisioning should promote one of the replicas (and the old active if/when it comes back will become a replica).

            It wasn't clear from your comment if you are seeing this behaviour, or if you're seeing something different - if so please raise a separate MB on Ephemeral and we can investigate.

            drigby Dave Rigby added a comment - Ritesh Agarwal I believe that for an Ephemeral bucket, any SyncWrites which returned success to the client should not be lost if the active crashes. This is because auto-reprovisioning should promote one of the replicas (and the old active if/when it comes back will become a replica). It wasn't clear from your comment if you are seeing this behaviour, or if you're seeing something different - if so please raise a separate MB on Ephemeral and we can investigate.
            ritam.sharma Ritam Sharma added a comment -

            Dave Rigby - We are seeing a different behaviour. Will log a new ticket with logs. Thank you Dave Rigby

            ritam.sharma Ritam Sharma added a comment - Dave Rigby - We are seeing a different behaviour. Will log a new ticket with logs. Thank you Dave Rigby

            Bulk closing invalid, won-fix and duplicate bugs

            raju Raju Suravarjjala added a comment - Bulk closing invalid, won-fix and duplicate bugs

            People

              shivani.gupta Shivani Gupta
              ritesh.agarwal Ritesh Agarwal
              Votes:
              0 Vote for this issue
              Watchers:
              8 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Gerrit Reviews

                  There are no open Gerrit changes

                  PagerDuty