Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-29127

[CX] Rollback on purges of deletes is not supported

    XMLWordPrintable

Details

    • CX Sprint 99, CX Sprint 100, CX Sprint 112, CX Sprint 113

    Description

      In certain scenarios, KV will perform purges of delete operations. A DCP consumer that tries to resume the DCP stream from a sequence number that is smaller than the purge sequence number will be asked to rollback to 0 since KV doesn't have the necessary deletes to reach eventual consistency.

      Analytics will ignore the rollback response and will keep trying to re-connect since it will not spot a branch off of the failover log and is not aware of other reasons to rollback.

       

      There are a few problems that are currently there:

      Problem 1

      Analytics utilizes a single stream to feed multiple datasets. On creation of a new dataset, Analytics will open a stream from sequence number 0. New dataset will store all mutations, existing datasets will ignore mutations until the stream reach the sequence number in their DCP state.

      This can lead to inconsistent data in the following scenario.

      1. Dataset1 stopped at sequence number 500.
      2. DCP stream is disconnected.
      3. Dataset2 is created.
      4. DCP producer purges up to sequence number 1000.
      5. Analytics will open the stream from sequence number 0 (because of Dataset2).
      6. Once the stream reaches 500, Dataset1 will start applying mutations, not knowing that deletes have been purged.

      Proposed solution

      When two datasets are inconsistent when it comes to their DCP states, Catch up, close the stream, and re-open. In our example above, at step 5, we open the stream from 0 to 500. Then once that stream ends, we open a new stream from 501 to infinity. Upon requesting the new stream, KV will respond with a rollback response to sequence number 0.

      There is quite a lot of wasted effort here. To mitigate it, before we start the stream from 0, we will ask KV for purge sequence numbers and perform rollbacks proactively.

       

      Problem 2

      When purge sequence number is large enough, the dcp consumer is expected to stream from 0 all the way to the purge sequence number without connection drops. See https://issues.couchbase.com/browse/MB-27800

      If a connection drops, then the DCP producer will ask the consumer to start over. Why is this especially bad for Analytics?

      When analytics streams a vbucket, it hashes its mutations into multiple nodes. Each of these nodes gets mutations from all vbuckets and puts then together in an LSM index creating components when memory is full.

      When analytics detects a need for rollback, it does that by removing components from the end of the index until the rollback point is exceeded and then re-open the stream.

      In this scenario, we are going to rollback to 0, all the vbuckets and restart the operation for all of them... So essentially, to get out of this situation, we need to successfully stream all 1024 vbuckets from 0 till the purge sequence number.

      There is no proposed solution for this situation yet. Many solutions exist but none is particularly attractive.

       

      Problem 3

      Testing of purge scenarios is not easy. As of now, the only way for us to test is to set a low purge period with minimum limit of 1 hour. Wait for that time and then perform manual compaction.

      This is not ideal. Is there a better way?

       

       

      Attachments

        Issue Links

          For Gerrit Dashboard: MB-29127
          # Subject Branch Project Status CR V

          Activity

            People

              tanzeem.ahmed Tanzeem Ahmed (Inactive)
              Abdullah.Alamoudi Abdullah Alamoudi [X] (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Gerrit Reviews

                  There are no open Gerrit changes

                  PagerDuty