Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-61323

XDCR - Unnecessary trigger of backfill pipeline, backfill with P2P needs revisiting

    XMLWordPrintable

Details

    • Bug
    • Resolution: Unresolved
    • Major
    • Morpheus
    • 7.6.0, 7.0.0, 7.0.4, 7.1.4, 7.0.5, 7.1.0, 7.1.1, 7.1.2, 7.2.0, 7.1.3, 7.2.1, 7.1.5, 7.2.4, 7.2.2, 7.1.6, 7.2.3, 7.2.5, 7.6.1
    • XDCR
    • None
    • Untriaged
    • 0
    • Unknown

    Description

      Ayush Nayyar, when testing filterBinary for capella found the following issue (AV-75604), which is a server bug in 7.2.4 and unrelated to filterBinary in capella. This is roughly the steps performed:

      1. Both source and target bucket have 2 collections - test.test and _default._default
      2. Create a unidirection replication with filterBinary turned on and binary docs loaded onto source bucket. Have _default._default to _default._default explicit collection mapping only.
      3. Notice changes_left is 0 and turn on another explicit mapping test.test to test.test to trigger backfill pipeline.
      4. Wait for changes_left 0 and all the binary documents should be have been filtered.
      5. Turn off filterBinary.
      6. Pause replication.
      7. Resume replication.

      It was noticed that when the replication was resumed, binary documents that were originally filtered were replicated. This is because a backfill pipeline ran (in conjunction with the main pipeline run due to replication resume) from seqno 0 which also replicated the binary documents that was originally filtered.

      Source cluster logs:

      s3://cb-customers-secure/259dec4e-bd11-422f-9964-ea1c4c9dd1cb/2024-03-26/collectinfo-2024-03-26t092802-ns_1@svc-dqis-node-001.znfv7b7mrg-ma0u.sandbox.nonprod-project-avengers.com-redacted-a2319c5945f93c5f.zip
      s3://cb-customers-secure/259dec4e-bd11-422f-9964-ea1c4c9dd1cb/2024-03-26/collectinfo-2024-03-26t092802-ns_1@svc-dqis-node-002.znfv7b7mrg-ma0u.sandbox.nonprod-project-avengers.com-redacted-77e4686eef00f0bd.zip
      s3://cb-customers-secure/259dec4e-bd11-422f-9964-ea1c4c9dd1cb/2024-03-26/collectinfo-2024-03-26t092802-ns_1@svc-dqis-node-003.znfv7b7mrg-ma0u.sandbox.nonprod-project-avengers.com-redacted-13f91773b87b62f0.zip

       

      The RCA is the following:

      During the lifetime of a backfill replication, if a node (say node X) observes a peer push of tasks of vbs which the current node doesn't own, it will store it in the backfill spec (in case of failover of a node in the cluster and the current node becomes the vb's master). So when the backfill is done for the vb that it owns, XDCR deems the backfill done, but doesn't delete the backfill spec because of the tasks from vbs that it is not the master of. So XDCR will go ahead and try to initiate a new backfill pipeline and soon realise that the tasks remaining are for vbs that it doesn't own and hence bail out without ever creating or running a new backfill pipeline.
      Now say there was a pause replication which will stop the pipelines.
      Say now there was a resume replication which will start the pipelines again.
      Now as part of main pipeline start, we do a P2P pull of checkpoints and backfill tasks. Now from the point of view of a different node Y in the cluster (peer of node X from above; node X has replica vbs of vbs owned by Y), will pull the backfill tasks from node X (since X got it from Y in the P2P push described above) and since it is the master of those tasks' vbs that it pulled, it will start a new backfill pipeline, eventhough it was not needed. So the whole backfill setup with the P2P in picture needs to be revisited in this case.

      One of the solution could be to do a one-time checkpointing and to not delete the checkpoints when the previous backfill was done for all the vbs that the node X owned (but there are tasks for vbs that X doesn't own)

      Attachments

        No reviews matched the request. Check your Options in the drop-down menu of this sections header.

        Activity

          People

            neil.huang Neil Huang
            sumukh.bhat Sumukh Bhat
            Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

              Created:
              Updated:

              Gerrit Reviews

                There are no open Gerrit changes

                PagerDuty