Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-44683

XDCR - target manifests can be lost when XMEM is stuck

    XMLWordPrintable

Details

    • Untriaged
    • 1
    • No

    Description

      I see a bunch of

      2021-02-27T18:41:18.792-08:00 ERRO GOXDCR.GenericSupervisor: PipelineSupervisor_2c3bc0c5670030c5aed22087051b502d/bucket2/bucket2 Received error report : Collections Router 2c3bc0c5670030c5aed22087051b502d/bucket2/bucket2 error - unable to find last known target manifest version 3386 from collectionsManifestSvc - err: Unable to find target manifest for version 3386
      

      that may need some investigating

      Update:
      It is possible that when a pipeline resumes, all the VB's are able to be resume a DCP stream. This leads to XDCR declaring the pipeline "ready for checkpoint" (checkpoint manager's isCheckpointAllowed())

      However, if XMEM is stuck or things are timing out, it is possible for certain VBs in ThroughSeqnoTracker to not have any data flow. Then, the next time ckpt mgr performs a checkpoint, it may retrieve 0 for the target manifest ID and commit that into a checkpoint. If this happens too often it'll render all the ckpts with 0 as their target manifest IDs. This will then lead to collections Manifest service throwing away all manifests for the target.

      Attachments

        Issue Links

          For Gerrit Dashboard: MB-44683
          # Subject Branch Project Status CR V

          Activity

            People

              pavithra.mahamani Pavithra Mahamani (Inactive)
              neil.huang Neil Huang
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Gerrit Reviews

                  There are no open Gerrit changes

                  PagerDuty