Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-31141

DelWithMetas from XDCR 4.5.1 -> 5.x creates corrupt tombstones

    XMLWordPrintable

Details

    • Bug
    • Resolution: Fixed
    • Critical
    • 5.5.2, 6.0.0
    • 5.5.1
    • couchbase-bucket, XDCR
    • Security Level: Public
    • None
    • Untriaged
    • Unknown

    Description

      It seems that XDCR from 4.5.1 (have not tested other versions yet) to Couchbase Server 5.5.x causes corruption on deleted documents leading to inability to rebalance and potential data loss (if replication streams have to reconnect and there is a failover).

      Steps To Reproduce

      1. Setup a single-node 4.5.1 cluster and a single-node 5.5.1 cluster
      2. Create a bucket on each cluster
      3. Setup XDCR between 4.5.1 and 5.5.1 on this bucket
      4. Create a document
      5. Delete that document

      After step 5, review the document on the source and target cluster:

      Source (4.5.1)

      Doc seq: 4
           id: test1
           rev: 4
           content_meta: 3
           size (on disk): 0
           cas: 1535966803550732288, expiry: 1535966802, flags: 0, datatype: 0, conflict_resolution_mode: 0
           doc deleted
           could not read document body: document not found
      

      Target (5.5.1)

      Doc seq: 4
           id: test1
           rev: 4
           content_meta: 131
           size (on disk): 15
           cas: 1535966803550732288, expiry: 1535966803, flags: 0, datatype: 0x00 (raw)
           doc deleted
           size: 5
           data: (snappy)
      

      Attached a pcap showing the DelWithMeta requests being sent by XDCR.

      Seems this has something to do with the format of the packet being sent by 4.5.1 not being respected properly by 5.x.

      In theory this issue has no impact (as the docs are deleted), but actually completely breaks rebalance in Couchbase Server 5.5.x onwards.
      This is because the value on disk is now snappy compressed (instead of being empty), so the datatype when reading the document off of disk is set to SNAPPY (0x2).
      This then means that all subsequent rebalances and internal replications (which backfill) fail for that document with the following error:

      2018-08-31T15:29:54.141273Z WARNING 185: Invalid format specified for DCP_DELETION - 4 - closing connection packet:mcbp::header: magic:0x80, opcode:0x58, keylen:23, extlen:21, datatype:0x2, specific:806, bodylen:51, opaque:0x21, rawextras:0000007c7e00000060f15b895e190
      

      The error above occurs because SNAPPY datatype is not a valid datatype for a DCP_DELETION (as this situation should never happen).
      This means once you're in this situation with corrupted documents you are unable to rebalance and also risk data loss (if your replication streams don't stay completely in-memory) upon failover.

      Attachments

        For Gerrit Dashboard: MB-31141
        # Subject Branch Project Status CR V

        Activity

          People

            pavithra.mahamani Pavithra Mahamani (Inactive)
            matt.carabine Matt Carabine (Inactive)
            Votes:
            0 Vote for this issue
            Watchers:
            13 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              PagerDuty