Details
-
Bug
-
Resolution: Fixed
-
Critical
-
5.5.1
-
Security Level: Public
-
None
-
Untriaged
-
Unknown
Description
It seems that XDCR from 4.5.1 (have not tested other versions yet) to Couchbase Server 5.5.x causes corruption on deleted documents leading to inability to rebalance and potential data loss (if replication streams have to reconnect and there is a failover).
Steps To Reproduce
- Setup a single-node 4.5.1 cluster and a single-node 5.5.1 cluster
- Create a bucket on each cluster
- Setup XDCR between 4.5.1 and 5.5.1 on this bucket
- Create a document
- Delete that document
After step 5, review the document on the source and target cluster:
Source (4.5.1)
Doc seq: 4
|
id: test1
|
rev: 4
|
content_meta: 3
|
size (on disk): 0
|
cas: 1535966803550732288, expiry: 1535966802, flags: 0, datatype: 0, conflict_resolution_mode: 0
|
doc deleted
|
could not read document body: document not found
|
Target (5.5.1)
Doc seq: 4
|
id: test1
|
rev: 4
|
content_meta: 131
|
size (on disk): 15
|
cas: 1535966803550732288, expiry: 1535966803, flags: 0, datatype: 0x00 (raw)
|
doc deleted
|
size: 5
|
data: (snappy)
|
Attached a pcap showing the DelWithMeta requests being sent by XDCR.
Seems this has something to do with the format of the packet being sent by 4.5.1 not being respected properly by 5.x.
In theory this issue has no impact (as the docs are deleted), but actually completely breaks rebalance in Couchbase Server 5.5.x onwards.
This is because the value on disk is now snappy compressed (instead of being empty), so the datatype when reading the document off of disk is set to SNAPPY (0x2).
This then means that all subsequent rebalances and internal replications (which backfill) fail for that document with the following error:
2018-08-31T15:29:54.141273Z WARNING 185: Invalid format specified for DCP_DELETION - 4 - closing connection packet:mcbp::header: magic:0x80, opcode:0x58, keylen:23, extlen:21, datatype:0x2, specific:806, bodylen:51, opaque:0x21, rawextras:0000007c7e00000060f15b895e190
|
The error above occurs because SNAPPY datatype is not a valid datatype for a DCP_DELETION (as this situation should never happen).
This means once you're in this situation with corrupted documents you are unable to rebalance and also risk data loss (if your replication streams don't stay completely in-memory) upon failover.