Details
-
Bug
-
Resolution: Fixed
-
Critical
-
6.5.0
-
Untriaged
-
Yes
Description
Steps to Reproduce
- Create a single node cluster of Couchbase Server 5.0.1
- Add a Couchbase Server 6.5.0 node and rebalance
- Create a bucket, my example uses test
- Create a document that resides on the 6.5.0 node, this command works if you followed my steps :
/opt/couchbase/bin/cbc create -U "couchbase://localhost/test" -u Administrator -P password testing1234 -V test
- Delete the document that was just created:
/opt/couchbase/bin/cbc rm -U "couchbase://localhost/test" -u Administrator -P password testing1234
- Add a new 6.5.0 node and mark the other 6.5.0 node for removal (so that you guarantee the document will move from one 6.5.x node to the other).
- Rebalance
Expected Result
Rebalance succeeds
Actual Result
Rebalance fails due to the error:
Rebalance exited with reason {mover_crashed,
|
{unexpected_exit,
|
{'EXIT',<0.21330.0>,
|
{{bulk_set_vbucket_state_failed,
|
[{'ns_1@10.143.171.103',
|
{'EXIT',
|
{{{{{child_interrupted,
|
{'EXIT',<24454.7397.0>,
|
socket_closed}},
|
All subsequent attempts to rebalance all fail.
Analysis
This is because DCP deletion packets between the two 6.5.0 nodes are being incorrectly handled:
2020-03-22T18:26:15.302067-07:00 WARNING 58: Invalid format specified for "DCP_DELETION" - Status: "Invalid arguments" - Closing connection. Packet:[{"bodylen":32,"cas":1584926732193038336,"datatype":"raw","extlen":21,"keylen":11,"magic":"ClientRequest","opaque":13,"opcode":"DCP_DELETION","vbucket":739}] Reason:"Request must include extras of length 18"
|
This then closes the DCP connection and ultimately leads to the rebalance failure.
The invalid DCP deletion packet is for the following doc:
Doc seq: 2
|
id: (collection:0x0:default) testing1234
|
rev: 2
|
content_meta: 3
|
size (on disk): 0
|
cas: 1584926732193038336, expiry: 1584926732, flags: 0, datatype: 0x00 (raw)
|
doc deleted (explicit)
|
DELETE_TIMES has not been negotiated as part of the DCP stream:
2020-03-22T18:26:15.108609-07:00 INFO (test) EventuallyPersistentEngine::dcpOpen: opening new DCP Consumer handler - stream name:replication:ns_1@10.143.171.102->ns_1@10.143.171.103:test, opaque:0, seqno:0, flags:0b000000000000000000000000000
|
00100 value:null
|
2020-03-22T18:26:15.108615-07:00 INFO 58: DCP connection opened successfully. INCLUDE_XATTRS [ 127.0.0.1:36102 - 127.0.0.1:11209 (<ud>@ns_server</ud>) ]
|
This is not negotiated because there is still a 5.0.x node in the cluster that is not DELETE_TIMES aware, however from a packet capture (MB-38390.pcap) it seems that the 6.5 nodes are sending DCP_DELETION_V2 packets, despite it not being negotiated:
Couchbase Protocol, DCP (Key) Deletion Request, Opcode: 0x58, VBucket: 0x2e3
|
Magic: Request (0x80)
|
Opcode: DCP (Key) Deletion (0x58)
|
Key Length: 11
|
Extras Length: 21
|
Data Type: 0x00
|
VBucket: 739 (0x02e3)
|
[Value Length: 0]
|
Total Body Length: 32
|
Opaque: 0x0000002a
|
CAS: 0x15fe19cdc8b10000
|
Extras
|
Key: testing1234
|
This is obviously rejected by the consumer as it is expected an extras length of 18, rather than 21, as covered in https://github.com/couchbase/kv_engine/blob/master/docs/dcp/documentation/commands/deletion.md.
Note: Haven't attached cbcollect_infos as the issue is readily reproducible 100% of the time by following the steps, if needed I can go ahead and generate these though.
Workaround
This has the net effect that for any cluster where there are deleted documents, it is impossible to directly upgrade from Couchbase Server <5.5.0 to Couchbase Server 6.5.0.
The workaround is to upgrade to an intermediate version before moving to 6.5.0, such as Couchbase Server 6.0.4, which through testing does not show the same issue.