Details
-
Improvement
-
Resolution: Unresolved
-
Critical
-
4.1.1
-
Release Note
Description
When a node has its vbuckets for a given bucket wiped and is restarted it will warmup but not find vbuckets and create them.
This is fine and expected behaviour, however this also has the knock-on effect of removing any data which previously existed on the replicas (which should not have been affected by the data deletion, so in theory you can failover to those replicas).
This causes unintended data loss as the data will have been completely removed from both the active and replica node.
The example used to reproduce below is very contrived, but this has been experienced out in the field when the data directory was accidentally unmounted and lost.
I guess this happens because the vbuckets now have sequence numbers of 0, so the replication streams then do a rollback.
Edit - confirmed that is the reason for the replica data removal:
2016-07-26T16:23:54.167980Z WARNING (travel-sample) DCP (Consumer) eq_dcpq:replication:ns_1@10.142.111.102->ns_1@10.142.111.103:travel-sample - (vb 529) Received rollback request to rollback seq no. 0
|
2016-07-26T16:23:54.171120Z WARNING (travel-sample) DCP (Consumer) eq_dcpq:replication:ns_1@10.142.111.102->ns_1@10.142.111.103:travel-sample - (vb 529) Attempting to reconnect stream with opaque 22, start seq no 0, end seq no 18446744073709551615, snap start seqno 0, and snap end seqno 0
|
2016-07-26T16:23:54.171873Z WARNING (travel-sample) DCP (Consumer) eq_dcpq:replication:ns_1@10.142.111.102->ns_1@10.142.111.103:travel-sample - (vb 530) Received rollback request to rollback seq no. 0
|
2016-07-26T16:23:54.172193Z WARNING (travel-sample) DCP (Consumer) eq_dcpq:replication:ns_1@10.142.111.102->ns_1@10.142.111.103:travel-sample - (vb 530) Attempting to reconnect stream with opaque 23, start seq no 0, end seq no 18446744073709551615, snap start seqno 0, and snap end seqno 0
|
2016-07-26T16:23:54.173403Z WARNING (travel-sample) DCP (Consumer) eq_dcpq:replication:ns_1@10.142.111.102->ns_1@10.142.111.103:travel-sample - (vb 531) Received rollback request to rollback seq no. 0
|
2016-07-26T16:23:54.173505Z WARNING (travel-sample) DCP (Consumer) eq_dcpq:replication:ns_1@10.142.111.102->ns_1@10.142.111.103:travel-sample - (vb 531) Attempting to reconnect stream with opaque 24, start seq no 0, end seq no 18446744073709551615, snap start seqno 0, and snap end seqno 0
|
2016-07-26T16:23:54.174479Z WARNING (travel-sample) DCP (Consumer) eq_dcpq:replication:ns_1@10.142.111.102->ns_1@10.142.111.103:travel-sample - (vb 532) Received rollback request to rollback seq no. 0
|
2016-07-26T16:23:54.179006Z WARNING (travel-sample) DCP (Consumer) eq_dcpq:replication:ns_1@10.142.111.102->ns_1@10.142.111.103:travel-sample - (vb 532) Attempting to reconnect stream with opaque 25, start seq no 0, end seq no 18446744073709551615, snap start seqno 0, and snap end seqno 0
|
2016-07-26T16:23:54.179577Z WARNING (travel-sample) DCP (Consumer) eq_dcpq:replication:ns_1@10.142.111.102->ns_1@10.142.111.103:travel-sample - (vb 533) Received rollback request to rollback seq no. 0
|
2016-07-26T16:23:54.179878Z WARNING (travel-sample) DCP (Consumer) eq_dcpq:replication:ns_1@10.142.111.102->ns_1@10.142.111.103:travel-sample - (vb 533) Attempting to reconnect stream with opaque 26, start seq no 0, end seq no 18446744073709551615, snap start seqno 0, and snap end seqno 0
|
2016-07-26T16:23:54.180694Z WARNING (travel-sample) DCP (Consumer) eq_dcpq:replication:ns_1@10.142.111.102->ns_1@10.142.111.103:travel-sample - (vb 534) Received rollback request to rollback seq no. 0
|
2016-07-26T16:23:54.181424Z WARNING (travel-sample) DCP (Consumer) eq_dcpq:replication:ns_1@10.142.111.102->ns_1@10.142.111.103:travel-sample - (vb 534) Attempting to reconnect stream with opaque 27, start seq no 0, end seq no 18446744073709551615, snap start seqno 0, and snap end seqno 0
|
2016-07-26T16:23:54.181900Z WARNING (travel-sample) DCP (Consumer) eq_dcpq:replication:ns_1@10.142.111.102->ns_1@10.142.111.103:travel-sample - (vb 535) Received rollback request to rollback seq no. 0
|
2016-07-26T16:23:54.182250Z WARNING (travel-sample) DCP (Consumer) eq_dcpq:replication:ns_1@10.142.111.102->ns_1@10.142.111.103:travel-sample - (vb 535) Attempting to reconnect stream with opaque 28, start seq no 0, end seq no 18446744073709551615, snap start seqno 0, and snap end seqno 0
|
2016-07-26T16:23:54.183228Z WARNING (travel-sample) DCP (Consumer) eq_dcpq:replication:ns_1@10.142.111.102->ns_1@10.142.111.103:travel-sample - (vb 536) Received rollback request to rollback seq no. 0
|
2016-07-26T16:23:54.183589Z WARNING (travel-sample) DCP (Consumer) eq_dcpq:replication:ns_1@10.142.111.102->ns_1@10.142.111.103:travel-sample - (vb 536) Attempting to reconnect stream with opaque 29, start seq no 0, end seq no 18446744073709551615, snap start seqno 0, and snap end seqno 0
|
2016-07-26T16:23:54.184036Z WARNING (travel-sample) DCP (Consumer) eq_dcpq:replication:ns_1@10.142.111.102->ns_1@10.142.111.103:travel-sample - (vb 537) Received rollback request to rollback seq no. 0
|
2016-07-26T16:23:54.187127Z WARNING (travel-sample) DCP (Consumer) eq_dcpq:replication:ns_1@10.142.111.102->ns_1@10.142.111.103:travel-sample - (vb 537) Attempting to reconnect stream with opaque 30, start seq no 0, end seq no 18446744073709551615, snap start seqno 0, and snap end seqno 0
|
2016-07-26T16:23:54.187794Z WARNING (travel-sample) DCP (Consumer) eq_dcpq:replication:ns_1@10.142.111.102->ns_1@10.142.111.103:travel-sample - (vb 538) Received rollback request to rollback seq no. 0
|
Logs:
- https://s3.amazonaws.com/cb-engineering/MB-20294/collectinfo-2016-07-26T163345-ns_1%4010.142.111.101.zip
- https://s3.amazonaws.com/cb-engineering/MB-20294/collectinfo-2016-07-26T163345-ns_1%4010.142.111.102.zip
- https://s3.amazonaws.com/cb-engineering/MB-20294/collectinfo-2016-07-26T163345-ns_1%4010.142.111.103.zip
Steps to reproduce:
1. Create a 3 node cluster with 1 replica and load the travel-sample bucket onto the nodes
2. Ensure auto-failover is disabled
3. Stop the couchbase-server service on one of the nodes
4. Delete all of the vbucket files in the travel-sample data directory on the offline node
5. Start the couchbase-server service up again
6. Observe how the number of replica items on the other 2 nodes changes from ~10.4K to a much lower value once the node warms up, these items are now 'missing' from the replicas
Attachments
Issue Links
- blocks
-
MB-41257 [Jepsen] Linearizability failure during disk failure test
- Closed