Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-20294

Loss of data on replicas when active vbucket files are deleted and warmup completes

    XMLWordPrintable

Details

    • Release Note

    Description

      When a node has its vbuckets for a given bucket wiped and is restarted it will warmup but not find vbuckets and create them.

      This is fine and expected behaviour, however this also has the knock-on effect of removing any data which previously existed on the replicas (which should not have been affected by the data deletion, so in theory you can failover to those replicas).
      This causes unintended data loss as the data will have been completely removed from both the active and replica node.

      The example used to reproduce below is very contrived, but this has been experienced out in the field when the data directory was accidentally unmounted and lost.

      I guess this happens because the vbuckets now have sequence numbers of 0, so the replication streams then do a rollback.

      Edit - confirmed that is the reason for the replica data removal:

      2016-07-26T16:23:54.167980Z WARNING (travel-sample) DCP (Consumer) eq_dcpq:replication:ns_1@10.142.111.102->ns_1@10.142.111.103:travel-sample - (vb 529) Received rollback request to rollback seq no. 0
      2016-07-26T16:23:54.171120Z WARNING (travel-sample) DCP (Consumer) eq_dcpq:replication:ns_1@10.142.111.102->ns_1@10.142.111.103:travel-sample - (vb 529) Attempting to reconnect stream with opaque 22, start seq no 0, end seq no 18446744073709551615, snap start seqno 0, and snap end seqno 0
      2016-07-26T16:23:54.171873Z WARNING (travel-sample) DCP (Consumer) eq_dcpq:replication:ns_1@10.142.111.102->ns_1@10.142.111.103:travel-sample - (vb 530) Received rollback request to rollback seq no. 0
      2016-07-26T16:23:54.172193Z WARNING (travel-sample) DCP (Consumer) eq_dcpq:replication:ns_1@10.142.111.102->ns_1@10.142.111.103:travel-sample - (vb 530) Attempting to reconnect stream with opaque 23, start seq no 0, end seq no 18446744073709551615, snap start seqno 0, and snap end seqno 0
      2016-07-26T16:23:54.173403Z WARNING (travel-sample) DCP (Consumer) eq_dcpq:replication:ns_1@10.142.111.102->ns_1@10.142.111.103:travel-sample - (vb 531) Received rollback request to rollback seq no. 0
      2016-07-26T16:23:54.173505Z WARNING (travel-sample) DCP (Consumer) eq_dcpq:replication:ns_1@10.142.111.102->ns_1@10.142.111.103:travel-sample - (vb 531) Attempting to reconnect stream with opaque 24, start seq no 0, end seq no 18446744073709551615, snap start seqno 0, and snap end seqno 0
      2016-07-26T16:23:54.174479Z WARNING (travel-sample) DCP (Consumer) eq_dcpq:replication:ns_1@10.142.111.102->ns_1@10.142.111.103:travel-sample - (vb 532) Received rollback request to rollback seq no. 0
      2016-07-26T16:23:54.179006Z WARNING (travel-sample) DCP (Consumer) eq_dcpq:replication:ns_1@10.142.111.102->ns_1@10.142.111.103:travel-sample - (vb 532) Attempting to reconnect stream with opaque 25, start seq no 0, end seq no 18446744073709551615, snap start seqno 0, and snap end seqno 0
      2016-07-26T16:23:54.179577Z WARNING (travel-sample) DCP (Consumer) eq_dcpq:replication:ns_1@10.142.111.102->ns_1@10.142.111.103:travel-sample - (vb 533) Received rollback request to rollback seq no. 0
      2016-07-26T16:23:54.179878Z WARNING (travel-sample) DCP (Consumer) eq_dcpq:replication:ns_1@10.142.111.102->ns_1@10.142.111.103:travel-sample - (vb 533) Attempting to reconnect stream with opaque 26, start seq no 0, end seq no 18446744073709551615, snap start seqno 0, and snap end seqno 0
      2016-07-26T16:23:54.180694Z WARNING (travel-sample) DCP (Consumer) eq_dcpq:replication:ns_1@10.142.111.102->ns_1@10.142.111.103:travel-sample - (vb 534) Received rollback request to rollback seq no. 0
      2016-07-26T16:23:54.181424Z WARNING (travel-sample) DCP (Consumer) eq_dcpq:replication:ns_1@10.142.111.102->ns_1@10.142.111.103:travel-sample - (vb 534) Attempting to reconnect stream with opaque 27, start seq no 0, end seq no 18446744073709551615, snap start seqno 0, and snap end seqno 0
      2016-07-26T16:23:54.181900Z WARNING (travel-sample) DCP (Consumer) eq_dcpq:replication:ns_1@10.142.111.102->ns_1@10.142.111.103:travel-sample - (vb 535) Received rollback request to rollback seq no. 0
      2016-07-26T16:23:54.182250Z WARNING (travel-sample) DCP (Consumer) eq_dcpq:replication:ns_1@10.142.111.102->ns_1@10.142.111.103:travel-sample - (vb 535) Attempting to reconnect stream with opaque 28, start seq no 0, end seq no 18446744073709551615, snap start seqno 0, and snap end seqno 0
      2016-07-26T16:23:54.183228Z WARNING (travel-sample) DCP (Consumer) eq_dcpq:replication:ns_1@10.142.111.102->ns_1@10.142.111.103:travel-sample - (vb 536) Received rollback request to rollback seq no. 0
      2016-07-26T16:23:54.183589Z WARNING (travel-sample) DCP (Consumer) eq_dcpq:replication:ns_1@10.142.111.102->ns_1@10.142.111.103:travel-sample - (vb 536) Attempting to reconnect stream with opaque 29, start seq no 0, end seq no 18446744073709551615, snap start seqno 0, and snap end seqno 0
      2016-07-26T16:23:54.184036Z WARNING (travel-sample) DCP (Consumer) eq_dcpq:replication:ns_1@10.142.111.102->ns_1@10.142.111.103:travel-sample - (vb 537) Received rollback request to rollback seq no. 0
      2016-07-26T16:23:54.187127Z WARNING (travel-sample) DCP (Consumer) eq_dcpq:replication:ns_1@10.142.111.102->ns_1@10.142.111.103:travel-sample - (vb 537) Attempting to reconnect stream with opaque 30, start seq no 0, end seq no 18446744073709551615, snap start seqno 0, and snap end seqno 0
      2016-07-26T16:23:54.187794Z WARNING (travel-sample) DCP (Consumer) eq_dcpq:replication:ns_1@10.142.111.102->ns_1@10.142.111.103:travel-sample - (vb 538) Received rollback request to rollback seq no. 0
      

      Logs:

      Steps to reproduce:

      1. Create a 3 node cluster with 1 replica and load the travel-sample bucket onto the nodes
      2. Ensure auto-failover is disabled
      3. Stop the couchbase-server service on one of the nodes
      4. Delete all of the vbucket files in the travel-sample data directory on the offline node
      5. Start the couchbase-server service up again
      6. Observe how the number of replica items on the other 2 nodes changes from ~10.4K to a much lower value once the node warms up, these items are now 'missing' from the replicas

      Attachments

        Issue Links

          No reviews matched the request. Check your Options in the drop-down menu of this sections header.

          Activity

            People

              Abhijeeth.Nuthan Abhijeeth Nuthan
              matt.carabine Matt Carabine (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              23 Start watching this issue

              Dates

                Created:
                Updated:

                Gerrit Reviews

                  There are no open Gerrit changes

                  PagerDuty