Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-48583

XDCR - xmem connection repair does not re-check VBUUID

    XMLWordPrintable

Details

    • Bug
    • Resolution: Unresolved
    • Major
    • Morpheus
    • 6.5.1, 6.6.0, 6.6.1, 6.6.2, 6.5.2, 6.5.0, 6.6.3, 7.0.0, 7.0.1, 7.0.2, 7.1.0
    • XDCR
    • None
    • Untriaged
    • 1
    • No

    Description

      When resuming a pipeline, XDCR picks a valid checkpoint by ensuring that the VBUUID is acceptable by both source and target. Only if both source and target agree that VBUUID is valid, would XDCR resume the pipeline.

      XDCR's XMEM (outgoing) nozzle contains a repair feature. When XDCR receives EOF error from a target node's connection, instead of restarting the pipeline, XDCR closes the connection, and reopens a new connection to the same target node. It has always assumed that receiving an EOF when reading from a network connection means network issues.

      However, one problem that could arise is not only network issue, but if the target node has crashed or ungracefully restarted. In this case, when the crashed node restarts and recovers from an unclean shut down, a new failover log entry is generated.
      (See https://github.com/couchbase/kv_engine/blob/master/docs/dcp/documentation/failure-scenarios.md)

      When Source XDCR's XMEM nozzle repairs the connection to the downed node, it does not know that the target has changed its VBUUID, and it keeps replicating data.

      Currently, it only detects the changed VBUUID when a regularly scheduled checkpoint takes place. During the checkpointing operation, it gets the failover log and compares it with the last known checkpoint. Only then, does it realize that a failover occurred, and restarts the pipeline, which will lead to an earlier checkpoint to re-replicate the lost data that occurred during the target node crash.

      The correct behavior should mean that XMEM is more aware during the repair operation to check failover log, so that it can restart pipeline as soon as it realizes that the connection error is not network related, but failover related. This will also cut down on the window for missing data.

      Attachments

        Issue Links

          No reviews matched the request. Check your Options in the drop-down menu of this sections header.

          Activity

            People

              neil.huang Neil Huang
              neil.huang Neil Huang
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated:

                Gerrit Reviews

                  There are no open Gerrit changes

                  PagerDuty