Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-7823

System Test: Reboot destination node, causes "Target database out of sync. Try to increase max_dbs_open at the target's server" 2

    Details

    • Type: Bug
    • Status: Closed
    • Priority: Major
    • Resolution: Cannot Reproduce
    • Affects Version/s: 2.0.1
    • Fix Version/s: 2.1.0
    • Component/s: XDCR
    • Security Level: Public
    • Labels:
      None
    • Environment:
      2.0.1-160 Linux

      Description

      Hi Junyi,

      On rebooting a node on destination cluster, do we expect that the replication to that node should restart?
      XDCR re-replication will depend on whether there are any more open checkpoints to be replicated to. And if the cluster is in steady state .. no incoming mutations/ no incoming/outgoing xdcr traffic , we infer the cluster is in stead state.

      Seeing some strange behavior

      Source : Sending out some data in bursts
      Destination: Doing mainly gets..

      This is a unidirectional replication from Source->Destination. The cluster had no incoming load after replicating 33M items. Both source and destination clusters were in steady state.
      Rebooted one node on destination.

      And the Source replication is failing w/ error "Target database out of sync. Try to increase max_dbs_open at the target's server."

      What is the expected behaviour on a "Reboot node" on either source/ destination? Why do we see these bursts of xdcr-traffic on the Source?

      Adding screenshots from the source and destination cluster.

      LInks :
      Source: http://ec2-107-22-40-124.compute-1.amazonaws.com:8091/
      Destination: http://ec2-54-235-229-199.compute-1.amazonaws.com:8091/

      No reviews matched the request. Check your Options in the drop-down menu of this sections header.

        Activity

        Hide
        junyi Junyi Xie (Inactive) added a comment -

        Ketaki,

        1) reboot source node: all replication originating from that node will instantly shutdown, and after the node restarts, each replicator on that node will start from the LAST successful checkpoint, which in the worse case may have to rescan all replicated mutations in past 30 minutes

        2) reboot target node: all replicators in source cluster may find they are unable to talk to the reboot node, and the replicator will crash, a new one will restart 30 seconds later and starts from the last successful checkpoint. If the replicator is doing checkpoint, it will fail with the error

        "Target database out of sync. Try to increase max_dbs_open at the target's server"

        and the replicator will shutdown itself and restart 30 seconds later, and rescan from the last successful checkpoint.

        In either case, we will see a burst of "mutation to replicate", this is the number of mutations since last checkpoint. But it may drops down very quickly as we just need to rescan all the data without actually replicating them.

        Show
        junyi Junyi Xie (Inactive) added a comment - Ketaki, 1) reboot source node: all replication originating from that node will instantly shutdown, and after the node restarts, each replicator on that node will start from the LAST successful checkpoint, which in the worse case may have to rescan all replicated mutations in past 30 minutes 2) reboot target node: all replicators in source cluster may find they are unable to talk to the reboot node, and the replicator will crash, a new one will restart 30 seconds later and starts from the last successful checkpoint. If the replicator is doing checkpoint, it will fail with the error "Target database out of sync. Try to increase max_dbs_open at the target's server" and the replicator will shutdown itself and restart 30 seconds later, and rescan from the last successful checkpoint. In either case, we will see a burst of "mutation to replicate", this is the number of mutations since last checkpoint. But it may drops down very quickly as we just need to rescan all the data without actually replicating them.
        Hide
        maria Maria McDuff (Inactive) added a comment -

        ketaki, is this still an issue?

        Show
        maria Maria McDuff (Inactive) added a comment - ketaki, is this still an issue?

          People

          • Assignee:
            junyi Junyi Xie (Inactive)
            Reporter:
            ketaki Ketaki Gangal
          • Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Gerrit Reviews

              There are no open Gerrit changes