Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-20337

XDCR does not send all of the mutation when the target cluster is rebalancing

    XMLWordPrintable

Details

    • Improvement
    • Status: Closed
    • Critical
    • Resolution: Won't Fix
    • 4.5.1
    • 7.0.0
    • XDCR
    • None

    Description

      Two clusters each with 2 nodes with unidirectional XDCR.

      When the target cluster is being rebalanced the mutations do not get sent:

      chanages_left_during_rebalance.png

      Three to four minutes after the rebalance finishes there is massive spike in XDCR operations where it seems to start working again:

      The bucket has roughly 200K items and during the rebalance a small number of operations is being sent to the cluster:

       

      cbc-pillowfight -U couchbase://10.111.151.101/default -I 2000 -m 1024 -m 1024 -p xdcr --rate-limit 10

       

       

      Attachments

        No reviews matched the request. Check your Options in the drop-down menu of this sections header.

        Activity

          I was using build 4.5.1-2785 to do this test.

          pvarley Patrick Varley added a comment - I was using build 4.5.1-2785 to do this test.

          The same behaviour is seen when the source node is rebalanced too.

          pvarley Patrick Varley added a comment - The same behaviour is seen when the source node is rebalanced too.
          yu Yu Sui (Inactive) added a comment - - edited

          This is the expected behavior.

          As of now, when XDCR detects topology changes on either source or target side, it continues running for a pre-defined period, which is 5 minutes by default, performs a checkpoint, and then restarts. The rationale for this approach is:
          1. Replication currently does not have an easy way to handle topology changes without restarting
          2. Replication restart is a resource heavy operation – checkpoint needs to be done, dcp connections and memcached connections need to be closed and then re-setup, etc. If we restart right away after detecting topology changes, we may be restarting so frequently that there is not much resource left for actual data replication.
          The time window where replication continues running after detecting topology changes is intentionally put in to ensure that we get a reasonable amount of data replication done each time before we restart. Replication behavior in this window is less than optimal, though, since replication is using the same source and target topology that replication is started with, which has become stale.
          1. If a vbucket has been moved around in the target cluster, replication will continue to send the mutations in that vbucket to the original owner of the vbucket. This will result in NOT_MY_VBUCKET errors and mutations in that vbucket will not get replicated. This is the behavior seen in this MB, MB-20337. The vbuckets that get moved around in the time window is typically a very small portion of all vbuckets, and replication on vbuckets that have not been moved around can still proceed. This is the reason that we think keeping replication running is a worthwhile effort.
          2. XDCR runtime statistics can become inaccurate when source topology changes since statistics collection is based on the stale source topology. This is the behavior seen in MB-20338.
          Both these issues will disappear after replication get restarted. They may re-surface if topology change is still ongoing and replication needs to get restarted again. After topology changes have stopped and replication gets restarted for the last time, things will be consistent and there will not be data losses. This is what we observed in our tests.

          We understand that such behavior may seem counter-intuitive and problematic to some customers. We have made several parameters configurable to allow customers to tune this behavior if that helps. The most useful parameters are:
          TopologyChangeCheckInterval – the interval which we check for source and target topology changes. The default setting is 10 seconds
          MaxTopologyChangeCountBeforeRestart - the number of topology changes seen before replication is restarted. The default setting is 30.
          In default settings, after we see a source or target topology change, we will wait for 30 * 10 = 300 seconds = 5 minutes before we restart replication. This is the reason that "Three to four minutes after the rebalance finishes there is massive spike in XDCR operations where it seems to start working again” in this MB. If needed the second parameter, and even the first parameter, can be tuned down to reduce the length of the time window where we delay replication restart. Just bear in mind that this can have a negative impact on replication throughput due to the overhead of replication restarts.

          An example command to change the parameters is as follows:
          curl -X POST -u Administrator:welcome http://127.0.0.1:13000/xdcr/internalSettings -d TopologyChangeCheckInterval=5 -d MaxTopologyChangeCountBeforeRestart=10

          Note that GOXDCR process will get automatically restarted after such command is issued. This command should be issued before replication is started to avoid interference.

          yu Yu Sui (Inactive) added a comment - - edited This is the expected behavior. As of now, when XDCR detects topology changes on either source or target side, it continues running for a pre-defined period, which is 5 minutes by default, performs a checkpoint, and then restarts. The rationale for this approach is: 1. Replication currently does not have an easy way to handle topology changes without restarting 2. Replication restart is a resource heavy operation – checkpoint needs to be done, dcp connections and memcached connections need to be closed and then re-setup, etc. If we restart right away after detecting topology changes, we may be restarting so frequently that there is not much resource left for actual data replication. The time window where replication continues running after detecting topology changes is intentionally put in to ensure that we get a reasonable amount of data replication done each time before we restart. Replication behavior in this window is less than optimal, though, since replication is using the same source and target topology that replication is started with, which has become stale. 1. If a vbucket has been moved around in the target cluster, replication will continue to send the mutations in that vbucket to the original owner of the vbucket. This will result in NOT_MY_VBUCKET errors and mutations in that vbucket will not get replicated. This is the behavior seen in this MB, MB-20337 . The vbuckets that get moved around in the time window is typically a very small portion of all vbuckets, and replication on vbuckets that have not been moved around can still proceed. This is the reason that we think keeping replication running is a worthwhile effort. 2. XDCR runtime statistics can become inaccurate when source topology changes since statistics collection is based on the stale source topology. This is the behavior seen in MB-20338 . Both these issues will disappear after replication get restarted. They may re-surface if topology change is still ongoing and replication needs to get restarted again. After topology changes have stopped and replication gets restarted for the last time, things will be consistent and there will not be data losses. This is what we observed in our tests. We understand that such behavior may seem counter-intuitive and problematic to some customers. We have made several parameters configurable to allow customers to tune this behavior if that helps. The most useful parameters are: TopologyChangeCheckInterval – the interval which we check for source and target topology changes. The default setting is 10 seconds MaxTopologyChangeCountBeforeRestart - the number of topology changes seen before replication is restarted. The default setting is 30. In default settings, after we see a source or target topology change, we will wait for 30 * 10 = 300 seconds = 5 minutes before we restart replication. This is the reason that "Three to four minutes after the rebalance finishes there is massive spike in XDCR operations where it seems to start working again” in this MB. If needed the second parameter, and even the first parameter, can be tuned down to reduce the length of the time window where we delay replication restart. Just bear in mind that this can have a negative impact on replication throughput due to the overhead of replication restarts. An example command to change the parameters is as follows: curl -X POST -u Administrator:welcome http://127.0.0.1:13000/xdcr/internalSettings -d TopologyChangeCheckInterval=5 -d MaxTopologyChangeCountBeforeRestart=10 Note that GOXDCR process will get automatically restarted after such command is issued. This command should be issued before replication is started to avoid interference.

          Given that this has come up a few time in support and the behaviour here is not what customer expect, this information needs to be put into the release notes and our Documentation. Please work with the Docs team to get this done.

          pvarley Patrick Varley added a comment - Given that this has come up a few time in support and the behaviour here is not what customer expect, this information needs to be put into the release notes and our Documentation. Please work with the Docs team to get this done.

          Anil,

          Please check if the goxdcr rebalance behavior and the explanation need to be included in release note and documentation.

          yu Yu Sui (Inactive) added a comment - Anil, Please check if the goxdcr rebalance behavior and the explanation need to be included in release note and documentation.

          Created DOC-1730 ticket to release note it and update our documentation. Assigning the issue back to you.

          anil Anil Kumar (Inactive) added a comment - Created DOC-1730 ticket to release note it and update our documentation. Assigning the issue back to you.
          jliang John Liang added a comment -

          Move to the spock.next for now. Depending on resource, we could move this up to spock.

          jliang John Liang added a comment - Move to the spock.next for now. Depending on resource, we could move this up to spock.

          People

            lilei.chen Lilei Chen
            pvarley Patrick Varley
            Votes:
            0 Vote for this issue
            Watchers:
            6 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Gerrit Reviews

                There are no open Gerrit changes

                PagerDuty