Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-6939

XDC queue grows and checkpoint commit failures in bi-directional XDCR with front-end workload

    Details

    • Type: Bug
    • Status: Closed
    • Priority: Critical
    • Resolution: Fixed
    • Affects Version/s: 2.0
    • Fix Version/s: 2.0
    • Component/s: XDCR
    • Security Level: Public
    • Labels:
      None
    • Environment:
      2.0-1856
      Bidirectional replication
      1024 vbuckets
      EC2 centos

      Description

      • Setup a bidirectional replication between two 8:8 clusters on bucket b1.
      • Setup a small front end load on cluster1 and cluster2 , 4K op/sec and 6K ops/sec.
        [Load contains creates, updates, deletes]
      • For the first 40M items, the replication is working as expected, the replication lag is small.
      • Delete the replication from cluster2 to cluster1, recreate the replication.
        [ Expected behaviour - Stop/Start replication.]

      We expect that XDC will stop/start replication with the above step.
      The last committed checkpoint will be checked and replication will continue from the last commited checkpoint.

      Noticing a huge number of gets ~ 30K ops/sec and fewer sets - 2-3k ops/sec on the other cluster.

      -The XDC queue is continuously growing, from < 500k to nearly 7M over a period of 2-3 hours.

      • Seeing continous checkpoint_failures on both the XDC queues.

      The Disk write queue on cluster1, is high ~ 2-3M. The drain rate however is fairly small ~ 30K.

      The items are not drained fast enough and the disk-write-queue is getting filled up faster.

      Adding screenshots from both the clusters.

      The default values currently are -
      XDCR_CHECKPOINT_INTERVAL:300
      XDCR_CAPI_CHECKPOINT_TIMEOUT:10

      @Junyi: I ve stopped the front end load on both the clusters now and I have passed on the cluster access.
      Let me know if you need additional information.

      No reviews matched the request. Check your Options in the drop-down menu of this sections header.

        Activity

        ketaki Ketaki Gangal created issue -
        Show
        ketaki Ketaki Gangal added a comment - Cluster2 : http://ec2-54-245-1-10.us-west-2.compute.amazonaws.com:8091/ Cluster1: http://ec2-50-18-16-89.us-west-1.compute.amazonaws.com/
        Hide
        junyi Junyi Xie (Inactive) added a comment -

        The title is misleading. It is NOT deleting/restarting XDCR caused the checkpoint commit failure (see my explanation below). Modify the title.

        Show
        junyi Junyi Xie (Inactive) added a comment - The title is misleading. It is NOT deleting/restarting XDCR caused the checkpoint commit failure (see my explanation below). Modify the title.
        junyi Junyi Xie (Inactive) made changes -
        Field Original Value New Value
        Summary Delete/Recreate replication on Bidirectional setup, causes continously growing XDC queue and checkpoint commit failures. observe growing XDC queue and checkpoint commit failures in bi-directional XDCR with front-end workload
        Hide
        junyi Junyi Xie (Inactive) added a comment - - edited

        Look at your clusters. In your testcase, both clusters have some workload on top of ongoing bidirectional XDCR on the same bucket. XDCR workload is much heavier than your front-end workload. Also, there is little chance to do de-duplicate in ep_engine since the front-end workloads have different key sets. The write_queue on both clusters is constantly like 100K-500K items per node, while the disk drain rate is only 2-4 K/sec per node. Today the XDCR checkpoiting timeout is 10 seconds, that is if ep_engine is unable to persist the open checkpoint issued by XDCR within 10 seconds, we give up and skip this checkpoint, raise "target commit failure" error, and move on without checkpoint. In your testcase, apparently we are unable to persist the XDCR checkpoint in 10 seconds in most cases. This is the reason you see checkpoint failure on the UI on both sides. The root cause is that drain rate is unable to keep up with your workload (both XDCR and front-end workload).

        By Chiyoung, now ep_engine has priority checkpoint command, which is used in rebalance. However, in XDCR, the use case is a bit different since we need to issue 32 concurrent checkpoints at the same time per node to the ep_engine. If XDCR issues the priority checkpoints directly, the performance impact to ep_engine as well as rebalance is unknown to both Chiyoung and me. Given the risk, it seems better to us to postpone this issue to post 2.0. Again the root cause is disk drain rate is too fast enough compared with the workload in the testcase.

        At this time, I think what you can do is to increase the checkpoint interval and timeout. Say,

        XDCR_CHECKPOINT_INTERVAL = 1800
        XDCR_CAPI_CHECKPOINT_TIMEOUT = 60

        That means we wait 60 seconds for checkpoint per 30 min (1800 seconds). The benefit is

        1) We increase the chance to get successful checkpoint. It does not make sense to try to issue a checkpoint but always fail.

        2) The aggregated waiting time is still the same as before, 60 secs per 30 min, no increase the time overhead of XDCR.

        BTW, it should not be a blocker. The system is just doing what it is supposed to do, it is because we the reach the limit of concurrent design in your testcase.

        Show
        junyi Junyi Xie (Inactive) added a comment - - edited Look at your clusters. In your testcase, both clusters have some workload on top of ongoing bidirectional XDCR on the same bucket. XDCR workload is much heavier than your front-end workload. Also, there is little chance to do de-duplicate in ep_engine since the front-end workloads have different key sets. The write_queue on both clusters is constantly like 100K-500K items per node, while the disk drain rate is only 2-4 K/sec per node. Today the XDCR checkpoiting timeout is 10 seconds, that is if ep_engine is unable to persist the open checkpoint issued by XDCR within 10 seconds, we give up and skip this checkpoint, raise "target commit failure" error, and move on without checkpoint. In your testcase, apparently we are unable to persist the XDCR checkpoint in 10 seconds in most cases. This is the reason you see checkpoint failure on the UI on both sides. The root cause is that drain rate is unable to keep up with your workload (both XDCR and front-end workload). By Chiyoung, now ep_engine has priority checkpoint command, which is used in rebalance. However, in XDCR, the use case is a bit different since we need to issue 32 concurrent checkpoints at the same time per node to the ep_engine. If XDCR issues the priority checkpoints directly, the performance impact to ep_engine as well as rebalance is unknown to both Chiyoung and me. Given the risk, it seems better to us to postpone this issue to post 2.0. Again the root cause is disk drain rate is too fast enough compared with the workload in the testcase. At this time, I think what you can do is to increase the checkpoint interval and timeout. Say, XDCR_CHECKPOINT_INTERVAL = 1800 XDCR_CAPI_CHECKPOINT_TIMEOUT = 60 That means we wait 60 seconds for checkpoint per 30 min (1800 seconds). The benefit is 1) We increase the chance to get successful checkpoint. It does not make sense to try to issue a checkpoint but always fail. 2) The aggregated waiting time is still the same as before, 60 secs per 30 min, no increase the time overhead of XDCR. BTW, it should not be a blocker. The system is just doing what it is supposed to do, it is because we the reach the limit of concurrent design in your testcase.
        junyi Junyi Xie (Inactive) made changes -
        Priority Blocker [ 1 ] Critical [ 2 ]
        Hide
        farshid Farshid Ghods (Inactive) added a comment -

        Junyi,Chiyoung,Damein and Ketaki had a discussin about this earlier

        more from chiyoung :

        Pavel, Ketaki,

        The toy build that uses the vbucket flushing prioritization for XDCR checkpointing is now available:

        http://builds.hq.northscale.net/latestbuilds/couchbase-server-community_toy-chiyoung-x86_64_2.0.0-1501-toy.rpm

        As we discussed, please vary the number of concurrent vbucket checkpointing processes to 8, 16, and 32. In addition, we may need to vary the checkpointing interval because its default interval 5 minutes might be too often and expensive.

        Thanks,
        Chiyoung

        Show
        farshid Farshid Ghods (Inactive) added a comment - Junyi,Chiyoung,Damein and Ketaki had a discussin about this earlier more from chiyoung : Pavel, Ketaki, The toy build that uses the vbucket flushing prioritization for XDCR checkpointing is now available: http://builds.hq.northscale.net/latestbuilds/couchbase-server-community_toy-chiyoung-x86_64_2.0.0-1501-toy.rpm As we discussed, please vary the number of concurrent vbucket checkpointing processes to 8, 16, and 32. In addition, we may need to vary the checkpointing interval because its default interval 5 minutes might be too often and expensive. Thanks, Chiyoung
        Hide
        steve Steve Yen added a comment -

        Hi Pavel, so you have awareness on this.

        Please also assign to Ketaki / work together on this.

        Thanks,
        Steve

        Show
        steve Steve Yen added a comment - Hi Pavel, so you have awareness on this. Please also assign to Ketaki / work together on this. Thanks, Steve
        steve Steve Yen made changes -
        Assignee Junyi Xie [ junyi ] Pavel Paulau [ pavelpaulau ]
        Hide
        steve Steve Yen added a comment -

        latest news/update...

        @pavelpaulau

        Need a bi-dir performance test of 4node to 4node of the above toy build with its default setting of 32 concurrent vbucket checkpoint processes (MAX_CONCURRENT_REPS_PER_DOC), with front-end workload of 16K ops/sec per cluster, with 50% mutations, without views.

        Show
        steve Steve Yen added a comment - latest news/update... @pavelpaulau Need a bi-dir performance test of 4node to 4node of the above toy build with its default setting of 32 concurrent vbucket checkpoint processes (MAX_CONCURRENT_REPS_PER_DOC), with front-end workload of 16K ops/sec per cluster, with 50% mutations, without views.
        steve Steve Yen made changes -
        Sprint Priority 2.5
        pavelpaulau Pavel Paulau made changes -
        Assignee Pavel Paulau [ pavelpaulau ] Ketaki Gangal [ ketaki ]
        junyi Junyi Xie (Inactive) made changes -
        Assignee Ketaki Gangal [ ketaki ] Junyi Xie [ junyi ]
        Hide
        steve Steve Yen added a comment - - edited

        see yammer conversation and graphs here on results of Pavel's experiment...

        https://www.yammer.com/couchbase.com/#/Threads/show?threadId=224449346

        Show
        steve Steve Yen added a comment - - edited see yammer conversation and graphs here on results of Pavel's experiment... https://www.yammer.com/couchbase.com/#/Threads/show?threadId=224449346
        Hide
        steve Steve Yen added a comment -

        More explanation on the toy-build change from Junyi...

        The only difference between toybuild and build 1858 is that we use priority checkpoint to persist XDCR checkpoint in the toybuild, while in 1858, we use normal checkpoint.

        The issue with normal checkpoint is that given the current drain rate, it is almost impossible to persist a checkpoint in 10 seconds in large scale test. The negative result is, 1) it caused a bunch of "target commit error" at source; 2) it made XDCR lose most checkpoints and paying 10 seconds per each for nothing; 3) even worse, it delayed the replication at least 30 seconds per vb replicator since source side need to restart the vb replicator, which in consequence may increase the replication backlog at source.

        By the logs from Pavel's clusters, with the new priority checkpoint (that is also used to improve rebalance with consistent view), I see XDCR is able to persist around 82% of all checkpoints issued, hence removing some delay and errors seen in normal build (in which case most checkpoints failed). Apparently XDCR itself can benefit from this priority checkpoint without any concern. That is the reason we see smaller XDCR backlog.

        The question to be answered now, is how this may impact other components like rebalance and normal vb checkpoints, if XDCR has to issue 32 priority checkpoints every 5 minutes. It is not quite good to benefit XDCR at the cost of others. That is th reason why Chiyoung suggested using longer checkpoint interval and small currency. Personally I am ok with the former but intended not to reduce the parallelism because that may impact XDCR performance significantly. However, the fundamental solution is to improve drain rate. it wont be easy and probably storage team will work on it post-2.0. « collapse

        Show
        steve Steve Yen added a comment - More explanation on the toy-build change from Junyi... The only difference between toybuild and build 1858 is that we use priority checkpoint to persist XDCR checkpoint in the toybuild, while in 1858, we use normal checkpoint. The issue with normal checkpoint is that given the current drain rate, it is almost impossible to persist a checkpoint in 10 seconds in large scale test. The negative result is, 1) it caused a bunch of "target commit error" at source; 2) it made XDCR lose most checkpoints and paying 10 seconds per each for nothing; 3) even worse, it delayed the replication at least 30 seconds per vb replicator since source side need to restart the vb replicator, which in consequence may increase the replication backlog at source. By the logs from Pavel's clusters, with the new priority checkpoint (that is also used to improve rebalance with consistent view), I see XDCR is able to persist around 82% of all checkpoints issued, hence removing some delay and errors seen in normal build (in which case most checkpoints failed). Apparently XDCR itself can benefit from this priority checkpoint without any concern. That is the reason we see smaller XDCR backlog. The question to be answered now, is how this may impact other components like rebalance and normal vb checkpoints, if XDCR has to issue 32 priority checkpoints every 5 minutes. It is not quite good to benefit XDCR at the cost of others. That is th reason why Chiyoung suggested using longer checkpoint interval and small currency. Personally I am ok with the former but intended not to reduce the parallelism because that may impact XDCR performance significantly. However, the fundamental solution is to improve drain rate. it wont be easy and probably storage team will work on it post-2.0. « collapse
        Hide
        farshid Farshid Ghods (Inactive) added a comment -

        The results from performance tests are very promising and before we run system tests we need to find out the best value for how often we want to persist the checkpoint.
        Ketaki,

        what is your take on increasing the interval that we persist checkpoints from 5 minutes to 30 minutes or 60 minutes
        the idea behind persisting the checkpoints is that if replication was stopped and deleted by the user and restarted we dont restart everything.

        Show
        farshid Farshid Ghods (Inactive) added a comment - The results from performance tests are very promising and before we run system tests we need to find out the best value for how often we want to persist the checkpoint. Ketaki, what is your take on increasing the interval that we persist checkpoints from 5 minutes to 30 minutes or 60 minutes the idea behind persisting the checkpoints is that if replication was stopped and deleted by the user and restarted we dont restart everything.
        Hide
        ketaki Ketaki Gangal added a comment -

        If biXDCR minus the stop/start replication does not show the huge growing queue, yes it would be preferable to adjust xdcr-specific checkpoint intervals rather than affect other components.

        Can we have perf results w/ changes on the parameters to see if this solving the issue as well.
        XDCR_CHECKPOINT_INTERVAL:300 *6
        XDCR_CAPI_CHECKPOINT_TIMEOUT:10 *6

        Based of the previous discussions on these values, it is possible that it may/not resolve the issue here.

        Show
        ketaki Ketaki Gangal added a comment - If biXDCR minus the stop/start replication does not show the huge growing queue, yes it would be preferable to adjust xdcr-specific checkpoint intervals rather than affect other components. Can we have perf results w/ changes on the parameters to see if this solving the issue as well. XDCR_CHECKPOINT_INTERVAL:300 *6 XDCR_CAPI_CHECKPOINT_TIMEOUT:10 *6 Based of the previous discussions on these values, it is possible that it may/not resolve the issue here.
        Hide
        steve Steve Yen added a comment -

        removing "observe" from summary so i don't confuse this with "observe"

        Show
        steve Steve Yen added a comment - removing "observe" from summary so i don't confuse this with "observe"
        steve Steve Yen made changes -
        Summary observe growing XDC queue and checkpoint commit failures in bi-directional XDCR with front-end workload XDC queue grows and checkpoint commit failures in bi-directional XDCR with front-end workload
        Hide
        junyi Junyi Xie (Inactive) added a comment -

        Ketaki,

        Yea we can test the parameter as you suggested to see how it works. Please note to use normal build instead of the toybuild.

        On the toybuild, XDCR_CAPI_CHECKPOINT_TIMEOUT is not longer valid and has no impact since we switch to priority checkpoint.

        Show
        junyi Junyi Xie (Inactive) added a comment - Ketaki, Yea we can test the parameter as you suggested to see how it works. Please note to use normal build instead of the toybuild. On the toybuild, XDCR_CAPI_CHECKPOINT_TIMEOUT is not longer valid and has no impact since we switch to priority checkpoint.
        Hide
        junyi Junyi Xie (Inactive) added a comment -

        I still see a bunch of checkpoint_commit_failure even we increase the XDCR_CAPI_CHECKPOINT_TIMEOUT to 60 seconds. This is because as long as the drain rate is unable to catch up with workload, we will eventually build a big disk write queue and
        thus XDCR will fail to persist any checkpoint even in 60 seconds.

        I think we need to 1) merge the priority checkpoint commit; 2) increase the checkpoint interval if it is a concern to ep_engine

        Show
        junyi Junyi Xie (Inactive) added a comment - I still see a bunch of checkpoint_commit_failure even we increase the XDCR_CAPI_CHECKPOINT_TIMEOUT to 60 seconds. This is because as long as the drain rate is unable to catch up with workload, we will eventually build a big disk write queue and thus XDCR will fail to persist any checkpoint even in 60 seconds. I think we need to 1) merge the priority checkpoint commit; 2) increase the checkpoint interval if it is a concern to ep_engine
        Hide
        junyi Junyi Xie (Inactive) added a comment -

        The fix is to use priority checkpoint instead of regular checkpoint

        Chiyoung and I created 3 commits to address the issue.

        On ns_server side

        http://review.couchbase.org/#/c/21730/
        http://review.couchbase.org/#/c/21799/

        On ep_engien side, it is under a different bug

        http://review.couchbase.org/#/c/21857/

        Show
        junyi Junyi Xie (Inactive) added a comment - The fix is to use priority checkpoint instead of regular checkpoint Chiyoung and I created 3 commits to address the issue. On ns_server side http://review.couchbase.org/#/c/21730/ http://review.couchbase.org/#/c/21799/ On ep_engien side, it is under a different bug http://review.couchbase.org/#/c/21857/
        junyi Junyi Xie (Inactive) made changes -
        Status Open [ 1 ] Resolved [ 5 ]
        Resolution Fixed [ 1 ]
        Hide
        ketaki Ketaki Gangal added a comment -

        Works as expected on 1893

        Show
        ketaki Ketaki Gangal added a comment - Works as expected on 1893
        ketaki Ketaki Gangal made changes -
        Status Resolved [ 5 ] Closed [ 6 ]
        Hide
        kzeller kzeller added a comment -

        Added to RN: XDCR checkpoint intervals have increased to 30 minutes from 5
        minutes. This helps
        increase the chance that a checkpoint will successfully replicate and not
        fail; this also reduces the frequent overhead required to determine
        if a checkpoint completed.

        Show
        kzeller kzeller added a comment - Added to RN: XDCR checkpoint intervals have increased to 30 minutes from 5 minutes. This helps increase the chance that a checkpoint will successfully replicate and not fail; this also reduces the frequent overhead required to determine if a checkpoint completed.

          People

          • Assignee:
            junyi Junyi Xie (Inactive)
            Reporter:
            ketaki Ketaki Gangal
          • Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Gerrit Reviews

              There are no open Gerrit changes