Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-7657

[RN 2.0.1] XDCR docs to replicate for bucket queue is draining unevenly.

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Blocker
    • Resolution: Cannot Reproduce
    • Affects Version/s: 2.0.1
    • Fix Version/s: 3.0
    • Component/s: couchbase-bucket, XDCR
    • Security Level: Public
    • Labels:
    • Environment:
      couchbase-2.0.1-144-rel on Centos 5.4 on both source and destination
    • Flagged:
      Release Note

      Description

      linux cluster
      22-node-ec2-10GB-Linux

      source Cluster ( 15 Nodes)
      create default bucket -3.5 GB per node
      sasl bucket - 2.0 GB per node

      destination Cluster ( 7 Nodes)
      create default bucket - 3.5 GB per node
      sasl bucket - 2.0 GB per node

      **Loading phase
      Source cluster:
      Default bucket
      define a workload that loads 40 M json items , key size 128-512 bytes to default bucket.
      The workload should push the system into a light dgm ( active resident ratio at 90 percent)

      **Access Phase:
      Source cluster:
      Default bucket
      create:5%, uupdate:10%, get:80%, delete:5% with cache miss:5%, opsPerSec:30000, running for 8 hours

      Then start replication:
      set a bi-directional XDCR on default bucket
      Source cluster:
      default bucket
      create:5%, uupdate:10%, get:80%, delete:5% with cache miss:5%, opsPerSec:30000, running for 2 hours

      Destination cluster:
      default bucket
      define a workload that loads 30 M json items(non-conflicting key-sets) , key size 128-512 bytes to default bucket.

      After all the loads on both clusters stop, wait for "XDCR docs to replicate for bucket" drops. But really slow.
      Saw XDCR docs to replicate for bucket queue is draining unevenly on default bucket on source
      The screen shot is from source cluster.

      # Subject Project Status CR V
      For Gerrit Dashboard: &For+MB-7657=message:MB-7657

        Activity

        Show
        Chisheng Chisheng Hong (Inactive) added a comment - Diags links: https://s3.amazonaws.com/bugdb/jira/MB-6799/ns-diag-20130131231808.txt.zip
        Hide
        Chisheng Chisheng Hong (Inactive) added a comment -

        Forget about the bug number in the diags link path. It doesn't matter. The above link give u the right diags

        Show
        Chisheng Chisheng Hong (Inactive) added a comment - Forget about the bug number in the diags link path. It doesn't matter. The above link give u the right diags
        Hide
        Chisheng Chisheng Hong (Inactive) added a comment -

        The destination cluster diags (the previous link is for the source):
        https://s3.amazonaws.com/bugdb/jira/MB-6799/ns-diag-20130201000242.txt.zip

        Show
        Chisheng Chisheng Hong (Inactive) added a comment - The destination cluster diags (the previous link is for the source): https://s3.amazonaws.com/bugdb/jira/MB-6799/ns-diag-20130201000242.txt.zip
        Hide
        frank Frank Weigel (Inactive) added a comment -

        How commonly does this happen and how closely is it tied to toe level of load on the source cluster (i.e. the 30k ops)

        Show
        frank Frank Weigel (Inactive) added a comment - How commonly does this happen and how closely is it tied to toe level of load on the source cluster (i.e. the 30k ops)
        Hide
        jin Jin Lim (Inactive) added a comment -

        Bumping this to "critical" for prompt engineering investigation. It can turn out to be a common case (i.e. users of XDCR can easily run into this) thus negatively impact overall level of load on the source cluster.

        Show
        jin Jin Lim (Inactive) added a comment - Bumping this to "critical" for prompt engineering investigation. It can turn out to be a common case (i.e. users of XDCR can easily run into this) thus negatively impact overall level of load on the source cluster.
        Hide
        jin Jin Lim (Inactive) added a comment -

        sync up with tomorrow's system test + make sure whether it is a regression or existing pls.

        Show
        jin Jin Lim (Inactive) added a comment - sync up with tomorrow's system test + make sure whether it is a regression or existing pls.
        Hide
        jin Jin Lim (Inactive) added a comment -

        Junyi and Chiseng, please communicate to see if this is a real regression.

        Show
        jin Jin Lim (Inactive) added a comment - Junyi and Chiseng, please communicate to see if this is a real regression.
        Hide
        ketaki Ketaki Gangal added a comment -

        Will post results based on System test on build 153.

        Show
        ketaki Ketaki Gangal added a comment - Will post results based on System test on build 153.
        Hide
        ketaki Ketaki Gangal added a comment -

        Adding results from 2.0.1-153-rel build. ( screenshot for docs-to-replicate)

        Still seeing unevennes on replication from source nodes (11 node-> 5 node cluster, bidirectional replication)

        Clusters are accessible here
        Source : http://ec2-23-20-140-198.compute-1.amazonaws.com:8091/index.html
        Destination : http://ec2-23-20-108-25.compute-1.amazonaws.com:8091/

        Show
        ketaki Ketaki Gangal added a comment - Adding results from 2.0.1-153-rel build. ( screenshot for docs-to-replicate) Still seeing unevennes on replication from source nodes (11 node-> 5 node cluster, bidirectional replication) Clusters are accessible here Source : http://ec2-23-20-140-198.compute-1.amazonaws.com:8091/index.html Destination : http://ec2-23-20-108-25.compute-1.amazonaws.com:8091/
        Hide
        jin Jin Lim (Inactive) added a comment -

        Thanks Ketaki. If can can you also provide some inputs for Frank's questions above? We really want to understand how likely (easily) users can run into this issue.

        Show
        jin Jin Lim (Inactive) added a comment - Thanks Ketaki. If can can you also provide some inputs for Frank's questions above? We really want to understand how likely (easily) users can run into this issue.
        Hide
        ketaki Ketaki Gangal added a comment -

        Seeing this fairly commonly on large-scale runs for xdcr. With a higher front end ops/sec ( 30k ops/sec + for a 11 node cluster), seeing unevennes on distributions of docs-to-replicate queue.

        The tests reduce the front end load in the successive phases of the runs, and all the xdcr-data is synced as expected.

        We see this unevenness on inital replication and in-between phases typically.

        Show
        ketaki Ketaki Gangal added a comment - Seeing this fairly commonly on large-scale runs for xdcr. With a higher front end ops/sec ( 30k ops/sec + for a 11 node cluster), seeing unevennes on distributions of docs-to-replicate queue. The tests reduce the front end load in the successive phases of the runs, and all the xdcr-data is synced as expected. We see this unevenness on inital replication and in-between phases typically.
        Hide
        ketaki Ketaki Gangal added a comment -
        Show
        ketaki Ketaki Gangal added a comment - Logs from some source nodes https://s3.amazonaws.com/bugdb/MB-7657/bug.tar
        Hide
        junyi Junyi Xie (Inactive) added a comment - - edited

        Ketaki,

        Thanks a lot for your test and logs.

        I looked at the two clusters on EC2 now (12:29PM EST, Feb 13), your bi-directional XDCR test has been running for a bit short of 20 hours and in general it is healthy to me.

        From the per node "docs to replicate" stats on two cluster, at day scale, I do not see unevenness. Please look at the 1st and 2nd screenshots I uploaded.

        I also looked at the log from source you uploaded, around 2013-02-12T23:08, and 2013-02-12T23:21, there are several timeout error which caused source replicator to crash, in all these errors the destination is ec2-23-20-217-189. That means in a short time period, that node is not able to responde timely to source nodes in XDCR. But source node is able to recover under this circumstance. Other than that, I do not see any other errors in the log.

        Looks like around 2013-02-12T23:08 and 2013-02-12T23:21 is the peak time where you have highest ops/sec and most docs to replicate. In particular, at that time, you have around 30k ops/sec on source and >30k ops/sec on destination. On the source the update ops/sec is > 10k ops/sec (see the 3rd and 4th screenshot I uploaded). Cannot judge any small scale unevenness from day-scale stats.

        In short, from your clusters and logs, there is no unevenness for long time through the test.

        The timeout errors may explain why you see the unevenness in a short period of time, other factors like compactors and env may also contribute to that. But I am not sure why it is serious.

        Show
        junyi Junyi Xie (Inactive) added a comment - - edited Ketaki, Thanks a lot for your test and logs. I looked at the two clusters on EC2 now (12:29PM EST, Feb 13), your bi-directional XDCR test has been running for a bit short of 20 hours and in general it is healthy to me. From the per node "docs to replicate" stats on two cluster, at day scale, I do not see unevenness. Please look at the 1st and 2nd screenshots I uploaded. I also looked at the log from source you uploaded, around 2013-02-12T23:08, and 2013-02-12T23:21, there are several timeout error which caused source replicator to crash, in all these errors the destination is ec2-23-20-217-189. That means in a short time period, that node is not able to responde timely to source nodes in XDCR. But source node is able to recover under this circumstance. Other than that, I do not see any other errors in the log. Looks like around 2013-02-12T23:08 and 2013-02-12T23:21 is the peak time where you have highest ops/sec and most docs to replicate. In particular, at that time, you have around 30k ops/sec on source and >30k ops/sec on destination. On the source the update ops/sec is > 10k ops/sec (see the 3rd and 4th screenshot I uploaded). Cannot judge any small scale unevenness from day-scale stats. In short, from your clusters and logs, there is no unevenness for long time through the test. The timeout errors may explain why you see the unevenness in a short period of time, other factors like compactors and env may also contribute to that. But I am not sure why it is serious.
        Hide
        ketaki Ketaki Gangal added a comment -

        Do we know why we see these timeouts under heavier load btw?

        Show
        ketaki Ketaki Gangal added a comment - Do we know why we see these timeouts under heavier load btw?
        Hide
        junyi Junyi Xie (Inactive) added a comment -

        Yes, basically the revs_diff and bulk_docs are unable to finish at destination cluster within connection timeout. I remember back in last December, we increased the timeout to 180 seconds.

        Show
        junyi Junyi Xie (Inactive) added a comment - Yes, basically the revs_diff and bulk_docs are unable to finish at destination cluster within connection timeout. I remember back in last December, we increased the timeout to 180 seconds.
        Hide
        jin Jin Lim (Inactive) added a comment -

        Based on discussions so far, we find that:

        • users can run into this issue pretty easily with large scale environment
        • restarting replicators after the described timeout crash is fairly cheap as far as Erlang process management concerns
        • eventually as load pressure on dst cluster improves so the unevenness gets subsidized by faster replication turnaround
        • obvious negative impact is that restarted replicators start from the last checkpoint that previously completed (thus replication is behind one checkpoint interval, approx. 5 minutes)

        Action plan:

        • will move this to 2.0.2
        • xdcr team, please continue investigate it. We cannot afford to just blame env. or other feasible factors unless they are proven with solid evidence.
        • qe (Ketaki) is currently running the same tests with 1/4 of replicators, validating if this reduces the chance of running into the symptom
        • if less number of replicator helps, we will update the document with clear description of symptom and the workaround of reducing the # of replicator.

        Thanks,
        Jin

        Show
        jin Jin Lim (Inactive) added a comment - Based on discussions so far, we find that: users can run into this issue pretty easily with large scale environment restarting replicators after the described timeout crash is fairly cheap as far as Erlang process management concerns eventually as load pressure on dst cluster improves so the unevenness gets subsidized by faster replication turnaround obvious negative impact is that restarted replicators start from the last checkpoint that previously completed (thus replication is behind one checkpoint interval, approx. 5 minutes) Action plan: will move this to 2.0.2 xdcr team, please continue investigate it. We cannot afford to just blame env. or other feasible factors unless they are proven with solid evidence. qe (Ketaki) is currently running the same tests with 1/4 of replicators, validating if this reduces the chance of running into the symptom if less number of replicator helps, we will update the document with clear description of symptom and the workaround of reducing the # of replicator. Thanks, Jin
        Hide
        jin Jin Lim (Inactive) added a comment -

        Hi Ketaki, please review the above comment and update this bug with your finding after running with less number of replicators. Afterward, please assign it back to Junyi or Jin for next action.

        Show
        jin Jin Lim (Inactive) added a comment - Hi Ketaki, please review the above comment and update this bug with your finding after running with less number of replicators. Afterward, please assign it back to Junyi or Jin for next action.
        Hide
        junyi Junyi Xie (Inactive) added a comment -

        Jin,

        Can you please explain a bit "users can run into this issue pretty easily with large scale environment"? From Ketaki's test, looks to me that is not the case (please see the screenshot from her clusters, where is the unevenness?)

        Also, if it is due to the timeout (very limited timeout in Ketaki's test), then it is nothing to do with size of cluster.

        Show
        junyi Junyi Xie (Inactive) added a comment - Jin, Can you please explain a bit "users can run into this issue pretty easily with large scale environment"? From Ketaki's test, looks to me that is not the case (please see the screenshot from her clusters, where is the unevenness?) Also, if it is due to the timeout (very limited timeout in Ketaki's test), then it is nothing to do with size of cluster.
        Hide
        jin Jin Lim (Inactive) added a comment -

        Hi Junyi, the summary was based on previous conversations among us. Do you suggest that there is no unevenness issue based on the screenshots? I am not sure if unevenness (spike or drop) can be easily identified based on normalized day scale graphs. Playing with the number of replicators is a workaround try to see if that reduces the number of timeout thus alleviates the unevenness.

        Show
        jin Jin Lim (Inactive) added a comment - Hi Junyi, the summary was based on previous conversations among us. Do you suggest that there is no unevenness issue based on the screenshots? I am not sure if unevenness (spike or drop) can be easily identified based on normalized day scale graphs. Playing with the number of replicators is a workaround try to see if that reduces the number of timeout thus alleviates the unevenness.
        Hide
        junyi Junyi Xie (Inactive) added a comment -

        Hi Jin,

        Probably there is some miscommunication. The spike/drop in Ketaki's is NOT unevenness. Here the unevenness means the different nodes drain the queue at very different speeds, as demonstrated by Chisheng's original test (look at Chisheng's 1st screenshot please, so nodes have 0 mutations to replicate, some have millions). The spike/drop in Ketaki's test comes from the significant change in front-end workload, which is expected. Also, the spike/drop in Ketaki's test is "mutations to replicate", NOT the replication speed, so it is not that the bucket is unevenly draining.

        From the screenshot I uploaded, overall each node was working at the same pace and I do not see significant unevenness. Therefore I am not sure if we are able to easily reproduce what Chisheng observed.

        That is why I am not sure about your notes that "users can run into this issue pretty easily with large scale environment".

        Thanks.

        Show
        junyi Junyi Xie (Inactive) added a comment - Hi Jin, Probably there is some miscommunication. The spike/drop in Ketaki's is NOT unevenness. Here the unevenness means the different nodes drain the queue at very different speeds, as demonstrated by Chisheng's original test (look at Chisheng's 1st screenshot please, so nodes have 0 mutations to replicate, some have millions). The spike/drop in Ketaki's test comes from the significant change in front-end workload, which is expected. Also, the spike/drop in Ketaki's test is "mutations to replicate", NOT the replication speed, so it is not that the bucket is unevenly draining. From the screenshot I uploaded, overall each node was working at the same pace and I do not see significant unevenness. Therefore I am not sure if we are able to easily reproduce what Chisheng observed. That is why I am not sure about your notes that "users can run into this issue pretty easily with large scale environment". Thanks.
        Hide
        jin Jin Lim (Inactive) added a comment -

        Hi Junyi, thanks for your detailed explanation. I will sync up with QE and take their latest update and input for your comment. We then should be able to move this to 2.0.2. Will discuss more during the xdcr - kv test meeting this afternoon. The action items below are still valid though:

        Action plan:

        • will move this to 2.0.2
        • xdcr team, please continue investigate it (unless there are 2.0.1 hot issues)
        • qe (Ketaki) is currently running the same tests with 1/4 #s of replicators
        • if less number of replicator helps, we will update the document with clear description of symptom and the workaround of reducing the # of replicator.
        Show
        jin Jin Lim (Inactive) added a comment - Hi Junyi, thanks for your detailed explanation. I will sync up with QE and take their latest update and input for your comment. We then should be able to move this to 2.0.2. Will discuss more during the xdcr - kv test meeting this afternoon. The action items below are still valid though: Action plan: will move this to 2.0.2 xdcr team, please continue investigate it (unless there are 2.0.1 hot issues) qe (Ketaki) is currently running the same tests with 1/4 #s of replicators if less number of replicator helps, we will update the document with clear description of symptom and the workaround of reducing the # of replicator.
        Hide
        junyi Junyi Xie (Inactive) added a comment -

        As discussed over meeting, if our customer hit this issue, other than the regular diag logs, we need them provide

        1. stat screenshot at least in minute scale of all nodes on which we can tell the unevenness happen
        2. description of workload, 1) how long has the test been running 2) the front-end ops/sec on both source and destination clusters, how many of them are updates of existent items, how many are new items (inserts), etc.

        reducing # of replicators and increase timeout may possibly help.

        Show
        junyi Junyi Xie (Inactive) added a comment - As discussed over meeting, if our customer hit this issue, other than the regular diag logs, we need them provide 1. stat screenshot at least in minute scale of all nodes on which we can tell the unevenness happen 2. description of workload, 1) how long has the test been running 2) the front-end ops/sec on both source and destination clusters, how many of them are updates of existent items, how many are new items (inserts), etc. reducing # of replicators and increase timeout may possibly help.
        Hide
        jin Jin Lim (Inactive) added a comment -

        Discussed with QE and it appears to be that users can still easily run into the symptom when running in a large-scale env. Junyi, do you concur with this assessment?

        Show
        jin Jin Lim (Inactive) added a comment - Discussed with QE and it appears to be that users can still easily run into the symptom when running in a large-scale env. Junyi, do you concur with this assessment?
        Hide
        jin Jin Lim (Inactive) added a comment -

        Documentation team, please review comments starting from Jin's 2/13 for information regarding this bug. Thanks.

        Show
        jin Jin Lim (Inactive) added a comment - Documentation team, please review comments starting from Jin's 2/13 for information regarding this bug. Thanks.
        Hide
        junyi Junyi Xie (Inactive) added a comment -

        I can tolerate some unevenness due to env and skewed workload, but if that unevenness is due to some destination node crash/down (as we have seen in another bug, 1-2 of 4 nodes at destination are not reachable, cause some source nodes stuck and unable to replicate), that is a different story.

        Show
        junyi Junyi Xie (Inactive) added a comment - I can tolerate some unevenness due to env and skewed workload, but if that unevenness is due to some destination node crash/down (as we have seen in another bug, 1-2 of 4 nodes at destination are not reachable, cause some source nodes stuck and unable to replicate), that is a different story.
        Hide
        junyi Junyi Xie (Inactive) added a comment -

        I looked at Ketaki's 10:5 bi-directional xdcr test at

        cluster 1
        http://ec2-23-20-140-198.compute-1.amazonaws.com:8091/index.html#sec=buckets
        cluster 2
        http://ec2-23-20-108-25.compute-1.amazonaws.com:8091/index.html#sec=buckets

        so far they look healthy and no unevenness is found.

        From bug fix perspective, I cannot exclude the possibility of unevenness at this time, therefore I will make some defense line in code which I hope can prevent unevenness across nodes from happening or recover from it in case it happens.

        Jin,

        You can determine if you want merge it to 2.0.1 or 2.0.2. Thanks.

        Show
        junyi Junyi Xie (Inactive) added a comment - I looked at Ketaki's 10:5 bi-directional xdcr test at cluster 1 http://ec2-23-20-140-198.compute-1.amazonaws.com:8091/index.html#sec=buckets cluster 2 http://ec2-23-20-108-25.compute-1.amazonaws.com:8091/index.html#sec=buckets so far they look healthy and no unevenness is found. From bug fix perspective, I cannot exclude the possibility of unevenness at this time, therefore I will make some defense line in code which I hope can prevent unevenness across nodes from happening or recover from it in case it happens. Jin, You can determine if you want merge it to 2.0.1 or 2.0.2. Thanks.
        Hide
        junyi Junyi Xie (Inactive) added a comment -
        Show
        junyi Junyi Xie (Inactive) added a comment - fix on gerrit http://review.couchbase.org/#/c/24645/
        Hide
        junyi Junyi Xie (Inactive) added a comment - - edited

        I manage to reproduce the uneven "mutation to replicate" stats across nodes.

        Looks to me, the root cause (at least in my testcase) is that the disk queues at different nodes drain at VERY different rate. My testcase is a simple 2->2 uni-directional with tiny front-end insert workload (on average 250 insert / sec / node, no get ops at all). Over time, the disk write queue size at two source nodes are quite skewed. One is almost empty, while the other is 150k-200k items.

        Consequently, the mutations to replicate at these two nodes are also quite different, the node with empty write queue has 150K items more than the other to replicate because the ep_engine on that node has dumped all items to disk.

        The other node with 150K items backlog in disk write queue has almost 0 "mutations to replicate" because on that node, XDCR is faster than write queue drain rate.

        This is consistent with what Chisheng's original test. Please see the screen shots of disk write queue I grabbed at different time of test.

        ep_engine team, please advise. Thanks.

        Show
        junyi Junyi Xie (Inactive) added a comment - - edited I manage to reproduce the uneven "mutation to replicate" stats across nodes. Looks to me, the root cause (at least in my testcase) is that the disk queues at different nodes drain at VERY different rate. My testcase is a simple 2->2 uni-directional with tiny front-end insert workload (on average 250 insert / sec / node, no get ops at all). Over time, the disk write queue size at two source nodes are quite skewed. One is almost empty, while the other is 150k-200k items. Consequently, the mutations to replicate at these two nodes are also quite different, the node with empty write queue has 150K items more than the other to replicate because the ep_engine on that node has dumped all items to disk. The other node with 150K items backlog in disk write queue has almost 0 "mutations to replicate" because on that node, XDCR is faster than write queue drain rate. This is consistent with what Chisheng's original test. Please see the screen shots of disk write queue I grabbed at different time of test. ep_engine team, please advise. Thanks.
        Hide
        junyi Junyi Xie (Inactive) added a comment -

        assign to ep_engine team for comment

        Show
        junyi Junyi Xie (Inactive) added a comment - assign to ep_engine team for comment
        Hide
        mikew Mike Wiederhold added a comment -

        If the root cause is what Junyi mentioned above then this seems like it would be the expected behavior since if one node in the cluster is slow to drain items to disk then we will see this unevenness. What likely caused the disk write queue to be high on one node is compaction. In our teams testing we have found that disk persistence gets very slow when compaction is happening and if compaction only happens on one of the nodes I would expect this situation. I can take a look at the logs later if anyone wants me to look into this more deeply. Just let me know.

        Also note that Aaron is doing some work on the compactor logic in couchstore to try to speed up persistence when compaction is taking place.

        Show
        mikew Mike Wiederhold added a comment - If the root cause is what Junyi mentioned above then this seems like it would be the expected behavior since if one node in the cluster is slow to drain items to disk then we will see this unevenness. What likely caused the disk write queue to be high on one node is compaction. In our teams testing we have found that disk persistence gets very slow when compaction is happening and if compaction only happens on one of the nodes I would expect this situation. I can take a look at the logs later if anyone wants me to look into this more deeply. Just let me know. Also note that Aaron is doing some work on the compactor logic in couchstore to try to speed up persistence when compaction is taking place.
        Hide
        jin Jin Lim (Inactive) added a comment - - edited

        Thanks for updates. As we previously discussed, will be moving this to 2.0.2 for further investigation and retesting with mentioned optimization. There is a toy build currently waiting for test that should help alleviate the unevenness from the XDCR side and QE is also planning to test with a workaround (less #s of replicator).

        Ketaki, please update any latest test result from either the workaround or the toy build (http://review.couchbase.org/#/c/24645/). Thanks much!

        Show
        jin Jin Lim (Inactive) added a comment - - edited Thanks for updates. As we previously discussed, will be moving this to 2.0.2 for further investigation and retesting with mentioned optimization. There is a toy build currently waiting for test that should help alleviate the unevenness from the XDCR side and QE is also planning to test with a workaround (less #s of replicator). Ketaki, please update any latest test result from either the workaround or the toy build ( http://review.couchbase.org/#/c/24645/ ). Thanks much!
        Hide
        ketaki Ketaki Gangal added a comment - - edited

        Tested this on 2.0.1-160-rel build and discussed this w/ Junyi

        • Seeing some nodes ( 1 node in particular) lagging behind on the xdcr-replication start.
          The drain rate , disk write queue on this node are very uneven, attached screenshot.
        • Uneven CPU across the cluster, some nodes show avg.80 percent, some show avg. 50 percent and some show less than 20 percent avg over a period of time. Screen shot attached.

        We ve compared the Compaction timing w/ the drain/disk timings and they are not in exact sync.

        There is some resource (i/o or otherwise) on this node, causing it to be varying on the disk/draining and showing this intermittent behaviour.

        _ On another note - Bidirectional replication on this cluster causes the clusters to become unresponsive and serveral timeout errors on the stats display.

        Adding logs from nodes on the source( 11 node cluster).

        Show
        ketaki Ketaki Gangal added a comment - - edited Tested this on 2.0.1-160-rel build and discussed this w/ Junyi Seeing some nodes ( 1 node in particular) lagging behind on the xdcr-replication start. The drain rate , disk write queue on this node are very uneven, attached screenshot. Uneven CPU across the cluster, some nodes show avg.80 percent, some show avg. 50 percent and some show less than 20 percent avg over a period of time. Screen shot attached. We ve compared the Compaction timing w/ the drain/disk timings and they are not in exact sync. There is some resource (i/o or otherwise) on this node, causing it to be varying on the disk/draining and showing this intermittent behaviour. _ On another note - Bidirectional replication on this cluster causes the clusters to become unresponsive and serveral timeout errors on the stats display. Adding logs from nodes on the source( 11 node cluster).
        Show
        ketaki Ketaki Gangal added a comment - Source : http://ec2-107-22-40-124.compute-1.amazonaws.com:8091/ Destination : http://ec2-54-242-239-237.compute-1.amazonaws.com:8091/
        Hide
        junyi Junyi Xie (Inactive) added a comment -

        Thanks Ketaki for the notes.

        One more thing, the node with 20% CPU usage is exactly the one lagging behind in XDCR, it is also the one which demonstrated quite different flusher behavior (disk write queue, drain rate, etc) from other nodes.

        I am not claiming the flusher is the culprit, on the contrary, I think both flusher and XDCR are victim of some I/O heavy job (possibly compaction, need verification): flusher was unable to flush quickly and XDCR was unable to read and replicate doc at similar rate as other nodes.

        Given the low CPU usage on that node, looks to me that node is pretty I/O bound.

        Show
        junyi Junyi Xie (Inactive) added a comment - Thanks Ketaki for the notes. One more thing, the node with 20% CPU usage is exactly the one lagging behind in XDCR, it is also the one which demonstrated quite different flusher behavior (disk write queue, drain rate, etc) from other nodes. I am not claiming the flusher is the culprit, on the contrary, I think both flusher and XDCR are victim of some I/O heavy job (possibly compaction, need verification): flusher was unable to flush quickly and XDCR was unable to read and replicate doc at similar rate as other nodes. Given the low CPU usage on that node, looks to me that node is pretty I/O bound.
        Show
        ketaki Ketaki Gangal added a comment - https://s3.amazonaws.com/bugdb/bug0/bug.tar
        Hide
        jin Jin Lim (Inactive) added a comment -

        Thanks for many updates. Per bug scrubs this morning, we want to try the the same test with disabling compaction and see it it alleviates the issue. Also, it may be worth a try to just increase compaction threshold from default 30% to something higher like 75 - 80%. I will leave this uptp QE to decided however test it with compaction.

        Show
        jin Jin Lim (Inactive) added a comment - Thanks for many updates. Per bug scrubs this morning, we want to try the the same test with disabling compaction and see it it alleviates the issue. Also, it may be worth a try to just increase compaction threshold from default 30% to something higher like 75 - 80%. I will leave this uptp QE to decided however test it with compaction.
        Hide
        junyi Junyi Xie (Inactive) added a comment - - edited

        Thanks Jin.

        We can repeat the test without compaction, But the question yet to be answered is why only 1 node demonstrated such different behavior from other nodes
        (other 2 nodes also behaved a bit differently but not that bad). IMHO compaction should behave more or less similarly on all nodes since in the cluster, since we have random distributed workload.

        I am not sure it is a bug or not, but since it is fairly easy to reproduce, we should keep an eye on that.

        Other than the test to turn on/off compaction, I am also interested in a simple KV test: turn off all XDCR/view/query/rebalance activity completely, just put some pressure on one cluster with 10 nodes with some non-trivial front-end workload for several hours, and observe how flusher behaves across all nodes. If QE has such experience, please share.

        Show
        junyi Junyi Xie (Inactive) added a comment - - edited Thanks Jin. We can repeat the test without compaction, But the question yet to be answered is why only 1 node demonstrated such different behavior from other nodes (other 2 nodes also behaved a bit differently but not that bad). IMHO compaction should behave more or less similarly on all nodes since in the cluster, since we have random distributed workload. I am not sure it is a bug or not, but since it is fairly easy to reproduce, we should keep an eye on that. Other than the test to turn on/off compaction, I am also interested in a simple KV test: turn off all XDCR/view/query/rebalance activity completely, just put some pressure on one cluster with 10 nodes with some non-trivial front-end workload for several hours, and observe how flusher behaves across all nodes. If QE has such experience, please share.
        Hide
        farshid Farshid Ghods (Inactive) added a comment - - edited

        Junyi,

        Thanks for more investigation. QE should be able to extract such data from the k/v test case and provide you access to such cluster.

        Ketaki,

        can we extract the information that Junyi has asked for from k/v cluster

        Show
        farshid Farshid Ghods (Inactive) added a comment - - edited Junyi, Thanks for more investigation. QE should be able to extract such data from the k/v test case and provide you access to such cluster. Ketaki, can we extract the information that Junyi has asked for from k/v cluster
        Hide
        junyi Junyi Xie (Inactive) added a comment - - edited

        Farshid,

        Thanks. Please share with the data or point me to the live cluster (if possible) in the KV system test.

        It will be interesting how our CB server behaves and adapts when we put different pressure on source at different time (for example, we can put small, moderate, heavy, spike, burst workloads at different stages of KV test).

        Show
        junyi Junyi Xie (Inactive) added a comment - - edited Farshid, Thanks. Please share with the data or point me to the live cluster (if possible) in the KV system test. It will be interesting how our CB server behaves and adapts when we put different pressure on source at different time (for example, we can put small, moderate, heavy, spike, burst workloads at different stages of KV test).
        Hide
        ketaki Ketaki Gangal added a comment -

        Sure, Adding in Tommie on this discussion.

        @Tommie - Do we have information on overall cluster v/s individual node behaviour on the KV usecase currently?

        Show
        ketaki Ketaki Gangal added a comment - Sure, Adding in Tommie on this discussion. @Tommie - Do we have information on overall cluster v/s individual node behaviour on the KV usecase currently?
        Hide
        ketaki Ketaki Gangal added a comment -

        Hi Junyi,

        For this setup, would the drain-rate on the source node the xdcr-replication ?

        • All the items are persisted to disk in the load phase.
        • In the access phase, 5 percent items are sets, rest 90 percent plus are reads.

        So even if this particular node is not draining evenly... how would that impact xdcr ability to read from disk and replicate forward?

        -Ketaki

        Show
        ketaki Ketaki Gangal added a comment - Hi Junyi, For this setup, would the drain-rate on the source node the xdcr-replication ? All the items are persisted to disk in the load phase. In the access phase, 5 percent items are sets, rest 90 percent plus are reads. So even if this particular node is not draining evenly... how would that impact xdcr ability to read from disk and replicate forward? -Ketaki
        Hide
        junyi Junyi Xie (Inactive) added a comment -

        As I mentioned, uneven drain rate may not be the reason of make XDCR more difficult read from disk, in stead of both flusher and XDCR can be victim of other activity (probably compactor, but I am not 100% sure) which were competing I/O with flusher and XDCR, making flusher harder to flush and XDCR harder to read doc.

        That is the reason we saw correlated abnormal activity of flusher and XDCR

        Show
        junyi Junyi Xie (Inactive) added a comment - As I mentioned, uneven drain rate may not be the reason of make XDCR more difficult read from disk, in stead of both flusher and XDCR can be victim of other activity (probably compactor, but I am not 100% sure) which were competing I/O with flusher and XDCR, making flusher harder to flush and XDCR harder to read doc. That is the reason we saw correlated abnormal activity of flusher and XDCR
        Hide
        jin Jin Lim (Inactive) added a comment -
        • Still waiting for the test results from Tommie for KV only system.
        • Disabling compaction didn't much help on Ketaki's rerun
        Show
        jin Jin Lim (Inactive) added a comment - Still waiting for the test results from Tommie for KV only system. Disabling compaction didn't much help on Ketaki's rerun
        Hide
        tommie Tommie McAfee added a comment -

        as far as draining goes in kv-only use-case, looks like all nodes are draining at same rate...(screen 201_kvonly_drain)

        Show
        tommie Tommie McAfee added a comment - as far as draining goes in kv-only use-case, looks like all nodes are draining at same rate...(screen 201_kvonly_drain)
        Hide
        jin Jin Lim (Inactive) added a comment -

        Tommie's result is in, unevenness is NOT seen on his kv-only test. As discussed in the bug scrubs, engineering team will continue to investigate this in 2.0.2 as it appears to be intermittent behavior without major timeout or performance degradation.

        This bug has taken a way too long of QE times and cycles to prove 1) it is reproducible symptom 2) none of suggested workarounds (compactor, # of replicator, etc) worked 3) every single data points for engineering debugging. Many thanks for the hardworking on this, QE team!

        Show
        jin Jin Lim (Inactive) added a comment - Tommie's result is in, unevenness is NOT seen on his kv-only test. As discussed in the bug scrubs, engineering team will continue to investigate this in 2.0.2 as it appears to be intermittent behavior without major timeout or performance degradation. This bug has taken a way too long of QE times and cycles to prove 1) it is reproducible symptom 2) none of suggested workarounds (compactor, # of replicator, etc) worked 3) every single data points for engineering debugging. Many thanks for the hardworking on this, QE team!
        Hide
        junyi Junyi Xie (Inactive) added a comment - - edited

        Please look at the the screen shot Tommie posted more carefully. The scale is 2-3M, not 10k, so even a very small bump will cause 50 -100K difference in disk queue. Obviously node 10.3.3.215 shows quite BIG unevenness, during first 60-70% of test, before increasing. If you convert the figure into logarithmic scale, it could be much easier to see. The disk write queue unevenness in our XDCR test is only 10K, but over time it can accumate quite a lot backlog. Also, please note, Tommie'test is without XDCR, there are less I/O competition for compactor and ep_engine, we still see several hundred K difference in disk write queue.

        Show
        junyi Junyi Xie (Inactive) added a comment - - edited Please look at the the screen shot Tommie posted more carefully. The scale is 2-3M, not 10k, so even a very small bump will cause 50 -100K difference in disk queue. Obviously node 10.3.3.215 shows quite BIG unevenness, during first 60-70% of test, before increasing. If you convert the figure into logarithmic scale, it could be much easier to see. The disk write queue unevenness in our XDCR test is only 10K, but over time it can accumate quite a lot backlog. Also, please note, Tommie'test is without XDCR, there are less I/O competition for compactor and ep_engine, we still see several hundred K difference in disk write queue.
        Hide
        kzeller kzeller added a comment -

        Added to 2.0.1 RN:

        You may experience an even rate of replication for nodes in a cluster. This is
        intermittent behavior that is due to different disk drain rates at different
        nodes and will subside by itself over time. There is no need to change any
        settings on your cluster to try to resolve this behavior.

        Show
        kzeller kzeller added a comment - Added to 2.0.1 RN: You may experience an even rate of replication for nodes in a cluster. This is intermittent behavior that is due to different disk drain rates at different nodes and will subside by itself over time. There is no need to change any settings on your cluster to try to resolve this behavior.
        Hide
        junyi Junyi Xie (Inactive) added a comment -

        Hi Karen,

        Is it "even rate" or "uneven rate"? I think users may experience uneven replication rate.

        Show
        junyi Junyi Xie (Inactive) added a comment - Hi Karen, Is it "even rate" or "uneven rate"? I think users may experience uneven replication rate.
        Hide
        kzeller kzeller added a comment -

        You're right "uneven"

        Show
        kzeller kzeller added a comment - You're right "uneven"
        Hide
        maria Maria McDuff (Inactive) added a comment -

        will re-visit once multiple reader/writer is in. this is 2.0.2 must fix.

        Show
        maria Maria McDuff (Inactive) added a comment - will re-visit once multiple reader/writer is in. this is 2.0.2 must fix.
        Hide
        maria Maria McDuff (Inactive) added a comment -

        note to qe (maria): once multiple rdr/writer in, QE to run test.

        Show
        maria Maria McDuff (Inactive) added a comment - note to qe (maria): once multiple rdr/writer in, QE to run test.
        Hide
        mikew Mike Wiederhold added a comment -

        Moving to 2.1 since we need to retest with the multi-reader/writer to see if that solves the problem. We can move this back to 2.0.2 if the ep-engine team has time to look at it.

        Show
        mikew Mike Wiederhold added a comment - Moving to 2.1 since we need to retest with the multi-reader/writer to see if that solves the problem. We can move this back to 2.0.2 if the ep-engine team has time to look at it.
        Hide
        jin Jin Lim (Inactive) added a comment -

        QE will retest this once the large scale test started for 2.0.2 and we will determine if improvements made to 2.0.2 (mrw, io batching, etc) addressed the issue or not. Thanks.

        Show
        jin Jin Lim (Inactive) added a comment - QE will retest this once the large scale test started for 2.0.2 and we will determine if improvements made to 2.0.2 (mrw, io batching, etc) addressed the issue or not. Thanks.

          People

          • Assignee:
            ketaki Ketaki Gangal
            Reporter:
            Chisheng Chisheng Hong (Inactive)
          • Votes:
            0 Vote for this issue
            Watchers:
            10 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Gerrit Reviews

              There are no open Gerrit changes