Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-48764

Avg. initial XDCR rate (items/sec), 1 -> 2 changes to 2 -> 2, dropped from 200K to 162K on build 7.1.0-1250

    XMLWordPrintable

Details

    • Bug
    • Status: Closed
    • Major
    • Resolution: Fixed
    • 7.1.0
    • 7.1.0
    • XDCR
    • Untriaged
    • 1
    • Yes

    Description

      Build Avg. initial XDCR rate Job
      7.1.0-1085 200,490 http://perf.jenkins.couchbase.com/job/titan/11564/
      7.1.0-1250 162,544 http://perf.jenkins.couchbase.com/job/titan/12048/

      Attachments

        No reviews matched the request. Check your Options in the drop-down menu of this sections header.

        Activity

          bo-chun.wang Bo-Chun Wang created issue -
          neil.huang Neil Huang made changes -
          Field Original Value New Value
          Assignee John Liang [ jliang ] Neil Huang [ neil.huang ]
          neil.huang Neil Huang added a comment -

          Bo-Chun Wang - can you please remind me does the clock start for measurement as soon as a replication is created?
          Bo-Chun Wang is it possible to run the test with replication setting "preReplicateVBMasterCheck=false" and see what the number is?

          neil.huang Neil Huang added a comment - Bo-Chun Wang - can you please remind me does the clock start for measurement as soon as a replication is created? Bo-Chun Wang is it possible to run the test with replication setting "preReplicateVBMasterCheck=false" and see what the number is?
          bo-chun.wang Bo-Chun Wang added a comment -

          Neil Huang

          The clock starts when the remote cluster is added.

          bo-chun.wang Bo-Chun Wang added a comment - Neil Huang The clock starts when the remote cluster is added.
          neil.huang Neil Huang added a comment -

          Thanks. Bo-Chun Wang
          There has been recent changes such that XDCR needs to do some more book keeping work once replication is created before actual data flow. If the clock starts as soon as RemClusterRef is created, the measurement may not be accurate.

          Thus is why I'm asking if you can re-run with the setting I mentioned off.

          neil.huang Neil Huang added a comment - Thanks. Bo-Chun Wang There has been recent changes such that XDCR needs to do some more book keeping work once replication is created before actual data flow. If the clock starts as soon as RemClusterRef is created, the measurement may not be accurate. Thus is why I'm asking if you can re-run with the setting I mentioned off.

          Neil Huang

          I did a run, but the run failed. The replication can't be finished in 8 hours so the run was aborted. I will re-run it and collect logs.

          http://perf.jenkins.couchbase.com/job/titan-xdcr-dev/53/

          2021-10-05T19:17:05 [INFO] replication_changes_left = 21,289,070

          ...

          2021-10-06T02:35:15 [INFO] replication_changes_left = 21,289,070

          Build timed out (after 480 minutes). Marking the build as aborted.

          bo-chun.wang Bo-Chun Wang added a comment - Neil Huang I did a run, but the run failed. The replication can't be finished in 8 hours so the run was aborted. I will re-run it and collect logs. http://perf.jenkins.couchbase.com/job/titan-xdcr-dev/53/ 2021-10-05T19:17:05 [INFO] replication_changes_left = 21,289,070 ... 2021-10-06T02:35:15 [INFO] replication_changes_left = 21,289,070 Build timed out (after 480 minutes). Marking the build as aborted.

          The second run doesn't hit the issue. Its initial XDCR rate is higher (188929), but the rate is still lower than 200K.

          http://perf.jenkins.couchbase.com/job/titan-xdcr-dev/54/

           

          bo-chun.wang Bo-Chun Wang added a comment - The second run doesn't hit the issue. Its initial XDCR rate is higher (188929), but the rate is still lower than 200K. http://perf.jenkins.couchbase.com/job/titan-xdcr-dev/54/  
          neil.huang Neil Huang added a comment - - edited

          IIRC this test is dependent upon the speed at which ns_server rebalances a node in. In other words, XDCR replication cannot complete until all VBs are finished rebalanced. So, if a rebalance operation by ns_server has a regression, this test will also experience a regression.
          Not saying that it is a regression by ns_server, but the test has a lot of variables I believe.

          If the test run with the setting turned off, which puts XDCR back to legacy behavior, then the performance drop is (200-189)/200 = 5.5% drop, which is at the borderline of classification of a real regression.

          A few asks:

          1) Can you do one more test run with the setting off and one more without the setting off? To make sure the drop is reproducible.
          1a) Can you take golang profiles during these runs so I can take a look at the inside to see if something looks off?
          2) Is there a test that measures the time it takes for rebalance to complete? (1->2) without the use of replication? Sharath mentioned there was a lack of this test but left before he had a chance to do it. We need a baseline test (without the use of XDCR) to test the time it takes for rebalance to occur. Then this test (1->2) on top will have a better baseline to compare against.

          neil.huang Neil Huang added a comment - - edited IIRC this test is dependent upon the speed at which ns_server rebalances a node in. In other words, XDCR replication cannot complete until all VBs are finished rebalanced. So, if a rebalance operation by ns_server has a regression, this test will also experience a regression. Not saying that it is a regression by ns_server, but the test has a lot of variables I believe. If the test run with the setting turned off, which puts XDCR back to legacy behavior, then the performance drop is (200-189)/200 = 5.5% drop, which is at the borderline of classification of a real regression. A few asks: 1) Can you do one more test run with the setting off and one more without the setting off? To make sure the drop is reproducible. 1a) Can you take golang profiles during these runs so I can take a look at the inside to see if something looks off? 2) Is there a test that measures the time it takes for rebalance to complete? (1->2) without the use of replication? Sharath mentioned there was a lack of this test but left before he had a chance to do it. We need a baseline test (without the use of XDCR) to test the time it takes for rebalance to occur. Then this test (1->2) on top will have a better baseline to compare against.
          bo-chun.wang Bo-Chun Wang added a comment -

          I finish two runs. The drop is reproducible. Both runs have captured the profile.

          1. without "preReplicateVBMasterCheck=false"

          http://perf.jenkins.couchbase.com/job/titan/12212/

          initial XDCR rate: 162550

          2. with "preReplicateVBMasterCheck=false"

          http://perf.jenkins.couchbase.com/job/titan-xdcr-dev/56/

          initial XDCR rate: 188454

           

          For 2): No, there is not a xdcr rebalance test without the use of replication.

           

          bo-chun.wang Bo-Chun Wang added a comment - I finish two runs. The drop is reproducible. Both runs have captured the profile. 1. without "preReplicateVBMasterCheck=false" http://perf.jenkins.couchbase.com/job/titan/12212/ initial XDCR rate: 162550 2. with "preReplicateVBMasterCheck=false" http://perf.jenkins.couchbase.com/job/titan-xdcr-dev/56/ initial XDCR rate: 188454   For 2): No, there is not a xdcr rebalance test without the use of replication.  
          neil.huang Neil Huang added a comment -

          I took a look at the CPU and heap profile between the two runs:

          With P2P on:
          http://perf.jenkins.couchbase.com/job/titan-xdcr-dev/56/console
          With P2P off:
          http://perf.jenkins.couchbase.com/job/titan/12212/console
          Both show pretty much the same CPU utilization as well as the same memory utilization, which means that XDCR replication isn’t showing signs of regression.

          Then I took a look at the item count decreasing from the runs.
          It seems that there are 200 million * 1KB == 0.2 TB of data, and it took about 16 minutes to replicate.

          I looked at the graphs of data_replicated and changes_left, and they aren’t that different.
          Then I realized that checkpoint interval is set to default by 10 minutes.

          The test takes about 20 minutes to run, and it restarts pipeline perhaps 3 times in each run:

          ~/Downloads/Perf/12212/cbcollect$ zipgrepCbCollectXdcr "PipelineMgr" *105.zip | grep "Try to"
          cbcollect_info_ns_1@172.23.96.105_20211006-212854/ns_server.goxdcr.log:2021-10-06T14:07:56.517-07:00 INFO GOXDCR.PipelineMgr: Try to start/restart Pipeline ea264189b301063b10e4b4dfc597f928/bucket-1/bucket-1.
          cbcollect_info_ns_1@172.23.96.105_20211006-212854/ns_server.goxdcr.log:2021-10-06T14:07:56.517-07:00 INFO GOXDCR.PipelineMgr: Try to stop pipeline ea264189b301063b10e4b4dfc597f928/bucket-1/bucket-1
          cbcollect_info_ns_1@172.23.96.105_20211006-212854/ns_server.goxdcr.log:2021-10-06T14:12:56.336-07:00 INFO GOXDCR.PipelineMgr: Try to fix Pipeline ea264189b301063b10e4b4dfc597f928/bucket-1/bucket-1. Current error(s)=r.update_err_ch : TopoChangeDet : Restarting pipeline due to source topology change...
          cbcollect_info_ns_1@172.23.96.105_20211006-212854/ns_server.goxdcr.log:2021-10-06T14:12:56.336-07:00 INFO GOXDCR.PipelineMgr: Try to stop pipeline ea264189b301063b10e4b4dfc597f928/bucket-1/bucket-1
          cbcollect_info_ns_1@172.23.96.105_20211006-212854/ns_server.goxdcr.log:2021-10-06T14:17:56.307-07:00 INFO GOXDCR.PipelineMgr: Try to fix Pipeline ea264189b301063b10e4b4dfc597f928/bucket-1/bucket-1. Current error(s)=r.update_err_ch : TopoChangeDet : Restarting pipeline due to source topology change...
          cbcollect_info_ns_1@172.23.96.105_20211006-212854/ns_server.goxdcr.log:2021-10-06T14:17:56.307-07:00 INFO GOXDCR.PipelineMgr: Try to stop pipeline ea264189b301063b10e4b4dfc597f928/bucket-1/bucket-1
          cbcollect_info_ns_1@172.23.96.105_20211006-212854/ns_server.goxdcr.log:2021-10-06T14:22:56.313-07:00 INFO GOXDCR.PipelineMgr: Try to fix Pipeline ea264189b301063b10e4b4dfc597f928/bucket-1/bucket-1. Current error(s)=r.update_err_ch : TopoChangeDet : Restarting pipeline due to source topology change...
          cbcollect_info_ns_1@172.23.96.105_20211006-212854/ns_server.goxdcr.log:2021-10-06T14:22:56.313-07:00 INFO GOXDCR.PipelineMgr: Try to stop pipeline ea264189b301063b10e4b4dfc597f928/bucket-1/bucket-1
          

          This means that XDCR is doing extra work of peer-to-peer communication without the benefit of the checkpoints.
          This is evident because the new node that rebalanced in (106) started with Seqno=0 even though 105 has done the work.

          Bo-Chun Wang, can you do another experiment with the following 2 runs parameters:
          1. Run with “preReplicateVBMasterCheck=false” and “checkpointInterval=180” (no peer to peer, and checkpoint every 3 minutes instead of 10 minutes)
          2. Run with “checkpointInterval=180” (peer-to-peer, and checkpoint every 3 minutes)

          As of now, the regression seems to be the extra work XDCR is doing with Peer-To-Peer checkpoint pull without actually instantiating checkpoints to be able to be used. When XDCR does P2P work, it's not replicating data and so it introduces delay and cause the throughput to go down. Having these 2 test parameters will allow us to have a baseline, and secondly to make sure that the time spent doing P2P communication instead of replicating isn’t wasted.

          neil.huang Neil Huang added a comment - I took a look at the CPU and heap profile between the two runs: With P2P on: http://perf.jenkins.couchbase.com/job/titan-xdcr-dev/56/console With P2P off: http://perf.jenkins.couchbase.com/job/titan/12212/console Both show pretty much the same CPU utilization as well as the same memory utilization, which means that XDCR replication isn’t showing signs of regression. Then I took a look at the item count decreasing from the runs. It seems that there are 200 million * 1KB == 0.2 TB of data, and it took about 16 minutes to replicate. I looked at the graphs of data_replicated and changes_left, and they aren’t that different. Then I realized that checkpoint interval is set to default by 10 minutes. The test takes about 20 minutes to run, and it restarts pipeline perhaps 3 times in each run: ~/Downloads/Perf/12212/cbcollect$ zipgrepCbCollectXdcr "PipelineMgr" *105.zip | grep "Try to" cbcollect_info_ns_1@172.23.96.105_20211006-212854/ns_server.goxdcr.log:2021-10-06T14:07:56.517-07:00 INFO GOXDCR.PipelineMgr: Try to start/restart Pipeline ea264189b301063b10e4b4dfc597f928/bucket-1/bucket-1. cbcollect_info_ns_1@172.23.96.105_20211006-212854/ns_server.goxdcr.log:2021-10-06T14:07:56.517-07:00 INFO GOXDCR.PipelineMgr: Try to stop pipeline ea264189b301063b10e4b4dfc597f928/bucket-1/bucket-1 cbcollect_info_ns_1@172.23.96.105_20211006-212854/ns_server.goxdcr.log:2021-10-06T14:12:56.336-07:00 INFO GOXDCR.PipelineMgr: Try to fix Pipeline ea264189b301063b10e4b4dfc597f928/bucket-1/bucket-1. Current error(s)=r.update_err_ch : TopoChangeDet : Restarting pipeline due to source topology change... cbcollect_info_ns_1@172.23.96.105_20211006-212854/ns_server.goxdcr.log:2021-10-06T14:12:56.336-07:00 INFO GOXDCR.PipelineMgr: Try to stop pipeline ea264189b301063b10e4b4dfc597f928/bucket-1/bucket-1 cbcollect_info_ns_1@172.23.96.105_20211006-212854/ns_server.goxdcr.log:2021-10-06T14:17:56.307-07:00 INFO GOXDCR.PipelineMgr: Try to fix Pipeline ea264189b301063b10e4b4dfc597f928/bucket-1/bucket-1. Current error(s)=r.update_err_ch : TopoChangeDet : Restarting pipeline due to source topology change... cbcollect_info_ns_1@172.23.96.105_20211006-212854/ns_server.goxdcr.log:2021-10-06T14:17:56.307-07:00 INFO GOXDCR.PipelineMgr: Try to stop pipeline ea264189b301063b10e4b4dfc597f928/bucket-1/bucket-1 cbcollect_info_ns_1@172.23.96.105_20211006-212854/ns_server.goxdcr.log:2021-10-06T14:22:56.313-07:00 INFO GOXDCR.PipelineMgr: Try to fix Pipeline ea264189b301063b10e4b4dfc597f928/bucket-1/bucket-1. Current error(s)=r.update_err_ch : TopoChangeDet : Restarting pipeline due to source topology change... cbcollect_info_ns_1@172.23.96.105_20211006-212854/ns_server.goxdcr.log:2021-10-06T14:22:56.313-07:00 INFO GOXDCR.PipelineMgr: Try to stop pipeline ea264189b301063b10e4b4dfc597f928/bucket-1/bucket-1 This means that XDCR is doing extra work of peer-to-peer communication without the benefit of the checkpoints. This is evident because the new node that rebalanced in (106) started with Seqno=0 even though 105 has done the work. Bo-Chun Wang , can you do another experiment with the following 2 runs parameters: 1. Run with “preReplicateVBMasterCheck=false” and “checkpointInterval=180” (no peer to peer, and checkpoint every 3 minutes instead of 10 minutes) 2. Run with “checkpointInterval=180” (peer-to-peer, and checkpoint every 3 minutes) As of now, the regression seems to be the extra work XDCR is doing with Peer-To-Peer checkpoint pull without actually instantiating checkpoints to be able to be used. When XDCR does P2P work, it's not replicating data and so it introduces delay and cause the throughput to go down. Having these 2 test parameters will allow us to have a baseline, and secondly to make sure that the time spent doing P2P communication instead of replicating isn’t wasted.

          http://perf.jenkins.couchbase.com/job/titan-xdcr-dev/57/ 

          preReplicateVBMasterCheck': False, 'checkpointInterval': 180

          Avg. initial XDCR rate (items/sec): 190198

           

          http://perf.jenkins.couchbase.com/job/titan-xdcr-dev/58/

          'checkpointInterval': 180

          Avg. initial XDCR rate (items/sec): 161277

          bo-chun.wang Bo-Chun Wang added a comment - http://perf.jenkins.couchbase.com/job/titan-xdcr-dev/57/   preReplicateVBMasterCheck': False, 'checkpointInterval': 180 Avg. initial XDCR rate (items/sec): 190198   http://perf.jenkins.couchbase.com/job/titan-xdcr-dev/58/ 'checkpointInterval': 180 Avg. initial XDCR rate (items/sec): 161277
          jliang John Liang made changes -
          Assignee Neil Huang [ neil.huang ] Lilei Chen [ lilei.chen ]
          wayne Wayne Siu added a comment -

          Bo-Chun Wang 

          Looks like the performance has recovered since late Oct. (build 7.1.0-1601+).  Can you confirm?  Thanks.

          wayne Wayne Siu added a comment - Bo-Chun Wang   Looks like the performance has recovered since late Oct. (build 7.1.0-1601+).  Can you confirm?  Thanks.
          bo-chun.wang Bo-Chun Wang made changes -
          bo-chun.wang Bo-Chun Wang added a comment -

          The performance is back. I close this ticket.
          http://showfast.sc.couchbase.com/#/timeline/Linux/xdcr/reb/all

          bo-chun.wang Bo-Chun Wang added a comment - The performance is back. I close this ticket. http://showfast.sc.couchbase.com/#/timeline/Linux/xdcr/reb/all
          bo-chun.wang Bo-Chun Wang made changes -
          Resolution Fixed [ 1 ]
          Status Open [ 1 ] Resolved [ 5 ]
          bo-chun.wang Bo-Chun Wang made changes -
          Status Resolved [ 5 ] Closed [ 6 ]

          People

            lilei.chen Lilei Chen
            bo-chun.wang Bo-Chun Wang
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Gerrit Reviews

                There are no open Gerrit changes

                PagerDuty