Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-48764

Avg. initial XDCR rate (items/sec), 1 -> 2 changes to 2 -> 2, dropped from 200K to 162K on build 7.1.0-1250

    XMLWordPrintable

Details

    • Bug
    • Status: Closed
    • Major
    • Resolution: Fixed
    • 7.1.0
    • 7.1.0
    • XDCR
    • Untriaged
    • 1
    • Yes

    Description

      Build Avg. initial XDCR rate Job
      7.1.0-1085 200,490 http://perf.jenkins.couchbase.com/job/titan/11564/
      7.1.0-1250 162,544 http://perf.jenkins.couchbase.com/job/titan/12048/

      Attachments

        No reviews matched the request. Check your Options in the drop-down menu of this sections header.

        Activity

          bo-chun.wang Bo-Chun Wang added a comment -

          I finish two runs. The drop is reproducible. Both runs have captured the profile.

          1. without "preReplicateVBMasterCheck=false"

          http://perf.jenkins.couchbase.com/job/titan/12212/

          initial XDCR rate: 162550

          2. with "preReplicateVBMasterCheck=false"

          http://perf.jenkins.couchbase.com/job/titan-xdcr-dev/56/

          initial XDCR rate: 188454

           

          For 2): No, there is not a xdcr rebalance test without the use of replication.

           

          bo-chun.wang Bo-Chun Wang added a comment - I finish two runs. The drop is reproducible. Both runs have captured the profile. 1. without "preReplicateVBMasterCheck=false" http://perf.jenkins.couchbase.com/job/titan/12212/ initial XDCR rate: 162550 2. with "preReplicateVBMasterCheck=false" http://perf.jenkins.couchbase.com/job/titan-xdcr-dev/56/ initial XDCR rate: 188454   For 2): No, there is not a xdcr rebalance test without the use of replication.  
          neil.huang Neil Huang added a comment -

          I took a look at the CPU and heap profile between the two runs:

          With P2P on:
          http://perf.jenkins.couchbase.com/job/titan-xdcr-dev/56/console
          With P2P off:
          http://perf.jenkins.couchbase.com/job/titan/12212/console
          Both show pretty much the same CPU utilization as well as the same memory utilization, which means that XDCR replication isn’t showing signs of regression.

          Then I took a look at the item count decreasing from the runs.
          It seems that there are 200 million * 1KB == 0.2 TB of data, and it took about 16 minutes to replicate.

          I looked at the graphs of data_replicated and changes_left, and they aren’t that different.
          Then I realized that checkpoint interval is set to default by 10 minutes.

          The test takes about 20 minutes to run, and it restarts pipeline perhaps 3 times in each run:

          ~/Downloads/Perf/12212/cbcollect$ zipgrepCbCollectXdcr "PipelineMgr" *105.zip | grep "Try to"
          cbcollect_info_ns_1@172.23.96.105_20211006-212854/ns_server.goxdcr.log:2021-10-06T14:07:56.517-07:00 INFO GOXDCR.PipelineMgr: Try to start/restart Pipeline ea264189b301063b10e4b4dfc597f928/bucket-1/bucket-1.
          cbcollect_info_ns_1@172.23.96.105_20211006-212854/ns_server.goxdcr.log:2021-10-06T14:07:56.517-07:00 INFO GOXDCR.PipelineMgr: Try to stop pipeline ea264189b301063b10e4b4dfc597f928/bucket-1/bucket-1
          cbcollect_info_ns_1@172.23.96.105_20211006-212854/ns_server.goxdcr.log:2021-10-06T14:12:56.336-07:00 INFO GOXDCR.PipelineMgr: Try to fix Pipeline ea264189b301063b10e4b4dfc597f928/bucket-1/bucket-1. Current error(s)=r.update_err_ch : TopoChangeDet : Restarting pipeline due to source topology change...
          cbcollect_info_ns_1@172.23.96.105_20211006-212854/ns_server.goxdcr.log:2021-10-06T14:12:56.336-07:00 INFO GOXDCR.PipelineMgr: Try to stop pipeline ea264189b301063b10e4b4dfc597f928/bucket-1/bucket-1
          cbcollect_info_ns_1@172.23.96.105_20211006-212854/ns_server.goxdcr.log:2021-10-06T14:17:56.307-07:00 INFO GOXDCR.PipelineMgr: Try to fix Pipeline ea264189b301063b10e4b4dfc597f928/bucket-1/bucket-1. Current error(s)=r.update_err_ch : TopoChangeDet : Restarting pipeline due to source topology change...
          cbcollect_info_ns_1@172.23.96.105_20211006-212854/ns_server.goxdcr.log:2021-10-06T14:17:56.307-07:00 INFO GOXDCR.PipelineMgr: Try to stop pipeline ea264189b301063b10e4b4dfc597f928/bucket-1/bucket-1
          cbcollect_info_ns_1@172.23.96.105_20211006-212854/ns_server.goxdcr.log:2021-10-06T14:22:56.313-07:00 INFO GOXDCR.PipelineMgr: Try to fix Pipeline ea264189b301063b10e4b4dfc597f928/bucket-1/bucket-1. Current error(s)=r.update_err_ch : TopoChangeDet : Restarting pipeline due to source topology change...
          cbcollect_info_ns_1@172.23.96.105_20211006-212854/ns_server.goxdcr.log:2021-10-06T14:22:56.313-07:00 INFO GOXDCR.PipelineMgr: Try to stop pipeline ea264189b301063b10e4b4dfc597f928/bucket-1/bucket-1
          

          This means that XDCR is doing extra work of peer-to-peer communication without the benefit of the checkpoints.
          This is evident because the new node that rebalanced in (106) started with Seqno=0 even though 105 has done the work.

          Bo-Chun Wang, can you do another experiment with the following 2 runs parameters:
          1. Run with “preReplicateVBMasterCheck=false” and “checkpointInterval=180” (no peer to peer, and checkpoint every 3 minutes instead of 10 minutes)
          2. Run with “checkpointInterval=180” (peer-to-peer, and checkpoint every 3 minutes)

          As of now, the regression seems to be the extra work XDCR is doing with Peer-To-Peer checkpoint pull without actually instantiating checkpoints to be able to be used. When XDCR does P2P work, it's not replicating data and so it introduces delay and cause the throughput to go down. Having these 2 test parameters will allow us to have a baseline, and secondly to make sure that the time spent doing P2P communication instead of replicating isn’t wasted.

          neil.huang Neil Huang added a comment - I took a look at the CPU and heap profile between the two runs: With P2P on: http://perf.jenkins.couchbase.com/job/titan-xdcr-dev/56/console With P2P off: http://perf.jenkins.couchbase.com/job/titan/12212/console Both show pretty much the same CPU utilization as well as the same memory utilization, which means that XDCR replication isn’t showing signs of regression. Then I took a look at the item count decreasing from the runs. It seems that there are 200 million * 1KB == 0.2 TB of data, and it took about 16 minutes to replicate. I looked at the graphs of data_replicated and changes_left, and they aren’t that different. Then I realized that checkpoint interval is set to default by 10 minutes. The test takes about 20 minutes to run, and it restarts pipeline perhaps 3 times in each run: ~/Downloads/Perf/12212/cbcollect$ zipgrepCbCollectXdcr "PipelineMgr" *105.zip | grep "Try to" cbcollect_info_ns_1@172.23.96.105_20211006-212854/ns_server.goxdcr.log:2021-10-06T14:07:56.517-07:00 INFO GOXDCR.PipelineMgr: Try to start/restart Pipeline ea264189b301063b10e4b4dfc597f928/bucket-1/bucket-1. cbcollect_info_ns_1@172.23.96.105_20211006-212854/ns_server.goxdcr.log:2021-10-06T14:07:56.517-07:00 INFO GOXDCR.PipelineMgr: Try to stop pipeline ea264189b301063b10e4b4dfc597f928/bucket-1/bucket-1 cbcollect_info_ns_1@172.23.96.105_20211006-212854/ns_server.goxdcr.log:2021-10-06T14:12:56.336-07:00 INFO GOXDCR.PipelineMgr: Try to fix Pipeline ea264189b301063b10e4b4dfc597f928/bucket-1/bucket-1. Current error(s)=r.update_err_ch : TopoChangeDet : Restarting pipeline due to source topology change... cbcollect_info_ns_1@172.23.96.105_20211006-212854/ns_server.goxdcr.log:2021-10-06T14:12:56.336-07:00 INFO GOXDCR.PipelineMgr: Try to stop pipeline ea264189b301063b10e4b4dfc597f928/bucket-1/bucket-1 cbcollect_info_ns_1@172.23.96.105_20211006-212854/ns_server.goxdcr.log:2021-10-06T14:17:56.307-07:00 INFO GOXDCR.PipelineMgr: Try to fix Pipeline ea264189b301063b10e4b4dfc597f928/bucket-1/bucket-1. Current error(s)=r.update_err_ch : TopoChangeDet : Restarting pipeline due to source topology change... cbcollect_info_ns_1@172.23.96.105_20211006-212854/ns_server.goxdcr.log:2021-10-06T14:17:56.307-07:00 INFO GOXDCR.PipelineMgr: Try to stop pipeline ea264189b301063b10e4b4dfc597f928/bucket-1/bucket-1 cbcollect_info_ns_1@172.23.96.105_20211006-212854/ns_server.goxdcr.log:2021-10-06T14:22:56.313-07:00 INFO GOXDCR.PipelineMgr: Try to fix Pipeline ea264189b301063b10e4b4dfc597f928/bucket-1/bucket-1. Current error(s)=r.update_err_ch : TopoChangeDet : Restarting pipeline due to source topology change... cbcollect_info_ns_1@172.23.96.105_20211006-212854/ns_server.goxdcr.log:2021-10-06T14:22:56.313-07:00 INFO GOXDCR.PipelineMgr: Try to stop pipeline ea264189b301063b10e4b4dfc597f928/bucket-1/bucket-1 This means that XDCR is doing extra work of peer-to-peer communication without the benefit of the checkpoints. This is evident because the new node that rebalanced in (106) started with Seqno=0 even though 105 has done the work. Bo-Chun Wang , can you do another experiment with the following 2 runs parameters: 1. Run with “preReplicateVBMasterCheck=false” and “checkpointInterval=180” (no peer to peer, and checkpoint every 3 minutes instead of 10 minutes) 2. Run with “checkpointInterval=180” (peer-to-peer, and checkpoint every 3 minutes) As of now, the regression seems to be the extra work XDCR is doing with Peer-To-Peer checkpoint pull without actually instantiating checkpoints to be able to be used. When XDCR does P2P work, it's not replicating data and so it introduces delay and cause the throughput to go down. Having these 2 test parameters will allow us to have a baseline, and secondly to make sure that the time spent doing P2P communication instead of replicating isn’t wasted.

          http://perf.jenkins.couchbase.com/job/titan-xdcr-dev/57/ 

          preReplicateVBMasterCheck': False, 'checkpointInterval': 180

          Avg. initial XDCR rate (items/sec): 190198

           

          http://perf.jenkins.couchbase.com/job/titan-xdcr-dev/58/

          'checkpointInterval': 180

          Avg. initial XDCR rate (items/sec): 161277

          bo-chun.wang Bo-Chun Wang added a comment - http://perf.jenkins.couchbase.com/job/titan-xdcr-dev/57/   preReplicateVBMasterCheck': False, 'checkpointInterval': 180 Avg. initial XDCR rate (items/sec): 190198   http://perf.jenkins.couchbase.com/job/titan-xdcr-dev/58/ 'checkpointInterval': 180 Avg. initial XDCR rate (items/sec): 161277
          wayne Wayne Siu added a comment -

          Bo-Chun Wang 

          Looks like the performance has recovered since late Oct. (build 7.1.0-1601+).  Can you confirm?  Thanks.

          wayne Wayne Siu added a comment - Bo-Chun Wang   Looks like the performance has recovered since late Oct. (build 7.1.0-1601+).  Can you confirm?  Thanks.
          bo-chun.wang Bo-Chun Wang added a comment -

          The performance is back. I close this ticket.
          http://showfast.sc.couchbase.com/#/timeline/Linux/xdcr/reb/all

          bo-chun.wang Bo-Chun Wang added a comment - The performance is back. I close this ticket. http://showfast.sc.couchbase.com/#/timeline/Linux/xdcr/reb/all

          People

            lilei.chen Lilei Chen
            bo-chun.wang Bo-Chun Wang
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Gerrit Reviews

                There are no open Gerrit changes

                PagerDuty