Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-50016

Avg. initial XDCR rate dropped from 500K to 350K on build 7.1.0-1787

    XMLWordPrintable

Details

    • Bug
    • Status: Closed
    • Major
    • Resolution: Fixed
    • 7.1.0
    • 7.1.0
    • XDCR
    • Untriaged
    • 1
    • Yes

    Description

      Avg. initial XDCR rate (items/sec), 5 -> 5 (2 source nozzles, 4 target nozzles), 1 bucket x 250M x 1KB

       

      Attachments

        Issue Links

          No reviews matched the request. Check your Options in the drop-down menu of this sections header.

          Activity

            neil.huang Neil Huang added a comment - Here are the differences: http://changelog.build.couchbase.com/?product=couchbase-server&fromVersion=7.1.0&fromBuild=1745&toVersion=7.1.0&toBuild=1787&f_analytics-dcp-client=off&f_asterixdb=off&f_backup=off&f_cbas=off&f_cbas-core=off&f_cbas-ui=off&f_cbauth=off&f_cbbs=off&f_cbft=off&f_cbgt=off&f_chronicle=off&f_eventing=off&f_eventing-ee=off&f_gocb=off&f_goutils=off&f_goxdcr=on&f_indexing=off&f_kv_engine=off&f_logstats=off&f_magma=off&f_n1fty=off&f_ns_server=off&f_phosphor=off&f_plasma=off&f_platform=off&f_product-metadata=off&f_query=off&f_query-ee=off&f_query-ui=off&f_retriever=off&f_sigar=off&f_testrunner=off&f_tlm=off&f_vbmap=off I'm thinking that https://github.com/couchbase/goxdcr/commit/a05036d264b863275f59af43206fefa2599b45df causes the test to start a bit later. Bo-Chun Wang , can you retry the test with preReplicateVBMasterCheck set to false for the test?
            neil.huang Neil Huang added a comment -

            build 1745

            2021-11-25T01:19:46.214-08:00 WARN GOXDCR.XDCRFactory: Error with Peer-To-Peer checkpoint pull, starting with local ckpts only: 172.23.96.107:8091 : Execution timed out - did not hear back from node after 51s. Could be due to peer node being busy to respond in time or this XDCR being too busy to handle incoming requests
            172.23.96.108:8091 : Execution timed out - did not hear back from node after 51s. Could be due to peer node being busy to respond in time or this XDCR being too busy to handle incoming requests
            2021-11-25T01:19:46.214-08:00 WARN GOXDCR.GenericPipeline: P2P PreReplicate for da2de46b2669e6a426bf8ecdb811fe03/bucket-1/bucket-1 Ckpt Pull and merge had errors but will continue to replicate: map[genericPipeline.vbMasterCheckFunc:172.23.96.107:8091 : Execution timed out - did not hear back from node after 51s. Could be due to peer node being busy to respond in time or this XDCR being too busy to handle incoming requests
            172.23.96.108:8091 : Execution timed out - did not hear back from node after 51s. Could be due to peer node being busy to respond in time or this XDCR being too busy to handle incoming requests]
            

            Looks like the test was waiting for P2P for 51 seconds to start.
            The later test would wait for P2P for much longer to start if a peer node is non-responsive. That can explain the slow down.

            The test (all multi-node tests, actually) should be updated so P2P is not run, to keep the metrics the same and to properly detect regressions.

            neil.huang Neil Huang added a comment - build 1745 2021-11-25T01:19:46.214-08:00 WARN GOXDCR.XDCRFactory: Error with Peer-To-Peer checkpoint pull, starting with local ckpts only: 172.23.96.107:8091 : Execution timed out - did not hear back from node after 51s. Could be due to peer node being busy to respond in time or this XDCR being too busy to handle incoming requests 172.23.96.108:8091 : Execution timed out - did not hear back from node after 51s. Could be due to peer node being busy to respond in time or this XDCR being too busy to handle incoming requests 2021-11-25T01:19:46.214-08:00 WARN GOXDCR.GenericPipeline: P2P PreReplicate for da2de46b2669e6a426bf8ecdb811fe03/bucket-1/bucket-1 Ckpt Pull and merge had errors but will continue to replicate: map[genericPipeline.vbMasterCheckFunc:172.23.96.107:8091 : Execution timed out - did not hear back from node after 51s. Could be due to peer node being busy to respond in time or this XDCR being too busy to handle incoming requests 172.23.96.108:8091 : Execution timed out - did not hear back from node after 51s. Could be due to peer node being busy to respond in time or this XDCR being too busy to handle incoming requests] Looks like the test was waiting for P2P for 51 seconds to start. The later test would wait for P2P for much longer to start if a peer node is non-responsive. That can explain the slow down. The test (all multi-node tests, actually) should be updated so P2P is not run, to keep the metrics the same and to properly detect regressions.
            bo-chun.wang Bo-Chun Wang added a comment -

            I finish a run with preReplicateVBMasterCheck set to false, and the performance regression is gone.

            http://perf.jenkins.couchbase.com/job/titan-xdcr-dev/61/

            Neil Huang 

            I have two questions.

            1. Should we set preReplicateVBMasterCheck to false in all XDCR runs so we can keep the metrics the same?
            2. How do we track P2P performance?
            bo-chun.wang Bo-Chun Wang added a comment - I finish a run with preReplicateVBMasterCheck set to false, and the performance regression is gone. http://perf.jenkins.couchbase.com/job/titan-xdcr-dev/61/ Neil Huang   I have two questions. Should we set preReplicateVBMasterCheck to false in all XDCR runs so we can keep the metrics the same? How do we track P2P performance?
            neil.huang Neil Huang added a comment -

            Bo-Chun Wang Before answering those questions... I would like to diagnose a bit further. Ideally, p2p should not impact the test because the systems are clean.
            I would like to see if we can address and minimize the impact of P2P where it should not affect the runtime numbers.

            I will work with you offline with toy builds to make sure there's no uncaught P2P issues before declaring that turning off the P2P globally is the solution.

            neil.huang Neil Huang added a comment - Bo-Chun Wang Before answering those questions... I would like to diagnose a bit further. Ideally, p2p should not impact the test because the systems are clean. I would like to see if we can address and minimize the impact of P2P where it should not affect the runtime numbers. I will work with you offline with toy builds to make sure there's no uncaught P2P issues before declaring that turning off the P2P globally is the solution.
            neil.huang Neil Huang added a comment -

            http://perf.jenkins.couchbase.com/job/titan/12959/console is a dry run where https://issues.couchbase.com/browse/MB-50095 is fixed. The regression number seems better and went back to 527k.

            MB-50095 is checked in now and so the next build should be good to re-run to get a more official number.

            neil.huang Neil Huang added a comment - http://perf.jenkins.couchbase.com/job/titan/12959/console is a dry run where https://issues.couchbase.com/browse/MB-50095 is fixed. The regression number seems better and went back to 527k. MB-50095 is checked in now and so the next build should be good to re-run to get a more official number.

            The rate is back to 558K in the run with build 7.1.0-1970. I close this issue.

            http://perf.jenkins.couchbase.com/job/titan/12970/ 

            bo-chun.wang Bo-Chun Wang added a comment - The rate is back to 558K in the run with build 7.1.0-1970. I close this issue. http://perf.jenkins.couchbase.com/job/titan/12970/  

            People

              bo-chun.wang Bo-Chun Wang
              bo-chun.wang Bo-Chun Wang
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Gerrit Reviews

                  There are no open Gerrit changes

                  PagerDuty