Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-48614

XDCR - P2P - adjust time allowed to run P2P protocol before timing out

    XMLWordPrintable

Details

    • Task
    • Resolution: Fixed
    • Major
    • 7.1.0
    • 7.1.0
    • XDCR
    • None
    • 1

    Description

      The timeout for running p2p protocol before each pipeline start is 60 seconds. That may not be enough for a large running system with pegged CPUs:

      Looking at the first instance of P2P timeout:

      172.23.108.103_20210927-000521/ns_server.goxdcr.log

      2021-09-26T13:04:55.079-07:00 INFO GOXDCR.GenericPipeline: Executing Action timed out
      2021-09-26T13:04:55.079-07:00 INFO GOXDCR.GenericPipeline: ****************************
      2021-09-26T13:04:55.079-07:00 ERRO GOXDCR.GenericPipeline: 0d53a5e6437fbb31d3ffddd9e55c21c6/bucket9/bucket9-955799311 failed to start, err=map[genericPipeline.RunP2PProtocol:Execution timed out]
      2021-09-26T13:04:55.079-07:00 ERRO GOXDCR.PipelineMgr: Failed to start the pipeline 0d53a5e6437fbb31d3ffddd9e55c21c6/bucket9/bucket9-955799311
      

      At the time, CPU starts getting busy:

      It's a correlation that as the CPU gets pegged, XDCR will need more time to perform P2P.
      It is also such that the first instance of a timeout failure is when CPU starts getting pegged, but no such timeout occurred beforehand.

      Moreover, looking at peer nodes, if a single node is slow to respond to a request, then it eats into the whole 1 minute process.

      One peer node was particular busy and slow to respond to general requests, at times taking up to 1 minute:

      FILE collectinfo-2021-09-27T000520-ns_1@172.23.97.121.zip -----------------------

      cbcollect_info_ns_1@172.23.97.121_20210927-000520/ns_server.goxdcr.log:2021-09-26T13:06:52.592-07:00 WARN GOXDCR.Utils: VBMasterCheckHandler(172.23.96.148:8091,0d53a5e6437fbb31d3ffddd9e55c21c6/default/remote) took 12.963134659s
      cbcollect_info_ns_1@172.23.97.121_20210927-000520/ns_server.goxdcr.log:2021-09-26T13:07:20.477-07:00 WARN GOXDCR.Utils: VBMasterCheckHandler(172.23.97.119:8091,0d53a5e6437fbb31d3ffddd9e55c21c6/bucket9/bucket9) took 27.881782721s
      cbcollect_info_ns_1@172.23.97.121_20210927-000520/ns_server.goxdcr.log:2021-09-26T13:07:47.493-07:00 WARN GOXDCR.Utils: VBMasterCheckHandler(172.23.99.25:8091,0d53a5e6437fbb31d3ffddd9e55c21c6/default/remote) took 12.462504646s
      cbcollect_info_ns_1@172.23.97.121_20210927-000520/ns_server.goxdcr.log:2021-09-26T13:08:14.514-07:00 WARN GOXDCR.Utils: VBMasterCheckHandler(172.23.99.21:8091,0d53a5e6437fbb31d3ffddd9e55c21c6/default/remote) took 26.808784537s
      cbcollect_info_ns_1@172.23.97.121_20210927-000520/ns_server.goxdcr.log:2021-09-26T13:09:08.824-07:00 WARN GOXDCR.Utils: VBMasterCheckHandler(172.23.99.20:8091,0d53a5e6437fbb31d3ffddd9e55c21c6/default/remote) took 29.423710138s
      cbcollect_info_ns_1@172.23.97.121_20210927-000520/ns_server.goxdcr.log:2021-09-26T13:10:02.887-07:00 WARN GOXDCR.Utils: VBMasterCheckHandler(172.23.106.100:8091,0d53a5e6437fbb31d3ffddd9e55c21c6/bucket9/bucket9) took 20.611870969s
      cbcollect_info_ns_1@172.23.97.121_20210927-000520/ns_server.goxdcr.log:2021-09-26T13:10:29.897-07:00 WARN GOXDCR.Utils: VBMasterCheckHandler(172.23.99.25:8091,0d53a5e6437fbb31d3ffddd9e55c21c6/default/remote) took 26.946682172s
      cbcollect_info_ns_1@172.23.97.121_20210927-000520/ns_server.goxdcr.log:2021-09-26T13:10:56.966-07:00 WARN GOXDCR.Utils: VBMasterCheckHandler(172.23.99.25:8091,0d53a5e6437fbb31d3ffddd9e55c21c6/default/remote) took 16.934218116s
      cbcollect_info_ns_1@172.23.97.121_20210927-000520/ns_server.goxdcr.log:2021-09-26T13:11:52.042-07:00 WARN GOXDCR.Utils: VBMasterCheckHandler(172.23.98.135:8091,0d53a5e6437fbb31d3ffddd9e55c21c6/bucket8/bucket8) took 9.276670459s
      cbcollect_info_ns_1@172.23.97.121_20210927-000520/ns_server.goxdcr.log:2021-09-26T13:12:19.056-07:00 WARN GOXDCR.Utils: VBMasterCheckHandler(172.23.99.25:8091,0d53a5e6437fbb31d3ffddd9e55c21c6/bucket9/bucket9) took 26.786388958s
      ...
      cbcollect_info_ns_1@172.23.97.121_20210927-000520/ns_server.goxdcr.log:2021-09-26T13:36:48.986-07:00 WARN GOXDCR.Utils: VBMasterCheckHandler(172.23.97.119:8091,0d53a5e6437fbb31d3ffddd9e55c21c6/bucket4/bucket4) took 1m13.958987404s
      

      Once the pull is done, it'll take time to merge all the ckpts and backfill replication in sequence... as it'll take time to do a RMW on all checkpoints.

      It makes sense to give the p2p protocol a bigger and longer breathing room as XDCR gets taxed more heavily per each pipeline start now given the P2P in place.

      We may also need to think about a more dynamic way to determine how long a P2P should run based on the number of replications in place. We have seen customer situations where they are running 60+ replications... we'll need to account for that scenario as well in a busy system so that pipeline has enough time to run P2P protocol before starting.

      Attachments

        Issue Links

          No reviews matched the request. Check your Options in the drop-down menu of this sections header.

          Activity

            People

              neil.huang Neil Huang
              neil.huang Neil Huang
              Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Gerrit Reviews

                  There are no open Gerrit changes

                  PagerDuty