Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-50542

[System Test][XDCR] stopPart timeout observed in longevity

    XMLWordPrintable

Details

    • Bug
    • Status: Open
    • Major
    • Resolution: Unresolved
    • 7.1.0
    • Neo.next
    • XDCR
    • Untriaged
    • 1
    • Unknown

    Description

      7.1.0-2097

      Test:
      -test tests/integration/neo/test_neo_couchstore_milestone4.yml -scope tests/integration/neo/scope_couchstore.yml
      Scale 3
      Iteration 3

      stopPart timeout investigated in MB-48745 is observed in latest longevity run:

      2022-01-19T19:53:44.791-08:00 INFO GOXDCR.GenericPipeline: Executing Action timed out
      2022-01-19T19:53:44.791-08:00 INFO GOXDCR.GenericPipeline: ****************************
      2022-01-19T19:53:44.791-08:00 WARN GOXDCR.GenericPipeline: backfill_b9d9658285c849a3e59d3640ed11f6ba/bucket4/bucket4-964860287 error stopping pipeline parts.%!(EXTRA *errors.errorString=Execution timed out)
      2022-01-19T19:53:44.791-08:00 INFO GOXDCR.GenericPipeline: backfill_b9d9658285c849a3e59d3640ed11f6ba/bucket4/bucket4-964860287 trying to stop connector Router_dcp_backfill_b9d9658285c849a3e59d3640ed11f6ba/bucket4/bucket4_172.23.108.103:11210_0
      2022-01-19T19:53:44.791-08:00 INFO GOXDCR.XDCRRouter: Router Router_dcp_backfill_b9d9658285c849a3e59d3640ed11f6ba/bucket4/bucket4_172.23.108.103:11210_0 stopping...
      2022-01-19T19:53:44.791-08:00 INFO GOXDCR.XDCRRouter: CollectionsRouter Router_dcp_backfill_b9d9658285c849a3e59d3640ed11f6ba/bucket4/bucket4_172.23.108.103:11210_0 is now stopping
      2022-01-19T19:53:44.791-08:00 INFO GOXDCR.XDCRRouter: CollectionsRouter Router_dcp_backfill_b9d9658285c849a3e59d3640ed11f6ba/bucket4/bucket4_172.23.108.103:11210_0 is now stopping
      2022-01-19T19:53:44.791-08:00 INFO GOXDCR.XDCRRouter: Router Router_dcp_backfill_b9d9658285c849a3e59d3640ed11f6ba/bucket4/bucket4_172.23.108.103:11210_0 stopped
      2022-01-19T19:53:44.791-08:00 INFO GOXDCR.GenericPipeline: backfill_b9d9658285c849a3e59d3640ed11f6ba/bucket4/bucket4-964860287 pipeline connectors have stopped successfully
      2022-01-19T19:53:44.791-08:00 INFO GOXDCR.GenericPipeline: BackfillPipeline backfill_b9d9658285c849a3e59d3640ed11f6ba/bucket4/bucket4-964860287 has been stopped
       errMap=map[genericPipeline.StopParts:Execution timed out]
      2022-01-19T19:53:44.791-08:00 ERRO GOXDCR.PipelineMgr: Received error(s) when stopping backfill pipeline b9d9658285c849a3e59d3640ed11f6ba/bucket4/bucket4 - genericPipeline.StopParts : Execution timed out
      2022-01-19T19:53:44.791-08:00 INFO GOXDCR.PipelineMgr: Replication Status=name={b9d9658285c849a3e59d3640ed11f6ba/bucket4/bucket4}, status={Replicating}, errors={[]}, oldProgress={All incoming nozzles have been opened}, progress={Pipeline is running}, oldBackfillProgress={Source nozzles have been closed}, backfillProgress={Pipeline has been stopped}
      2022-01-19T19:53:44.791-08:00 INFO GOXDCR.PipelineMgr: b9d9658285c849a3e59d3640ed11f6ba/bucket4/bucket4 calling one backfill callback...
      2022-01-19T19:53:45.137-08:00 INFO GOXDCR.TopoChangeDet: TopologyChangeDetectorSvc for pipeline b9d9658285c849a3e59d3640ed11f6ba/default/remote handleTargetTopologyChange completed
      2022-01-19T19:53:45.156-08:00 INFO GOXDCR.StatsMgr: b9d9658285c849a3e59d3640ed11f6ba/bucket4/bucket4 total_docs=1343242, docs_processed=1343242, changes_left=0
      2022-01-19T19:53:45.177-08:00 INFO GOXDCR.StatsMgr: b9d9658285c849a3e59d3640ed11f6ba/bucket9/bucket9 total_docs=98952, docs_processed=98084, changes_left=868
      2022-01-19T19:53:45.229-08:00 INFO GOXDCR.StatsMgr: b9d9658285c849a3e59d3640ed11f6ba/default/remote total_docs=38815111, docs_processed=38138302, changes_left=676809
      2022-01-19T19:53:45.232-08:00 INFO GOXDCR.StatsMgr: b9d9658285c849a3e59d3640ed11f6ba/bucket8/bucket8 total_docs=101232, docs_processed=100344, changes_left=888
      2022-01-19T19:53:45.318-08:00 INFO GOXDCR.TopoChangeDet: Source topology has changed for pipeline b9d9658285c849a3e59d3640ed11f6ba/default/remote
      2022-01-19T19:53:45.318-08:00 INFO GOXDCR.TopoChangeDet: Pipeline b9d9658285c849a3e59d3640ed11f6ba/default/remote - vblist_supposed=[0 9 10 11 12 13 14 15 16 17 3
      

      Logs:
      url : https://cb-jira.s3.us-east-2.amazonaws.com/logs/systestmon-1642657419/collectinfo-2022-01-20T054341-ns_1%40172.23.104.137.zip
      url : https://cb-jira.s3.us-east-2.amazonaws.com/logs/systestmon-1642657419/collectinfo-2022-01-20T054341-ns_1%40172.23.104.155.zip
      url : https://cb-jira.s3.us-east-2.amazonaws.com/logs/systestmon-1642657419/collectinfo-2022-01-20T054341-ns_1%40172.23.104.157.zip
      url : https://cb-jira.s3.us-east-2.amazonaws.com/logs/systestmon-1642657419/collectinfo-2022-01-20T054341-ns_1%40172.23.104.5.zip
      url : https://cb-jira.s3.us-east-2.amazonaws.com/logs/systestmon-1642657419/collectinfo-2022-01-20T054341-ns_1%40172.23.104.67.zip
      url : https://cb-jira.s3.us-east-2.amazonaws.com/logs/systestmon-1642657419/collectinfo-2022-01-20T054341-ns_1%40172.23.104.69.zip
      url : https://cb-jira.s3.us-east-2.amazonaws.com/logs/systestmon-1642657419/collectinfo-2022-01-20T054341-ns_1%40172.23.104.70.zip
      url : https://cb-jira.s3.us-east-2.amazonaws.com/logs/systestmon-1642657419/collectinfo-2022-01-20T054341-ns_1%40172.23.105.107.zip
      url : https://cb-jira.s3.us-east-2.amazonaws.com/logs/systestmon-1642657419/collectinfo-2022-01-20T054341-ns_1%40172.23.105.111.zip
      url : https://cb-jira.s3.us-east-2.amazonaws.com/logs/systestmon-1642657419/collectinfo-2022-01-20T054341-ns_1%40172.23.106.100.zip
      url : https://cb-jira.s3.us-east-2.amazonaws.com/logs/systestmon-1642657419/collectinfo-2022-01-20T054341-ns_1%40172.23.106.188.zip
      url : https://cb-jira.s3.us-east-2.amazonaws.com/logs/systestmon-1642657419/collectinfo-2022-01-20T054341-ns_1%40172.23.108.103.zip
      url : https://cb-jira.s3.us-east-2.amazonaws.com/logs/systestmon-1642657419/collectinfo-2022-01-20T054341-ns_1%40172.23.120.107.zip
      url : https://cb-jira.s3.us-east-2.amazonaws.com/logs/systestmon-1642657419/collectinfo-2022-01-20T054341-ns_1%40172.23.120.245.zip
      url : https://cb-jira.s3.us-east-2.amazonaws.com/logs/systestmon-1642657419/collectinfo-2022-01-20T054341-ns_1%40172.23.121.117.zip
      url : https://cb-jira.s3.us-east-2.amazonaws.com/logs/systestmon-1642657419/collectinfo-2022-01-20T054341-ns_1%40172.23.123.28.zip
      url : https://cb-jira.s3.us-east-2.amazonaws.com/logs/systestmon-1642657419/collectinfo-2022-01-20T054341-ns_1%40172.23.96.148.zip
      url : https://cb-jira.s3.us-east-2.amazonaws.com/logs/systestmon-1642657419/collectinfo-2022-01-20T054341-ns_1%40172.23.96.251.zip
      url : https://cb-jira.s3.us-east-2.amazonaws.com/logs/systestmon-1642657419/collectinfo-2022-01-20T054341-ns_1%40172.23.96.252.zip
      url : https://cb-jira.s3.us-east-2.amazonaws.com/logs/systestmon-1642657419/collectinfo-2022-01-20T054341-ns_1%40172.23.96.253.zip
      url : https://cb-jira.s3.us-east-2.amazonaws.com/logs/systestmon-1642657419/collectinfo-2022-01-20T054341-ns_1%40172.23.97.119.zip
      url : https://cb-jira.s3.us-east-2.amazonaws.com/logs/systestmon-1642657419/collectinfo-2022-01-20T054341-ns_1%40172.23.97.121.zip
      url : https://cb-jira.s3.us-east-2.amazonaws.com/logs/systestmon-1642657419/collectinfo-2022-01-20T054341-ns_1%40172.23.97.122.zip
      url : https://cb-jira.s3.us-east-2.amazonaws.com/logs/systestmon-1642657419/collectinfo-2022-01-20T054341-ns_1%40172.23.97.239.zip
      url : https://cb-jira.s3.us-east-2.amazonaws.com/logs/systestmon-1642657419/collectinfo-2022-01-20T054341-ns_1%40172.23.98.135.zip
      url : https://cb-jira.s3.us-east-2.amazonaws.com/logs/systestmon-1642657419/collectinfo-2022-01-20T054341-ns_1%40172.23.99.11.zip
      url : https://cb-jira.s3.us-east-2.amazonaws.com/logs/systestmon-1642657419/collectinfo-2022-01-20T054341-ns_1%40172.23.99.21.zip
      url : https://cb-jira.s3.us-east-2.amazonaws.com/logs/systestmon-1642657419/collectinfo-2022-01-20T054341-ns_1%40172.23.99.25.zip

      Attachments

        No reviews matched the request. Check your Options in the drop-down menu of this sections header.

        Activity

          james.lee James Lee added a comment - - edited

          Assigning to XDCR and removing tools, I don't believe this is a tools issue, please feel free to re-assign if I'm wrong.

          james.lee James Lee added a comment - - edited Assigning to XDCR and removing tools, I don't believe this is a tools issue, please feel free to re-assign if I'm wrong.

          Thanks James Lee, my bad! thanks for fixing the component..

          arunkumar Arunkumar Senthilnathan (Inactive) added a comment - Thanks James Lee , my bad! thanks for fixing the component..
          neil.huang Neil Huang added a comment -

          Traces of one node:

          cbcollect_info_ns_1@172.23.108.103_20220120-054341/ns_server.goxdcr.log:2022-01-19T19:25:24.772-08:00 INFO GOXDCR.GenericPipeline: Stopping BackfillPipeline backfill_b9d9658285c849a3e59d3640ed11f6ba/bucket4/bucket4-404239169
          cbcollect_info_ns_1@172.23.108.103_20220120-054341/ns_server.goxdcr.log:2022-01-19T19:25:24.774-08:00 INFO GOXDCR.RuntimeCtx: backfill_b9d9658285c849a3e59d3640ed11f6ba/bucket4/bucket4 Pipeline context is stopping...
          cbcollect_info_ns_1@172.23.108.103_20220120-054341/ns_server.goxdcr.log:2022-01-19T19:25:24.774-08:00 INFO GOXDCR.GenericSupervisor: Stopping supervisor PipelineSupervisor_b9d9658285c849a3e59d3640ed11f6ba/bucket4/bucket4.
          cbcollect_info_ns_1@172.23.108.103_20220120-054341/ns_server.goxdcr.log:2022-01-19T19:25:24.774-08:00 INFO GOXDCR.GenericSupervisor: monitorPipelineHealth routine is exiting because parent supervisor PipelineSupervisor_b9d9658285c849a3e59d3640ed11f6ba/bucket4/bucket4 ha
          s been stopped (originally started? true)
          cbcollect_info_ns_1@172.23.108.103_20220120-054341/ns_server.goxdcr.log:2022-01-19T19:25:24.774-08:00 INFO GOXDCR.GenericSupervisor: checkAndLogFilterErrors routine is exiting because parent supervisor PipelineSupervisor_b9d9658285c849a3e59d3640ed11f6ba/bucket4/bucket4
          has been stopped
          cbcollect_info_ns_1@172.23.108.103_20220120-054341/ns_server.goxdcr.log:2022-01-19T19:25:24.774-08:00 INFO GOXDCR.GenericSupervisor: Stopped supervisor PipelineSupervisor_b9d9658285c849a3e59d3640ed11f6ba/bucket4/bucket4.
          cbcollect_info_ns_1@172.23.108.103_20220120-054341/ns_server.goxdcr.log:2022-01-19T19:25:24.775-08:00 INFO GOXDCR.XmemNozzle: Stopping xmem_backfill_b9d9658285c849a3e59d3640ed11f6ba/bucket4/bucket4_172.23.123.24:11210_0
          cbcollect_info_ns_1@172.23.108.103_20220120-054341/ns_server.goxdcr.log:2022-01-19T19:25:24.775-08:00 INFO GOXDCR.XmemNozzle: xmem_backfill_b9d9658285c849a3e59d3640ed11f6ba/bucket4/bucket4_172.23.123.24:11210_0 has been stopped
          cbcollect_info_ns_1@172.23.108.103_20220120-054341/ns_server.goxdcr.log:2022-01-19T19:25:24.775-08:00 INFO GOXDCR.XmemNozzle: Stopping xmem_backfill_b9d9658285c849a3e59d3640ed11f6ba/bucket4/bucket4_172.23.96.145:11210_0
          cbcollect_info_ns_1@172.23.108.103_20220120-054341/ns_server.goxdcr.log:2022-01-19T19:25:24.775-08:00 INFO GOXDCR.XmemNozzle: xmem_backfill_b9d9658285c849a3e59d3640ed11f6ba/bucket4/bucket4_172.23.96.145:11210_0 has been stopped
          cbcollect_info_ns_1@172.23.108.103_20220120-054341/ns_server.goxdcr.log:2022-01-19T19:25:24.775-08:00 INFO GOXDCR.DcpNozzle: dcp_backfill_b9d9658285c849a3e59d3640ed11f6ba/bucket4/bucket4_172.23.108.103:11210_0 is stopping...
          cbcollect_info_ns_1@172.23.108.103_20220120-054341/ns_server.goxdcr.log:2022-01-19T19:25:24.778-08:00 INFO GOXDCR.DcpNozzle: dcp_backfill_b9d9658285c849a3e59d3640ed11f6ba/bucket4/bucket4_172.23.108.103:11210_0 checkInactiveUprStreams routine is exiting because dcp nozzl
          e has been stopped
          cbcollect_info_ns_1@172.23.108.103_20220120-054341/ns_server.goxdcr.log:2022-01-19T19:25:34.660-08:00 INFO GOXDCR.PipelineMgr: Replication Status = map[b9d9658285c849a3e59d3640ed11f6ba/bucket4/bucket4:name={b9d9658285c849a3e59d3640ed11f6ba/bucket4/bucket4}, status={Repl
          icating}, errors={[]}, oldProgress={All incoming nozzles have been opened}, progress={Pipeline is running}, oldBackfillProgress={Runtime context has been stopped}, backfillProgress={Source nozzles have been closed} b9d9658285c849a3e59d3640ed11f6ba/bucket8/bucket8:name={
          b9d9658285c849a3e59d3640ed11f6ba/bucket8/bucket8}, status={Replicating}, errors={[]}, oldProgress={All incoming nozzles have been opened}, progress={Pipeline is running}, oldBackfillProgress={Source nozzles have been closed}, backfillProgress={Pipeline has been stopped}
           b9d9658285c849a3e59d3640ed11f6ba/bucket9/bucket9:name={b9d9658285c849a3e59d3640ed11f6ba/bucket9/bucket9}, status={Replicating}, errors={[]}, oldProgress={All incoming nozzles have been opened}, progress={Pipeline is running}, oldBackfillProgress={Source nozzles have be
          en closed}, backfillProgress={Pipeline has been stopped} b9d9658285c849a3e59d3640ed11f6ba/default/remote:name={b9d9658285c849a3e59d3640ed11f6ba/default/remote}, status={Replicating}, errors={[]}, oldProgress={All incoming nozzles have been opened}, progress={Pipeline is
           running}]
          cbcollect_info_ns_1@172.23.108.103_20220120-054341/ns_server.goxdcr.log:2022-01-19T19:25:34.779-08:00 WARN GOXDCR.GenericPipeline: backfill_b9d9658285c849a3e59d3640ed11f6ba/bucket4/bucket4-404239169 error stopping pipeline parts.%!(EXTRA *errors.errorString=Execution timed out)
          cbcollect_info_ns_1@172.23.108.103_20220120-054341/ns_server.goxdcr.log:2022-01-19T19:25:34.779-08:00 INFO GOXDCR.GenericPipeline: BackfillPipeline backfill_b9d9658285c849a3e59d3640ed11f6ba/bucket4/bucket4-404239169 has been stopped
          cbcollect_info_ns_1@172.23.108.103_20220120-054341/ns_server.goxdcr.log:2022-01-19T19:25:34.779-08:00 ERRO GOXDCR.PipelineMgr: Received error(s) when stopping backfill pipeline b9d9658285c849a3e59d3640ed11f6ba/bucket4/bucket4 - genericPipeline.StopParts : Execution timed out
          cbcollect_info_ns_1@172.23.108.103_20220120-054341/ns_server.goxdcr.log:2022-01-19T19:25:34.779-08:00 INFO GOXDCR.PipelineMgr: Replication Status=name={b9d9658285c849a3e59d3640ed11f6ba/bucket4/bucket4}, status={Replicating}, errors={[]}, oldProgress={All incoming nozzles have been opened}, progress={Pipeline is running}, oldBackfillProgress={Source nozzles have been closed}, backfillProgress={Pipeline has been stopped}
          cbcollect_info_ns_1@172.23.108.103_20220120-054341/ns_server.goxdcr.log:2022-01-19T19:25:35.055-08:00 INFO GOXDCR.PipelineMgr: Replication b9d9658285c849a3e59d3640ed11f6ba/bucket4/bucket4 backfill stop experienced error(s): genericPipeline.StopParts : Execution timed out. Will let it die
          cbcollect_info_ns_1@172.23.108.103_20220120-054341/ns_server.goxdcr.log:2022-01-19T19:25:35.055-08:00 INFO GOXDCR.PipelineMgr: Replication status received startBackfill, current status=name={b9d9658285c849a3e59d3640ed11f6ba/bucket4/bucket4}, status={Replicating}, errors={[]}, oldProgress={All incoming nozzles have been opened}, progress={Pipeline is running}, oldBackfillProgress={Source nozzles have been closed}, backfillProgress={Pipeline has been stopped}
          cbcollect_info_ns_1@172.23.108.103_20220120-054341/ns_server.goxdcr.log:2022-01-19T19:25:35.109-08:00 INFO GOXDCR.DcpNozzle: dcp_backfill_b9d9658285c849a3e59d3640ed11f6ba/bucket4/bucket4_172.23.108.103:11210_0 has been stopped
          

          The timeout happens at 19:25:34 and the DCP nozzle stopped just a second afterwards.
          It seems like DCP nozzle is potentially causing stopPart timeouts.

          Looking at the logs, I think a few things are happening:

          1. I see backfill pipelines restarting without letting the system breathe as it processes backfill tasks
          2. The backfill pipeline start time and stop time are close. Probably due to the many small backfill tasks that are in place.
          3. DCP Nozzle has locking that can cause start/stop contention when it is started and stopped in close proximity.
          neil.huang Neil Huang added a comment - Traces of one node: cbcollect_info_ns_1@172.23.108.103_20220120-054341/ns_server.goxdcr.log:2022-01-19T19:25:24.772-08:00 INFO GOXDCR.GenericPipeline: Stopping BackfillPipeline backfill_b9d9658285c849a3e59d3640ed11f6ba/bucket4/bucket4-404239169 cbcollect_info_ns_1@172.23.108.103_20220120-054341/ns_server.goxdcr.log:2022-01-19T19:25:24.774-08:00 INFO GOXDCR.RuntimeCtx: backfill_b9d9658285c849a3e59d3640ed11f6ba/bucket4/bucket4 Pipeline context is stopping... cbcollect_info_ns_1@172.23.108.103_20220120-054341/ns_server.goxdcr.log:2022-01-19T19:25:24.774-08:00 INFO GOXDCR.GenericSupervisor: Stopping supervisor PipelineSupervisor_b9d9658285c849a3e59d3640ed11f6ba/bucket4/bucket4. cbcollect_info_ns_1@172.23.108.103_20220120-054341/ns_server.goxdcr.log:2022-01-19T19:25:24.774-08:00 INFO GOXDCR.GenericSupervisor: monitorPipelineHealth routine is exiting because parent supervisor PipelineSupervisor_b9d9658285c849a3e59d3640ed11f6ba/bucket4/bucket4 ha s been stopped (originally started? true) cbcollect_info_ns_1@172.23.108.103_20220120-054341/ns_server.goxdcr.log:2022-01-19T19:25:24.774-08:00 INFO GOXDCR.GenericSupervisor: checkAndLogFilterErrors routine is exiting because parent supervisor PipelineSupervisor_b9d9658285c849a3e59d3640ed11f6ba/bucket4/bucket4 has been stopped cbcollect_info_ns_1@172.23.108.103_20220120-054341/ns_server.goxdcr.log:2022-01-19T19:25:24.774-08:00 INFO GOXDCR.GenericSupervisor: Stopped supervisor PipelineSupervisor_b9d9658285c849a3e59d3640ed11f6ba/bucket4/bucket4. cbcollect_info_ns_1@172.23.108.103_20220120-054341/ns_server.goxdcr.log:2022-01-19T19:25:24.775-08:00 INFO GOXDCR.XmemNozzle: Stopping xmem_backfill_b9d9658285c849a3e59d3640ed11f6ba/bucket4/bucket4_172.23.123.24:11210_0 cbcollect_info_ns_1@172.23.108.103_20220120-054341/ns_server.goxdcr.log:2022-01-19T19:25:24.775-08:00 INFO GOXDCR.XmemNozzle: xmem_backfill_b9d9658285c849a3e59d3640ed11f6ba/bucket4/bucket4_172.23.123.24:11210_0 has been stopped cbcollect_info_ns_1@172.23.108.103_20220120-054341/ns_server.goxdcr.log:2022-01-19T19:25:24.775-08:00 INFO GOXDCR.XmemNozzle: Stopping xmem_backfill_b9d9658285c849a3e59d3640ed11f6ba/bucket4/bucket4_172.23.96.145:11210_0 cbcollect_info_ns_1@172.23.108.103_20220120-054341/ns_server.goxdcr.log:2022-01-19T19:25:24.775-08:00 INFO GOXDCR.XmemNozzle: xmem_backfill_b9d9658285c849a3e59d3640ed11f6ba/bucket4/bucket4_172.23.96.145:11210_0 has been stopped cbcollect_info_ns_1@172.23.108.103_20220120-054341/ns_server.goxdcr.log:2022-01-19T19:25:24.775-08:00 INFO GOXDCR.DcpNozzle: dcp_backfill_b9d9658285c849a3e59d3640ed11f6ba/bucket4/bucket4_172.23.108.103:11210_0 is stopping... cbcollect_info_ns_1@172.23.108.103_20220120-054341/ns_server.goxdcr.log:2022-01-19T19:25:24.778-08:00 INFO GOXDCR.DcpNozzle: dcp_backfill_b9d9658285c849a3e59d3640ed11f6ba/bucket4/bucket4_172.23.108.103:11210_0 checkInactiveUprStreams routine is exiting because dcp nozzl e has been stopped cbcollect_info_ns_1@172.23.108.103_20220120-054341/ns_server.goxdcr.log:2022-01-19T19:25:34.660-08:00 INFO GOXDCR.PipelineMgr: Replication Status = map[b9d9658285c849a3e59d3640ed11f6ba/bucket4/bucket4:name={b9d9658285c849a3e59d3640ed11f6ba/bucket4/bucket4}, status={Repl icating}, errors={[]}, oldProgress={All incoming nozzles have been opened}, progress={Pipeline is running}, oldBackfillProgress={Runtime context has been stopped}, backfillProgress={Source nozzles have been closed} b9d9658285c849a3e59d3640ed11f6ba/bucket8/bucket8:name={ b9d9658285c849a3e59d3640ed11f6ba/bucket8/bucket8}, status={Replicating}, errors={[]}, oldProgress={All incoming nozzles have been opened}, progress={Pipeline is running}, oldBackfillProgress={Source nozzles have been closed}, backfillProgress={Pipeline has been stopped} b9d9658285c849a3e59d3640ed11f6ba/bucket9/bucket9:name={b9d9658285c849a3e59d3640ed11f6ba/bucket9/bucket9}, status={Replicating}, errors={[]}, oldProgress={All incoming nozzles have been opened}, progress={Pipeline is running}, oldBackfillProgress={Source nozzles have be en closed}, backfillProgress={Pipeline has been stopped} b9d9658285c849a3e59d3640ed11f6ba/default/remote:name={b9d9658285c849a3e59d3640ed11f6ba/default/remote}, status={Replicating}, errors={[]}, oldProgress={All incoming nozzles have been opened}, progress={Pipeline is running}] cbcollect_info_ns_1@172.23.108.103_20220120-054341/ns_server.goxdcr.log:2022-01-19T19:25:34.779-08:00 WARN GOXDCR.GenericPipeline: backfill_b9d9658285c849a3e59d3640ed11f6ba/bucket4/bucket4-404239169 error stopping pipeline parts.%!(EXTRA *errors.errorString=Execution timed out) cbcollect_info_ns_1@172.23.108.103_20220120-054341/ns_server.goxdcr.log:2022-01-19T19:25:34.779-08:00 INFO GOXDCR.GenericPipeline: BackfillPipeline backfill_b9d9658285c849a3e59d3640ed11f6ba/bucket4/bucket4-404239169 has been stopped cbcollect_info_ns_1@172.23.108.103_20220120-054341/ns_server.goxdcr.log:2022-01-19T19:25:34.779-08:00 ERRO GOXDCR.PipelineMgr: Received error(s) when stopping backfill pipeline b9d9658285c849a3e59d3640ed11f6ba/bucket4/bucket4 - genericPipeline.StopParts : Execution timed out cbcollect_info_ns_1@172.23.108.103_20220120-054341/ns_server.goxdcr.log:2022-01-19T19:25:34.779-08:00 INFO GOXDCR.PipelineMgr: Replication Status=name={b9d9658285c849a3e59d3640ed11f6ba/bucket4/bucket4}, status={Replicating}, errors={[]}, oldProgress={All incoming nozzles have been opened}, progress={Pipeline is running}, oldBackfillProgress={Source nozzles have been closed}, backfillProgress={Pipeline has been stopped} cbcollect_info_ns_1@172.23.108.103_20220120-054341/ns_server.goxdcr.log:2022-01-19T19:25:35.055-08:00 INFO GOXDCR.PipelineMgr: Replication b9d9658285c849a3e59d3640ed11f6ba/bucket4/bucket4 backfill stop experienced error(s): genericPipeline.StopParts : Execution timed out. Will let it die cbcollect_info_ns_1@172.23.108.103_20220120-054341/ns_server.goxdcr.log:2022-01-19T19:25:35.055-08:00 INFO GOXDCR.PipelineMgr: Replication status received startBackfill, current status=name={b9d9658285c849a3e59d3640ed11f6ba/bucket4/bucket4}, status={Replicating}, errors={[]}, oldProgress={All incoming nozzles have been opened}, progress={Pipeline is running}, oldBackfillProgress={Source nozzles have been closed}, backfillProgress={Pipeline has been stopped} cbcollect_info_ns_1@172.23.108.103_20220120-054341/ns_server.goxdcr.log:2022-01-19T19:25:35.109-08:00 INFO GOXDCR.DcpNozzle: dcp_backfill_b9d9658285c849a3e59d3640ed11f6ba/bucket4/bucket4_172.23.108.103:11210_0 has been stopped The timeout happens at 19:25:34 and the DCP nozzle stopped just a second afterwards. It seems like DCP nozzle is potentially causing stopPart timeouts. Looking at the logs, I think a few things are happening: I see backfill pipelines restarting without letting the system breathe as it processes backfill tasks The backfill pipeline start time and stop time are close. Probably due to the many small backfill tasks that are in place. DCP Nozzle has locking that can cause start/stop contention when it is started and stopped in close proximity.
          neil.huang Neil Huang added a comment -

          Just for the record, the stopPart timeout shown doesn't seem to be urgent errors.

          neil.huang Neil Huang added a comment - Just for the record, the stopPart timeout shown doesn't seem to be urgent errors.

          People

            neil.huang Neil Huang
            arunkumar Arunkumar Senthilnathan (Inactive)
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

              Created:
              Updated:

              Gerrit Reviews

                There are no open Gerrit changes

                PagerDuty