Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-57130

[BP 7.2.1] - XDCR - CheckpointMgr Initialize error will lead to memory leak

    XMLWordPrintable

Details

    • Bug
    • Resolution: Fixed
    • Critical
    • 7.2.1
    • 7.0.0, 7.0.1, 7.0.2, 7.0.3, 7.0.4, 7.0.5, 7.1.0, 7.1.1, 7.1.2, 7.2.0, 7.1.3, 7.2.1
    • XDCR
    • Untriaged
    • 0
    • No

    Description

      There are two issues. The first is a backfill pipeline-type memory leak. The second one is a main pipeline memory leak that is found while attempting to perform theretical exercise from the first investigation. See below.

      Backfill Pipeline Memory leak

      1
      If the backfill pipeline started, it should have been registered
      https://src.couchbase.org/source/xref/7.1.3/goproj/src/github.com/couchbase/goxdcr/pipeline_manager/pipeline_manager.go#1015-1036
      And during stopping it should have stopped it.
      https://src.couchbase.org/source/xref/7.1.3/goproj/src/github.com/couchbase/goxdcr/pipeline_manager/pipeline_manager.go#1058-1062

      Why didn’t it?

      The reason is because https://src.couchbase.org/source/xref/7.1.3/goproj/src/github.com/couchbase/goxdcr/pipeline_manager/pipeline_manager.go#1007-1013 returns early before setting it to replication status, which will cause a pipeline memory leak as the StopBackfillPipeline() won’t find said pipeline because it can’t find it from the repStatus (https://src.couchbase.org/source/xref/7.1.3/goproj/src/github.com/couchbase/goxdcr/pipeline_manager/pipeline_manager.go#1016) is not called.

      Even though the pipeline Start() was never called, the async listeners were created and because pipelineStop was never called, the async listeners stayed around bloating up memory.

      I was able to reproduce it locally by forcing that error path in code, manually forcing a backfill to start, and watch the restart happening.

      From 5 count to
      heap profile: 247: 9215104 [3256: 34060560] @ heap/1048576
      12: 983040 [12: 983040] @ 0x100539205 0x100b45fea 0x100b45fc0 0x100b448d2 0x100b42ec6 0x10065da33 0x1006623a5 0x10008ddc2 0x100661ae7 0x100661ab5 0x100072541
      #	0x100539204	github.com/couchbase/goxdcr/component.NewAsyncComponentEventListenerImpl+0x44			/Users/neil.huang/source/couchbase/goproj/src/github.com/couchbase/goxdcr/component/async_listener.go:38
      #	0x100b45fe9	github.com/couchbase/goxdcr/component.NewDefaultAsyncComponentEventListenerImpl+0x369		/Users/neil.huang/source/couchbase/goproj/src/github.com/couchbase/goxdcr/component/async_listener.go:48
      #	0x100b45fbf	github.com/couchbase/goxdcr/factory.(*XDCRFactory).registerAsyncListenersOnTargets+0x33f	/Users/neil.huang/source/couchbase/goproj/src/github.com/couchbase/goxdcr/factory/xdcr_factory.go:426
      #	0x100b448d1	github.com/couchbase/goxdcr/factory.(*XDCRFactory).newPipelineCommon+0x18b1			/Users/neil.huang/source/couchbase/goproj/src/github.com/couchbase/goxdcr/factory/xdcr_factory.go:330
      #	0x100b42ec5	github.com/couchbase/goxdcr/factory.(*XDCRFactory).NewSecondaryPipeline+0x145			/Users/neil.huang/source/couchbase/goproj/src/github.com/couchbase/goxdcr/factory/xdcr_factory.go:160
      #	0x10065da32	github.com/couchbase/goxdcr/pipeline_manager.(*PipelineManager).StartBackfillPipeline+0x6f2	/Users/neil.huang/source/couchbase/goproj/src/github.com/couchbase/goxdcr/pipeline_manager/pipeline_manager.go:1014
      #	0x1006623a4	github.com/couchbase/goxdcr/pipeline_manager.(*PipelineUpdater).run.func1+0x884			/Users/neil.huang/source/couchbase/goproj/src/github.com/couchbase/goxdcr/pipeline_manager/pipeline_manager.go:1656
      #	0x10008ddc1	sync.(*Once).doSlow+0xc1
      to
      14: 1146880 [14: 1146880] @ 0x100539205 0x100b457fa 0x100b457d3 0x100b448ae 0x100b42ec6 0x10065da33 0x1006623a5 0x10008ddc2 0x100661ae7 0x100661ab5 0x100072541
      #	0x100539204	github.com/couchbase/goxdcr/component.NewAsyncComponentEventListenerImpl+0x44			/Users/neil.huang/source/couchbase/goproj/src/github.com/couchbase/goxdcr/component/async_listener.go:38
      #	0x100b457f9	github.com/couchbase/goxdcr/component.NewDefaultAsyncComponentEventListenerImpl+0x4f9		/Users/neil.huang/source/couchbase/goproj/src/github.com/couchbase/goxdcr/component/async_listener.go:48
      #	0x100b457d2	github.com/couchbase/goxdcr/factory.(*XDCRFactory).registerAsyncListenersOnSources+0x4d2	/Users/neil.huang/source/couchbase/goproj/src/github.com/couchbase/goxdcr/factory/xdcr_factory.go:383
      #	0x100b448ad	github.com/couchbase/goxdcr/factory.(*XDCRFactory).newPipelineCommon+0x188d			/Users/neil.huang/source/couchbase/goproj/src/github.com/couchbase/goxdcr/factory/xdcr_factory.go:329
      #	0x100b42ec5	github.com/couchbase/goxdcr/factory.(*XDCRFactory).NewSecondaryPipeline+0x145			/Users/neil.huang/source/couchbase/goproj/src/github.com/couchbase/goxdcr/factory/xdcr_factory.go:160
      #	0x10065da32	github.com/couchbase/goxdcr/pipeline_manager.(*PipelineManager).StartBackfillPipeline+0x6f2	/Users/neil.huang/source/couchbase/goproj/src/github.com/couchbase/goxdcr/pipeline_manager/pipeline_manager.go:1014
      #	0x1006623a4	github.com/couchbase/goxdcr/pipeline_manager.(*PipelineUpdater).run.func1+0x884			/Users/neil.huang/source/couchbase/goproj/src/github.com/couchbase/goxdcr/pipeline_manager/pipeline_manager.go:1656
      #	0x10008ddc1	sync.(*Once).doSlow+0xc1									/Users/neil.huang/.cbdepscache/exploded/x86_64/go-1.20/go/src/sync/once.go:74
      to
      heap profile: 342: 12294312 [4692: 79199096] @ heap/1048576
      15: 1228800 [15: 1228800] @ 0x100539205 0x100b46185 0x100b46156 0x100b448d2 0x100b42ec6 0x10065da33 0x1006623a5 0x10008ddc2 0x100661ae7 0x100661ab5 0x100072541
      #	0x100539204	github.com/couchbase/goxdcr/component.NewAsyncComponentEventListenerImpl+0x44			/Users/neil.huang/source/couchbase/goproj/src/github.com/couchbase/goxdcr/component/async_listener.go:38
      #	0x100b46184	github.com/couchbase/goxdcr/component.NewDefaultAsyncComponentEventListenerImpl+0x504		/Users/neil.huang/source/couchbase/goproj/src/github.com/couchbase/goxdcr/component/async_listener.go:48
      #	0x100b46155	github.com/couchbase/goxdcr/factory.(*XDCRFactory).registerAsyncListenersOnTargets+0x4d5	/Users/neil.huang/source/couchbase/goproj/src/github.com/couchbase/goxdcr/factory/xdcr_factory.go:435
      #	0x100b448d1	github.com/couchbase/goxdcr/factory.(*XDCRFactory).newPipelineCommon+0x18b1			/Users/neil.huang/source/couchbase/goproj/src/github.com/couchbase/goxdcr/factory/xdcr_factory.go:330
      #	0x100b42ec5	github.com/couchbase/goxdcr/factory.(*XDCRFactory).NewSecondaryPipeline+0x145			/Users/neil.huang/source/couchbase/goproj/src/github.com/couchbase/goxdcr/factory/xdcr_factory.go:160
      #	0x10065da32	github.com/couchbase/goxdcr/pipeline_manager.(*PipelineManager).StartBackfillPipeline+0x6f2	/Users/neil.huang/source/couchbase/goproj/src/github.com/couchbase/goxdcr/pipeline_manager/pipeline_manager.go:1014
      #	0x1006623a4	github.com/couchbase/goxdcr/pipeline_manager.(*PipelineUpdater).run.func1+0x884			/Users/neil.huang/source/couchbase/goproj/src/github.com/couchbase/goxdcr/pipeline_manager/pipeline_manager.go:1656
      #	0x10008ddc1	sync.(*Once).doSlow+0xc1									/Users/neil.huang/.cbdepscache/exploded/x86_64/go-1.20/go/src/sync/once.go:74
       
       
      heap profile: 450: 16564672 [6213: 138609792] @ heap/1048576
      20: 1638400 [20: 1638400] @ 0x100539205 0x100b46285 0x100b4625e 0x100b448d2 0x100b42ec6 0x10065da33 0x1006623a5 0x10008ddc2 0x100661ae7 0x100661ab5 0x100072541
      #	0x100539204	github.com/couchbase/goxdcr/component.NewAsyncComponentEventListenerImpl+0x44			/Users/neil.huang/source/couchbase/goproj/src/github.com/couchbase/goxdcr/component/async_listener.go:38
      #	0x100b46284	github.com/couchbase/goxdcr/component.NewDefaultAsyncComponentEventListenerImpl+0x604		/Users/neil.huang/source/couchbase/goproj/src/github.com/couchbase/goxdcr/component/async_listener.go:48
      #	0x100b4625d	github.com/couchbase/goxdcr/factory.(*XDCRFactory).registerAsyncListenersOnTargets+0x5dd	/Users/neil.huang/source/couchbase/goproj/src/github.com/couchbase/goxdcr/factory/xdcr_factory.go:441
      #	0x100b448d1	github.com/couchbase/goxdcr/factory.(*XDCRFactory).newPipelineCommon+0x18b1			/Users/neil.huang/source/couchbase/goproj/src/github.com/couchbase/goxdcr/factory/xdcr_factory.go:330
      #	0x100b42ec5	github.com/couchbase/goxdcr/factory.(*XDCRFactory).NewSecondaryPipeline+0x145			/Users/neil.huang/source/couchbase/goproj/src/github.com/couchbase/goxdcr/factory/xdcr_factory.go:160
      #	0x10065da32	github.com/couchbase/goxdcr/pipeline_manager.(*PipelineManager).StartBackfillPipeline+0x6f2	/Users/neil.huang/source/couchbase/goproj/src/github.com/couchbase/goxdcr/pipeline_manager/pipeline_manager.go:1014
      #	0x1006623a4	github.com/couchbase/goxdcr/pipeline_manager.(*PipelineUpdater).run.func1+0x884			/Users/neil.huang/source/couchbase/goproj/src/github.com/couchbase/goxdcr/pipeline_manager/pipeline_manager.go:1656
      #	0x10008ddc1	sync.(*Once).doSlow+0xc1									/Users/neil.huang/.cbdepscache/exploded/x86_64/go-1.20/go/src/sync/once.go:74
      

      as it grows overtime.

      2
      if there is no backfill task then why did the backfill pipeline keep restarting? https://src.couchbase.org/source/xref/7.1.3/goproj/src/github.com/couchbase/goxdcr/pipeline_manager/pipeline_manager.go#990-993 is supposed to prevent this from happening… yet the code went all the way into https://src.couchbase.org/source/xref/7.1.3/goproj/src/github.com/couchbase/goxdcr/pipeline_svc/checkpoint_manager.go#460-463

      With the complicated backfill task merging and all, there is a potential faulty defensive check where if a backfill task has an empty starting seqno, it could still get merged.
      https://src.couchbase.org/source/xref/7.1.3/goproj/src/github.com/couchbase/goxdcr/metadata/backfill_replication.go#1374

      Theoretically it is possible to get into this situation, but I cannot find a way to reproduce it. The defensive coding mentioned above here, though, is faulty, and theoretically could cause the situation we’re seeing here.

      In conclusion, there are two issues.
      The first issue is that somehow a backfill task with an invalid starting timestamp is merged and saved into the backfill spec. This triggers the error message from checkpoint manager (point 2). There is no log to pinpoint what happened as we don’t log that level of verbose detail for backfill calculations. However, a defensive programming bug was found, where if the defensive programming was done correctly, this issue would not have happened.

      The second issue is that once the invalid backfill task has been embedded, it causes Checkpoint manager to exit before the pipeline manager actually starts the pipeline and remembers the pipeline to be stopped. Then, when the pipeline manager attempts to restart the backfill pipeline, it couldn’t find that instance of the backfill pipeline to stop (due to the early error code exit from checkpoint manager, caused by the first issue). And thus, each restart will leak more async listeners lying around, unstopped.

      And since each async listener contains static alloc of channel (i.e. malloc chunk of memory), and it isn’t stopped, the GC won’t clear it and thus XDCR memory usage goes up.

      Main Pipeline Memory leak

      From investigating the first and testing out the fix, another similar theoretical error could happen to the main pipeline when starting checkpoint manager.
      https://src.couchbase.org/source/xref/7.1.3/goproj/src/github.com/couchbase/goxdcr/pipeline_svc/checkpoint_manager.go#406-409
      The error here could happen if it has problem connecting or parsing target bucket information.

      Similarly, when the Attach() method is called, the pipeline isn't in a starting state. Any further attempts to stop it would be ignored.

      I tested via a manual injected error at this point, and we see the same in the logs:

      2023-05-26T22:07:21.747-07:00 INFO GOXDCR.PipelineMgr: Try to fix Pipeline 18175263f0df590e9d431e4c21936470/B1/B2. Current error(s)=pipelineMgr.pipeline_factory.NewPipeline(18175263f0df590e9d431e4c21936470/B1/B2) : NEIL DEBUG fake main error
      2023-05-26T22:07:21.747-07:00 INFO GOXDCR.PipelineMgr: Try to stop pipeline 18175263f0df590e9d431e4c21936470/B1/B2
      2023-05-26T22:07:21.747-07:00 INFO GOXDCR.PipelineMgr: Replication status is updated now, current status=name={18175263f0df590e9d431e4c21936470/B1/B2}, status={Pending}, errors={[{"time":"2023-05-26T22:07:11.747386-07:00","errMsg":"pipelineMgr.pipeline_factory.NewPipeline(18175263f0df590e9d431e4c21936470/B1/B2):NEIL DEBUG fake main error"},{"time":"2023-05-26T22:07:01.707807-...07:00","errMsg":"pipelineMgr.pipeline_factory.NewPipeline(18175263f0df590e9d431e4c21936470/B1/B2):NEIL DEBUG fake main error"}]}, oldProgress={2 target nozzles have been constructed}, progress={Source nozzles have been wired to target nozzles}
      2023-05-26T22:07:21.747-07:00 INFO GOXDCR.PipelineMgr: Trying to stop the pipeline 18175263f0df590e9d431e4c21936470/B1/B2
      2023-05-26T22:07:21.747-07:00 INFO GOXDCR.PipelineMgr: Pipeline 18175263f0df590e9d431e4c21936470/B1/B2 is not running
      2023-05-26T22:07:21.747-07:00 INFO GOXDCR.PipelineMgr: Starting the pipeline 18175263f0df590e9d431e4c21936470/B1/B2
      

      Similarly we see memory leak in the heap profile:

      heap profile: 341: 16765864 [2900: 80881656] @ heap/1048576
      16: 1310720 [16: 1310720] @ 0x100536be5 0x100b42d4a 0x100b42d20 0x100b41632 0x100b3f994 0x1006561e8 0x100660ae5 0x10065ee29 0x10008c762 0x10065e827 0x10065e7f5 0x100070ee1
      #	0x100536be4	github.com/couchbase/goxdcr/component.NewAsyncComponentEventListenerImpl+0x44			/Users/neil.huang/source/couchbase/goproj/src/github.com/couchbase/goxdcr/component/async_listener.go:38
      #	0x100b42d49	github.com/couchbase/goxdcr/component.NewDefaultAsyncComponentEventListenerImpl+0x369		/Users/neil.huang/source/couchbase/goproj/src/github.com/couchbase/goxdcr/component/async_listener.go:48
      #	0x100b42d1f	github.com/couchbase/goxdcr/factory.(*XDCRFactory).registerAsyncListenersOnTargets+0x33f	/Users/neil.huang/source/couchbase/goproj/src/github.com/couchbase/goxdcr/factory/xdcr_factory.go:426
      #	0x100b41631	github.com/couchbase/goxdcr/factory.(*XDCRFactory).newPipelineCommon+0x18b1			/Users/neil.huang/source/couchbase/goproj/src/github.com/couchbase/goxdcr/factory/xdcr_factory.go:330
      #	0x100b3f993	github.com/couchbase/goxdcr/factory.(*XDCRFactory).NewPipeline+0x153				/Users/neil.huang/source/couchbase/goproj/src/github.com/couchbase/goxdcr/factory/xdcr_factory.go:134
      #	0x1006561e7	github.com/couchbase/goxdcr/pipeline_manager.(*PipelineManager).StartPipeline+0x2a7		/Users/neil.huang/source/couchbase/goproj/src/github.com/couchbase/goxdcr/pipeline_manager/pipeline_manager.go:455
      #	0x100660ae4	github.com/couchbase/goxdcr/pipeline_manager.(*PipelineUpdater).update+0x424			/Users/neil.huang/source/couchbase/goproj/src/github.com/couchbase/goxdcr/pipeline_manager/pipeline_manager.go:1935
      #	0x10065ee28	github.com/couchbase/goxdcr/pipeline_manager.(*PipelineUpdater).run.func1+0x5c8			/Users/neil.huang/source/couchbase/goproj/src/github.com/couchbase/goxdcr/pipeline_manager/pipeline_manager.go:1644
      

      The fix should apply to both.

      For the record, the main pipeline memory leak could have dated back even pre-collections (7.0).

       

      Issue Resolution
      A Checkpoint Manager Initialization error caused two memory leak types. These were a backfill pipeline and a main pipeline memory leak. The Pipeline Manager and backfill pipeline have both been modified to prevent the memory leaks.

      Attachments

        Issue Links

          No reviews matched the request. Check your Options in the drop-down menu of this sections header.

          Activity

            People

              ayush.nayyar Ayush Nayyar
              neil.huang Neil Huang
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Gerrit Reviews

                  There are no open Gerrit changes

                  PagerDuty