Details
-
Bug
-
Resolution: Fixed
-
Critical
-
7.0.0, 7.0.1, 7.0.2, 7.0.3, 7.0.4, 7.0.5, 7.1.0, 7.1.1, 7.1.2, 7.2.0, 7.1.3, 7.2.1
-
Untriaged
-
0
-
No
Description
There are two issues. The first is a backfill pipeline-type memory leak. The second one is a main pipeline memory leak that is found while attempting to perform theretical exercise from the first investigation. See below.
Backfill Pipeline Memory leak
1
If the backfill pipeline started, it should have been registered
https://src.couchbase.org/source/xref/7.1.3/goproj/src/github.com/couchbase/goxdcr/pipeline_manager/pipeline_manager.go#1015-1036
And during stopping it should have stopped it.
https://src.couchbase.org/source/xref/7.1.3/goproj/src/github.com/couchbase/goxdcr/pipeline_manager/pipeline_manager.go#1058-1062
Why didn’t it?
The reason is because https://src.couchbase.org/source/xref/7.1.3/goproj/src/github.com/couchbase/goxdcr/pipeline_manager/pipeline_manager.go#1007-1013 returns early before setting it to replication status, which will cause a pipeline memory leak as the StopBackfillPipeline() won’t find said pipeline because it can’t find it from the repStatus (https://src.couchbase.org/source/xref/7.1.3/goproj/src/github.com/couchbase/goxdcr/pipeline_manager/pipeline_manager.go#1016) is not called.
Even though the pipeline Start() was never called, the async listeners were created and because pipelineStop was never called, the async listeners stayed around bloating up memory.
I was able to reproduce it locally by forcing that error path in code, manually forcing a backfill to start, and watch the restart happening.
From 5 count to
|
…
|
heap profile: 247: 9215104 [3256: 34060560] @ heap/1048576
|
12: 983040 [12: 983040] @ 0x100539205 0x100b45fea 0x100b45fc0 0x100b448d2 0x100b42ec6 0x10065da33 0x1006623a5 0x10008ddc2 0x100661ae7 0x100661ab5 0x100072541
|
# 0x100539204 github.com/couchbase/goxdcr/component.NewAsyncComponentEventListenerImpl+0x44 /Users/neil.huang/source/couchbase/goproj/src/github.com/couchbase/goxdcr/component/async_listener.go:38
|
# 0x100b45fe9 github.com/couchbase/goxdcr/component.NewDefaultAsyncComponentEventListenerImpl+0x369 /Users/neil.huang/source/couchbase/goproj/src/github.com/couchbase/goxdcr/component/async_listener.go:48
|
# 0x100b45fbf github.com/couchbase/goxdcr/factory.(*XDCRFactory).registerAsyncListenersOnTargets+0x33f /Users/neil.huang/source/couchbase/goproj/src/github.com/couchbase/goxdcr/factory/xdcr_factory.go:426
|
# 0x100b448d1 github.com/couchbase/goxdcr/factory.(*XDCRFactory).newPipelineCommon+0x18b1 /Users/neil.huang/source/couchbase/goproj/src/github.com/couchbase/goxdcr/factory/xdcr_factory.go:330
|
# 0x100b42ec5 github.com/couchbase/goxdcr/factory.(*XDCRFactory).NewSecondaryPipeline+0x145 /Users/neil.huang/source/couchbase/goproj/src/github.com/couchbase/goxdcr/factory/xdcr_factory.go:160
|
# 0x10065da32 github.com/couchbase/goxdcr/pipeline_manager.(*PipelineManager).StartBackfillPipeline+0x6f2 /Users/neil.huang/source/couchbase/goproj/src/github.com/couchbase/goxdcr/pipeline_manager/pipeline_manager.go:1014
|
# 0x1006623a4 github.com/couchbase/goxdcr/pipeline_manager.(*PipelineUpdater).run.func1+0x884 /Users/neil.huang/source/couchbase/goproj/src/github.com/couchbase/goxdcr/pipeline_manager/pipeline_manager.go:1656
|
# 0x10008ddc1 sync.(*Once).doSlow+0xc1
|
…
|
to
|
14: 1146880 [14: 1146880] @ 0x100539205 0x100b457fa 0x100b457d3 0x100b448ae 0x100b42ec6 0x10065da33 0x1006623a5 0x10008ddc2 0x100661ae7 0x100661ab5 0x100072541
|
# 0x100539204 github.com/couchbase/goxdcr/component.NewAsyncComponentEventListenerImpl+0x44 /Users/neil.huang/source/couchbase/goproj/src/github.com/couchbase/goxdcr/component/async_listener.go:38
|
# 0x100b457f9 github.com/couchbase/goxdcr/component.NewDefaultAsyncComponentEventListenerImpl+0x4f9 /Users/neil.huang/source/couchbase/goproj/src/github.com/couchbase/goxdcr/component/async_listener.go:48
|
# 0x100b457d2 github.com/couchbase/goxdcr/factory.(*XDCRFactory).registerAsyncListenersOnSources+0x4d2 /Users/neil.huang/source/couchbase/goproj/src/github.com/couchbase/goxdcr/factory/xdcr_factory.go:383
|
# 0x100b448ad github.com/couchbase/goxdcr/factory.(*XDCRFactory).newPipelineCommon+0x188d /Users/neil.huang/source/couchbase/goproj/src/github.com/couchbase/goxdcr/factory/xdcr_factory.go:329
|
# 0x100b42ec5 github.com/couchbase/goxdcr/factory.(*XDCRFactory).NewSecondaryPipeline+0x145 /Users/neil.huang/source/couchbase/goproj/src/github.com/couchbase/goxdcr/factory/xdcr_factory.go:160
|
# 0x10065da32 github.com/couchbase/goxdcr/pipeline_manager.(*PipelineManager).StartBackfillPipeline+0x6f2 /Users/neil.huang/source/couchbase/goproj/src/github.com/couchbase/goxdcr/pipeline_manager/pipeline_manager.go:1014
|
# 0x1006623a4 github.com/couchbase/goxdcr/pipeline_manager.(*PipelineUpdater).run.func1+0x884 /Users/neil.huang/source/couchbase/goproj/src/github.com/couchbase/goxdcr/pipeline_manager/pipeline_manager.go:1656
|
# 0x10008ddc1 sync.(*Once).doSlow+0xc1 /Users/neil.huang/.cbdepscache/exploded/x86_64/go-1.20/go/src/sync/once.go:74
|
…
|
to
|
heap profile: 342: 12294312 [4692: 79199096] @ heap/1048576
|
15: 1228800 [15: 1228800] @ 0x100539205 0x100b46185 0x100b46156 0x100b448d2 0x100b42ec6 0x10065da33 0x1006623a5 0x10008ddc2 0x100661ae7 0x100661ab5 0x100072541
|
# 0x100539204 github.com/couchbase/goxdcr/component.NewAsyncComponentEventListenerImpl+0x44 /Users/neil.huang/source/couchbase/goproj/src/github.com/couchbase/goxdcr/component/async_listener.go:38
|
# 0x100b46184 github.com/couchbase/goxdcr/component.NewDefaultAsyncComponentEventListenerImpl+0x504 /Users/neil.huang/source/couchbase/goproj/src/github.com/couchbase/goxdcr/component/async_listener.go:48
|
# 0x100b46155 github.com/couchbase/goxdcr/factory.(*XDCRFactory).registerAsyncListenersOnTargets+0x4d5 /Users/neil.huang/source/couchbase/goproj/src/github.com/couchbase/goxdcr/factory/xdcr_factory.go:435
|
# 0x100b448d1 github.com/couchbase/goxdcr/factory.(*XDCRFactory).newPipelineCommon+0x18b1 /Users/neil.huang/source/couchbase/goproj/src/github.com/couchbase/goxdcr/factory/xdcr_factory.go:330
|
# 0x100b42ec5 github.com/couchbase/goxdcr/factory.(*XDCRFactory).NewSecondaryPipeline+0x145 /Users/neil.huang/source/couchbase/goproj/src/github.com/couchbase/goxdcr/factory/xdcr_factory.go:160
|
# 0x10065da32 github.com/couchbase/goxdcr/pipeline_manager.(*PipelineManager).StartBackfillPipeline+0x6f2 /Users/neil.huang/source/couchbase/goproj/src/github.com/couchbase/goxdcr/pipeline_manager/pipeline_manager.go:1014
|
# 0x1006623a4 github.com/couchbase/goxdcr/pipeline_manager.(*PipelineUpdater).run.func1+0x884 /Users/neil.huang/source/couchbase/goproj/src/github.com/couchbase/goxdcr/pipeline_manager/pipeline_manager.go:1656
|
# 0x10008ddc1 sync.(*Once).doSlow+0xc1 /Users/neil.huang/.cbdepscache/exploded/x86_64/go-1.20/go/src/sync/once.go:74
|
|
…
|
|
heap profile: 450: 16564672 [6213: 138609792] @ heap/1048576
|
20: 1638400 [20: 1638400] @ 0x100539205 0x100b46285 0x100b4625e 0x100b448d2 0x100b42ec6 0x10065da33 0x1006623a5 0x10008ddc2 0x100661ae7 0x100661ab5 0x100072541
|
# 0x100539204 github.com/couchbase/goxdcr/component.NewAsyncComponentEventListenerImpl+0x44 /Users/neil.huang/source/couchbase/goproj/src/github.com/couchbase/goxdcr/component/async_listener.go:38
|
# 0x100b46284 github.com/couchbase/goxdcr/component.NewDefaultAsyncComponentEventListenerImpl+0x604 /Users/neil.huang/source/couchbase/goproj/src/github.com/couchbase/goxdcr/component/async_listener.go:48
|
# 0x100b4625d github.com/couchbase/goxdcr/factory.(*XDCRFactory).registerAsyncListenersOnTargets+0x5dd /Users/neil.huang/source/couchbase/goproj/src/github.com/couchbase/goxdcr/factory/xdcr_factory.go:441
|
# 0x100b448d1 github.com/couchbase/goxdcr/factory.(*XDCRFactory).newPipelineCommon+0x18b1 /Users/neil.huang/source/couchbase/goproj/src/github.com/couchbase/goxdcr/factory/xdcr_factory.go:330
|
# 0x100b42ec5 github.com/couchbase/goxdcr/factory.(*XDCRFactory).NewSecondaryPipeline+0x145 /Users/neil.huang/source/couchbase/goproj/src/github.com/couchbase/goxdcr/factory/xdcr_factory.go:160
|
# 0x10065da32 github.com/couchbase/goxdcr/pipeline_manager.(*PipelineManager).StartBackfillPipeline+0x6f2 /Users/neil.huang/source/couchbase/goproj/src/github.com/couchbase/goxdcr/pipeline_manager/pipeline_manager.go:1014
|
# 0x1006623a4 github.com/couchbase/goxdcr/pipeline_manager.(*PipelineUpdater).run.func1+0x884 /Users/neil.huang/source/couchbase/goproj/src/github.com/couchbase/goxdcr/pipeline_manager/pipeline_manager.go:1656
|
# 0x10008ddc1 sync.(*Once).doSlow+0xc1 /Users/neil.huang/.cbdepscache/exploded/x86_64/go-1.20/go/src/sync/once.go:74
|
as it grows overtime.
2
if there is no backfill task then why did the backfill pipeline keep restarting? https://src.couchbase.org/source/xref/7.1.3/goproj/src/github.com/couchbase/goxdcr/pipeline_manager/pipeline_manager.go#990-993 is supposed to prevent this from happening… yet the code went all the way into https://src.couchbase.org/source/xref/7.1.3/goproj/src/github.com/couchbase/goxdcr/pipeline_svc/checkpoint_manager.go#460-463
With the complicated backfill task merging and all, there is a potential faulty defensive check where if a backfill task has an empty starting seqno, it could still get merged.
https://src.couchbase.org/source/xref/7.1.3/goproj/src/github.com/couchbase/goxdcr/metadata/backfill_replication.go#1374
Theoretically it is possible to get into this situation, but I cannot find a way to reproduce it. The defensive coding mentioned above here, though, is faulty, and theoretically could cause the situation we’re seeing here.
In conclusion, there are two issues.
The first issue is that somehow a backfill task with an invalid starting timestamp is merged and saved into the backfill spec. This triggers the error message from checkpoint manager (point 2). There is no log to pinpoint what happened as we don’t log that level of verbose detail for backfill calculations. However, a defensive programming bug was found, where if the defensive programming was done correctly, this issue would not have happened.
The second issue is that once the invalid backfill task has been embedded, it causes Checkpoint manager to exit before the pipeline manager actually starts the pipeline and remembers the pipeline to be stopped. Then, when the pipeline manager attempts to restart the backfill pipeline, it couldn’t find that instance of the backfill pipeline to stop (due to the early error code exit from checkpoint manager, caused by the first issue). And thus, each restart will leak more async listeners lying around, unstopped.
And since each async listener contains static alloc of channel (i.e. malloc chunk of memory), and it isn’t stopped, the GC won’t clear it and thus XDCR memory usage goes up.
Main Pipeline Memory leak
From investigating the first and testing out the fix, another similar theoretical error could happen to the main pipeline when starting checkpoint manager.
https://src.couchbase.org/source/xref/7.1.3/goproj/src/github.com/couchbase/goxdcr/pipeline_svc/checkpoint_manager.go#406-409
The error here could happen if it has problem connecting or parsing target bucket information.
Similarly, when the Attach() method is called, the pipeline isn't in a starting state. Any further attempts to stop it would be ignored.
I tested via a manual injected error at this point, and we see the same in the logs:
2023-05-26T22:07:21.747-07:00 INFO GOXDCR.PipelineMgr: Try to fix Pipeline 18175263f0df590e9d431e4c21936470/B1/B2. Current error(s)=pipelineMgr.pipeline_factory.NewPipeline(18175263f0df590e9d431e4c21936470/B1/B2) : NEIL DEBUG fake main error
|
2023-05-26T22:07:21.747-07:00 INFO GOXDCR.PipelineMgr: Try to stop pipeline 18175263f0df590e9d431e4c21936470/B1/B2
|
2023-05-26T22:07:21.747-07:00 INFO GOXDCR.PipelineMgr: Replication status is updated now, current status=name={18175263f0df590e9d431e4c21936470/B1/B2}, status={Pending}, errors={[{"time":"2023-05-26T22:07:11.747386-07:00","errMsg":"pipelineMgr.pipeline_factory.NewPipeline(18175263f0df590e9d431e4c21936470/B1/B2):NEIL DEBUG fake main error"},{"time":"2023-05-26T22:07:01.707807-...07:00","errMsg":"pipelineMgr.pipeline_factory.NewPipeline(18175263f0df590e9d431e4c21936470/B1/B2):NEIL DEBUG fake main error"}]}, oldProgress={2 target nozzles have been constructed}, progress={Source nozzles have been wired to target nozzles}
|
2023-05-26T22:07:21.747-07:00 INFO GOXDCR.PipelineMgr: Trying to stop the pipeline 18175263f0df590e9d431e4c21936470/B1/B2
|
2023-05-26T22:07:21.747-07:00 INFO GOXDCR.PipelineMgr: Pipeline 18175263f0df590e9d431e4c21936470/B1/B2 is not running
|
2023-05-26T22:07:21.747-07:00 INFO GOXDCR.PipelineMgr: Starting the pipeline 18175263f0df590e9d431e4c21936470/B1/B2
|
Similarly we see memory leak in the heap profile:
heap profile: 341: 16765864 [2900: 80881656] @ heap/1048576
|
16: 1310720 [16: 1310720] @ 0x100536be5 0x100b42d4a 0x100b42d20 0x100b41632 0x100b3f994 0x1006561e8 0x100660ae5 0x10065ee29 0x10008c762 0x10065e827 0x10065e7f5 0x100070ee1
|
# 0x100536be4 github.com/couchbase/goxdcr/component.NewAsyncComponentEventListenerImpl+0x44 /Users/neil.huang/source/couchbase/goproj/src/github.com/couchbase/goxdcr/component/async_listener.go:38
|
# 0x100b42d49 github.com/couchbase/goxdcr/component.NewDefaultAsyncComponentEventListenerImpl+0x369 /Users/neil.huang/source/couchbase/goproj/src/github.com/couchbase/goxdcr/component/async_listener.go:48
|
# 0x100b42d1f github.com/couchbase/goxdcr/factory.(*XDCRFactory).registerAsyncListenersOnTargets+0x33f /Users/neil.huang/source/couchbase/goproj/src/github.com/couchbase/goxdcr/factory/xdcr_factory.go:426
|
# 0x100b41631 github.com/couchbase/goxdcr/factory.(*XDCRFactory).newPipelineCommon+0x18b1 /Users/neil.huang/source/couchbase/goproj/src/github.com/couchbase/goxdcr/factory/xdcr_factory.go:330
|
# 0x100b3f993 github.com/couchbase/goxdcr/factory.(*XDCRFactory).NewPipeline+0x153 /Users/neil.huang/source/couchbase/goproj/src/github.com/couchbase/goxdcr/factory/xdcr_factory.go:134
|
# 0x1006561e7 github.com/couchbase/goxdcr/pipeline_manager.(*PipelineManager).StartPipeline+0x2a7 /Users/neil.huang/source/couchbase/goproj/src/github.com/couchbase/goxdcr/pipeline_manager/pipeline_manager.go:455
|
# 0x100660ae4 github.com/couchbase/goxdcr/pipeline_manager.(*PipelineUpdater).update+0x424 /Users/neil.huang/source/couchbase/goproj/src/github.com/couchbase/goxdcr/pipeline_manager/pipeline_manager.go:1935
|
# 0x10065ee28 github.com/couchbase/goxdcr/pipeline_manager.(*PipelineUpdater).run.func1+0x5c8 /Users/neil.huang/source/couchbase/goproj/src/github.com/couchbase/goxdcr/pipeline_manager/pipeline_manager.go:1644
|
The fix should apply to both.
For the record, the main pipeline memory leak could have dated back even pre-collections (7.0).
Issue | Resolution |
A Checkpoint Manager Initialization error caused two memory leak types. These were a backfill pipeline and a main pipeline memory leak. | The Pipeline Manager and backfill pipeline have both been modified to prevent the memory leaks. |
Attachments
Issue Links
- is a backport of
-
MB-57129 XDCR - CheckpointMgr Initialize error will lead to memory leak
- Resolved