Details
-
Bug
-
Resolution: Fixed
-
Critical
-
6.5.1, 6.6.0, 6.6.1, 6.6.2, 6.5.0, 6.6.3, 6.6.4, 6.6.5
-
Untriaged
-
1
-
No
Description
If we look at the code a bit deeper, we have final clean up here: http://src.couchbase.org/source/xref/6.6.2/goproj/src/github.com/couchbase/goxdcr/parts/xmem_nozzle.go#1203-1204
It is waiting for the processData_sendbatch to exit. But, that go-routine is blocked here: http://src.couchbase.org/source/xref/6.6.2/goproj/src/github.com/couchbase/goxdcr/parts/xmem_nozzle.go#1266-1267
The enSlot is stuck here: http://src.couchbase.org/source/xref/6.6.2/goproj/src/github.com/couchbase/goxdcr/parts/xmem_nozzle.go#325 with no way out.
In an environment where the target cluster is consistently out of memory (ENOMEM) or busy (TMPFAIL), and potentially one or more XMEM of a replication is unable to replicate, this can happen.
So XMEM will eventually fill up the buffer, with no way to drain it as there is no exit path unless the buffer returns an empty slot on a channel. When the pipelines are stopped, the go-routines are stuck and the buffer are never released. This is shown via the increasing number of the rogue go-routines shown here.
This seems to be occurring if we pause and resume pipelines when no document can be written at all. Each time we pause, we leak memory as the clean up process gets stuck and buffer is not cleaned as a result.
Reproduction
To reproduce this locally, I have to made a few changes to my local build. This includes:
- As soon as XMEM receives the response for a SET_META, the code overrides the response code to ENOMEM
- I have to disable the Warning of “Received temporary error”… to ensure the CPU doesn’t waste time printing messages to the log.
- I have to disable the flowControl to shorten the amount of time it takes to fill up the buffer (Note for XDCR team: even with flow control, it won’t prevent buffer fill up because in a customer’s environment, the trickling of resendWithReset and resendWithTimeout would have filled up the buffer anyway. Not disabling it will just make it harder to reproduce)
- I have to disable the backoffFactor, so that XMEM will fill the buffer up even amidst a flurry of (falsely injected) ENOMEM and not backoff/slowdown the send rate.
Then, I created 5 replications. Let them run for 15 seconds, pause them for 15 seconds, and repeat.
As expected, we see enSlot being stuck and leaking goroutines (only relevant parts shown):
xdcr.20:42:08 |
goroutine profile: total 230
|
26 @ 0x403bbf6 0x400a9cc 0x400a3f8 0x450912b 0x450ec85 0x4510248 0x450e098 0x450dcac 0x406bbe1
|
# 0x450912a github.com/couchbase/goxdcr/parts.(*requestBuffer).enSlot+0x8a /Users/neil.huang/source/couchbase/goproj/src/github.com/couchbase/goxdcr/parts/xmem_nozzle.go:361
|
# 0x450ec84 github.com/couchbase/goxdcr/parts.(*XmemNozzle).batchSetMetaWithRetry+0x6e4 /Users/neil.huang/source/couchbase/goproj/src/github.com/couchbase/goxdcr/parts/xmem_nozzle.go:1193
|
# 0x4510247 github.com/couchbase/goxdcr/parts.(*XmemNozzle).sendSetMeta_internal+0x47 /Users/neil.huang/source/couchbase/goproj/src/github.com/couchbase/goxdcr/parts/xmem_nozzle.go:1436
|
# 0x450e097 github.com/couchbase/goxdcr/parts.(*XmemNozzle).processBatch+0x37 /Users/neil.huang/source/couchbase/goproj/src/github.com/couchbase/goxdcr/parts/xmem_nozzle.go:1106
|
# 0x450dcab github.com/couchbase/goxdcr/parts.(*XmemNozzle).processData_sendbatch+0x48b /Users/neil.huang/source/couchbase/goproj/src/github.com/couchbase/goxdcr/parts/xmem_nozzle.go:1061
|
|
26 @ 0x403bbf6 0x404c5f3 0x404c5cd 0x4067645 0x4083d32 0x450e25f 0x406bbe1
|
# 0x4067644 sync.runtime_Semacquire+0x24 /Users/neil.huang/.cbdepscache/exploded/x86_64/go-1.18.1/go/src/runtime/sema.go:56
|
# 0x4083d31 sync.(*WaitGroup).Wait+0x51 /Users/neil.huang/.cbdepscache/exploded/x86_64/go-1.18.1/go/src/sync/waitgroup.go:136
|
# 0x450e25e github.com/couchbase/goxdcr/parts.(*XmemNozzle).finalCleanup+0x3e /Users/neil.huang/source/couchbase/goproj/src/github.com/couchbase/goxdcr/parts/xmem_nozzle.go:1122
|
5 Minutes later:
xdcr.20:47:38 |
goroutine profile: total 248
|
36 @ 0x403bbf6 0x400a9cc 0x400a3f8 0x450912b 0x450ec85 0x4510248 0x450e098 0x450dcac 0x406bbe1
|
# 0x450912a github.com/couchbase/goxdcr/parts.(*requestBuffer).enSlot+0x8a /Users/neil.huang/source/couchbase/goproj/src/github.com/couchbase/goxdcr/parts/xmem_nozzle.go:361
|
# 0x450ec84 github.com/couchbase/goxdcr/parts.(*XmemNozzle).batchSetMetaWithRetry+0x6e4 /Users/neil.huang/source/couchbase/goproj/src/github.com/couchbase/goxdcr/parts/xmem_nozzle.go:1193
|
# 0x4510247 github.com/couchbase/goxdcr/parts.(*XmemNozzle).sendSetMeta_internal+0x47 /Users/neil.huang/source/couchbase/goproj/src/github.com/couchbase/goxdcr/parts/xmem_nozzle.go:1436
|
# 0x450e097 github.com/couchbase/goxdcr/parts.(*XmemNozzle).processBatch+0x37 /Users/neil.huang/source/couchbase/goproj/src/github.com/couchbase/goxdcr/parts/xmem_nozzle.go:1106
|
# 0x450dcab github.com/couchbase/goxdcr/parts.(*XmemNozzle).processData_sendbatch+0x48b /Users/neil.huang/source/couchbase/goproj/src/github.com/couchbase/goxdcr/parts/xmem_nozzle.go:1061
|
|
36 @ 0x403bbf6 0x404c5f3 0x404c5cd 0x4067645 0x4083d32 0x450e25f 0x406bbe1
|
# 0x4067644 sync.runtime_Semacquire+0x24 /Users/neil.huang/.cbdepscache/exploded/x86_64/go-1.18.1/go/src/runtime/sema.go:56
|
# 0x4083d31 sync.(*WaitGroup).Wait+0x51 /Users/neil.huang/.cbdepscache/exploded/x86_64/go-1.18.1/go/src/sync/waitgroup.go:136
|
# 0x450e25e github.com/couchbase/goxdcr/parts.(*XmemNozzle).finalCleanup+0x3e /Users/neil.huang/source/couchbase/goproj/src/github.com/couchbase/goxdcr/parts/xmem_nozzle.go:1122
|
Running the parseMemStat parser script I wrote, we see evident memory leak:
neil.huang@NeilsMacbookPro:~/source/couchbase/ns_server$ cat logs/n_0/goxdcr.log | parseMemStats
|
----------------------
|
Date: 2022-07-21T20:39:07.508
|
HeapInUse: 207405056 (197 MB) Increased 197 MB
|
HeapIdle: 89047040 (84 MB) Increased 84 MB
|
HeapReleased: 3670016 (3 MB) Increased 3 MB
|
SumHeap: 292782080 (279 MB) Increased 279 MB
|
NextGCTarget: 277958976 (265 MB) Increased 265 MB
|
Allocs: 4328427792
|
----------------------
|
Date: 2022-07-21T20:41:07.541
|
HeapInUse: 301580288 (287 MB) Increased 89 MB
|
HeapIdle: 195706880 (186 MB) Increased 101 MB
|
HeapReleased: 37617664 (35 MB) Increased 32 MB
|
SumHeap: 459669504 (438 MB) Increased 159 MB
|
NextGCTarget: 476099664 (454 MB) Increased 188 MB
|
Allocs: 15179370856
|
----------------------
|
Date: 2022-07-21T20:43:07.542
|
HeapInUse: 465944576 (444 MB) Increased 156 MB
|
HeapIdle: 246169600 (234 MB) Increased 48 MB
|
HeapReleased: 524288 (0 MB) Decreased 35 MB
|
SumHeap: 711589888 (678 MB) Increased 240 MB
|
NextGCTarget: 683744976 (652 MB) Increased 198 MB
|
Allocs: 28068110944
|
----------------------
|
Date: 2022-07-21T20:45:07.542
|
HeapInUse: 488390656 (465 MB) Increased 21 MB
|
HeapIdle: 325730304 (310 MB) Increased 75 MB
|
HeapReleased: 51052544 (48 MB) Increased 48 MB
|
SumHeap: 763068416 (727 MB) Increased 49 MB
|
NextGCTarget: 708899072 (676 MB) Increased 23 MB
|
Allocs: 38806413256
|
----------------------
|
Date: 2022-07-21T20:47:07.544
|
HeapInUse: 848830464 (809 MB) Increased 343 MB
|
HeapIdle: 32432128 (30 MB) Decreased 279 MB
|
HeapReleased: 1572864 (1 MB) Decreased 47 MB
|
SumHeap: 879689728 (838 MB) Increased 111 MB
|
NextGCTarget: 851613248 (812 MB) Increased 136 MB
|
Allocs: 49594392688
|
----------------------
|