Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-53106

[BP 6.6.x] - XMEM will leak memory if pipeline is paused with full buffer

    XMLWordPrintable

Details

    • Untriaged
    • 1
    • No

    Description

      If we look at the code a bit deeper, we have final clean up here: http://src.couchbase.org/source/xref/6.6.2/goproj/src/github.com/couchbase/goxdcr/parts/xmem_nozzle.go#1203-1204
      It is waiting for the processData_sendbatch to exit. But, that go-routine is blocked here: http://src.couchbase.org/source/xref/6.6.2/goproj/src/github.com/couchbase/goxdcr/parts/xmem_nozzle.go#1266-1267
      The enSlot is stuck here: http://src.couchbase.org/source/xref/6.6.2/goproj/src/github.com/couchbase/goxdcr/parts/xmem_nozzle.go#325 with no way out.

      In an environment where the target cluster is consistently out of memory (ENOMEM) or busy (TMPFAIL), and potentially one or more XMEM of a replication is unable to replicate, this can happen.
      So XMEM will eventually fill up the buffer, with no way to drain it as there is no exit path unless the buffer returns an empty slot on a channel. When the pipelines are stopped, the go-routines are stuck and the buffer are never released. This is shown via the increasing number of the rogue go-routines shown here.

      This seems to be occurring if we pause and resume pipelines when no document can be written at all. Each time we pause, we leak memory as the clean up process gets stuck and buffer is not cleaned as a result.

       
      Reproduction
      To reproduce this locally, I have to made a few changes to my local build. This includes:

      1. As soon as XMEM receives the response for a SET_META, the code overrides the response code to ENOMEM
      2. I have to disable the Warning of “Received temporary error”… to ensure the CPU doesn’t waste time printing messages to the log.
      3. I have to disable the flowControl to shorten the amount of time it takes to fill up the buffer (Note for XDCR team: even with flow control, it won’t prevent buffer fill up because in a customer’s environment, the trickling of resendWithReset and resendWithTimeout would have filled up the buffer anyway. Not disabling it will just make it harder to reproduce)
      4. I have to disable the backoffFactor, so that XMEM will fill the buffer up even amidst a flurry of (falsely injected) ENOMEM and not backoff/slowdown the send rate.

      Then, I created 5 replications. Let them run for 15 seconds, pause them for 15 seconds, and repeat.

      As expected, we see enSlot being stuck and leaking goroutines (only relevant parts shown):

      xdcr.20:42:08

      goroutine profile: total 230
      26 @ 0x403bbf6 0x400a9cc 0x400a3f8 0x450912b 0x450ec85 0x4510248 0x450e098 0x450dcac 0x406bbe1
      #       0x450912a       github.com/couchbase/goxdcr/parts.(*requestBuffer).enSlot+0x8a                  /Users/neil.huang/source/couchbase/goproj/src/github.com/couchbase/goxdcr/parts/xmem_nozzle.go:361
      #       0x450ec84       github.com/couchbase/goxdcr/parts.(*XmemNozzle).batchSetMetaWithRetry+0x6e4     /Users/neil.huang/source/couchbase/goproj/src/github.com/couchbase/goxdcr/parts/xmem_nozzle.go:1193
      #       0x4510247       github.com/couchbase/goxdcr/parts.(*XmemNozzle).sendSetMeta_internal+0x47       /Users/neil.huang/source/couchbase/goproj/src/github.com/couchbase/goxdcr/parts/xmem_nozzle.go:1436
      #       0x450e097       github.com/couchbase/goxdcr/parts.(*XmemNozzle).processBatch+0x37               /Users/neil.huang/source/couchbase/goproj/src/github.com/couchbase/goxdcr/parts/xmem_nozzle.go:1106
      #       0x450dcab       github.com/couchbase/goxdcr/parts.(*XmemNozzle).processData_sendbatch+0x48b     /Users/neil.huang/source/couchbase/goproj/src/github.com/couchbase/goxdcr/parts/xmem_nozzle.go:1061
       
      26 @ 0x403bbf6 0x404c5f3 0x404c5cd 0x4067645 0x4083d32 0x450e25f 0x406bbe1
      #       0x4067644       sync.runtime_Semacquire+0x24                                            /Users/neil.huang/.cbdepscache/exploded/x86_64/go-1.18.1/go/src/runtime/sema.go:56
      #       0x4083d31       sync.(*WaitGroup).Wait+0x51                                             /Users/neil.huang/.cbdepscache/exploded/x86_64/go-1.18.1/go/src/sync/waitgroup.go:136
      #       0x450e25e       github.com/couchbase/goxdcr/parts.(*XmemNozzle).finalCleanup+0x3e       /Users/neil.huang/source/couchbase/goproj/src/github.com/couchbase/goxdcr/parts/xmem_nozzle.go:1122
      

      5 Minutes later:

      xdcr.20:47:38

      goroutine profile: total 248
      36 @ 0x403bbf6 0x400a9cc 0x400a3f8 0x450912b 0x450ec85 0x4510248 0x450e098 0x450dcac 0x406bbe1
      #       0x450912a       github.com/couchbase/goxdcr/parts.(*requestBuffer).enSlot+0x8a                  /Users/neil.huang/source/couchbase/goproj/src/github.com/couchbase/goxdcr/parts/xmem_nozzle.go:361
      #       0x450ec84       github.com/couchbase/goxdcr/parts.(*XmemNozzle).batchSetMetaWithRetry+0x6e4     /Users/neil.huang/source/couchbase/goproj/src/github.com/couchbase/goxdcr/parts/xmem_nozzle.go:1193
      #       0x4510247       github.com/couchbase/goxdcr/parts.(*XmemNozzle).sendSetMeta_internal+0x47       /Users/neil.huang/source/couchbase/goproj/src/github.com/couchbase/goxdcr/parts/xmem_nozzle.go:1436
      #       0x450e097       github.com/couchbase/goxdcr/parts.(*XmemNozzle).processBatch+0x37               /Users/neil.huang/source/couchbase/goproj/src/github.com/couchbase/goxdcr/parts/xmem_nozzle.go:1106
      #       0x450dcab       github.com/couchbase/goxdcr/parts.(*XmemNozzle).processData_sendbatch+0x48b     /Users/neil.huang/source/couchbase/goproj/src/github.com/couchbase/goxdcr/parts/xmem_nozzle.go:1061
       
      36 @ 0x403bbf6 0x404c5f3 0x404c5cd 0x4067645 0x4083d32 0x450e25f 0x406bbe1
      #       0x4067644       sync.runtime_Semacquire+0x24                                            /Users/neil.huang/.cbdepscache/exploded/x86_64/go-1.18.1/go/src/runtime/sema.go:56
      #       0x4083d31       sync.(*WaitGroup).Wait+0x51                                             /Users/neil.huang/.cbdepscache/exploded/x86_64/go-1.18.1/go/src/sync/waitgroup.go:136
      #       0x450e25e       github.com/couchbase/goxdcr/parts.(*XmemNozzle).finalCleanup+0x3e       /Users/neil.huang/source/couchbase/goproj/src/github.com/couchbase/goxdcr/parts/xmem_nozzle.go:1122
      

      Running the parseMemStat parser script I wrote, we see evident memory leak:

      neil.huang@NeilsMacbookPro:~/source/couchbase/ns_server$ cat logs/n_0/goxdcr.log | parseMemStats
      ----------------------
      Date: 2022-07-21T20:39:07.508
      HeapInUse: 207405056 (197 MB) Increased 197 MB
      HeapIdle: 89047040 (84 MB) Increased 84 MB
      HeapReleased: 3670016 (3 MB) Increased 3 MB
      SumHeap: 292782080 (279 MB) Increased 279 MB
      NextGCTarget: 277958976 (265 MB) Increased 265 MB
      Allocs: 4328427792
      ----------------------
      Date: 2022-07-21T20:41:07.541
      HeapInUse: 301580288 (287 MB) Increased 89 MB
      HeapIdle: 195706880 (186 MB) Increased 101 MB
      HeapReleased: 37617664 (35 MB) Increased 32 MB
      SumHeap: 459669504 (438 MB) Increased 159 MB
      NextGCTarget: 476099664 (454 MB) Increased 188 MB
      Allocs: 15179370856
      ----------------------
      Date: 2022-07-21T20:43:07.542
      HeapInUse: 465944576 (444 MB) Increased 156 MB
      HeapIdle: 246169600 (234 MB) Increased 48 MB
      HeapReleased: 524288 (0 MB) Decreased 35 MB
      SumHeap: 711589888 (678 MB) Increased 240 MB
      NextGCTarget: 683744976 (652 MB) Increased 198 MB
      Allocs: 28068110944
      ----------------------
      Date: 2022-07-21T20:45:07.542
      HeapInUse: 488390656 (465 MB) Increased 21 MB
      HeapIdle: 325730304 (310 MB) Increased 75 MB
      HeapReleased: 51052544 (48 MB) Increased 48 MB
      SumHeap: 763068416 (727 MB) Increased 49 MB
      NextGCTarget: 708899072 (676 MB) Increased 23 MB
      Allocs: 38806413256
      ----------------------
      Date: 2022-07-21T20:47:07.544
      HeapInUse: 848830464 (809 MB) Increased 343 MB
      HeapIdle: 32432128 (30 MB) Decreased 279 MB
      HeapReleased: 1572864 (1 MB) Decreased 47 MB
      SumHeap: 879689728 (838 MB) Increased 111 MB
      NextGCTarget: 851613248 (812 MB) Increased 136 MB
      Allocs: 49594392688
      ----------------------
      

       

      Attachments

        Issue Links

          No reviews matched the request. Check your Options in the drop-down menu of this sections header.

          Activity

            Build couchbase-server-6.6.5-10101 contains goxdcr commit 57208a2 with commit message:
            MB-53106: ensure enSlot() has a exit path if blocked

            build-team Couchbase Build Team added a comment - Build couchbase-server-6.6.5-10101 contains goxdcr commit 57208a2 with commit message: MB-53106 : ensure enSlot() has a exit path if blocked
            ritam.sharma Ritam Sharma added a comment -

            Pavithra Mahamani - Please validate the issue.

            ritam.sharma Ritam Sharma added a comment - Pavithra Mahamani - Please validate the issue.

            Verified on 6.6.5-10101

            pavithra.mahamani Pavithra Mahamani (Inactive) added a comment - Verified on 6.6.5-10101

            People

              pavithra.mahamani Pavithra Mahamani (Inactive)
              neil.huang Neil Huang
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Gerrit Reviews

                  There are no open Gerrit changes

                  PagerDuty