Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-47816

Prometheus leaks memory and gets OOM killed when decimation is active

    XMLWordPrintable

Details

    • Untriaged
    • 1
    • Unknown

    Description

      On test installation in aws prometheus consumed ~25G of memory and got OOM killed:

      level=info ts=2021-07-15T10:18:46.826Z caller=compact.go:494 component=tsdb msg="write block" mint=1626307200000 maxt=1626328800000 ulid=01FAMTT0VNVMV2149RSNW0M4GW duration=821.47751ms
      level=info ts=2021-07-15T10:18:47.407Z caller=compact.go:494 component=tsdb msg="write block" mint=1626328800000 maxt=1626336000000 ulid=01FAMTT1NAT3G7FPGE729R1PBR duration=580.965225ms
      level=info ts=2021-07-15T10:18:47.416Z caller=db.go:1152 component=tsdb msg="Deleting obsolete block" block=01FAMTR51NKY7WPP56W2201MMV
      level=info ts=2021-07-15T10:18:47.420Z caller=db.go:1152 component=tsdb msg="Deleting obsolete block" block=01FAMTR5VBSYVN4G5XEWFRXZPZ
      level=info ts=2021-07-15T10:18:47.436Z caller=db.go:1152 component=tsdb msg="Deleting obsolete block" block=01FAMTR3GW7WWJGKNF37FDXK60
      fatal error: runtime: out of memory
      runtime stack:
      runtime.throw(0x28d6767, 0x16)
          /home/couchbase/jenkins/workspace/cbdeps-platform-build/deps/go1.14.2/src/runtime/panic.go:1116 +0x72
      runtime.sysMap(0xc6cc000000, 0x4000000, 0x45aa2d8)
          /home/couchbase/jenkins/workspace/cbdeps-platform-build/deps/go1.14.2/src/runtime/mem_linux.go:169 +0xc5
      runtime.(*mheap).sysAlloc(0x45954a0, 0x400000, 0x45954a8, 0xb9)
          /home/couchbase/jenkins/workspace/cbdeps-platform-build/deps/go1.14.2/src/runtime/malloc.go:715 +0x1cd
      runtime.(*mheap).grow(0x45954a0, 0xb9, 0x0)
          /home/couchbase/jenkins/workspace/cbdeps-platform-build/deps/go1.14.2/src/runtime/mheap.go:1286 +0x11c
      runtime.(*mheap).allocSpan(0x45954a0, 0xb9, 0xfc10100, 0x45aa2e8, 0xc004f6bf28)
      

      Logs: https://s3.amazonaws.com/cb-engineering/stevewatanabe-19JUL21-AWS/collectinfo-2021-07-19T165101-ns_1%40127.0.0.1.zip

      Attachments

        Issue Links

          No reviews matched the request. Check your Options in the drop-down menu of this sections header.

          Activity

            Build couchbase-server-7.0.2-6561 contains ns_server commit 4320c90 with commit message:
            MB-47816: Make cleaning stats tombstones configurable

            build-team Couchbase Build Team added a comment - Build couchbase-server-7.0.2-6561 contains ns_server commit 4320c90 with commit message: MB-47816 : Make cleaning stats tombstones configurable

            Build couchbase-server-7.1.0-1182 contains ns_server commit 4320c90 with commit message:
            MB-47816: Make cleaning stats tombstones configurable

            build-team Couchbase Build Team added a comment - Build couchbase-server-7.1.0-1182 contains ns_server commit 4320c90 with commit message: MB-47816 : Make cleaning stats tombstones configurable
            wayne Wayne Siu added a comment -

            Timofey Barmin
            Is this ticket done or more changes are expected? Thanks.

            wayne Wayne Siu added a comment - Timofey Barmin Is this ticket done or more changes are expected? Thanks.

            Moving ticket to Neo for now. Here is current state of things as it related to this issue:

            1. Pruning is disable just as in 7.0.1. That's the safest option we have right now.
            2. With KV stats optimization, we see that we can hold about a week worth of stats which is not all that bad.
            3. Patches for Prometheus to address resident memory grows looks promising, but they have not been merged. We expect this to finalize in 7.1 timeframe.
            4. We've observed CPU spikes during decimation. We think they've existed earlier and are not a regression introduced by the above patches. It likely the same logic that needs to scan all measurements in blocks and selectively remove measurements we want to decimate or prune. That logic is probably behaved the same earlier, but we might not have noticed the cpu spikes (we can test in 7.0.0 to be conclusive). We still need an investigation here that will choose the right interval between pruning to minimize CPU impact. With 1 min we see lower CPU (%40) spikes but they are obviously more frequent. With 15 min, we see higher spikes (%120) but they can be longer (did not measure). There is no real penalty if we have a longer delay between pruning. So we need to experiment to see what may be acceptable. We may even revisit pruning logic to introduce some "artificial sleeps" to see if that can smooth out CPU consumption.
            meni.hillel Meni Hillel (Inactive) added a comment - Moving ticket to Neo for now. Here is current state of things as it related to this issue: Pruning is disable just as in 7.0.1. That's the safest option we have right now. With KV stats optimization, we see that we can hold about a week worth of stats which is not all that bad. Patches for Prometheus to address resident memory grows looks promising, but they have not been merged. We expect this to finalize in 7.1 timeframe. We've observed CPU spikes during decimation. We think they've existed earlier and are not a regression introduced by the above patches. It likely the same logic that needs to scan all measurements in blocks and selectively remove measurements we want to decimate or prune. That logic is probably behaved the same earlier, but we might not have noticed the cpu spikes (we can test in 7.0.0 to be conclusive). We still need an investigation here that will choose the right interval between pruning to minimize CPU impact. With 1 min we see lower CPU (%40) spikes but they are obviously more frequent. With 15 min, we see higher spikes (%120) but they can be longer (did not measure). There is no real penalty if we have a longer delay between pruning. So we need to experiment to see what may be acceptable. We may even revisit pruning logic to introduce some "artificial sleeps" to see if that can smooth out CPU consumption.

            People

              dfinlay Dave Finlay
              timofey.barmin Timofey Barmin
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

                Created:
                Updated:

                Gerrit Reviews

                  There are no open Gerrit changes

                  PagerDuty