Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-47502

Disable stats decimation as a 7.0.1 workaround for a memory leak in Prometheus

    XMLWordPrintable

Details

    • Untriaged
    • 1
    • Unknown

    Description

      On test installation in aws prometheus consumed ~25G of memory and got OOM killed:

      level=info ts=2021-07-15T10:18:46.826Z caller=compact.go:494 component=tsdb msg="write block" mint=1626307200000 maxt=1626328800000 ulid=01FAMTT0VNVMV2149RSNW0M4GW duration=821.47751ms
      level=info ts=2021-07-15T10:18:47.407Z caller=compact.go:494 component=tsdb msg="write block" mint=1626328800000 maxt=1626336000000 ulid=01FAMTT1NAT3G7FPGE729R1PBR duration=580.965225ms
      level=info ts=2021-07-15T10:18:47.416Z caller=db.go:1152 component=tsdb msg="Deleting obsolete block" block=01FAMTR51NKY7WPP56W2201MMV
      level=info ts=2021-07-15T10:18:47.420Z caller=db.go:1152 component=tsdb msg="Deleting obsolete block" block=01FAMTR5VBSYVN4G5XEWFRXZPZ
      level=info ts=2021-07-15T10:18:47.436Z caller=db.go:1152 component=tsdb msg="Deleting obsolete block" block=01FAMTR3GW7WWJGKNF37FDXK60
      fatal error: runtime: out of memory
      runtime stack:
      runtime.throw(0x28d6767, 0x16)
          /home/couchbase/jenkins/workspace/cbdeps-platform-build/deps/go1.14.2/src/runtime/panic.go:1116 +0x72
      runtime.sysMap(0xc6cc000000, 0x4000000, 0x45aa2d8)
          /home/couchbase/jenkins/workspace/cbdeps-platform-build/deps/go1.14.2/src/runtime/mem_linux.go:169 +0xc5
      runtime.(*mheap).sysAlloc(0x45954a0, 0x400000, 0x45954a8, 0xb9)
          /home/couchbase/jenkins/workspace/cbdeps-platform-build/deps/go1.14.2/src/runtime/malloc.go:715 +0x1cd
      runtime.(*mheap).grow(0x45954a0, 0xb9, 0x0)
          /home/couchbase/jenkins/workspace/cbdeps-platform-build/deps/go1.14.2/src/runtime/mheap.go:1286 +0x11c
      runtime.(*mheap).allocSpan(0x45954a0, 0xb9, 0xfc10100, 0x45aa2e8, 0xc004f6bf28)
      

      Logs: https://s3.amazonaws.com/cb-engineering/stevewatanabe-19JUL21-AWS/collectinfo-2021-07-19T165101-ns_1%40127.0.0.1.zip

      Attachments

        Issue Links

          No reviews matched the request. Check your Options in the drop-down menu of this sections header.

          Activity

            Considering the original reported issue, where memory grows to very high levels 25G) after shorter duration, disable pruning seemed to have contributed to a significant improvement. Considering customers will be upgrading from 6.6 and/or initial evaluation of 7.x would get a much better resource consumption experience relative to 6.6, where we had very high memory consumption due to stats.

            This is not to say that we are satisfied with the current state. We'll continue to dig in and eliminate Prometheus continual memory growth to ensure it is capped. At this point the issue is clearly identified and acknowledged by Prometheus engineering. They have been working on the issue and provided a few patches, but have not merged them. We have experimented with these patches and determined they are still insufficient.

            Given above status, and per maintenance meeting, we recommend not to introduce any delay for releasing 7.0.1 and release as scheduled (1st week of Sep).

            meni.hillel Meni Hillel (Inactive) added a comment - Considering the original reported issue, where memory grows to very high levels 25G) after shorter duration, disable pruning seemed to have contributed to a significant improvement. Considering customers will be upgrading from 6.6 and/or initial evaluation of 7.x would get a much better resource consumption experience relative to 6.6, where we had very high memory consumption due to stats. This is not to say that we are satisfied with the current state. We'll continue to dig in and eliminate Prometheus continual memory growth to ensure it is capped. At this point the issue is clearly identified and acknowledged by Prometheus engineering. They have been working on the issue and provided a few patches, but have not merged them. We have experimented with these patches and determined they are still insufficient. Given above status, and per maintenance meeting, we recommend not to introduce any delay for releasing 7.0.1 and release as scheduled (1st week of Sep).

            There is a question about virtual memory consumption on github: https://github.com/prometheus/prometheus/issues/5295
            I think we should look at resident memory instead. It seems like retention policy started working today, so it should plateau soon I hope.

            timofey.barmin Timofey Barmin added a comment - There is a question about virtual memory consumption on github: https://github.com/prometheus/prometheus/issues/5295 I think we should look at resident memory instead. It seems like retention policy started working today, so it should plateau soon I hope.

            Actually I was wrong. Retention.size has not started working there yet. The total stats_data size is 1.3GB and blocks are ~100MB in size, so it should have removed 3 blocks already, but that's not happening. On another server we see similar behavior with 2GB total size. Here is a bug about that and it seems like it is expected behavior. I remember we saw it working before though. We can only wait I think, it seems like it should start removing old blocks.

            timofey.barmin Timofey Barmin added a comment - Actually I was wrong. Retention.size has not started working there yet. The total stats_data size is 1.3GB and blocks are ~100MB in size, so it should have removed 3 blocks already, but that's not happening. On another server we see similar behavior with 2GB total size. Here is a bug about that and it seems like it is expected behavior. I remember we saw it working before though. We can only wait I think, it seems like it should start removing old blocks.

            We've reviewed current state of resident memory and we are satisfied with it. Virtual memory still grows as it is mapped to files and it should be ok. The resident memory is the one we are most concern as high consumption will end up with OOM on Prometheus or cause other OS to refuse memory allocation for other processes. We are good to go for 7.0.1.

            meni.hillel Meni Hillel (Inactive) added a comment - We've reviewed current state of resident memory and we are satisfied with it. Virtual memory still grows as it is mapped to files and it should be ok. The resident memory is the one we are most concern as high consumption will end up with OOM on Prometheus or cause other OS to refuse memory allocation for other processes. We are good to go for 7.0.1.

            Closing this based on above comments.

            Balakumaran.Gopal Balakumaran Gopal added a comment - Closing this based on above comments.

            People

              ritam.sharma Ritam Sharma
              timofey.barmin Timofey Barmin
              Votes:
              0 Vote for this issue
              Watchers:
              16 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Gerrit Reviews

                  PagerDuty