Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-50076

[bug] Prometheus metrics using excess RAM from node

    XMLWordPrintable

Details

    • Bug
    • Status: Closed
    • Critical
    • Resolution: Not a Bug
    • 7.0.0, 7.0.2, 7.0.1
    • None
    • ns_server
    • OS - Debian GNU/Linux 10
      Couchbase Server 7.0.0-5302 (CE)
    • Ubuntu 64-bit
    • Impediment
    • 1
    • Unknown

    Description

      We recently noticed issues with excessive resource consumption of some subsystems within Couchbase.

      According to the images attached, Prometheus processes are consuming a lot of RAM and CPU memory, it is even the process that consumes the most resources within the virtual machine.

      Doing a research, I noticed that Prometheus using in Couchbase 7.0 has version 2.22.0 (branch: HEAD, revision: a6239a377d49104ac7253a99aef8feb8dee0a7c2)

      There are some bug reports that indicate high resource consumption and that some limit parameters are not being respected, according to the problem: https://github.com/prometheus/prometheus/issues/9744

      Our first approach, as the issue suggests, is to update to version 2.22.1 where the bug is fixed, but since Couchbase uses a custom version of Prometheus, there is a custom flag that runs along with the parent process of Couchbase, you can see the error below when changing the Prometheus version:

      Error parsing commandline arguments: unknown long flag '--storage.tsdb.no-lockfile'
      prometheus: error: unknown long flag '--storage.tsdb.no-lockfile'

      The version that Prometheus uses within Couchbase is different from the release in the official Prometheus repository, where:

      prometheus, version 2.22.0 (branch: HEAD, revision: a6239a377d49104ac7253a99aef8feb8dee0a7c2) is the custom version of Couchbase

      prometheus, version 2.22.0 (branch: HEAD, revision: 0a7fdd3b76960808c3a91d92267c3d815c1bc354) is the same version as Prometheus but without the custom flags.

      The Workaround we got is to remove the Prometheus binary and restart the child process, this way the Prometheus binary doesn't load and doesn't overload the cluster, on the other hand, we lose all visibility of queries, index, and cluster activities that are important for operation.

      Also attached is a screenshot of the process monitors with the high levels of RAM and CPU that the process consumes over time, causing unavailability in our environment.

      Attachments

        No reviews matched the request. Check your Options in the drop-down menu of this sections header.

        Activity

          Diego Frazatto Pedroso can you provide the specific numbers along with logs? Once provided, if you also happen to have the VM accessible to us, please provide creds and we'll investigate.

          meni.hillel Meni Hillel (Inactive) added a comment - Diego Frazatto Pedroso can you provide the specific numbers along with logs? Once provided, if you also happen to have the VM accessible to us, please provide creds and we'll investigate.

          Hello Meni Hillel, thank you for the reply.

          Our cluster running on a private VPC, cannot be accessed by the Internet. I think the behavior has been identified before, there many issues on Github with the same behavior.

          I note that the fix has already been merged on the Master branch on Github in the Couchbase Community repo. 

          We find a workaround to avoid the Prometheus consume less RAM and CPU, we'll wait for the new release (CE 7.0.3)

          Best Regards

          Diego

          pedrosodiego Diego Frazatto Pedroso added a comment - Hello Meni Hillel , thank you for the reply. Our cluster running on a private VPC, cannot be accessed by the Internet. I think the behavior has been identified before, there many issues on Github with the same behavior. I note that the fix has already been merged on the Master branch on Github in the Couchbase Community repo.  We find a workaround to avoid the Prometheus consume less RAM and CPU, we'll wait for the new release (CE 7.0.3) Best Regards Diego

          Diego Frazatto Pedroso Yes, we've upgrade Prometheus version in master which addresses memory and storage leaks. At the same time, the memory leak is triggered by our "decimation logic", which is basically removing older measurements. Prometheus does not cope with this very well and therefore we disabled "decimation logic" in 7.0.2 and higher. We did not observe memory leaks after disabling it. If you are still seeing what seems to be a leak, please attach latest logs with the 7.0.2 system you are using.

          meni.hillel Meni Hillel (Inactive) added a comment - Diego Frazatto Pedroso Yes, we've upgrade Prometheus version in master which addresses memory and storage leaks. At the same time, the memory leak is triggered by our "decimation logic", which is basically removing older measurements. Prometheus does not cope with this very well and therefore we disabled "decimation logic" in 7.0.2 and higher. We did not observe memory leaks after disabling it. If you are still seeing what seems to be a leak, please attach latest logs with the 7.0.2 system you are using.

          Hi Meni Hillel, thank you for the reply. Yesterday we upgrade for the last CE version and Prometheus decreased considerably.

          We'll keep monitoring all Couchbase processes, but I believe the issue was fixed.

          Thank you a lot

          Best Regards!

          pedrosodiego Diego Frazatto Pedroso added a comment - Hi Meni Hillel , thank you for the reply. Yesterday we upgrade for the last CE version and Prometheus decreased considerably. We'll keep monitoring all Couchbase processes, but I believe the issue was fixed. Thank you a lot Best Regards!

          Closing since this is not a bug

          ashwin.govindarajulu Ashwin Govindarajulu added a comment - Closing since this is not a bug

          People

            guilherme.saueressig Guilherme Saueressig
            pedrosodiego Diego Frazatto Pedroso
            Votes:
            0 Vote for this issue
            Watchers:
            6 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Time Tracking

                Estimated:
                Original Estimate - 32h
                32h
                Remaining:
                Remaining Estimate - 32h
                32h
                Logged:
                Time Spent - Not Specified
                Not Specified

                Gerrit Reviews

                  There are no open Gerrit changes

                  PagerDuty