Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-44510

Prometheus doesn't handle suspend-resume correctly

    XMLWordPrintable

Details

    • Untriaged
    • 1
    • Unknown

    Description

      Looks like when time suddenly changes for prometheus (this includes OS suspend-resume, or laptop sleep-wake-up), it starts reporting "out of bound" error for all targets:

      level=debug ts=2021-02-17T16:58:00.252Z caller=scrape.go:1127 component="scrape manager" scrape_pool=general target=http://127.0.0.1:8094/_prometheusMetrics msg="Append failed" err="out of bounds"
      level=warn ts=2021-02-17T16:58:00.252Z caller=scrape.go:1133 component="scrape manager" scrape_pool=general target=http://127.0.0.1:8094/_prometheusMetrics msg="Append failed" err="out of bounds"
      level=warn ts=2021-02-17T16:58:00.252Z caller=scrape.go:1082 component="scrape manager" scrape_pool=general target=http://127.0.0.1:8094/_prometheusMetrics msg="Appending scrape report failed" err="out of bounds"
      level=debug ts=2021-02-17T16:58:01.779Z caller=scrape.go:1412 component="scrape manager" scrape_pool=general target=http://127.0.0.1:8091/_prometheusMetrics msg="Out of bounds metric" series="audit_queue_length\{category=\"audit\"}"
      level=debug ts=2021-02-17T16:58:01.779Z caller=scrape.go:1412 component="scrape manager" scrape_pool=general target=http://127.0.0.1:8091/_prometheusMetrics msg="Out of bounds metric" series="audit_unsuccessful_retries\{category=\"audit\"}"
      
      

      In my case not every scrape fails but rather every other scrape or so.

      This can lead at least to holes in stats data (for sure) and possibly higher cpu load (my guess).

      There is a github issue for that: https://github.com/prometheus/prometheus/issues/8243

      It's very unlikely that it will be fixed by the prometheus team, so we probably should handle it by ourselves. 

      The most obvious way to fix it is to detect time changes and restart the prometheus process. It at least should help with forward time jumps. 

       

      Possibly related: https://github.com/golang/go/issues/35012

      Attachments

        Issue Links

          For Gerrit Dashboard: MB-44510
          # Subject Branch Project Status CR V

          Activity

            People

              timofey.barmin Timofey Barmin
              timofey.barmin Timofey Barmin
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Gerrit Reviews

                  There are no open Gerrit changes

                  PagerDuty