Uploaded image for project: 'Couchbase Monitoring and Observability Stack'
  1. Couchbase Monitoring and Observability Stack
  2. CMOS-17

Consolidated set of Grafana Dashboards

    XMLWordPrintable

Details

    • Epic
    • Status: Done
    • Critical
    • Resolution: Done
    • None
    • 0.1
    • cmos
    • None
    • Dashboards

    Description

      We need to provide customers with an out-of-box dashboard for Grafana.

      Priority High Level Requirements for Out-Of-Box Grafana Dashboard Comments
           
      P0 We should add out-of-box official dashboard support for Couchbase Server 7.x series to begin with, get feedback, review, and update.  
      P1 We need to support Couchbase Server 6.x series until the 6.x is EOL. Provide separate repositories for 6.x with official sample dashboards. Going forward we should only support Prometheus endpoint shipped with Couchbase Server 7 and above.
      P1 Primary endpoint for the official sample dashboard will be the Prometheus endpoint shipped with Couchbase Server 7.0; however, for backward compatibility, i.e., Couchbase Server 6.x we should leverage the Operator-led and PS-led exporter. Official dashboards
      – Couchbase Server 6.x
      – Couchbase Server 7.x
      P0 Dashboard should be organized in hierarchical order:
      • Universe of Couchbase - Provides a high-level snapshot of health across multiple Couchbase clusters deployed (configured)
        • Cluster names and their health status
      • Couchbase cluster - Provides a high-level snapshot of entire cluster-level health
        • Individual node health
        • Individual services health (Data, Query, Index, Search, Analytics, Eventing, XDCR, Backup)
      • System health (CPU, Memory, Disk, Network)
        • System Information
        • Breakdown of CPU % by Couchbase Process
        • Breakdown of Memory by Couchbase Server Process
        • Network / Disk Utilization by Couchbase Server Process
      • Couchbase services
        • Data - Bucket overview/ops, the client connected, users connected, response times/mctiming, resource configuration info, [7.0 and above] scopes overview, and count of collections
        • Query - Overall query performance, slowest queries, most common queries, most impactful queries, prepared statements and their performance
        • Index - Overall index performance, scan times, number of requests/sec, avg. item size, item count, never been scanned, index disk size
        • Search - Overall search performance, scan times, number of requests/sec, avg. item size, item count, never been scanned, index disk size
        • Eventing - Overall eventing performance, processing metrics, error counts/rates
        • XDCR - Aggregated metrics across all replications, source, and destination replication names (not UUID), individual replication with filter, % complete, replication state, docs replicated, replication backlog, network bandwidth used, changes left
      Some of the system metrics breakdown of CPU % by process, network utilization will require the node-exporter to be installed to gather those metrics
      P0  Core Health Screening:
      • The Universe of Couchbase - Display high level the aggregated view of each Couchbase cluster health screening results 'healthy or unhealthy.'
      • Display a panel view listing health checks with appropriate descriptions and 'pass or fail' indicators at each hierarchical order - Couchbase cluster->node->Couchbase services
       
      P1 Dashboard variables: 
      • Support variables for each dashboard; it is important to define consistent variable names across all dashboards. Variables should align with metric labels and allow for filtering of results.
      • Variables can also allow for multiple selections, as well as “all.”
       
      P1 Labeling:
      • Metrics should be labeled uniformly and indicate their source, including Cluster, Pod, Node, Service, etc. For example, all the data services metrics should get labeled with prefix data_.
       

       

      Attachments

        Issue Links

          No reviews matched the request. Check your Options in the drop-down menu of this sections header.

          Activity

            Hey Patrick Stephens, can you review the requirements just updated above? Then, I would suggest we reassign or break it down with some tasks, targeting 7.x first. I know that earlier releases are also required, but maybe as a first-order task, try to get a 7.x sample, iterate on that, then work back to CNPE with possible configuration for 7.x style metrics or just use it's current approach, then PSPE.

            Other approaches can work too, but what I like about this is that it gets the destination established and iterated on first, filled in by other things that will be steps there as 7.x adoption increases.

            Of course, please bring up any questions to Anil Kumar.

            A couple of questions I have…

            • This calls out specifically healthy/unhealthy. Seems okay to start developing a dashboard with this, but per previous discussions there are some indicators that are binary, some that are subjective. Anil Kumar: do we have a definition on healthy/unhealthy? Any thoughts on how to get to that?
            • It's not quite clear to me where the health check output (available in CMOS as a prometheus endpoint, but needs to be configurable and visualized somewhere) would be. Anil Kumar, any thoughts on where this would fit in your tables?
            • This may be somewhere I'm not familiar with, but is there a repository of screenshots or Grafana JSON definitions somewhere already as previous examples?
            ingenthr Matt Ingenthron added a comment - Hey Patrick Stephens , can you review the requirements just updated above? Then, I would suggest we reassign or break it down with some tasks, targeting 7.x first. I know that earlier releases are also required, but maybe as a first-order task, try to get a 7.x sample, iterate on that, then work back to CNPE with possible configuration for 7.x style metrics or just use it's current approach, then PSPE. Other approaches can work too, but what I like about this is that it gets the destination established and iterated on first, filled in by other things that will be steps there as 7.x adoption increases. Of course, please bring up any questions to Anil Kumar . A couple of questions I have… This calls out specifically healthy/unhealthy. Seems okay to start developing a dashboard with this, but per previous discussions there are some indicators that are binary, some that are subjective. Anil Kumar : do we have a definition on healthy/unhealthy? Any thoughts on how to get to that? It's not quite clear to me where the health check output (available in CMOS as a prometheus endpoint, but needs to be configurable and visualized somewhere) would be. Anil Kumar , any thoughts on where this would fit in your tables? This may be somewhere I'm not familiar with, but is there a repository of screenshots or Grafana JSON definitions somewhere already as previous examples?

            Matt Ingenthron

            is there a repository of screenshots or Grafana JSON definitions somewhere already as previous examples?

            the linked requirements doc includes a prior-art section links to the (many) varied JSON dashboard definitions we have

            It's not quite clear to me where the health check output would be

            This can just feature as a regular panel on each of the hierarchical dashboards. e.g. on the "single cluster view" - a simple aggregated visualisation that conveys ("are all of the health checks passing?") - with an appropriate "details" panel that highlights individual failures. The Universe dashboard would again contain an aggregation of these by cluster. There are some nice hexagonal visualisations that neatly convey whether a node / cluster is passing all of the necessary checks.

            dhaikney David Haikney added a comment - Matt Ingenthron is there a repository of screenshots or Grafana JSON definitions somewhere already as previous examples? the linked requirements doc includes a prior-art section links to the (many) varied JSON dashboard definitions we have It's not quite clear to me where the health check output would be This can just feature as a regular panel on each of the hierarchical dashboards. e.g. on the "single cluster view" - a simple aggregated visualisation that conveys ("are all of the health checks passing?") - with an appropriate "details" panel that highlights individual failures. The Universe dashboard would again contain an aggregation of these by cluster. There are some nice hexagonal visualisations that neatly convey whether a node / cluster is passing all of the necessary checks.

            Thanks David Haikney

            I have updated the table and added a couple of rows to clarify the above questions. Thanks!

            anil Anil Kumar (Inactive) added a comment - Thanks  David Haikney I have updated the table and added a couple of rows to clarify the above questions. Thanks!

            I know the requirements doc mentions all of the other exporters that are out there.  I would think initially it should be focused on 7.0 as Matt Ingenthron mentioned.  My concern with trying to make the pre-7.0 exporters work with CMOS is those exporters are more point in time metrics and not aggregate metrics like the majority of the metrics in 7.0, plus all of the metric names have changed completely.  

            Additionally, I have not done complete testing with the cbprometheus_python exporters, but I know at a minimum the index metrics we were gathering using the /pools/default/buckets/@index-{BUCKET}/nodes/{NODE}:8091/stats endpoint in 6.6 and earlier, returns a "200 OK" but an empty payload in 7.0 and at this time there is no plans to add support for that exporter in 7.0.  

            If we're going to add support for any of the previous exporters in CMOS, I would say we would only "officially" support the Couchbase-exporter built by engineering for CAO, just my opinion.  But I think it will be very difficult to support these older exporters as all of the PromQL queries will be completely different. 

            aaron.benton Aaron Benton (Inactive) added a comment - I know the requirements doc mentions all of the other exporters that are out there.  I would think initially it should be focused on 7.0 as Matt Ingenthron  mentioned.  My concern with trying to make the pre-7.0 exporters work with CMOS is those exporters are more point in time metrics and not aggregate metrics like the majority of the metrics in 7.0, plus all of the metric names have changed completely.   Additionally, I have not done complete testing with the cbprometheus_python exporters, but I know at a minimum the index metrics we were gathering using the /pools/default/buckets/@index-{BUCKET}/nodes/{NODE}:8091/stats endpoint in 6.6 and earlier, returns a "200 OK" but an empty payload in 7.0 and at this time there is no plans to add support for that exporter in 7.0.   If we're going to add support for any of the previous exporters in CMOS, I would say we would only "officially" support the Couchbase-exporter built by engineering for CAO, just my opinion.  But I think it will be very difficult to support these older exporters as all of the PromQL queries will be completely different. 

            I think this is on you to review

            patrick.stephens Patrick Stephens (Inactive) added a comment - I think this is on you to review

            People

              anil Anil Kumar (Inactive)
              dhaikney David Haikney
              Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Gerrit Reviews

                  There are no open Gerrit changes

                  PagerDuty