Uploaded image for project: 'Couchbase Monitoring and Observability Stack'
  1. Couchbase Monitoring and Observability Stack
  2. CMOS-17

Consolidated set of Grafana Dashboards

    XMLWordPrintable

Details

    • Epic
    • Resolution: Done
    • Critical
    • 0.1
    • None
    • cmos
    • None
    • Dashboards

    Description

      We need to provide customers with an out-of-box dashboard for Grafana.

      Priority High Level Requirements for Out-Of-Box Grafana Dashboard Comments
           
      P0 We should add out-of-box official dashboard support for Couchbase Server 7.x series to begin with, get feedback, review, and update.  
      P1 We need to support Couchbase Server 6.x series until the 6.x is EOL. Provide separate repositories for 6.x with official sample dashboards. Going forward we should only support Prometheus endpoint shipped with Couchbase Server 7 and above.
      P1 Primary endpoint for the official sample dashboard will be the Prometheus endpoint shipped with Couchbase Server 7.0; however, for backward compatibility, i.e., Couchbase Server 6.x we should leverage the Operator-led and PS-led exporter. Official dashboards
      – Couchbase Server 6.x
      – Couchbase Server 7.x
      P0 Dashboard should be organized in hierarchical order:
      • Universe of Couchbase - Provides a high-level snapshot of health across multiple Couchbase clusters deployed (configured)
        • Cluster names and their health status
      • Couchbase cluster - Provides a high-level snapshot of entire cluster-level health
        • Individual node health
        • Individual services health (Data, Query, Index, Search, Analytics, Eventing, XDCR, Backup)
      • System health (CPU, Memory, Disk, Network)
        • System Information
        • Breakdown of CPU % by Couchbase Process
        • Breakdown of Memory by Couchbase Server Process
        • Network / Disk Utilization by Couchbase Server Process
      • Couchbase services
        • Data - Bucket overview/ops, the client connected, users connected, response times/mctiming, resource configuration info, [7.0 and above] scopes overview, and count of collections
        • Query - Overall query performance, slowest queries, most common queries, most impactful queries, prepared statements and their performance
        • Index - Overall index performance, scan times, number of requests/sec, avg. item size, item count, never been scanned, index disk size
        • Search - Overall search performance, scan times, number of requests/sec, avg. item size, item count, never been scanned, index disk size
        • Eventing - Overall eventing performance, processing metrics, error counts/rates
        • XDCR - Aggregated metrics across all replications, source, and destination replication names (not UUID), individual replication with filter, % complete, replication state, docs replicated, replication backlog, network bandwidth used, changes left
      Some of the system metrics breakdown of CPU % by process, network utilization will require the node-exporter to be installed to gather those metrics
      P0  Core Health Screening:
      • The Universe of Couchbase - Display high level the aggregated view of each Couchbase cluster health screening results 'healthy or unhealthy.'
      • Display a panel view listing health checks with appropriate descriptions and 'pass or fail' indicators at each hierarchical order - Couchbase cluster->node->Couchbase services
       
      P1 Dashboard variables: 
      • Support variables for each dashboard; it is important to define consistent variable names across all dashboards. Variables should align with metric labels and allow for filtering of results.
      • Variables can also allow for multiple selections, as well as “all.”
       
      P1 Labeling:
      • Metrics should be labeled uniformly and indicate their source, including Cluster, Pod, Node, Service, etc. For example, all the data services metrics should get labeled with prefix data_.
       

       

      Attachments

        Issue Links

          Activity

            People

              anil Anil Kumar (Inactive)
              dhaikney David Haikney (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                PagerDuty