Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-62097

[BP 7.2.6] - XDCR - newPipeline type errors need to be reflected on prometheus stats

    XMLWordPrintable

Details

    • Bug
    • Resolution: Fixed
    • Major
    • 7.2.6
    • 7.6.0, Morpheus, 7.2.0, 7.2.1, 7.2.4, 7.2.2, 7.2.3, 7.2.5, 7.6.2, 7.2.6, 7.6.1, 7.6.4
    • XDCR
    • Untriaged
    • 0
    • No

    Description

      When XDCR has trouble creating a pipeline, as in this case where potentially there’s a networking issue, the current prometheus stats do not suffice - because there is no pipeline from which to extract the error status.

      In a forced reproduction way, I have the following screenshots attached.

      Then, from prometheus, we see that the pipeline is in a paused state (because it isn’t able to be created to extract the error status) - and there is no error count (because there is no pipeline).

       curl -sX GET -u Administrator:wewewe localhost:9000/metrics | grep xdcr | grep errors
      # HELP xdcr_target_eaccess_total The total number of EACCESS errors returned from the target node.
      # HELP xdcr_pipeline_errors The number of currently present errors for a specific Replication Pipeline.
      # TYPE xdcr_pipeline_errors gauge
      xdcr_pipeline_errors {targetClusterUUID="a392925e6421998d14d00fa865bdf7d7", sourceBucketName="B1", targetBucketName="B2", pipelineType="Main"} 0
      xdcr_pipeline_errors {targetClusterUUID="a392925e6421998d14d00fa865bdf7d7", sourceBucketName="B1", targetBucketName="B2", pipelineType="Backfill"} 0
      # HELP xdcr_target_tmpfail_total The total number of TMPFAIL errors returned from the target node.
      s$ curl -sX GET -u Administrator:wewewe localhost:9000/metrics | grep xdcr | grep status
      # HELP xdcr_pipeline_status The pipeline status for a specific pipeline, where it could be paused, running or, error.
      # TYPE xdcr_pipeline_status gauge
      xdcr_pipeline_status {targetClusterUUID="a392925e6421998d14d00fa865bdf7d7", sourceBucketName="B1", targetBucketName="B2", pipelineType="Main", status="Paused"} 1
      xdcr_pipeline_status {targetClusterUUID="a392925e6421998d14d00fa865bdf7d7", sourceBucketName="B1", targetBucketName="B2", pipelineType="Main", status="Running"} 0
      xdcr_pipeline_status {targetClusterUUID="a392925e6421998d14d00fa865bdf7d7", sourceBucketName="B1", targetBucketName="B2", pipelineType="Main", status="Error"} 0
      xdcr_pipeline_status {targetClusterUUID="a392925e6421998d14d00fa865bdf7d7", sourceBucketName="B1", targetBucketName="B2", pipelineType="Backfill", status="Paused"} 1
      xdcr_pipeline_status {targetClusterUUID="a392925e6421998d14d00fa865bdf7d7", sourceBucketName="B1", targetBucketName="B2", pipelineType="Backfill", status="Running"} 0
      xdcr_pipeline_status {targetClusterUUID="a392925e6421998d14d00fa865bdf7d7", sourceBucketName="B1", targetBucketName="B2", pipelineType="Backfill", status="Error"} 0
      

      Two things need to be addressed from prometheus perspective:
      1. The pipeline error counter needs to count the number of failures for pipeline creation
      2. The pipeline status needs to be updated to error if pipeline creation encounters errors

      Attachments

        Issue Links

          For Gerrit Dashboard: MB-62097
          # Subject Branch Project Status CR V

          Activity

            People

              ayush.nayyar Ayush Nayyar
              neil.huang Neil Huang
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Gerrit Reviews

                  There are no open Gerrit changes

                  PagerDuty