Details
-
Bug
-
Resolution: Fixed
-
Major
-
7.6.0, Morpheus, 7.2.0, 7.2.1, 7.2.4, 7.2.2, 7.2.3, 7.2.5, 7.6.2, 7.2.6, 7.6.1, 7.6.4
-
Untriaged
-
0
-
No
Description
When XDCR has trouble creating a pipeline, as in this case where potentially there’s a networking issue, the current prometheus stats do not suffice - because there is no pipeline from which to extract the error status.
In a forced reproduction way, I have the following screenshots attached.
Then, from prometheus, we see that the pipeline is in a paused state (because it isn’t able to be created to extract the error status) - and there is no error count (because there is no pipeline).
curl -sX GET -u Administrator:wewewe localhost:9000/metrics | grep xdcr | grep errors
|
# HELP xdcr_target_eaccess_total The total number of EACCESS errors returned from the target node.
|
# HELP xdcr_pipeline_errors The number of currently present errors for a specific Replication Pipeline.
|
# TYPE xdcr_pipeline_errors gauge
|
xdcr_pipeline_errors {targetClusterUUID="a392925e6421998d14d00fa865bdf7d7", sourceBucketName="B1", targetBucketName="B2", pipelineType="Main"} 0
|
xdcr_pipeline_errors {targetClusterUUID="a392925e6421998d14d00fa865bdf7d7", sourceBucketName="B1", targetBucketName="B2", pipelineType="Backfill"} 0
|
# HELP xdcr_target_tmpfail_total The total number of TMPFAIL errors returned from the target node.
|
s$ curl -sX GET -u Administrator:wewewe localhost:9000/metrics | grep xdcr | grep status
|
# HELP xdcr_pipeline_status The pipeline status for a specific pipeline, where it could be paused, running or, error.
|
# TYPE xdcr_pipeline_status gauge
|
xdcr_pipeline_status {targetClusterUUID="a392925e6421998d14d00fa865bdf7d7", sourceBucketName="B1", targetBucketName="B2", pipelineType="Main", status="Paused"} 1
|
xdcr_pipeline_status {targetClusterUUID="a392925e6421998d14d00fa865bdf7d7", sourceBucketName="B1", targetBucketName="B2", pipelineType="Main", status="Running"} 0
|
xdcr_pipeline_status {targetClusterUUID="a392925e6421998d14d00fa865bdf7d7", sourceBucketName="B1", targetBucketName="B2", pipelineType="Main", status="Error"} 0
|
xdcr_pipeline_status {targetClusterUUID="a392925e6421998d14d00fa865bdf7d7", sourceBucketName="B1", targetBucketName="B2", pipelineType="Backfill", status="Paused"} 1
|
xdcr_pipeline_status {targetClusterUUID="a392925e6421998d14d00fa865bdf7d7", sourceBucketName="B1", targetBucketName="B2", pipelineType="Backfill", status="Running"} 0
|
xdcr_pipeline_status {targetClusterUUID="a392925e6421998d14d00fa865bdf7d7", sourceBucketName="B1", targetBucketName="B2", pipelineType="Backfill", status="Error"} 0
|
Two things need to be addressed from prometheus perspective:
1. The pipeline error counter needs to count the number of failures for pipeline creation
2. The pipeline status needs to be updated to error if pipeline creation encounters errors
Attachments
Issue Links
For Gerrit Dashboard: MB-62097 | ||||||
---|---|---|---|---|---|---|
# | Subject | Branch | Project | Status | CR | V |
210676,2 | MB-62097: pipeline failing to start will update prometheus appropriately | neo | goxdcr | Status: MERGED | +2 | +1 |