Details
-
Bug
-
Resolution: Fixed
-
Major
-
7.6.0, Morpheus, 7.2.0, 7.2.1, 7.2.4, 7.2.2, 7.2.3, 7.2.5, 7.6.2, 7.2.6, 7.6.1
-
Untriaged
-
0
-
No
Description
When XDCR has trouble creating a pipeline, as in this case where potentially there’s a networking issue, the current prometheus stats do not suffice - because there is no pipeline from which to extract the error status.
In a forced reproduction way, I have the following screenshots attached.
Then, from prometheus, we see that the pipeline is in a paused state (because it isn’t able to be created to extract the error status) - and there is no error count (because there is no pipeline).
curl -sX GET -u Administrator:wewewe localhost:9000/metrics | grep xdcr | grep errors
|
# HELP xdcr_target_eaccess_total The total number of EACCESS errors returned from the target node.
|
# HELP xdcr_pipeline_errors The number of currently present errors for a specific Replication Pipeline.
|
# TYPE xdcr_pipeline_errors gauge
|
xdcr_pipeline_errors {targetClusterUUID="a392925e6421998d14d00fa865bdf7d7", sourceBucketName="B1", targetBucketName="B2", pipelineType="Main"} 0
|
xdcr_pipeline_errors {targetClusterUUID="a392925e6421998d14d00fa865bdf7d7", sourceBucketName="B1", targetBucketName="B2", pipelineType="Backfill"} 0
|
# HELP xdcr_target_tmpfail_total The total number of TMPFAIL errors returned from the target node.
|
s$ curl -sX GET -u Administrator:wewewe localhost:9000/metrics | grep xdcr | grep status
|
# HELP xdcr_pipeline_status The pipeline status for a specific pipeline, where it could be paused, running or, error.
|
# TYPE xdcr_pipeline_status gauge
|
xdcr_pipeline_status {targetClusterUUID="a392925e6421998d14d00fa865bdf7d7", sourceBucketName="B1", targetBucketName="B2", pipelineType="Main", status="Paused"} 1
|
xdcr_pipeline_status {targetClusterUUID="a392925e6421998d14d00fa865bdf7d7", sourceBucketName="B1", targetBucketName="B2", pipelineType="Main", status="Running"} 0
|
xdcr_pipeline_status {targetClusterUUID="a392925e6421998d14d00fa865bdf7d7", sourceBucketName="B1", targetBucketName="B2", pipelineType="Main", status="Error"} 0
|
xdcr_pipeline_status {targetClusterUUID="a392925e6421998d14d00fa865bdf7d7", sourceBucketName="B1", targetBucketName="B2", pipelineType="Backfill", status="Paused"} 1
|
xdcr_pipeline_status {targetClusterUUID="a392925e6421998d14d00fa865bdf7d7", sourceBucketName="B1", targetBucketName="B2", pipelineType="Backfill", status="Running"} 0
|
xdcr_pipeline_status {targetClusterUUID="a392925e6421998d14d00fa865bdf7d7", sourceBucketName="B1", targetBucketName="B2", pipelineType="Backfill", status="Error"} 0
|
Two things need to be addressed from prometheus perspective:
1. The pipeline error counter needs to count the number of failures for pipeline creation
2. The pipeline status needs to be updated to error if pipeline creation encounters errors