Details
-
Bug
-
Resolution: Fixed
-
Critical
-
7.6.2
-
Untriaged
-
0
-
Unknown
Description
Problem:
For cursor_aware functions, on a checkpointing failure other than CAS_MISMATCH, the function handler skips onto the next mutation silently.
A user has no mechanism:
- Of knowing whether a failure is ongoing and mutations are being skipped due to checkpointing failure
- Of knowing the document IDs of mutations for which checkpointing failed.
- To control the maximum time up to which the checkpointing is allowed to run.
Solutions:
- Expose dcp_mutation_checkpoint_failure and dcp_deletion_checkpoint_failure via prometheus endpoints so that they can be reported as failures in eventing stats and in the on-prem UI by ns_server.
- Push document IDs of failed mutations to app log.
- Adding a checkpoint_timeout (just like script timeout) to control time limit on checkpointing activity.