With the new error logging code, we now display "recent 10 errors". Added a screenshot at end of email.
At any point, the last 10 error are displayed on the replication - 10 errors, which may or may not be valid depending upon the current time.
This issue needs to be addressed at two levels -
1. Level of error logging - Currently too much information is displayed, which also gives misleading idea on state of replication.
2. Classification of errors v/s warnings.
Having lower level information on the ns_logs can help trouble shoot , but having all of that information on the web-console might just confuse and overwhelm end-user IMO.
XDCR can have an error at any of the following levels
- xdc vbucket replicators - timing out, checkpoint failures, db_not_found
- xdc replication manager
- ns_server level - where it is unable to talk to the other remote cluster and so on.
With some recent trials on the new code, we see a lot of errors on the level of bucket replicators, say vbucket XXX commit_checkpoint_failure.
But the replication is continuing as expected. Replication has not failed, but it is continuing minus the above checkpoint failure.
It might be nicer to classify errors v/s warnings.
Errors - When finally xdcr has stopped working . No more data is being sent over to the destination.
Replication will be attempted for X number of times, and is finally given up?
Warnings - When there are timeouts, but it is a recoverable situation.