Details
-
Improvement
-
Resolution: Won't Do
-
Major
-
None
-
7.1.0
-
None
-
1
Description
What is the issue?
As far as I know, the majority of DCP clients will try to restart a DCP connection that got unexpectedly closed due to, for example, a failover. This is not the case for the cbbackupmgr and the Backup service.
At this point in time the Backup service doesn't support any kind of recovery or continuation of a backup that had failed due to one of the nodes that it was streaming data from failing over, which results in end user-facing errors that look like this:
"error": "exit status 1: failed to execute cluster operations: failed to execute bucket operation for bucket 'bucket6': failed to transfer bucket data for bucket 'bucket6': failed to transfer key value data: failed to transfer key value data: EOF"
|
This is not a very informative but is probably the best high-level error we can produce because EOF is the only error we are getting from the Data service, this is what we have in the cbbackupmgr logs:
2021-12-16T02:02:29.575-08:00 WARN: (DCP) (bucket6) (vb 261) Stream closed due to unexpected error 'EOF' | {"uuid":214011845766233,"snap_start":0,"snap_end":16242,"last_seqno":8760,"retries":0} -- couchbase.(*DCPAsyncWorker).End() at dcp_async_worker.go:538
|
2021-12-16T02:02:29.575-08:00 WARN: (DCP) (bucket6) (vb 261) Received an unexpected error whilst streaming, beginning teardown: EOF -- couchbase.(*DCPAsyncWorker).handleDCPError() at dcp_async_worker.go:615
|
What is the suggested improvement?
Adding a way of getting a more informative error from the Data service when a DCP stream is closed, not necessarily anything extremely specific (I understand that the Data service might not know that the stream got closed because of a failover as this is not even reflected in the memcached.log) but any error we can use to infer a possible set of reasons for a backup failure in this case and convey them to the end user would be very much welcome.
As a side note, this could also be beneficial to other services as well as they can, for example, use different timeout strategies based on the reason for the stream being closed.
Attachments
Issue Links
- relates to
-
MB-50135 [System Test][CBM] backup task failed with error - failed to transfer key value data: failed to transfer key value data: EOF
- Closed