Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-50171

Data service chould return a more informative error for the closed backup DCP streams

    XMLWordPrintable

Details

    • Improvement
    • Resolution: Won't Do
    • Major
    • None
    • 7.1.0
    • couchbase-bucket
    • None
    • 1

    Description

      What is the issue?
      As far as I know, the majority of DCP clients will try to restart a DCP connection that got unexpectedly closed due to, for example, a failover. This is not the case for the cbbackupmgr and the Backup service.

      At this point in time the Backup service doesn't support any kind of recovery or continuation of a backup that had failed due to one of the nodes that it was streaming data from failing over, which results in end user-facing errors that look like this:

      "error": "exit status 1: failed to execute cluster operations: failed to execute bucket operation for bucket 'bucket6': failed to transfer bucket data for bucket 'bucket6': failed to transfer key value data: failed to transfer key value data: EOF"
      

      This is not a very informative but is probably the best high-level error we can produce because EOF is the only error we are getting from the Data service, this is what we have in the cbbackupmgr logs:

      2021-12-16T02:02:29.575-08:00 WARN: (DCP) (bucket6) (vb 261) Stream closed due to unexpected error 'EOF' | {"uuid":214011845766233,"snap_start":0,"snap_end":16242,"last_seqno":8760,"retries":0} -- couchbase.(*DCPAsyncWorker).End() at dcp_async_worker.go:538
      2021-12-16T02:02:29.575-08:00 WARN: (DCP) (bucket6) (vb 261) Received an unexpected error whilst streaming, beginning teardown: EOF -- couchbase.(*DCPAsyncWorker).handleDCPError() at dcp_async_worker.go:615
      

      What is the suggested improvement?
      Adding a way of getting a more informative error from the Data service when a DCP stream is closed, not necessarily anything extremely specific (I understand that the Data service might not know that the stream got closed because of a failover as this is not even reflected in the memcached.log) but any error we can use to infer a possible set of reasons for a backup failure in this case and convey them to the end user would be very much welcome.

      As a side note, this could also be beneficial to other services as well as they can, for example, use different timeout strategies based on the reason for the stream being closed.
       

      Attachments

        Issue Links

          No reviews matched the request. Check your Options in the drop-down menu of this sections header.

          Activity

            People

              maks.januska Maksimiljans Januska
              maks.januska Maksimiljans Januska
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Gerrit Reviews

                  There are no open Gerrit changes

                  PagerDuty