Details
Description
What's the issue?
Since MB-41372, we don't propagate errors for cbimport in the restore worker pool like we do for cbbackupmgr. James Lee says this was for a couple of reasons:
- There’s some errors you can’t really detect will happen e.g. the imported data producing an invalid packet when being sent.
- You want the import to be resilient to failures to import a single item (e.g. too big, not valid json)
This results in some scenarios - such as a full ephemeral bucket or the bucket being deleted from under us - taking an extremely long time to fail/error out, because each document must exhaust it's 5 minute retries.
What's the fix?
We could try to handle these issues individually, e.g. by detecting when a bucket is deleted using /pools/default. This feels like it may become a whack-a-mole effort however. A better way might be to revisit the reasons for not propagating the errors and attacking those so we can propagate errors once again. As an example we already validate that the document is valid JSON/UTF-8.