When we resume a backup, we open new connections to the cluster to re-stream (or top up) the data that we have stored for any vBuckets that were "completed" before the backup ended prematurely.
Once we have completed a multi-part upload, the ship has sailed, that object (as far a cbbackupmgr is concerned is complete it's WORM). Currently, during a backup when vBuckets are "completed" the respective multi-part upload will be completed; this means that if we need to resume the backup (because it failed somewhere else down the line) we can't "top-up" that vBucket.
We need to change the logic so that upon "committing" the vBucket, we simply upload all the parts which remain in memory. At the end of the backup e.g. once all the vBuckets have completed streaming (all the DCP connections are closed) and we are about to upload the archive metadata to the cloud, we complete all the multipart uploads.
"Well yes, but actually no" - Although this does solve this problem, it creates another. Let's propose a scenario: we have completed our backup (all DCP streams have closed), we begin to complete all of the multipart uploads but for some reason (likely network) we failed to complete all of the multipart uploads (most importantly, some have been completed). Now we try to '--resume' the backup again, and we are once again in the same situation; we try to re-open streams to top-up our already completed multipart uploads.
We need to make the fetching/putting of archive metadata (including the committing of multipart uploads) into a step in the plan; this means when resuming, we can skip the DCP stage and continue on completed the incomplete multipart uploads.
Below are some steps which will causes invalid data with the above solution:
1) Load the beer sample bucket
2) Run a backup
3) <ctrl-c> the backup whilst it's in progress
4) Add a document using the Web UI which is in a vBucket which is already "complete" e.g. will be topped up.
5) Resume the backup
6) The backup will fail to complete the multipart upload for the vBucket which contained the extra document.
This is an issue solely because of the hard 5MB limit for multipart uploads. When we first started the backup and complete the vBucket we plan to add too, it's final part is uploaded (which is very likely to be less than 5MB which is fine since
it's the last part... but it might not be). When add to that vBucket and restart the document, another part is created for that vBucket and is upload; this is great we know where all the parts are and which order they need to be completed in, however,
we have a single rogue part in the middle which is smaller than the minimum 5MB. This means we can't complete this multipart.
This issue can be solved but requires some finagling to get around our snapshot requirement (We must always be ahead of or equal to our internal snapshot markers). When we "top-up" a vBucket with it's final part being smaller than the minimum part size of 5MB we:
1) Download the contents of the final part (which will be less than 5MB)
2) We append to this part locally, as we normally would
3) Once we hit the upload threshold/or finished the "top-up" we upload the part with the same part-number (this overwrites that part)
4) We complete the multipart upload as per usual This isn't an ideal solution but is the cleanest way I can think off (which allows us to remain completely stateless).
It's not possible to download a single part from an incomplete multipart upload; this rules out the solution above.
Another possible solution which would work to a certain degree, however, would require changes to the restore skip logic would be to:
1) Remove the snapshot marker mutation function which is utilized by VBucketBackupWriter to update our progress internally
2) The means that we can change how the snapshot markers are persisted to avoid having the 'lastSeqNo' be anything contained in a part which is smaller than 5MB.
As explained above this method causes issues with the archive skip logic since the snapshot start/end seqno will default to zero for vBuckets that don't contain any parts which are larger than 5MB e.g. there is only a single part.
Another solution which may be cleaner than the ones I have described above, would be to always pad the final part so that it meets the 5MB part limit; this wouldn't cause any issues with restore since Rift can already handle
"dead-space" in its data stores. This would allow us to resume backups without having to worry about rolling back the final part because it was smaller than the minimum part size. This solution is extremely easy to implement, however, after some testing, it has severe consequences for performance of backups which are smaller than 5GB overall.
This solution might be the cleanest one, when we call 'Upload' on a multipart uploader for which the final part is smaller than 5MB we:
1) Complete that multipart upload
2) Start a new multipart upload with the same 'key'
3) Immediately perform an 'UploadPartCopy' for that entire object (which makes that entire object a single 'part' although that object will still exist)
4) We "top-up" the mutlipart upload.
5) Data transfer completes
6) We complete the multipart upload (at this point, the object is replaced with the new mulipart upload since it has the same key; we are in a roundabout way appending to the mulipart upload)
This solution is probably the cleanest since it doesn't require adding any additional data (which users will pay to store), it doesn't require downloading any data (which is expensive), it only requires
making a few additional requests (which are cheap). This solution does however, not entirely solve the problem; it only works if the total size of all the previous parts is greater than 5MB. This is an issue we can work around, so before we resume a cloud backup we:
1) Verify that all the vBuckets multipart uploads are greater than 5MB
2) For vBuckets that do not contain enough data we:
2a) Abort the multipart upload for that vBucket
2b) Remove the local snapshot files for that vBucket
3) Continue the resume as normal
This means that, at a vBucket level, we resume streaming data if we have already stream over 5MB of data, otherwise, we open a new stream with a start seqno of zero and start streaming data as as if we had just started the backup.
To see why this is an issue lets first look at how we create (and keep track of a multipart upload)
1) Create mutlipart upload (AWS returns a upload id)
2) We persist this upload id to disk as soon as we get it
There is an issue here, what happens if we panic in another thread (or the user sends sinint or worse sigkill) after we have created the upload, but before we have persisted the upload manifest. In this situation we end up with an orphaned multipart upload. Users are
charged for all the parts of a multipart even if they are not completed, this isn't a huge issue since there aren't any upload parts so they won't be charged storage costs. We should however (as we already do) try out best to cleanup after ourselves.
Since AWS is kind enough to keep track of multipart upload parts for us, we can take advantage of the AWS SDK's ListMultipartUploads function. We can use this to list all the multipart
uploads where the key is prefixed with the current backup e.g. at the end of the backup we:
1) Complete all the known multipart uploads (the ones we that exist in the upload manifests)
2) List all the multipart uploads that exist with the key prefix '$archive/$repo/$backup' which will be unique to this backup
3) Abort all the listed (orphaned) multipart uploads