Details
-
Bug
-
Resolution: Done
-
Major
-
2.2.0
-
4 - Operation Krack-down, 5 - Kraken Bug Fixes/Docs
-
1
Description
Based on performance testing of 2.1.0-239 operator, it appears that operator backup/restore is very slow compared to bare-metal: https://hub.internal.couchbase.com/confluence/display/QA/Couchbase+Operator+Performance
There are 3 reasons for this:
1 - kubernetes network/compute overhead
2 - EBS volume performance
3 - backup/restore pod thread settings
The first bottleneck is 3. The pods run with default of 1 thread. There was an improvement made by the tools team to auto select number of threads: https://issues.couchbase.com/browse/MB-35618. The flag is --auto-select-threads, however it is not present in the backup script used by the operator backup pod: https://github.com/couchbaselabs/couchbase-operator-backup/blob/master/backup.py#L899
That flag could be added to the pod or a setting exposed via the couchbase cluster object to control number of threads.
Adding more threads should push throughput up to the EBS volume bottleneck. To push past the EBS volume bottleneck, operator backups could parallelize across multiple EBS volumes (pvc). Otherwise, there is a hard cap at 250 MB/sec for operator backups, which is roughly 50% of bare-metal.