Details
-
Task
-
Resolution: Unresolved
-
Major
-
Morpheus
-
None
Description
What's the issue?
We've tested our PiTR feature as a prototype, and we understand how it works/how it solves the problem, however, we should think up some test cases which validate that it'll work as expected in production and to determine the impact it has on perform (whether that's for backup/restore or KV)
As an example, we'd like to know:
1) In the worst cases, how much overhead to we incur from couchstore
2) How much overhead to we have from simply storing duplicate documents
3) How does our backup performance stand up, can we cope with backups in the worst case
4) Are we going to restore these PiTR backups, how long is it going to take?
5) How does it impact read/write latency?
6) How does it affect intra-node communication e.g. rebalance/replication?
A good place to start with this type of testing, might be:
1) Generating a standard size dataset, performing control tests so we have baselines
2) Generating a "sensible" PiTR dataset, performing the same tests
3) Generating an extreme PiTR dataset, performing the same tests
In terms of stats to gather/KPIs:
1) cbcollect_info (for backup/cluster)
2) couchstore fragmentation (data size, duplicated data size and "live" doc size)
3) Histrograms for get/set (might be included in cbcollect_info)
4) Backup sizes (data size, duplicated data size and "live" doc size)
...
What's the fix?
Write up a "one pager" which outlines what testing we're going to do (e.g. what I've stated above, but in more detail). We'll create a separate task to actually perform this testing.