Description
In xdcr, during checkpointing, we query ep-engine stats at the destination as follows: Immediately after receiving a checkpointing request from the source, we query the open_checkpoint_id and last_persisted_checkpoint_id from ep-engine and wait until either the last_persisted_checkpoint_id becomes equal to the open_checkpoint_id or a 10 second timeout occurs (in which case we log a warning and do not checkpoint). The idea here is that since checkpointing bypasses ep-engine and updates Couch directly, unlike regular document updates, we need to make sure it's "safe" to checkpoint, and it is safe only after the open checkpoint id seen at the time of receiving the checkpoint request has been persisted.
Recent runs of xdcr with 1024 vbuckets has revealed that we're hitting this timeout very frequently. This could be due to the following causes:
1. We issue far too many polling requests. It should suffice to query the stats only once per all replication streams.
2. It is likely that ep-engine is actually taking far too long to serve the stats requests. This needs to be investigated and fixed if it turns out to be true.
Another approach to checkpointing that could improve performance is as follows: In 2.0, ep-engine supports command to explicitly close the current open checkpoint and open a new one. Doing this once for all pending replication streams will improve checkpointing performance as we'd only have to wait until the last closed checkpoint is persisted.