added a comment - - edited
Look at your clusters. In your testcase, both clusters have some workload on top of ongoing bidirectional XDCR on the same bucket. XDCR workload is much heavier than your front-end workload. Also, there is little chance to do de-duplicate in ep_engine since the front-end workloads have different key sets. The write_queue on both clusters is constantly like 100K-500K items per node, while the disk drain rate is only 2-4 K/sec per node. Today the XDCR checkpoiting timeout is 10 seconds, that is if ep_engine is unable to persist the open checkpoint issued by XDCR within 10 seconds, we give up and skip this checkpoint, raise "target commit failure" error, and move on without checkpoint. In your testcase, apparently we are unable to persist the XDCR checkpoint in 10 seconds in most cases. This is the reason you see checkpoint failure on the UI on both sides. The root cause is that drain rate is unable to keep up with your workload (both XDCR and front-end workload).
By Chiyoung, now ep_engine has priority checkpoint command, which is used in rebalance. However, in XDCR, the use case is a bit different since we need to issue 32 concurrent checkpoints at the same time per node to the ep_engine. If XDCR issues the priority checkpoints directly, the performance impact to ep_engine as well as rebalance is unknown to both Chiyoung and me. Given the risk, it seems better to us to postpone this issue to post 2.0. Again the root cause is disk drain rate is too fast enough compared with the workload in the testcase.
At this time, I think what you can do is to increase the checkpoint interval and timeout. Say,
XDCR_CHECKPOINT_INTERVAL = 1800
XDCR_CAPI_CHECKPOINT_TIMEOUT = 60
That means we wait 60 seconds for checkpoint per 30 min (1800 seconds). The benefit is
1) We increase the chance to get successful checkpoint. It does not make sense to try to issue a checkpoint but always fail.
2) The aggregated waiting time is still the same as before, 60 secs per 30 min, no increase the time overhead of XDCR.
BTW, it should not be a blocker. The system is just doing what it is supposed to do, it is because we the reach the limit of concurrent design in your testcase.