Copying Mike's helpful explanation from hidden ticket:
I have looked at this issue and similar issues and their are two issues here and one bug that I need you to fix on the ns_server side.
The first issue here is that some of the XXX-use-case servers are heavily over loaded. In some of the logs I have looked at I saw over a 500k items in a lot of the checkpoint queues. As a result the server memory is overloaded and waiting to persist items to disk.
The stat that is used by ns_server for the alerts is also no the correct stat. ep_overhead is use to keep track of the size of the checkpoint queues and other structures used by ep-engine. The correct stat to use (vb_active_meta_data + vb_replica_meta_data + vb_pending_meta_data). This should resolve the issue. Also, note that the issue of heavy checkpointing overhead will benefit greatly from having multiple writers in our next release.