Details
Description
the intra-replication queue grows to unacceptable limits, exposing dataloss of multiple seconds of queued replication.
the problem is more pronounced on the RightScale provision cluster, but can be seen on local physical clusters with long enough test run (>20min). recovery requires stopping input request queue.
initial measurements of the erlang process suggests that minor retries on scheduled network i/o eventually build up into a limit for push of replication data, scheduler_wait appears to be the consuming element, epoll_wait counter increases per measurement, as does the mean time wait, suggesting thrashing in the erlang event scheduler. there are various papers/presentations that suggest Erlang is sensitive to the balance of tasks (a mix of long event and short event can cause performance thruput issues).
cbcollectinfo logs will be attached shortly