Loading...

XML

Word

Printable

Details

Type: Bug
Resolution: Won't Fix
Priority: Critical
Fix Version/s: 4.0.0
Affects Version/s: 3.0-Beta, 3.0, 3.0.1
Component/s: ns_server
Security Level: Public
Labels:
None
Environment:
4 node cluster; 4 core nodes; beer-sample application run at 60Kops (50/50 ratio), nodes provisioned on RightScale EC2 x1.large images

Triage:
Untriaged
Operating System:
Centos 64-bit
Is this a Regression?:
Yes

Description

the intra-replication queue grows to unacceptable limits, exposing dataloss of multiple seconds of queued replication.
the problem is more pronounced on the RightScale provision cluster, but can be seen on local physical clusters with long enough test run (>20min). recovery requires stopping input request queue.
initial measurements of the erlang process suggests that minor retries on scheduled network i/o eventually build up into a limit for push of replication data, scheduler_wait appears to be the consuming element, epoll_wait counter increases per measurement, as does the mean time wait, suggesting thrashing in the erlang event scheduler. there are various papers/presentations that suggest Erlang is sensitive to the balance of tasks (a mix of long event and short event can cause performance thruput issues).

cbcollectinfo logs will be attached shortly

Attachments

Gerrit Reviews

- Issue Only
- Show All Reviews
- Show Open Reviews
- Show All Issues
- Show Open Issues

No reviews matched the request. Check your Options in the drop-down menu of this sections header.

Activity

People

Assignee:: Thomas Anderson (Inactive)

Reporter:: Thomas Anderson (Inactive)

Votes:: 0 Vote for this issue

Watchers:: 4 Start watching this issue

Dates

Created:: 09/Sep/14 1:51 PM

Updated:: 07/Aug/15 4:43 PM

Resolved:: 08/Apr/15 9:00 PM

Gerrit Reviews

There are no open Gerrit changes

Intrareplication falls behind OPs causing data loss situation

Details

Description

Attachments

Gerrit Reviews

Activity

People

Dates

Gerrit Reviews

PagerDuty