Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-12157

Intrareplication falls behind OPs causing data loss situation

    XMLWordPrintable

Details

    • Bug
    • Resolution: Won't Fix
    • Critical
    • 4.0.0
    • 3.0-Beta, 3.0, 3.0.1
    • ns_server
    • Security Level: Public
    • None
    • 4 node cluster; 4 core nodes; beer-sample application run at 60Kops (50/50 ratio), nodes provisioned on RightScale EC2 x1.large images
    • Untriaged
    • Centos 64-bit
    • Yes

    Description

      the intra-replication queue grows to unacceptable limits, exposing dataloss of multiple seconds of queued replication.
      the problem is more pronounced on the RightScale provision cluster, but can be seen on local physical clusters with long enough test run (>20min). recovery requires stopping input request queue.
      initial measurements of the erlang process suggests that minor retries on scheduled network i/o eventually build up into a limit for push of replication data, scheduler_wait appears to be the consuming element, epoll_wait counter increases per measurement, as does the mean time wait, suggesting thrashing in the erlang event scheduler. there are various papers/presentations that suggest Erlang is sensitive to the balance of tasks (a mix of long event and short event can cause performance thruput issues).

      cbcollectinfo logs will be attached shortly

      Attachments

        No reviews matched the request. Check your Options in the drop-down menu of this sections header.

        Activity

          People

            thomas Thomas Anderson (Inactive)
            thomas Thomas Anderson (Inactive)
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Gerrit Reviews

                There are no open Gerrit changes

                PagerDuty