Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-3449

Recovering from multiple crashes in a large cluster left the system unusable

    XMLWordPrintable

Details

    Description

      This is a higher level tracking bug. I am attempting to gather logs but the cluster has been wiped so it may be hard. The anectdotal scenario is as follows:
      -10 nodes cluster with ~1 billion keys (+1 replica) running fine for an extended period of time
      -5 nodes crashed due to bug MB-3443
      -After the memcached processes finished warming up, replication started
      -This is where things get a little hazier...upon looking at the cluster sometime after replication started, it was clear that almost ALL items (replica and active) were ejected to disk.

      We need to attempt to reproduce this behavior

      Attachments

        No reviews matched the request. Check your Options in the drop-down menu of this sections header.

        Activity

          People

            thuan Thuan Nguyen
            perry Perry Krug
            Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Gerrit Reviews

                There are no open Gerrit changes

                PagerDuty