Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-8039

failover is not quick when any node (including being failed over) is not responding

    XMLWordPrintable

Details

    • Bug
    • Resolution: Fixed
    • Critical
    • 2.5.0
    • 2.0, 2.0.1
    • ns_server
    • Security Level: Public
    • Yes
    • 02/Sep/2013 - 20/Sep/2013

    Description

      SUBJ.

      This happens because janitor_agent can be stuck waiting for:

      *) tap connections "ping" (which we do in order to discover and clean up dead connections)

      *) stuck vbucket filter change request (which is sent to "other" side, i.e. non-local memcached)

      And corresponding ebucketmigrator can be stuck there too.

      So unresponsiveness of 1 node can cause this critical component of all other nodes to be stuck. We cannot activate any vbuckets without stopping replication into them. And that requires:

      *) janitor agent not be stuck

      *) corresponding ebucketmigrators not being stuck

      I've re-visited this problem just now and ideally fix will be made with support from ep-engine side which could be done as part of UPR work.

      Without ep-engine support that will require significant changes in ns_server which are harder to do right now particularly due to 1.8.x backwards compatibility support. That would be doable but would take at least several days of work.

      Attachments

        No reviews matched the request. Check your Options in the drop-down menu of this sections header.

        Activity

          People

            andreibaranouski Andrei Baranouski
            alkondratenko Aleksey Kondratenko (Inactive)
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Gerrit Reviews

                There are no open Gerrit changes

                PagerDuty