Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-46011

Unsafe failover orchestrator after split brain

    XMLWordPrintable

Details

    • Bug
    • Status: Closed
    • Major
    • Resolution: Not a Bug
    • Cheshire-Cat
    • 7.0.0
    • ns_server
    • CB EE 7.0.0-5050

    Description

      Filing this after talking to Meni Hillel and Steve Watanabe this morning based on the comment here at: 
      https://issues.couchbase.com/browse/MB-37842?focusedCommentId=494107&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-494107

       

      Steps to reproduce
      1. 3 node cluster .215 .217 .219
      2. Split brain .217 by blocking traffic from .219 on .215 and vice-versa
      So now we have two cluster halves:
      First half - .215 and .217 with .219 unresponsive
      Second half - .217 and .219 with .215 unresponsive
      Now it appears that we can't (unsafe) failover .215 (orchestrator) node out of the cluster as these options will fail

      Option 1: Unsafe failover request to .215/.217

      curl -v -X POST -u Administrator:password http://172.23.105.217:8091/controller/failOver -d 'otpNode=ns_1@172.23.105.215' -d 'allowUnsafe=true'

      will fail as:

      Unexpected server error: {error,
                                   {not_in_peers,'ns_1@172.23.105.215',
                                       ['ns_1@172.23.105.217',

      as .217 and .215 think that orchestrator (.215) is healthy

      Option 2: Unsafe failover request to .219

      curl -v -X POST -u Administrator:password http://172.23.105.219:8091/controller/failOver -d 'otpNode=ns_1@172.23.105.215' -d 'allowUnsafe=true'

      won't return a response as ultimately the request tries to get routed to .215 and it cannot communicate with it.

      Option 3: Do a regular hard-failover of .215 by making a request to .217 node

      curl -v -X POST -u Administrator:password http://172.23.105.217:8091/controller/failOver -d 'otpNode=ns_1@172.23.105.215' 

      will fail as:

      Failover exited with reason {{badmatch,
      {error,
      {no_quorum,
      [{required_quorum,
      [majority,
      {majority,
      {set,2,16,16,8,80,48,
      {[],[],[],[],[],[],[],[],[],[],[],[],[],
      [],[],[]},
      {{[],[],[],[],[],[],[],
      ['ns_1@172.23.105.217'],
      [],
      ['ns_1@172.23.105.219'],
      [],[],[],[],[],[]}}}}]},
      {leases,
      ['ns_1@172.23.105.215',
      'ns_1@172.23.105.217']}]}}},
      [{failover,deactivate_nodes,2,

      (Note that we can unsafe failover .219; that works fine. The issue here is that we can't get .215 (orch) out of the cluster)

      Logs attached

      Attachments

        No reviews matched the request. Check your Options in the drop-down menu of this sections header.

        Activity

          dfinlay Dave Finlay added a comment -

          Thanks for this Sumedh Basarkod. These are an artifact of the situation that we still have an orchestrator combined with the asymmetric network partition and the fact that we require nodes that survive a quorum failover to be up.

          The following remedies should work in this situation, I would think.

          • user decides to regularly failover .219 from .215 or .217
          • user stops .215 and then from .217 or .219 regularly fails over .215
          • user stops .217 and from .215 unsafely fails over .217 and .219 or from .219 unsafely fails over .215 and .217

          Did you get a chance to test these?

          dfinlay Dave Finlay added a comment - Thanks for this Sumedh Basarkod . These are an artifact of the situation that we still have an orchestrator combined with the asymmetric network partition and the fact that we require nodes that survive a quorum failover to be up. The following remedies should work in this situation, I would think. user decides to regularly failover .219 from .215 or .217 user stops .215 and then from .217 or .219 regularly fails over .215 user stops .217 and from .215 unsafely fails over .217 and .219 or from .219 unsafely fails over .215 and .217 Did you get a chance to test these?

          Thanks Dave Finlay. I was looking for a way to get .215 out of the cluster, resulting in a 2 node cluster of .217 and .219, and for this the second remedy works (stopping .215 was the key I believe). Resolving

          sumedh.basarkod Sumedh Basarkod added a comment - Thanks Dave Finlay . I was looking for a way to get .215 out of the cluster, resulting in a 2 node cluster of .217 and .219, and for this the second remedy works (stopping .215 was the key I believe). Resolving
          dfinlay Dave Finlay added a comment -

          Thanks Sumedh.

          dfinlay Dave Finlay added a comment - Thanks Sumedh.

          People

            dfinlay Dave Finlay
            sumedh.basarkod Sumedh Basarkod
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Gerrit Reviews

                There are no open Gerrit changes

                PagerDuty