Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-46011

Unsafe failover orchestrator after split brain

    XMLWordPrintable

Details

    • Bug
    • Resolution: Not a Bug
    • Major
    • 7.0.0
    • Cheshire-Cat
    • ns_server
    • CB EE 7.0.0-5050

    Description

      Filing this after talking to Meni Hillel and Steve Watanabe this morning based on the comment here at: 
      https://issues.couchbase.com/browse/MB-37842?focusedCommentId=494107&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-494107

       

      Steps to reproduce
      1. 3 node cluster .215 .217 .219
      2. Split brain .217 by blocking traffic from .219 on .215 and vice-versa
      So now we have two cluster halves:
      First half - .215 and .217 with .219 unresponsive
      Second half - .217 and .219 with .215 unresponsive
      Now it appears that we can't (unsafe) failover .215 (orchestrator) node out of the cluster as these options will fail

      Option 1: Unsafe failover request to .215/.217

      curl -v -X POST -u Administrator:password http://172.23.105.217:8091/controller/failOver -d 'otpNode=ns_1@172.23.105.215' -d 'allowUnsafe=true'

      will fail as:

      Unexpected server error: {error,
                                   {not_in_peers,'ns_1@172.23.105.215',
                                       ['ns_1@172.23.105.217',

      as .217 and .215 think that orchestrator (.215) is healthy

      Option 2: Unsafe failover request to .219

      curl -v -X POST -u Administrator:password http://172.23.105.219:8091/controller/failOver -d 'otpNode=ns_1@172.23.105.215' -d 'allowUnsafe=true'

      won't return a response as ultimately the request tries to get routed to .215 and it cannot communicate with it.

      Option 3: Do a regular hard-failover of .215 by making a request to .217 node

      curl -v -X POST -u Administrator:password http://172.23.105.217:8091/controller/failOver -d 'otpNode=ns_1@172.23.105.215' 

      will fail as:

      Failover exited with reason {{badmatch,
      {error,
      {no_quorum,
      [{required_quorum,
      [majority,
      {majority,
      {set,2,16,16,8,80,48,
      {[],[],[],[],[],[],[],[],[],[],[],[],[],
      [],[],[]},
      {{[],[],[],[],[],[],[],
      ['ns_1@172.23.105.217'],
      [],
      ['ns_1@172.23.105.219'],
      [],[],[],[],[],[]}}}}]},
      {leases,
      ['ns_1@172.23.105.215',
      'ns_1@172.23.105.217']}]}}},
      [{failover,deactivate_nodes,2,

      (Note that we can unsafe failover .219; that works fine. The issue here is that we can't get .215 (orch) out of the cluster)

      Logs attached

      Attachments

        No reviews matched the request. Check your Options in the drop-down menu of this sections header.

        Activity

          People

            dfinlay Dave Finlay
            sumedh.basarkod Sumedh Basarkod (Inactive)
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Gerrit Reviews

                There are no open Gerrit changes

                PagerDuty