Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-37842

Partition-failover of the first node in the cluster fails with "500 Internal server error; config sync failed"

    XMLWordPrintable

Details

    • Improvement
    • Status: Closed
    • Major
    • Resolution: Not a Bug
    • Cheshire-Cat
    • 7.1.0
    • ns_server

    Description

      Steps to reproduce this bug is as follows:

      1. Create a 4 node cluster (10.112.194.101, 10.112.194.102, 10.112.194.103, 10.112.194.104, with 10.112.194.101 being the node that initiates the cluster creation)
      2.  Isolate two nodes  10.112.194.101 and 10.112.194.102 from each other. So this introduces a network partition such that these two nodes cannot communicate with
        each other, but are able to communicate with all other nodes.
        This can be done by executing the following commands inside each of the two nodes above
        Execute in node 1 and node 2 respectively,

      iptables -A INPUT -s 10.112.194.102 -j DROP 

      iptables -A INPUT -s 10.112.194.101 -j DROP

       3. Hard failover the first node with a rest call to third node. So in node 1 execute:

      curl -v -X POST -u Administrator:password http://10.112.194.103:8091/controller/failOver -d 'otpNode=ns_1@10.112.194.101'

      Failover fails with the above mentioned error. Screenshots are attached.

      I found this intermittent bug originally when we run jepsen-durability-misc-daily-new tests (http://qa.sc.couchbase.com/job/jepsen-durability-misc-daily-new/) for the partition-failover workload(and when the failed over node happens to be the first node in the cluster).
      Nemesis crashes because failover fails with the above mentioned error and results in "unknown" error. The config for that is as follows:

      workload=partition-failover,node-count=6,replicas=2,no-autofailover,kv-timeout=30,durability=0:100:0:0

      Node that this may fail or succeed depending upon whether the failed over node is first node of the cluster or not.

      Attachments

        No reviews matched the request. Check your Options in the drop-down menu of this sections header.

        Activity

          Sumedh Basarkod I don't believe this is a bug with the Jepsen tests, unless it's a misunderstanding of how the cluster config gets propagated in hard failover. Which in my underdansting should not need to be synced to all nodes in the cluster before the node is expelled from the topology by the master.
          Dave Finlay can the ns_server team give some input on if this is a valid bug or understanding in the test?

          richard.demellow Richard deMellow added a comment - Sumedh Basarkod I don't believe this is a bug with the Jepsen tests, unless it's a misunderstanding of how the cluster config gets propagated in hard failover. Which in my underdansting should not need to be synced to all nodes in the cluster before the node is expelled from the topology by the master. Dave Finlay can the ns_server team give some input on if this is a valid bug or understanding in the test?

          Node .101 is an orchestrator node even with the introduced partition, since it can maintain leases with the majority of the cluster. So the failover request is directed to .101. Because of the weaknesses of ns_config, we do require that the orchestrator synchronize with all non-failed nodes before proceeding with the failover. So this is exactly what happens. Note, that even before the durability changes, the behavior was more or less the same: you wouldn't get an explicit error, but the vbucket activation would not happen until after the orchestrator could talk to all non-failed nodes.

          So this behavior is obviously undesirable, but cannot be addressed until we move to the new quorum-based metadata system and change the failover code itself not to require for all remaining nodes to be healthy (aka partial janitoring).

          Aliaksey Artamonau Aliaksey Artamonau (Inactive) added a comment - Node .101 is an orchestrator node even with the introduced partition, since it can maintain leases with the majority of the cluster. So the failover request is directed to .101. Because of the weaknesses of ns_config, we do require that the orchestrator synchronize with all non-failed nodes before proceeding with the failover. So this is exactly what happens. Note, that even before the durability changes, the behavior was more or less the same: you wouldn't get an explicit error, but the vbucket activation would not happen until after the orchestrator could talk to all non-failed nodes. So this behavior is obviously undesirable, but cannot be addressed until we move to the new quorum-based metadata system and change the failover code itself not to require for all remaining nodes to be healthy (aka partial janitoring).
          meni.hillel Meni Hillel (Inactive) added a comment - - edited

          In the case of an orchestrator failover, the failed over node, needs to access all the rest of the nodes to successfully complete orchestrator failover. In this case, the orchestrator is asked to failover itself. With chronical, we can ease the janitor to only require majority instead. Thus, moving to cc.next.

          meni.hillel Meni Hillel (Inactive) added a comment - - edited In the case of an orchestrator failover, the failed over node, needs to access all the rest of the nodes to successfully complete orchestrator failover. In this case, the orchestrator is asked to failover itself. With chronical, we can ease the janitor to only require majority instead. Thus, moving to cc.next.

          Meni Hillel / Dave Finlay
          This may be unrelated, but would the same be expected to work now in CC if we did unsafe failover (instead of regular hard failover)? ie;
          If we have 3 nodes A, B, C and we made a partition using split brain of node B (A and C cannot communicate with each other) and so now we have two halves:
          First half: A and B nodes
          Second half: B and C nodes
          and now if we wished to failover the orchestrator node(A) we would have to do unsafe failover, and I think it wouldn't work as well

          sumedh.basarkod Sumedh Basarkod added a comment - Meni Hillel  / Dave Finlay ,  This may be unrelated, but would the same be expected to work now in CC if we did unsafe failover (instead of regular hard failover)? ie; If we have 3 nodes A, B, C and we made a partition using split brain of node B (A and C cannot communicate with each other) and so now we have two halves: First half: A and B nodes Second half: B and C nodes and now if we wished to failover the orchestrator node(A) we would have to do unsafe failover, and I think it wouldn't work as well

          People

            sumedh.basarkod Sumedh Basarkod
            sumedh.basarkod Sumedh Basarkod
            Votes:
            0 Vote for this issue
            Watchers:
            9 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Gerrit Reviews

                There are no open Gerrit changes

                PagerDuty