Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-5020

Rebalance state incorrectly reported as running even when it's not and user is unable to stop it or fail over/add nodes

    Details

      Description

      One of customers had master node fail in the middle of rebalance. As a result rebalance was actually aborted, but ns_config flag that marks rebalance as running was still there.

      What's most notable is that we're not allowing many actions in UI while rebalancing. So UI was incorrectly thinking that rebalance is running and not allowing that broken node to be failed over.

      Stop rebalance wasn't actually working as well because rebalance wasn't really running.

      Customer had to manually reset rebalance state via /diag/eval snippet that sets rebalance_state config variable. I've recommended something like that: ns_config:set(rebalance_status,

      {node, <<"stopped by human">>}

      ).

      It's notable that 1.8.0 actually have code to clean up stale rebalance status, but it is only triggered when all nodes are healthy, which was not holding in this customer's case.

      So decision was to actually clear rebalance status when asked, but to warn user if our orchestrator is clearly not running rebalance because network partition may actually mean that some other network partition still has old orchestrator that tries to run rebalance.

      No reviews matched the request. Check your Options in the drop-down menu of this sections header.

        Activity

        Hide
        alkondratenko Aleksey Kondratenko (Inactive) added a comment -

        Fix merged as a bunch of commits

        Show
        alkondratenko Aleksey Kondratenko (Inactive) added a comment - Fix merged as a bunch of commits
        Hide
        thuan Thuan Nguyen added a comment -

        Integrated in github-ns-server-2-0 #329 (See http://qa.hq.northscale.net/job/github-ns-server-2-0/329/)
        Store rebalancer PID in config.MB-5020 (Revision 2a05fa0ebebd94822f7ab42ea12ebb849a7216b5)
        Drop rebalance status even when rebalance isn't running.MB-5020 (Revision 30e1951fe98f97b8976e03774db284bdfd9d3906)
        Add stopRebalanceIsSafe to pool details.MB-5020 (Revision 2671ab08b868b1ba908e35ef13d8d27b2f540d43)
        Warn user on unsafe rebalance stop attempt.MB-5020 (Revision 6d653e9d0f4043ac4c0bf84c6857831df675ec6b)

        Result = SUCCESS
        Aliaksey Kandratsenka :
        Files :

        • src/ns_janitor.erl
        • src/ns_orchestrator.erl

        Aliaksey Kandratsenka :
        Files :

        • src/ns_janitor.erl
        • src/ns_orchestrator.erl

        Aliaksey Kandratsenka :
        Files :

        • src/ns_cluster_membership.erl
        • src/menelaus_web.erl

        Aliaksey Kandratsenka :
        Files :

        • priv/public/js/servers.js
        • priv/public/index.html
        Show
        thuan Thuan Nguyen added a comment - Integrated in github-ns-server-2-0 #329 (See http://qa.hq.northscale.net/job/github-ns-server-2-0/329/ ) Store rebalancer PID in config. MB-5020 (Revision 2a05fa0ebebd94822f7ab42ea12ebb849a7216b5) Drop rebalance status even when rebalance isn't running. MB-5020 (Revision 30e1951fe98f97b8976e03774db284bdfd9d3906) Add stopRebalanceIsSafe to pool details. MB-5020 (Revision 2671ab08b868b1ba908e35ef13d8d27b2f540d43) Warn user on unsafe rebalance stop attempt. MB-5020 (Revision 6d653e9d0f4043ac4c0bf84c6857831df675ec6b) Result = SUCCESS Aliaksey Kandratsenka : Files : src/ns_janitor.erl src/ns_orchestrator.erl Aliaksey Kandratsenka : Files : src/ns_janitor.erl src/ns_orchestrator.erl Aliaksey Kandratsenka : Files : src/ns_cluster_membership.erl src/menelaus_web.erl Aliaksey Kandratsenka : Files : priv/public/js/servers.js priv/public/index.html

          People

          • Assignee:
            Aliaksey Artamonau Aliaksey Artamonau
            Reporter:
            alkondratenko Aleksey Kondratenko (Inactive)
          • Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Gerrit Reviews

              There are no open Gerrit changes