Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-43353

address issue with dist_manager crashing during rename

    XMLWordPrintable

Details

    • Triaged
    • Yes

    Description

      as described in comments here:
      http://review.couchbase.org/c/ns_server/+/135827

      I think this is still raceful.

      If the supervisor managing root_sup is busy with something when dist_manager crashes during a rename, or if some of the ns_server_cluster_sup process take a long time to terminate, that may give ns_node_disco enough time to process the DOWN message and self-eject.

      I don't quite know what to do about both rename related changes. I can see how they narrow the window for some races. But neither solve the problem in its entirety, so it's hard to say whether we end up at a better place overall.

      To clarify a little bit. It's easier for me to convince myself that the previous change (ns_config checking for rename in init) is strictly improving the state of affairs. It's harder to come to the same conclusion about this change.

      A quick (but not so clean) way to make the situation better would be for processes like ns_node_disco to check the termination reason of the renaming transaction. If it's 'normal', then assume everything went fine. Otherwise, terminate the process and let the logic in the init() function to deal with it. One problem with this though is that ns_node_disco might monitor the renaming process too late to get any reason but 'noproc'.

      Potential solution might be monitoring dist_manager from ns_node_disco and terminating ns_node_disco immediately if dist_manager crashes

      Attachments

        Issue Links

          No reviews matched the request. Check your Options in the drop-down menu of this sections header.

          Activity

            People

              navdeep.boparai Navdeep Boparai
              artem Artem Stemkovski
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Gerrit Reviews

                  There are no open Gerrit changes

                  PagerDuty