Details
-
Bug
-
Resolution: Fixed
-
Major
-
Cheshire-Cat
-
Triaged
-
Yes
Description
as described in comments here:
http://review.couchbase.org/c/ns_server/+/135827
I think this is still raceful.
If the supervisor managing root_sup is busy with something when dist_manager crashes during a rename, or if some of the ns_server_cluster_sup process take a long time to terminate, that may give ns_node_disco enough time to process the DOWN message and self-eject.
I don't quite know what to do about both rename related changes. I can see how they narrow the window for some races. But neither solve the problem in its entirety, so it's hard to say whether we end up at a better place overall.
To clarify a little bit. It's easier for me to convince myself that the previous change (ns_config checking for rename in init) is strictly improving the state of affairs. It's harder to come to the same conclusion about this change.
A quick (but not so clean) way to make the situation better would be for processes like ns_node_disco to check the termination reason of the renaming transaction. If it's 'normal', then assume everything went fine. Otherwise, terminate the process and let the logic in the init() function to deal with it. One problem with this though is that ns_node_disco might monitor the renaming process too late to get any reason but 'noproc'.
Potential solution might be monitoring dist_manager from ns_node_disco and terminating ns_node_disco immediately if dist_manager crashes