Details
Description
When node-to-node encryption is on, node rename can fail with the reason provided below. This error usually leads to unsuccessful node addition which is very annoying and usually hard to reproduce.
9542 {function_clause,
|
9543 [{dist_manager,decode_status,
|
9544 [{error,
|
9545 {{shutdown,
|
9546 {failed_to_start_child,ssl_dist_sup,
|
9547 {already_started,<0.218.0>}}},
|
9548 {child,undefined,net_sup_dynamic,
|
9549 {erl_distribution,start_link,
|
9550 [#{clean_halt => false,name => 'n_0@127.0.0.1',
|
9551 name_domain => longnames,net_tickintensity => 4,
|
9552 net_ticktime => 60,supervisor => net_sup_dynamic}]},
|
9553 permanent,false,1000,supervisor,
|
9554 [erl_distribution]}}}],
|
9555 [{file,"src/dist_manager.erl"},{line,220}]},
|
9556 {dist_manager,bringup,2,[{file,"src/dist_manager.erl"},{line,251}]},
|
9557 {dist_manager,do_adjust_address,4,
|
9558 [{file,"src/dist_manager.erl"},{line,357}]},
|
9559 {async,'-async_init/4-fun-1-',3,[{file,"src/async.erl"},{line,199}]}]}
|
9560
|
basically it can't start ssl_dist_sup because it is already started. In reality it is not already started, but hasn't been stopped yet. When we are doing rename we stop distribution first before starting it. That stop is synchronous and is supposed to wait for all dist processes to stop (the stop is actually a stop of a supervisor net_sup_dynamic).
The problem is in the fact that the shutdown timeout for net_sup_dynamic is only 1 second and if the whole net_kernel:stop() procedure takes more time, that supervisor simply gets killed, while its children are still running. They all stop eventually but if we try to restart distribution immediately we get already_started.
net_sup_dynamic spec:
38 start(Opts) ->
|
39 C = #{id => net_sup_dynamic,
|
40 start => {?MODULE,start_link,[Opts#{clean_halt => false,
|
41 supervisor => net_sup_dynamic}]},
|
42 restart => permanent,
|
43 shutdown => 1000,
|
44 type => supervisor,
|
45 modules => [erl_distribution]},
|
46 supervisor:start_child(kernel_sup, C).
|
Attachments
Issue Links
- is triggering
-
MB-58652 Rebuild erlang packages (caused by MB-58630)
- Closed