Description
Here's what I saw:
{<0.738.0>,
[
,
,
{initial_call,{proc_lib,init_p,5}},
{backtrace,
[<<"Program counter: 0xf7393db0 (gen:do_call/4 + 304)">>,
<<"CP: 0x00000000 (invalid)">>,<<"arity = 0">>,<<>>,
<<"0xef337a74 Return addr 0xf1c9d738 (gen_server:call/2 + 64)">>,
<<"y(0) #Ref<0.0.0.16323>">>,
<<"y(1) 'n_0@10.17.2.163'">>,<<"y(2) []">>,
<<"y(3) 5000">>,<<"y(4) get_nodes">>,
<<"y(5) '$gen_call'">>,<<"y(6) <0.1830.0>">>,
<<>>,
<<"0xef337a94 Return addr 0xef4588c0 (ns_doctor:get_nodes/0 + 44)">>,
<<"y(0) get_nodes">>,<<"y(1) ns_doctor">>,
<<"y(2) Catch 0xf1c9d738 (gen_server:call/2 + 64)">>,
<<>>,
<<"0xef337aa4 Return addr 0xef44c628 (cb_replication:supported_mode/0 + 20)">>,
<<"y(0) Catch 0xef4588d0 (ns_doctor:get_nodes/0 + 60)">>,
<<>>,
<<"0xef337aac Return addr 0xef44c5e4 (cb_replication:node_replicator_triples/2 + 36)">>,
<<"y(0) []">>,<<>>,
<<"xef337ab4 Return addr 0xef481144 (failover_safeness_level:'-build_local_safeness_info_new/">>,
<<"y(0) 'n_0@10.17.2.163'">>,
<<"y(1) \"default\"">>,<<>>,
<<"xef337ac0 Return addr 0xef4806d0 (failover_safeness_level:build_local_safeness_info_new/1 ">>,
<<"y(0) []">>,<<"y(1) \"default\"">>,<<>>,
<<"xef337acc Return addr 0xef480644 (failover_safeness_level:build_local_safeness_info/1 + 28">>,
<<"y(0) [
]">>,<<>>,
<<"0xef337ad4 Return addr 0xef46909c (ns_heart:current_status/1 + 484)">>,
<<"y(0) [\"default\"]">>,<<>>,
<<"0xef337adc Return addr 0xef468974 (ns_heart:handle_call/3 + 100)">>,
<<"y(0) []">>,<<"y(1) []">>,
<<"y(2) []">>,<<"y(3) []">>,
<<"y(4) [
,
{curr_items_tot,0},
{vb_replica_curr_items,0}]">>,
<<"y(5) []">>,<<"y(6) 1">>,
<<"(7) [
<<"(8) [{cpu_idle_ms,7900},{cpu_local_ms,8010},{cpu_utilization_rate,1.373283e+00},{mem_a">>,
<<>>,
<<"0xef337b04 Return addr 0xf1c9fadc (gen_server:handle_msg/5 + 148)">>,
<<"(0) {state,undefined,[{meminfo,<<1170 bytes>>}
,{system_memory_data,[{system_total_memo">>,
<<>>,
<<"0xef337b0c Return addr 0xf73a1aa0 (proc_lib:init_p_do_apply/3 + 28)">>,
<<"y(0) ns_heart">>,
<<"(1) {state,undefined,[
,{system_memory_data,[{system_total_memo">>,
<<"y(2) ns_heart">>,<<"y(3) <0.696.0>">>,
<<"y(4) status">>,
<<"y(5) {<0.1834.0>,{#Ref<0.0.0.16279>,'n_0@10.17.2.163'}}">>,
<<"y(6) Catch 0xf1c9fadc (gen_server:handle_msg/5 + 148)">>,
<<>>,
<<"0xef337b2c Return addr 0x0827ae54 (<terminate process normally>)">>,
<<"y(0) Catch 0xf73a1ab0 (proc_lib:init_p_do_apply/3 + 44)">>,
<<>>]},
And this is right when ns_doctor is starting and grabbing initial status.
So ns_doctor is calling ns_heart while ns_heart quite subtly calls ns_doctor.
This is that 2.0 issue of slow/failing node joins.