Loading...

XML

Word

Printable

Details

Type: Bug
Resolution: Fixed
Priority: Major
Fix Version/s: 6.6.6
Affects Version/s: 6.6.0, 6.6.1, 6.6.2, 6.6.3, 6.6.4, 6.6.5
Component/s: build, ns_server
Labels:
- approved-for-6.6.6
- verified-6.6.6

Triage:
Triaged
Story Points:
1
Is this a Regression?:
Unknown
Sprint:
Build Team 2022 Sprint 19

Description

The scenario is the following: There are two nodes in the cluster NodeA and NodeB. NodeA goes down. NodeA’s machine changes IP. NodeB stays up and tries to reconnect to NodeA. Then NodeA starts and also tries to reconnect to NodeB.
Both nodes are using TLS distribution (node to node encryption).

Only erlang versions < 21 are affected

Few facts first:
1. Every time net_kernel tries to connect to a node it makes a call to tls dist proxy (only in case when tls is used):

     gen_server:call(

       ?MODULE, {connect, Driver, Ip, Port, ExtraOpts}, infinity).

2. The tls:connect function takes 127 seconds to complete (for OS to return etimedout) when there is no syn_ack comes back (it depends on machine's tcp settings actually);
3. There is a timeout inside net_kernel that triggers reconnects every 7 seconds if current connection can’t get established.
4. NodeA can’t connect to NodeB if NodeB has a pending connection to NodeA and NodeA < NodeB (conflict resolution in erlang distribution).

What I think happens there:

Phase 1: Old NodeA’s IP is not valid, but name resolution is still returning the old IP for sometime (not clear how long this time interval is).
During that time NodeB tries to reconnect to NodeA every 7 seconds (see #3). So each such attempt (see #1) got stuck for 127 seconds (see #2).
So during that phase the tls dist proxy process at NodeB accumulates {$gen_call, _, _} messages with wrong IP in mailbox (it receives a message every 7 seconds, but it takes 127 seconds to handle each of them).
During that phase nodes can’t connect to each other (because of #4).
Phase 2: Name resolution starts returning correct IP.
NodeB tries to use correct IP address but the mailbox of the tls dist proxy process is still full of {$gen_call, _, _} messages with “old” IP, that are still being handled => we still see SYN_SENT sockets are created in netstat.
Also now the mailbox is getting filled with another kind of {$gen_call, _, _} messages, this time they contain correct IP address.
During that phase nodes can’t connect to each other (because of #4).
Phase 3: Tls proxy process is done with {$gen_call, _, _} messages that contain wrong IP
Tls proxy is handling {$gen_call, _, _} with correct IP address now, but most of them already got expired long time ago (net_kernel got timeout from gen_server:call for them).
If remote node is available then this message queue can be handled relatively fast, otherwise if each connect attempt takes seconds the long queue may stay there for a while.
During this phase nodes still should not be able to connect to each other but if both nodes are up, this phase doesn’t take long.

In logs it can be confirmed by checking the mailbox of the tls dist proxy. It should contain many messages with incorrect IP.

Summary:

1. Basically this is a bug in erlang 20. The bug leads to the following behavior:

If a node gets wrong IP once - it loses ability to connect to any nodes for 127 seconds;
This node also losses ability to accept connections if remote node's name "is less than" this node's name;
Every 7 seconds in such state adds another 127 seconds of inability to connect.
Only 6.* nodes should be affected because the tls proxy is gone in erlang 21.

2. Hypothetically we can come up with some script to fix it without node restart but I'm not sure if we need it.

3. Hypothetically we can fix it in erlang 20 for future 6.* releases, but I'm not sure we need it.

4. Hypothetically we can reduce net.ipv4.tcp_syn_retries (or change other tcp settings) so each reconnect takes less time and OS return etimedout faster. If takes less than 7 second, we should see no problems at all.

5. The simplest workaround is seems to be to upgrade to 7.0 or even 7.1