Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-52898

[n2n encryption] Nodes can't connect to each other if at least one node in cluster changes IP (or disappears)

    XMLWordPrintable

Details

    • Triaged
    • 1
    • Unknown
    • Build Team 2022 Sprint 19

    Description

      The scenario is the following: There are two nodes in the cluster NodeA and NodeB. NodeA goes down. NodeA’s machine changes IP. NodeB stays up and tries to reconnect to NodeA. Then NodeA starts and also tries to reconnect to NodeB.
      Both nodes are using TLS distribution (node to node encryption).

      Only erlang versions < 21 are affected

      Few facts first:
      1. Every time net_kernel tries to connect to a node it makes a call to tls dist proxy (only in case when tls is used):

           gen_server:call(
             ?MODULE, {connect, Driver, Ip, Port, ExtraOpts}, infinity).
      

      2. The tls:connect function takes 127 seconds to complete (for OS to return etimedout) when there is no syn_ack comes back (it depends on machine's tcp settings actually);
      3. There is a timeout inside net_kernel that triggers reconnects every 7 seconds if current connection can’t get established.
      4. NodeA can’t connect to NodeB if NodeB has a pending connection to NodeA and NodeA < NodeB (conflict resolution in erlang distribution).

      What I think happens there:

      Phase 1: Old NodeA’s IP is not valid, but name resolution is still returning the old IP for sometime (not clear how long this time interval is).
      During that time NodeB tries to reconnect to NodeA every 7 seconds (see #3). So each such attempt (see #1) got stuck for 127 seconds (see #2).
      So during that phase the tls dist proxy process at NodeB accumulates {$gen_call, _, _} messages with wrong IP in mailbox (it receives a message every 7 seconds, but it takes 127 seconds to handle each of them).
      During that phase nodes can’t connect to each other (because of #4).
      Phase 2: Name resolution starts returning correct IP.
      NodeB tries to use correct IP address but the mailbox of the tls dist proxy process is still full of {$gen_call, _, _} messages with “old” IP, that are still being handled => we still see SYN_SENT sockets are created in netstat.
      Also now the mailbox is getting filled with another kind of {$gen_call, _, _} messages, this time they contain correct IP address.
      During that phase nodes can’t connect to each other (because of #4).
      Phase 3: Tls proxy process is done with {$gen_call, _, _} messages that contain wrong IP
      Tls proxy is handling {$gen_call, _, _} with correct IP address now, but most of them already got expired long time ago (net_kernel got timeout from gen_server:call for them).
      If remote node is available then this message queue can be handled relatively fast, otherwise if each connect attempt takes seconds the long queue may stay there for a while.
      During this phase nodes still should not be able to connect to each other but if both nodes are up, this phase doesn’t take long.

      In logs it can be confirmed by checking the mailbox of the tls dist proxy. It should contain many messages with incorrect IP.

      Summary:

      1. Basically this is a bug in erlang 20. The bug leads to the following behavior:

      If a node gets wrong IP once - it loses ability to connect to any nodes for 127 seconds;
      This node also losses ability to accept connections if remote node's name "is less than" this node's name;
      Every 7 seconds in such state adds another 127 seconds of inability to connect.
      Only 6.* nodes should be affected because the tls proxy is gone in erlang 21.

      2. Hypothetically we can come up with some script to fix it without node restart but I'm not sure if we need it.

      3. Hypothetically we can fix it in erlang 20 for future 6.* releases, but I'm not sure we need it.

      4. Hypothetically we can reduce net.ipv4.tcp_syn_retries (or change other tcp settings) so each reconnect takes less time and OS return etimedout faster. If takes less than 7 second, we should see no problems at all.

      5. The simplest workaround is seems to be to upgrade to 7.0 or even 7.1

      Attachments

        Issue Links

          No reviews matched the request. Check your Options in the drop-down menu of this sections header.

          Activity

            Build couchbase-server-7.0.5-7623 contains tlm commit 8e346ed with commit message:
            MB-52898: Merge remote-tracking branch 'origin/6.6.5' into mad-hatter

            build-team Couchbase Build Team added a comment - Build couchbase-server-7.0.5-7623 contains tlm commit 8e346ed with commit message: MB-52898 : Merge remote-tracking branch 'origin/6.6.5' into mad-hatter

            Build couchbase-server-7.0.5-7623 contains tlm commit be8cba5 with commit message:
            MB-52898: Update to erlang 20.3.8.11-cb15

            build-team Couchbase Build Team added a comment - Build couchbase-server-7.0.5-7623 contains tlm commit be8cba5 with commit message: MB-52898 : Update to erlang 20.3.8.11-cb15

            Shaazin Sheikh
            Here is how you can reproduce it in order to verify the fix:
            1) create 3 node cluster that uses fqdn's (for example node1.test, node2.test, node3.test). N2n encryption should be enabled, "control" level is enough.
            2) shutdown node1 (at this point node2 and node3 are trying to connect to node1)
            3) At node2 and node3 change /etc/hosts file and make sure node1 fqdn points to some not local network IP that does not reply (for example "node1.test 1.2.3.4" works for me). After that messages in ssl proxy start to accumulate - this can be verified by

            curl  -u Administrator '127.0.0.1:8091/diag/eval' -d 'erlang:process_info(whereis(ssl_tls_dist_proxy), message_queue_len).'
            

            4) Wait some time to make sure the number of messages in the ssl_tls_dist_proxy mailbox is at least ~20
            5) Kill distribution connection between node2 and node3:

            curl  -u Administrator '127.0.0.1:8091/diag/eval' -d '[erlang:disconnect_node(N) || N <- nodes()].'
            

            6) Connection between node2 and node3 should not recover (in UI they see each other as "red") - this means the bug is reproduced.

            timofey.barmin Timofey Barmin added a comment - Shaazin Sheikh Here is how you can reproduce it in order to verify the fix: 1) create 3 node cluster that uses fqdn's (for example node1.test, node2.test, node3.test). N2n encryption should be enabled, "control" level is enough. 2) shutdown node1 (at this point node2 and node3 are trying to connect to node1) 3) At node2 and node3 change /etc/hosts file and make sure node1 fqdn points to some not local network IP that does not reply (for example "node1.test 1.2.3.4" works for me). After that messages in ssl proxy start to accumulate - this can be verified by curl -u Administrator '127.0.0.1:8091/diag/eval' -d 'erlang:process_info(whereis(ssl_tls_dist_proxy), message_queue_len).' 4) Wait some time to make sure the number of messages in the ssl_tls_dist_proxy mailbox is at least ~20 5) Kill distribution connection between node2 and node3: curl -u Administrator '127.0.0.1:8091/diag/eval' -d '[erlang:disconnect_node(N) || N <- nodes()].' 6) Connection between node2 and node3 should not recover (in UI they see each other as "red") - this means the bug is reproduced.

            Reproduced the issue on EE 6.6.5-10080 with encryption on (control)
            Killing distribution connection between node2 and node3 did not recover the connection between the two nodes.

            Followed the same steps on EE 6.6.5-10080 with encryption off as well as on the latest build which has the fix EE 6.6.5-10117 with encryption on (control) .
            The number of messages in the ssl_tls_dist_proxy mailbox remains 0 and on killing distribution connection between node2 and node3, they go down for few seconds and then come back.

            Thus verified the fix.

            shaazin.sheikh Shaazin Sheikh added a comment - Reproduced the issue on EE 6.6.5-10080 with encryption on (control) Killing distribution connection between node2 and node3 did not recover the connection between the two nodes. Followed the same steps on EE 6.6.5-10080 with encryption off as well as on the latest build which has the fix EE 6.6.5-10117 with encryption on (control) . The number of messages in the ssl_tls_dist_proxy mailbox remains 0 and on killing distribution connection between node2 and node3, they go down for few seconds and then come back. Thus verified the fix.

            Verified on Enterprise Edition 6.6.6 build 10574.

            • N2n encryption was set to control.
            • The number of messages in the ssl_tls_dist_proxy mailbox remains 0.
            • On killing distribution connection between node2 and node3, they go down for few seconds and then come back.

            Closing the ticket.

            shaazin.sheikh Shaazin Sheikh added a comment - Verified on Enterprise Edition 6.6.6 build 10574. N2n encryption was set to control. The number of messages in the ssl_tls_dist_proxy mailbox remains 0. On killing distribution connection between node2 and node3, they go down for few seconds and then come back. Closing the ticket.

            People

              shaazin.sheikh Shaazin Sheikh
              timofey.barmin Timofey Barmin
              Votes:
              0 Vote for this issue
              Watchers:
              10 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Time Tracking

                  Estimated:
                  Original Estimate - Not Specified
                  Not Specified
                  Remaining:
                  Remaining Estimate - 0h
                  0h
                  Logged:
                  Time Spent - 1h
                  1h

                  Gerrit Reviews

                    There are no open Gerrit changes

                    PagerDuty