Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-37059

"Reconnect <node> retries exceeded" error while enabling n2n encryption

    XMLWordPrintable

Details

    • Untriaged
    • Yes

    Description

      Build 6.5.0-4837

      Observed that while enabling node to node encryption following error is thrown-

      Turned on encryption for node: http://172.23.96.100:8091
      ERROR: _ - Reconnect to 'ns_1@172.23.96.106' retries exceeded

      Note that out of 10 runs we saw this error 8 times, with reconnect to different nodes failing.

      Command used to enable node to node encryption-

      /opt/couchbase/bin/couchbase-cli node-to-node-encryption --enable --cluster 172.23.96.100:8091 -u Administrator -p password

      Here are details of one of run where we see this error-

      Job- http://perf.jenkins.couchbase.com/job/titan-reb-dev/90/

      Logs-
      https://cb-jira.s3.us-east-2.amazonaws.com/logs/reconnect_error/collectinfo-2019-11-20T060144-ns_1%40172.23.96.100.zip
      https://cb-jira.s3.us-east-2.amazonaws.com/logs/reconnect_error/collectinfo-2019-11-20T060144-ns_1%40172.23.96.101.zip
      https://cb-jira.s3.us-east-2.amazonaws.com/logs/reconnect_error/collectinfo-2019-11-20T060144-ns_1%40172.23.96.102.zip
      https://cb-jira.s3.us-east-2.amazonaws.com/logs/reconnect_error/collectinfo-2019-11-20T060144-ns_1%40172.23.96.103.zip
      https://cb-jira.s3.us-east-2.amazonaws.com/logs/reconnect_error/collectinfo-2019-11-20T060144-ns_1%40172.23.96.104.zip
      https://cb-jira.s3.us-east-2.amazonaws.com/logs/reconnect_error/collectinfo-2019-11-20T060144-ns_1%40172.23.96.105.zip
      https://cb-jira.s3.us-east-2.amazonaws.com/logs/reconnect_error/collectinfo-2019-11-20T060144-ns_1%40172.23.96.106.zip
      https://cb-jira.s3.us-east-2.amazonaws.com/logs/reconnect_error/collectinfo-2019-11-20T060144-ns_1%40172.23.96.107.zip
      https://cb-jira.s3.us-east-2.amazonaws.com/logs/reconnect_error/collectinfo-2019-11-20T060144-ns_1%40172.23.96.108.zip
       
      In older builds like 6.5.0-4647, we don't used to see this often.

      Attachments

        No reviews matched the request. Check your Options in the drop-down menu of this sections header.

        Activity

          dfinlay Dave Finlay added a comment - - edited

          I'm not quite sure what's going on. I see the setupNetConfig REST request get returned with a 200:

          172.23.96.100 - Administrator [19/Nov/2019:21:57:08 -0800] "POST /node/controller/enableExternalListener HTTP/1.1" 200 0 - couchbase-cli  6.5.0-4837
          172.23.96.100 - Administrator [19/Nov/2019:21:57:09 -0800] "POST /node/controller/setupNetConfig HTTP/1.1" 200 0 - couchbase-cli  6.5.0-4837
          

          Inside ns_server we see:

          [ns_server:info,2019-11-19T21:57:09.081-08:00,ns_1@172.23.96.100:netconfig_updater<0.202.0>:netconfig_updater:apply_config_unprotected:158]Node is going to apply the following settings: [{nodeEncryption,true}]
          [ns_server:debug,2019-11-19T21:57:09.093-08:00,ns_1@172.23.96.100:cb_dist<0.2356.0>:cb_dist:info_msg:752]cb_dist: Updated cb_dist config "/opt/couchbase/var/lib/couchbase/config/dist_cfg": [{external_listeners,
                                                                                                [inet_tcp_dist,
                                                                                                 inet_tls_dist,
                                                                                                 inet6_tcp_dist]},
                                                                                               {preferred_external_proto,
                                                                                                inet_tls_dist},
                                                                                               {preferred_local_proto,
                                                                                                inet_tcp_dist}]
          

          .100 plans to reconnect to all the other nodes in the cluster

          ns_server:info,2019-11-19T21:57:09.094-08:00,ns_1@172.23.96.100:netconfig_updater<0.202.0>:netconfig_updater:change_ext_dist_proto:241]Reconnecting to all known erl nodes since dist protocol settings changed, expected afamily is inet, expected encryption is true, nodes: ['ns_1@172.23.96.101',                                                                                                                                         'ns_1@172.23.96.102',                                                                                                                                         'ns_1@172.23.96.103',                                                                                                                                         'ns_1@172.23.96.104',                                                                                                                                         'ns_1@172.23.96.105',                                                                                                                                         'ns_1@172.23.96.106',                                                                                                                                         'ns_1@172.23.96.107',                                                                                                                                         'ns_1@172.23.96.108']
          ...
          [ns_server:debug,2019-11-19T21:57:09.404-08:00,ns_1@172.23.96.100:netconfig_updater<0.202.0>:netconfig_updater:ensure_connection_proto:254]Reconnected to 'ns_1@172.23.96.106', checking connection type...
          

          And the change is successfully applied:

          [ns_server:info,2019-11-19T21:57:09.526-08:00,ns_1@172.23.96.100:netconfig_updater<0.202.0>:netconfig_updater:apply_config_unprotected:187]Node network settings ([{nodeEncryption,true}]) successfully applied
          

          Mahesh Mandhare: I'm guessing you're sure these are the correct logs, right?

          Assigning to Timofey Barmin for a closer look.

          dfinlay Dave Finlay added a comment - - edited I'm not quite sure what's going on. I see the setupNetConfig REST request get returned with a 200: 172.23.96.100 - Administrator [19/Nov/2019:21:57:08 -0800] "POST /node/controller/enableExternalListener HTTP/1.1" 200 0 - couchbase-cli 6.5.0-4837 172.23.96.100 - Administrator [19/Nov/2019:21:57:09 -0800] "POST /node/controller/setupNetConfig HTTP/1.1" 200 0 - couchbase-cli 6.5.0-4837 Inside ns_server we see: [ns_server:info,2019-11-19T21:57:09.081-08:00,ns_1@172.23.96.100:netconfig_updater<0.202.0>:netconfig_updater:apply_config_unprotected:158]Node is going to apply the following settings: [{nodeEncryption,true}] [ns_server:debug,2019-11-19T21:57:09.093-08:00,ns_1@172.23.96.100:cb_dist<0.2356.0>:cb_dist:info_msg:752]cb_dist: Updated cb_dist config "/opt/couchbase/var/lib/couchbase/config/dist_cfg": [{external_listeners, [inet_tcp_dist, inet_tls_dist, inet6_tcp_dist]}, {preferred_external_proto, inet_tls_dist}, {preferred_local_proto, inet_tcp_dist}] .100 plans to reconnect to all the other nodes in the cluster ns_server:info,2019-11-19T21:57:09.094-08:00,ns_1@172.23.96.100:netconfig_updater<0.202.0>:netconfig_updater:change_ext_dist_proto:241]Reconnecting to all known erl nodes since dist protocol settings changed, expected afamily is inet, expected encryption is true, nodes: ['ns_1@172.23.96.101', 'ns_1@172.23.96.102', 'ns_1@172.23.96.103', 'ns_1@172.23.96.104', 'ns_1@172.23.96.105', 'ns_1@172.23.96.106', 'ns_1@172.23.96.107', 'ns_1@172.23.96.108'] ... [ns_server:debug,2019-11-19T21:57:09.404-08:00,ns_1@172.23.96.100:netconfig_updater<0.202.0>:netconfig_updater:ensure_connection_proto:254]Reconnected to 'ns_1@172.23.96.106', checking connection type... And the change is successfully applied: [ns_server:info,2019-11-19T21:57:09.526-08:00,ns_1@172.23.96.100:netconfig_updater<0.202.0>:netconfig_updater:apply_config_unprotected:187]Node network settings ([{nodeEncryption,true}]) successfully applied Mahesh Mandhare : I'm guessing you're sure these are the correct logs, right? Assigning to Timofey Barmin for a closer look.

          Logs do match timestamp on console logs on jenkins job, so they looks correct.

          mahesh.mandhare Mahesh Mandhare (Inactive) added a comment - Logs do match timestamp on console logs on jenkins job, so they looks correct.
          dfinlay Dave Finlay added a comment -

          Yes, Timofey looked into it today and indeed it is an issue in this situation.

          dfinlay Dave Finlay added a comment - Yes, Timofey looked into it today and indeed it is an issue in this situation.

          Build couchbase-server-6.5.0-4904 contains ns_server commit 1cd9c64 with commit message:
          MB-37059: Exponential backoff for netconfig_updater

          build-team Couchbase Build Team added a comment - Build couchbase-server-6.5.0-4904 contains ns_server commit 1cd9c64 with commit message: MB-37059 : Exponential backoff for netconfig_updater

          Build couchbase-server-7.0.0-1096 contains ns_server commit 1cd9c64 with commit message:
          MB-37059: Exponential backoff for netconfig_updater

          build-team Couchbase Build Team added a comment - Build couchbase-server-7.0.0-1096 contains ns_server commit 1cd9c64 with commit message: MB-37059 : Exponential backoff for netconfig_updater
          mahesh.mandhare Mahesh Mandhare (Inactive) added a comment - Build 6.5.0-4908 Verified that we did not see this issue in 3 runs of same test on same hardware. Jobs- http://perf.jenkins.couchbase.com/job/titan-reb-dev/104/ http://perf.jenkins.couchbase.com/job/titan-reb-dev/105/ http://perf.jenkins.couchbase.com/job/titan-reb-dev/106/

          People

            mahesh.mandhare Mahesh Mandhare (Inactive)
            mahesh.mandhare Mahesh Mandhare (Inactive)
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Gerrit Reviews

                There are no open Gerrit changes

                PagerDuty