Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-45929

couchbase web server yields connection refused after several rebalance in & outs

    XMLWordPrintable

Details

    • Bug
    • Resolution: Duplicate
    • Critical
    • 7.0.0
    • Cheshire-Cat
    • ns_server
    • test[ClusterCbRemoteLinksLifecycleIT 2: lifecycle: alternate-address-rebalance-out]
    • 1
    • Yes

    Description

      After a number of rebalance in and outs of on a cluster_run cluster of n_1 & n_2, n_2 attempts to contact webserver on n_2 (:9002) get connection refused, this condition persists at least until the our test framework gives up (~90s after last n_2 rebalance out).

      2021-04-27T07:14:30.469-07:00 INFO ClusterExecutionITBase [main] Running cli: rebalance -c 172.18.0.3:9001 -u couchbase -p couchbase --server-remove 172.18.0.3:9002
      2021-04-27T07:16:05.480-07:00 INFO ClusterExecutionITBase [main+] >> Unable to display progress bar on this os
      2021-04-27T07:16:05.480-07:00 INFO ClusterExecutionITBase [main+] >> SUCCESS: Rebalance complete
      ...
      2021-04-27T07:17:35.601-07:00 ERRO TestExecutor [main] testFile src/test/resources/runtimets/queries/remote/cb/lifecycle/alternate-address-rebalance-out/test.15.cb.cmd raised an unexpected exception
      java.util.concurrent.ExecutionException: java.lang.IllegalStateException: timed out before desired response received (last result: org.apache.http.conn.HttpHostConnectException: Connect to 172.18.0.3:9002 [/172.18.0.3] failed: Connection refused (Connection refused))
      

      This does not seem to be intermittent, it fails reliably on every test run, both locally on macbook & on jenkins ubuntu environment.

      Note, the cbcollect_infos failed with some dump-guts failure, so while i've attached the cbcollect_infos they seem to be useless- i have also attached the raw logs for these nodes.

      2021-04-27T07:18:06.716-07:00 INFO ClusterExecutionITBase [main+] >> Found dump-guts: /home/couchbase/jenkins/workspace/cbas-cbcluster-test2/install/bin/dump-guts
      2021-04-27T07:18:06.721-07:00 INFO ClusterExecutionITBase [ForkJoinPool.commonPool-worker-5+] >> Raw PID 1 control groups /proc/1/cgroup (cat /proc/1/cgroup) - OK
      2021-04-27T07:18:06.721-07:00 INFO ClusterExecutionITBase [ForkJoinPool.commonPool-worker-5+] >> Found dump-guts: /home/couchbase/jenkins/workspace/cbas-cbcluster-test2/install/bin/dump-guts
      2021-04-27T07:18:07.652-07:00 INFO ClusterExecutionITBase [main+] >> Error occurred getting server guts: Got exception: {error,badarg}
      2021-04-27T07:18:07.652-07:00 INFO ClusterExecutionITBase [main+] >> [{lists,keyfind,[port_meta,1,'_deleted'],[]},
      2021-04-27T07:18:07.652-07:00 INFO ClusterExecutionITBase [main+] >>  {'dump-guts__escript__1619__533086__994357__5',extract_rest_port,2,
      2021-04-27T07:18:07.652-07:00 INFO ClusterExecutionITBase [main+] >>      [{file,
      2021-04-27T07:18:07.652-07:00 INFO ClusterExecutionITBase [main+] >>           "/home/couchbase/jenkins/workspace/cbas-cbcluster-test2/install/bin/dump-guts"},
      2021-04-27T07:18:07.652-07:00 INFO ClusterExecutionITBase [main+] >>       {line,458}]},
      2021-04-27T07:18:07.652-07:00 INFO ClusterExecutionITBase [main+] >>  {'dump-guts__escript__1619__533086__994357__5',main_with_everything,4,
      2021-04-27T07:18:07.652-07:00 INFO ClusterExecutionITBase [main+] >>      [{file,
      2021-04-27T07:18:07.652-07:00 INFO ClusterExecutionITBase [main+] >>           "/home/couchbase/jenkins/workspace/cbas-cbcluster-test2/install/bin/dump-guts"},
      2021-04-27T07:18:07.652-07:00 INFO ClusterExecutionITBase [main+] >>       {line,553}]},
      2021-04-27T07:18:07.652-07:00 INFO ClusterExecutionITBase [main+] >>  {'dump-guts__escript__1619__533086__994357__5',main,1,
      2021-04-27T07:18:07.652-07:00 INFO ClusterExecutionITBase [main+] >>      [{file,
      2021-04-27T07:18:07.652-07:00 INFO ClusterExecutionITBase [main+] >>           "/home/couchbase/jenkins/workspace/cbas-cbcluster-test2/install/bin/dump-guts"},
      2021-04-27T07:18:07.652-07:00 INFO ClusterExecutionITBase [main+] >>       {line,136}]},
      2021-04-27T07:18:07.652-07:00 INFO ClusterExecutionITBase [main+] >>  {escript,run,2,[{file,"escript.erl"},{line,758}]},
      2021-04-27T07:18:07.652-07:00 INFO ClusterExecutionITBase [main+] >>  {escript,start,1,[{file,"escript.erl"},{line,277}]},
      2021-04-27T07:18:07.653-07:00 INFO ClusterExecutionITBase [main+] >>  {init,start_em,1,[]},
      2021-04-27T07:18:07.653-07:00 INFO ClusterExecutionITBase [main+] >>  {init,do_boot,3,[]}]
      
      

      The n_2 node seems to be in some state where it keeps logging this repeatedly:

      {net_kernel,{auto_connect,'couchdb_n_2@cb.local',
                                {1132,#Ref<0.2357520616.3444178948.168890>}}}
      [ns_server:debug,2021-04-27T07:19:59.374-07:00,n_2@172.18.0.3:net_kernel<0.1669.0>:cb_dist:info_msg:778]cb_dist: Setting up new connection to 'couchdb_n_2@cb.local' using inet_tcp_dist
      [ns_server:debug,2021-04-27T07:19:59.374-07:00,n_2@172.18.0.3:cb_dist<0.1666.0>:cb_dist:info_msg:778]cb_dist: Added connection {con,#Ref<0.2357520616.3444310017.167261>,
                                     inet_tcp_dist,undefined,undefined}
      [ns_server:debug,2021-04-27T07:19:59.374-07:00,n_2@172.18.0.3:cb_dist<0.1666.0>:cb_dist:info_msg:778]cb_dist: Updated connection: {con,#Ref<0.2357520616.3444310017.167261>,
                                        inet_tcp_dist,<0.14481.4>,
                                        #Ref<0.2357520616.3444310017.167264>}
      [error_logger:info,2021-04-27T07:19:59.386-07:00,n_2@172.18.0.3:net_kernel<0.1669.0>:ale_error_logger_handler:do_log:101]
      =========================NOTICE REPORT=========================
      {net_kernel,{'EXIT',<0.14481.4>,{recv_challenge_ack_failed,{error,closed}}}}
      [ns_server:debug,2021-04-27T07:19:59.386-07:00,n_2@172.18.0.3:cb_dist<0.1666.0>:cb_dist:info_msg:778]cb_dist: Connection down: {con,#Ref<0.2357520616.3444310017.167261>,
                                     inet_tcp_dist,<0.14481.4>,
                                     #Ref<0.2357520616.3444310017.167264>}
      [error_logger:info,2021-04-27T07:19:59.386-07:00,n_2@172.18.0.3:net_kernel<0.1669.0>:ale_error_logger_handler:do_log:101]
      =========================NOTICE REPORT=========================
      {net_kernel,{net_kernel,1054,nodedown,'couchdb_n_2@cb.local'}}
      [ns_server:debug,2021-04-27T07:19:59.387-07:00,n_2@172.18.0.3:<0.14365.4>:ns_server_nodes_sup:do_wait_link_to_couchdb_node:161]ns_couchdb is not ready: {badrpc,nodedown}
      [error_logger:info,2021-04-27T07:19:59.588-07:00,n_2@172.18.0.3:net_kernel<0.1669.0>:ale_error_logger_handler:do_log:101]
      

      Attachments

        1. test-1.log
          6 kB
        2. n_2_logs.zip
          8.24 MB
        3. n_1_logs.zip
          7.94 MB
        4. cbcollect_info_n_2.zip
          14.30 MB
        5. cbcollect_info_n_1.zip
          14.30 MB

        Issue Links

          No reviews matched the request. Check your Options in the drop-down menu of this sections header.

          Activity

            People

              dfinlay Dave Finlay
              michael.blow Michael Blow
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Gerrit Reviews

                  There are no open Gerrit changes

                  PagerDuty