Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-51319

Some nodes in the cluster are rendered unusable post cleanup

    XMLWordPrintable

Details

    • Untriaged
    • Centos 64-bit
    • 1
    • Unknown
    • KV 2022-Feb, KV March-22

    Description

      Script to Repro

      There is not a particular test that repro's this. Basically cleanup can fail for any test rendering one or more nodes in the cluster unusable.
      

      Logs before node became unusable.

      2022-03-06 13:49:41,844 | test  | INFO    | MainThread | [basetestcase:log_setup_status:647] ========= BaseTestCase setup started for test #5 test_data_load_collections_with_graceful_failover_rebalance_out =========
      2022-03-06 13:50:22,311 | test  | INFO    | MainThread | [rest_client:monitorRebalance:1610] Rebalance done. Taken 11.0540001392 seconds to complete
      2022-03-06 13:50:22,319 | test  | INFO    | MainThread | [common_lib:sleep:23] Sleep 5 seconds. Reason: Wait after rebalance complete
      2022-03-06 13:50:27,359 | test  | ERROR   | MainThread | [rest_client:_http_request:834] GET http://172.23.100.35:8091/nodes/self body:  headers: {'Accept': '*/*', 'Connection': 'close', 'Authorization': 'Basic QWRtaW5pc3RyYXRvcjpwYXNzd29yZA==', 'Content-Type': 'application/json'} error: 404 reason: unknown "Node is unknown to this cluster." auth: Administrator:password
      http://172.23.100.35:8091/nodes/self with status 0: Node is unknown to this cluster.
      2022-03-06 13:50:27,362 | test  | ERROR   | MainThread | [rest_client:__init__:312] Error Node is unknown to this cluster. was gotten, 5 seconds sleep before retry
      2022-03-06 13:50:32,378 | test  | ERROR   | MainThread | [rest_client:_http_request:834] GET http://172.23.100.35:8091/nodes/self body:  headers: {'Accept': '*/*', 'Connection': 'close', 'Authorization': 'Basic QWRtaW5pc3RyYXRvcjpwYXNzd29yZA==', 'Content-Type': 'application/json'} error: 404 reason: unknown "Node is unknown to this cluster." auth: Administrator:password
      http://172.23.100.35:8091/nodes/self with status 0: Node is unknown to this cluster.
      2022-03-06 13:50:32,380 | test  | ERROR   | MainThread | [rest_client:__init__:312] Error Node is unknown to this cluster. was gotten, 5 seconds sleep before retry
      2022-03-06 13:50:37,394 | test  | ERROR   | MainThread | [rest_client:_http_request:834] GET http://172.23.100.35:8091/nodes/self body:  headers: {'Accept': '*/*', 'Connection': 'close', 'Authorization': 'Basic QWRtaW5pc3RyYXRvcjpwYXNzd29yZA==', 'Content-Type': 'application/json'} error: 404 reason: unknown "Node is unknown to this cluster." auth: Administrator:password
      http://172.23.100.35:8091/nodes/self with status 0: Node is unknown to this cluster.
      2022-03-06 13:50:37,395 | test  | ERROR   | MainThread | [rest_client:__init__:312] Error Node is unknown to this cluster. was gotten, 5 seconds sleep before retry
      2022-03-06 13:50:42,404 | test  | ERROR   | MainThread | [rest_client:__init__:317] Node 172.23.100.35:8091 is in a broken state!
      2022-03-06 13:50:42,404 | test  | ERROR   | MainThread | [cluster_ready_functions:cleanup_cluster:232] Can't create rest connection after rebalance out for ejected nodes, will retry after 10 seconds according to MB-8430: Unable to reach the host @ 172.23.100.35
      2022-03-06 13:50:42,411 | test  | INFO    | MainThread | [common_lib:sleep:23] Sleep 10 seconds. Reason: MB-8430
      2022-03-06 13:50:52,420 | test  | ERROR   | MainThread | [rest_client:_http_request:834] GET http://172.23.100.35:8091/nodes/self body:  headers: {'Accept': '*/*', 'Connection': 'close', 'Authorization': 'Basic QWRtaW5pc3RyYXRvcjpwYXNzd29yZA==', 'Content-Type': 'application/json'} error: 404 reason: unknown "Node is unknown to this cluster." auth: Administrator:password
      http://172.23.100.35:8091/nodes/self with status 0: Node is unknown to this cluster.
      2022-03-06 13:50:52,421 | test  | ERROR   | MainThread | [rest_client:__init__:312] Error Node is unknown to this cluster. was gotten, 5 seconds sleep before retry
      2022-03-06 13:50:57,436 | test  | ERROR   | MainThread | [rest_client:_http_request:834] GET http://172.23.100.35:8091/nodes/self body:  headers: {'Accept': '*/*', 'Connection': 'close', 'Authorization': 'Basic QWRtaW5pc3RyYXRvcjpwYXNzd29yZA==', 'Content-Type': 'application/json'} error: 404 reason: unknown "Node is unknown to this cluster." auth: Administrator:password
      http://172.23.100.35:8091/nodes/self with status 0: Node is unknown to this cluster.
      2022-03-06 13:50:57,437 | test  | ERROR   | MainThread | [rest_client:__init__:312] Error Node is unknown to this cluster. was gotten, 5 seconds sleep before retry
      2022-03-06 13:51:02,453 | test  | ERROR   | MainThread | [rest_client:_http_request:834] GET http://172.23.100.35:8091/nodes/self body:  headers: {'Accept': '*/*', 'Connection': 'close', 'Authorization': 'Basic QWRtaW5pc3RyYXRvcjpwYXNzd29yZA==', 'Content-Type': 'application/json'} error: 404 reason: unknown "Node is unknown to this cluster." auth: Administrator:password
      http://172.23.100.35:8091/nodes/self with status 0: Node is unknown to this cluster.
      2022-03-06 13:51:02,456 | test  | ERROR   | MainThread | [rest_client:__init__:312] Error Node is unknown to this cluster. was gotten, 5 seconds sleep before retry
      2022-03-06 13:51:07,463 | test  | ERROR   | MainThread | [rest_client:__init__:317] Node 172.23.100.35:8091 is in a broken state!
      Traceback (most recent call last):
        File "pytests/basetestcase.py", line 363, in setUp
          self.cluster_util.cluster_cleanup(cluster,
        File "pytests/basetestcase.py", line 363, in setUp
          self.cluster_util.cluster_cleanup(cluster,
        File "couchbase_utils/cluster_utils/cluster_ready_functions.py", line 169, in cluster_cleanup
          self.cleanup_cluster(cluster, master=cluster.master)
        File "couchbase_utils/cluster_utils/cluster_ready_functions.py", line 237, in cleanup_cluster
          rest = RestConnection(removed)
        File "lib/membase/api/rest_client.py", line 319, in __init__
          raise ServerUnavailableException(self.ip)
      ServerUnavailableException: Unable to reach the host @ 172.23.100.35
      

      172.23.100.35

      [ns_server:debug,2022-03-06T21:12:03.979-08:00,ns_1@172.23.100.35:<0.17157.46>:ns_memcached:ensure_bucket_inner:1318]Bucket "default" not found during ensure_bucket
      [ns_server:error,2022-03-06T21:12:04.012-08:00,ns_1@172.23.100.35:<0.17160.46>:ns_server_stats:report_prom_stats:172]ns_server stats reporting exception: error:badarg
      [{ets,lookup,
            [ns_server_stats,{c,{<<"rest_request_enters">>,[]}}],
            [{error_info,#{cause => id,module => erl_stdlib_errors}}]},
       {ns_server_stats,'-report_ns_server_lc_stats/1-fun-0-',2,
                        [{file,"src/ns_server_stats.erl"},{line,257}]},
       {lists,foreach,2,[{file,"lists.erl"},{line,1342}]},
       {ns_server_stats,'-report_prom_stats/2-fun-0-',2,
                        [{file,"src/ns_server_stats.erl"},{line,170}]},
       {ns_server_stats,report_prom_stats,2,
                        [{file,"src/ns_server_stats.erl"},{line,180}]},
       {async,'-async_init/4-fun-1-',3,[{file,"src/async.erl"},{line,191}]}]
      [ns_server:error,2022-03-06T21:12:04.012-08:00,ns_1@172.23.100.35:<0.17160.46>:ns_server_stats:report_prom_stats:172]system stats reporting exception: exit:{noproc,
                                              {gen_server,call,
                                               [ns_server_stats,get_stats]}}
      [{gen_server,call,2,[{file,"gen_server.erl"},{line,239}]},
       {ns_server_stats,report_system_stats,1,
                        [{file,"src/ns_server_stats.erl"},{line,188}]},
       {ns_server_stats,'-report_prom_stats/2-fun-0-',2,
                        [{file,"src/ns_server_stats.erl"},{line,170}]},
       {ns_server_stats,report_prom_stats,2,
                        [{file,"src/ns_server_stats.erl"},{line,182}]},
       {async,'-async_init/4-fun-1-',3,[{file,"src/async.erl"},{line,191}]}]
      [ns_server:debug,2022-03-06T21:12:04.110-08:00,ns_1@172.23.100.35:<0.17158.46>:ns_memcached:ensure_bucket_inner:1318]Bucket "default" not found during ensure_bucket
      [ns_server:error,2022-03-06T21:12:04.324-08:00,ns_1@172.23.100.35:<0.17166.46>:ns_server_stats:report_prom_stats:172]ns_server stats reporting exception: error:badarg
      [{ets,safe_fixtable,
            [ns_server_stats,true],
            [{error_info,#{cause => id,module => erl_stdlib_errors}}]},
       {ets,foldl,3,[{file,"ets.erl"},{line,625}]},
       {ns_server_stats,report_ns_server_hc_stats,1,
                        [{file,"src/ns_server_stats.erl"},{line,264}]},
       {ns_server_stats,'-report_prom_stats/2-fun-0-',2,
                        [{file,"src/ns_server_stats.erl"},{line,170}]},
       {ns_server_stats,report_prom_stats,2,
                        [{file,"src/ns_server_stats.erl"},{line,178}]},
       {async,'-async_init/4-fun-1-',3,[{file,"src/async.erl"},{line,191}]}]
      [ns_server:error,2022-03-06T21:12:04.651-08:00,ns_1@172.23.100.35:<0.17058.46>:menelaus_util:reply_server_error_before_close:210]Server error during processing: ["web request failed",
                                       {path,"/pools/default"},
                                       {method,'GET'},
                                       {type,exit},
                                       {what,
                                        {noproc,
                                         {gen_server,call,
                                          ['service_status_keeper-index',
                                           get_version]}}},
                                       {trace,
                                        [{gen_server,call,2,
                                          [{file,"gen_server.erl"},{line,239}]},
                                         {menelaus_web_pools,do_build_pool_info,4,
                                          [{file,"src/menelaus_web_pools.erl"},
                                           {line,211}]},
                                         {menelaus_web_pools,pool_info,6,
                                          [{file,"src/menelaus_web_pools.erl"},
                                           {line,106}]},
                                         {menelaus_web_pools,handle_pool_info_wait,
                                          5,
                                          [{file,"src/menelaus_web_pools.erl"},
                                           {line,118}]},
                                         {request_tracker,request,2,
                                          [{file,"src/request_tracker.erl"},
                                           {line,40}]},
                                         {menelaus_util,handle_request,2,
                                          [{file,"src/menelaus_util.erl"},
                                           {line,221}]},
                                         {mochiweb_http,headers,6,
                                          [{file,
                                            "/home/couchbase/jenkins/workspace/couchbase-server-unix/couchdb/src/mochiweb/mochiweb_http.erl"},
                                           {line,153}]},
                                         {proc_lib,init_p_do_apply,3,
                                          [{file,"proc_lib.erl"},{line,226}]}]}]
      
      

      Maybe it is another side effect of MB-49512 which is also hit frequently during cleanups related to bucket not being dropped completely.
      cbcollect_info attached.

      Attachments

        1. consoleText_MB-51319.txt
          1.98 MB
        2. debug.log.1
          40.00 MB
        3. latest-diag.log
          1.52 MB
        4. memcached.log.000021.txt
          7.94 MB
        5. node35.gdb.log
          789 kB
        For Gerrit Dashboard: MB-51319
        # Subject Branch Project Status CR V

        Activity

          People

            Balakumaran.Gopal Balakumaran Gopal
            Balakumaran.Gopal Balakumaran Gopal
            Votes:
            0 Vote for this issue
            Watchers:
            11 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Gerrit Reviews

                There are no open Gerrit changes

                PagerDuty