Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-51319

Some nodes in the cluster are rendered unusable post cleanup

    XMLWordPrintable

Details

    • Untriaged
    • Centos 64-bit
    • 1
    • Unknown
    • KV 2022-Feb, KV March-22

    Description

      Script to Repro

      There is not a particular test that repro's this. Basically cleanup can fail for any test rendering one or more nodes in the cluster unusable.
      

      Logs before node became unusable.

      2022-03-06 13:49:41,844 | test  | INFO    | MainThread | [basetestcase:log_setup_status:647] ========= BaseTestCase setup started for test #5 test_data_load_collections_with_graceful_failover_rebalance_out =========
      2022-03-06 13:50:22,311 | test  | INFO    | MainThread | [rest_client:monitorRebalance:1610] Rebalance done. Taken 11.0540001392 seconds to complete
      2022-03-06 13:50:22,319 | test  | INFO    | MainThread | [common_lib:sleep:23] Sleep 5 seconds. Reason: Wait after rebalance complete
      2022-03-06 13:50:27,359 | test  | ERROR   | MainThread | [rest_client:_http_request:834] GET http://172.23.100.35:8091/nodes/self body:  headers: {'Accept': '*/*', 'Connection': 'close', 'Authorization': 'Basic QWRtaW5pc3RyYXRvcjpwYXNzd29yZA==', 'Content-Type': 'application/json'} error: 404 reason: unknown "Node is unknown to this cluster." auth: Administrator:password
      http://172.23.100.35:8091/nodes/self with status 0: Node is unknown to this cluster.
      2022-03-06 13:50:27,362 | test  | ERROR   | MainThread | [rest_client:__init__:312] Error Node is unknown to this cluster. was gotten, 5 seconds sleep before retry
      2022-03-06 13:50:32,378 | test  | ERROR   | MainThread | [rest_client:_http_request:834] GET http://172.23.100.35:8091/nodes/self body:  headers: {'Accept': '*/*', 'Connection': 'close', 'Authorization': 'Basic QWRtaW5pc3RyYXRvcjpwYXNzd29yZA==', 'Content-Type': 'application/json'} error: 404 reason: unknown "Node is unknown to this cluster." auth: Administrator:password
      http://172.23.100.35:8091/nodes/self with status 0: Node is unknown to this cluster.
      2022-03-06 13:50:32,380 | test  | ERROR   | MainThread | [rest_client:__init__:312] Error Node is unknown to this cluster. was gotten, 5 seconds sleep before retry
      2022-03-06 13:50:37,394 | test  | ERROR   | MainThread | [rest_client:_http_request:834] GET http://172.23.100.35:8091/nodes/self body:  headers: {'Accept': '*/*', 'Connection': 'close', 'Authorization': 'Basic QWRtaW5pc3RyYXRvcjpwYXNzd29yZA==', 'Content-Type': 'application/json'} error: 404 reason: unknown "Node is unknown to this cluster." auth: Administrator:password
      http://172.23.100.35:8091/nodes/self with status 0: Node is unknown to this cluster.
      2022-03-06 13:50:37,395 | test  | ERROR   | MainThread | [rest_client:__init__:312] Error Node is unknown to this cluster. was gotten, 5 seconds sleep before retry
      2022-03-06 13:50:42,404 | test  | ERROR   | MainThread | [rest_client:__init__:317] Node 172.23.100.35:8091 is in a broken state!
      2022-03-06 13:50:42,404 | test  | ERROR   | MainThread | [cluster_ready_functions:cleanup_cluster:232] Can't create rest connection after rebalance out for ejected nodes, will retry after 10 seconds according to MB-8430: Unable to reach the host @ 172.23.100.35
      2022-03-06 13:50:42,411 | test  | INFO    | MainThread | [common_lib:sleep:23] Sleep 10 seconds. Reason: MB-8430
      2022-03-06 13:50:52,420 | test  | ERROR   | MainThread | [rest_client:_http_request:834] GET http://172.23.100.35:8091/nodes/self body:  headers: {'Accept': '*/*', 'Connection': 'close', 'Authorization': 'Basic QWRtaW5pc3RyYXRvcjpwYXNzd29yZA==', 'Content-Type': 'application/json'} error: 404 reason: unknown "Node is unknown to this cluster." auth: Administrator:password
      http://172.23.100.35:8091/nodes/self with status 0: Node is unknown to this cluster.
      2022-03-06 13:50:52,421 | test  | ERROR   | MainThread | [rest_client:__init__:312] Error Node is unknown to this cluster. was gotten, 5 seconds sleep before retry
      2022-03-06 13:50:57,436 | test  | ERROR   | MainThread | [rest_client:_http_request:834] GET http://172.23.100.35:8091/nodes/self body:  headers: {'Accept': '*/*', 'Connection': 'close', 'Authorization': 'Basic QWRtaW5pc3RyYXRvcjpwYXNzd29yZA==', 'Content-Type': 'application/json'} error: 404 reason: unknown "Node is unknown to this cluster." auth: Administrator:password
      http://172.23.100.35:8091/nodes/self with status 0: Node is unknown to this cluster.
      2022-03-06 13:50:57,437 | test  | ERROR   | MainThread | [rest_client:__init__:312] Error Node is unknown to this cluster. was gotten, 5 seconds sleep before retry
      2022-03-06 13:51:02,453 | test  | ERROR   | MainThread | [rest_client:_http_request:834] GET http://172.23.100.35:8091/nodes/self body:  headers: {'Accept': '*/*', 'Connection': 'close', 'Authorization': 'Basic QWRtaW5pc3RyYXRvcjpwYXNzd29yZA==', 'Content-Type': 'application/json'} error: 404 reason: unknown "Node is unknown to this cluster." auth: Administrator:password
      http://172.23.100.35:8091/nodes/self with status 0: Node is unknown to this cluster.
      2022-03-06 13:51:02,456 | test  | ERROR   | MainThread | [rest_client:__init__:312] Error Node is unknown to this cluster. was gotten, 5 seconds sleep before retry
      2022-03-06 13:51:07,463 | test  | ERROR   | MainThread | [rest_client:__init__:317] Node 172.23.100.35:8091 is in a broken state!
      Traceback (most recent call last):
        File "pytests/basetestcase.py", line 363, in setUp
          self.cluster_util.cluster_cleanup(cluster,
        File "pytests/basetestcase.py", line 363, in setUp
          self.cluster_util.cluster_cleanup(cluster,
        File "couchbase_utils/cluster_utils/cluster_ready_functions.py", line 169, in cluster_cleanup
          self.cleanup_cluster(cluster, master=cluster.master)
        File "couchbase_utils/cluster_utils/cluster_ready_functions.py", line 237, in cleanup_cluster
          rest = RestConnection(removed)
        File "lib/membase/api/rest_client.py", line 319, in __init__
          raise ServerUnavailableException(self.ip)
      ServerUnavailableException: Unable to reach the host @ 172.23.100.35
      

      172.23.100.35

      [ns_server:debug,2022-03-06T21:12:03.979-08:00,ns_1@172.23.100.35:<0.17157.46>:ns_memcached:ensure_bucket_inner:1318]Bucket "default" not found during ensure_bucket
      [ns_server:error,2022-03-06T21:12:04.012-08:00,ns_1@172.23.100.35:<0.17160.46>:ns_server_stats:report_prom_stats:172]ns_server stats reporting exception: error:badarg
      [{ets,lookup,
            [ns_server_stats,{c,{<<"rest_request_enters">>,[]}}],
            [{error_info,#{cause => id,module => erl_stdlib_errors}}]},
       {ns_server_stats,'-report_ns_server_lc_stats/1-fun-0-',2,
                        [{file,"src/ns_server_stats.erl"},{line,257}]},
       {lists,foreach,2,[{file,"lists.erl"},{line,1342}]},
       {ns_server_stats,'-report_prom_stats/2-fun-0-',2,
                        [{file,"src/ns_server_stats.erl"},{line,170}]},
       {ns_server_stats,report_prom_stats,2,
                        [{file,"src/ns_server_stats.erl"},{line,180}]},
       {async,'-async_init/4-fun-1-',3,[{file,"src/async.erl"},{line,191}]}]
      [ns_server:error,2022-03-06T21:12:04.012-08:00,ns_1@172.23.100.35:<0.17160.46>:ns_server_stats:report_prom_stats:172]system stats reporting exception: exit:{noproc,
                                              {gen_server,call,
                                               [ns_server_stats,get_stats]}}
      [{gen_server,call,2,[{file,"gen_server.erl"},{line,239}]},
       {ns_server_stats,report_system_stats,1,
                        [{file,"src/ns_server_stats.erl"},{line,188}]},
       {ns_server_stats,'-report_prom_stats/2-fun-0-',2,
                        [{file,"src/ns_server_stats.erl"},{line,170}]},
       {ns_server_stats,report_prom_stats,2,
                        [{file,"src/ns_server_stats.erl"},{line,182}]},
       {async,'-async_init/4-fun-1-',3,[{file,"src/async.erl"},{line,191}]}]
      [ns_server:debug,2022-03-06T21:12:04.110-08:00,ns_1@172.23.100.35:<0.17158.46>:ns_memcached:ensure_bucket_inner:1318]Bucket "default" not found during ensure_bucket
      [ns_server:error,2022-03-06T21:12:04.324-08:00,ns_1@172.23.100.35:<0.17166.46>:ns_server_stats:report_prom_stats:172]ns_server stats reporting exception: error:badarg
      [{ets,safe_fixtable,
            [ns_server_stats,true],
            [{error_info,#{cause => id,module => erl_stdlib_errors}}]},
       {ets,foldl,3,[{file,"ets.erl"},{line,625}]},
       {ns_server_stats,report_ns_server_hc_stats,1,
                        [{file,"src/ns_server_stats.erl"},{line,264}]},
       {ns_server_stats,'-report_prom_stats/2-fun-0-',2,
                        [{file,"src/ns_server_stats.erl"},{line,170}]},
       {ns_server_stats,report_prom_stats,2,
                        [{file,"src/ns_server_stats.erl"},{line,178}]},
       {async,'-async_init/4-fun-1-',3,[{file,"src/async.erl"},{line,191}]}]
      [ns_server:error,2022-03-06T21:12:04.651-08:00,ns_1@172.23.100.35:<0.17058.46>:menelaus_util:reply_server_error_before_close:210]Server error during processing: ["web request failed",
                                       {path,"/pools/default"},
                                       {method,'GET'},
                                       {type,exit},
                                       {what,
                                        {noproc,
                                         {gen_server,call,
                                          ['service_status_keeper-index',
                                           get_version]}}},
                                       {trace,
                                        [{gen_server,call,2,
                                          [{file,"gen_server.erl"},{line,239}]},
                                         {menelaus_web_pools,do_build_pool_info,4,
                                          [{file,"src/menelaus_web_pools.erl"},
                                           {line,211}]},
                                         {menelaus_web_pools,pool_info,6,
                                          [{file,"src/menelaus_web_pools.erl"},
                                           {line,106}]},
                                         {menelaus_web_pools,handle_pool_info_wait,
                                          5,
                                          [{file,"src/menelaus_web_pools.erl"},
                                           {line,118}]},
                                         {request_tracker,request,2,
                                          [{file,"src/request_tracker.erl"},
                                           {line,40}]},
                                         {menelaus_util,handle_request,2,
                                          [{file,"src/menelaus_util.erl"},
                                           {line,221}]},
                                         {mochiweb_http,headers,6,
                                          [{file,
                                            "/home/couchbase/jenkins/workspace/couchbase-server-unix/couchdb/src/mochiweb/mochiweb_http.erl"},
                                           {line,153}]},
                                         {proc_lib,init_p_do_apply,3,
                                          [{file,"proc_lib.erl"},{line,226}]}]}]
      
      

      Maybe it is another side effect of MB-49512 which is also hit frequently during cleanups related to bucket not being dropped completely.
      cbcollect_info attached.

      Attachments

        1. consoleText_MB-51319.txt
          1.98 MB
        2. debug.log.1
          40.00 MB
        3. latest-diag.log
          1.52 MB
        4. memcached.log.000021.txt
          7.94 MB
        5. node35.gdb.log
          789 kB
        No reviews matched the request. Check your Options in the drop-down menu of this sections header.

        Activity

          Balakumaran.Gopal Balakumaran Gopal created issue -
          Balakumaran.Gopal Balakumaran Gopal made changes -
          Field Original Value New Value
          Summary Certain nodes are not accessible post cleanup of the cluster Some nodes in the cluster are rendered unusable post cleanup
          Balakumaran.Gopal Balakumaran Gopal made changes -
          Description +Script to Repro+
          {noformat}
          There is not a particular test that repro's this. Basically cleanup can fail for any test rendering one or more nodes in the cluster in unusable state.
          {noformat}

          +172.23.100.35+
          {noformat}
          [ns_server:debug,2022-03-06T21:12:03.979-08:00,ns_1@172.23.100.35:<0.17157.46>:ns_memcached:ensure_bucket_inner:1318]Bucket "default" not found during ensure_bucket
          [ns_server:error,2022-03-06T21:12:04.012-08:00,ns_1@172.23.100.35:<0.17160.46>:ns_server_stats:report_prom_stats:172]ns_server stats reporting exception: error:badarg
          [{ets,lookup,
                [ns_server_stats,{c,{<<"rest_request_enters">>,[]}}],
                [{error_info,#{cause => id,module => erl_stdlib_errors}}]},
           {ns_server_stats,'-report_ns_server_lc_stats/1-fun-0-',2,
                            [{file,"src/ns_server_stats.erl"},{line,257}]},
           {lists,foreach,2,[{file,"lists.erl"},{line,1342}]},
           {ns_server_stats,'-report_prom_stats/2-fun-0-',2,
                            [{file,"src/ns_server_stats.erl"},{line,170}]},
           {ns_server_stats,report_prom_stats,2,
                            [{file,"src/ns_server_stats.erl"},{line,180}]},
           {async,'-async_init/4-fun-1-',3,[{file,"src/async.erl"},{line,191}]}]
          [ns_server:error,2022-03-06T21:12:04.012-08:00,ns_1@172.23.100.35:<0.17160.46>:ns_server_stats:report_prom_stats:172]system stats reporting exception: exit:{noproc,
                                                  {gen_server,call,
                                                   [ns_server_stats,get_stats]}}
          [{gen_server,call,2,[{file,"gen_server.erl"},{line,239}]},
           {ns_server_stats,report_system_stats,1,
                            [{file,"src/ns_server_stats.erl"},{line,188}]},
           {ns_server_stats,'-report_prom_stats/2-fun-0-',2,
                            [{file,"src/ns_server_stats.erl"},{line,170}]},
           {ns_server_stats,report_prom_stats,2,
                            [{file,"src/ns_server_stats.erl"},{line,182}]},
           {async,'-async_init/4-fun-1-',3,[{file,"src/async.erl"},{line,191}]}]
          [ns_server:debug,2022-03-06T21:12:04.110-08:00,ns_1@172.23.100.35:<0.17158.46>:ns_memcached:ensure_bucket_inner:1318]Bucket "default" not found during ensure_bucket
          [ns_server:error,2022-03-06T21:12:04.324-08:00,ns_1@172.23.100.35:<0.17166.46>:ns_server_stats:report_prom_stats:172]ns_server stats reporting exception: error:badarg
          [{ets,safe_fixtable,
                [ns_server_stats,true],
                [{error_info,#{cause => id,module => erl_stdlib_errors}}]},
           {ets,foldl,3,[{file,"ets.erl"},{line,625}]},
           {ns_server_stats,report_ns_server_hc_stats,1,
                            [{file,"src/ns_server_stats.erl"},{line,264}]},
           {ns_server_stats,'-report_prom_stats/2-fun-0-',2,
                            [{file,"src/ns_server_stats.erl"},{line,170}]},
           {ns_server_stats,report_prom_stats,2,
                            [{file,"src/ns_server_stats.erl"},{line,178}]},
           {async,'-async_init/4-fun-1-',3,[{file,"src/async.erl"},{line,191}]}]
          [ns_server:error,2022-03-06T21:12:04.651-08:00,ns_1@172.23.100.35:<0.17058.46>:menelaus_util:reply_server_error_before_close:210]Server error during processing: ["web request failed",
                                           {path,"/pools/default"},
                                           {method,'GET'},
                                           {type,exit},
                                           {what,
                                            {noproc,
                                             {gen_server,call,
                                              ['service_status_keeper-index',
                                               get_version]}}},
                                           {trace,
                                            [{gen_server,call,2,
                                              [{file,"gen_server.erl"},{line,239}]},
                                             {menelaus_web_pools,do_build_pool_info,4,
                                              [{file,"src/menelaus_web_pools.erl"},
                                               {line,211}]},
                                             {menelaus_web_pools,pool_info,6,
                                              [{file,"src/menelaus_web_pools.erl"},
                                               {line,106}]},
                                             {menelaus_web_pools,handle_pool_info_wait,
                                              5,
                                              [{file,"src/menelaus_web_pools.erl"},
                                               {line,118}]},
                                             {request_tracker,request,2,
                                              [{file,"src/request_tracker.erl"},
                                               {line,40}]},
                                             {menelaus_util,handle_request,2,
                                              [{file,"src/menelaus_util.erl"},
                                               {line,221}]},
                                             {mochiweb_http,headers,6,
                                              [{file,
                                                "/home/couchbase/jenkins/workspace/couchbase-server-unix/couchdb/src/mochiweb/mochiweb_http.erl"},
                                               {line,153}]},
                                             {proc_lib,init_p_do_apply,3,
                                              [{file,"proc_lib.erl"},{line,226}]}]}]

          {noformat}

          Maybe it is another side effect of MB-49512 which is also hit frequently during cleanups related to bucket not being dropped completely.
          cbcollect_info attached.
          +Script to Repro+
          {noformat}
          There is not a particular test that repro's this. Basically cleanup can fail for any test rendering one or more nodes in the cluster unusable.
          {noformat}

          +172.23.100.35+
          {noformat}
          [ns_server:debug,2022-03-06T21:12:03.979-08:00,ns_1@172.23.100.35:<0.17157.46>:ns_memcached:ensure_bucket_inner:1318]Bucket "default" not found during ensure_bucket
          [ns_server:error,2022-03-06T21:12:04.012-08:00,ns_1@172.23.100.35:<0.17160.46>:ns_server_stats:report_prom_stats:172]ns_server stats reporting exception: error:badarg
          [{ets,lookup,
                [ns_server_stats,{c,{<<"rest_request_enters">>,[]}}],
                [{error_info,#{cause => id,module => erl_stdlib_errors}}]},
           {ns_server_stats,'-report_ns_server_lc_stats/1-fun-0-',2,
                            [{file,"src/ns_server_stats.erl"},{line,257}]},
           {lists,foreach,2,[{file,"lists.erl"},{line,1342}]},
           {ns_server_stats,'-report_prom_stats/2-fun-0-',2,
                            [{file,"src/ns_server_stats.erl"},{line,170}]},
           {ns_server_stats,report_prom_stats,2,
                            [{file,"src/ns_server_stats.erl"},{line,180}]},
           {async,'-async_init/4-fun-1-',3,[{file,"src/async.erl"},{line,191}]}]
          [ns_server:error,2022-03-06T21:12:04.012-08:00,ns_1@172.23.100.35:<0.17160.46>:ns_server_stats:report_prom_stats:172]system stats reporting exception: exit:{noproc,
                                                  {gen_server,call,
                                                   [ns_server_stats,get_stats]}}
          [{gen_server,call,2,[{file,"gen_server.erl"},{line,239}]},
           {ns_server_stats,report_system_stats,1,
                            [{file,"src/ns_server_stats.erl"},{line,188}]},
           {ns_server_stats,'-report_prom_stats/2-fun-0-',2,
                            [{file,"src/ns_server_stats.erl"},{line,170}]},
           {ns_server_stats,report_prom_stats,2,
                            [{file,"src/ns_server_stats.erl"},{line,182}]},
           {async,'-async_init/4-fun-1-',3,[{file,"src/async.erl"},{line,191}]}]
          [ns_server:debug,2022-03-06T21:12:04.110-08:00,ns_1@172.23.100.35:<0.17158.46>:ns_memcached:ensure_bucket_inner:1318]Bucket "default" not found during ensure_bucket
          [ns_server:error,2022-03-06T21:12:04.324-08:00,ns_1@172.23.100.35:<0.17166.46>:ns_server_stats:report_prom_stats:172]ns_server stats reporting exception: error:badarg
          [{ets,safe_fixtable,
                [ns_server_stats,true],
                [{error_info,#{cause => id,module => erl_stdlib_errors}}]},
           {ets,foldl,3,[{file,"ets.erl"},{line,625}]},
           {ns_server_stats,report_ns_server_hc_stats,1,
                            [{file,"src/ns_server_stats.erl"},{line,264}]},
           {ns_server_stats,'-report_prom_stats/2-fun-0-',2,
                            [{file,"src/ns_server_stats.erl"},{line,170}]},
           {ns_server_stats,report_prom_stats,2,
                            [{file,"src/ns_server_stats.erl"},{line,178}]},
           {async,'-async_init/4-fun-1-',3,[{file,"src/async.erl"},{line,191}]}]
          [ns_server:error,2022-03-06T21:12:04.651-08:00,ns_1@172.23.100.35:<0.17058.46>:menelaus_util:reply_server_error_before_close:210]Server error during processing: ["web request failed",
                                           {path,"/pools/default"},
                                           {method,'GET'},
                                           {type,exit},
                                           {what,
                                            {noproc,
                                             {gen_server,call,
                                              ['service_status_keeper-index',
                                               get_version]}}},
                                           {trace,
                                            [{gen_server,call,2,
                                              [{file,"gen_server.erl"},{line,239}]},
                                             {menelaus_web_pools,do_build_pool_info,4,
                                              [{file,"src/menelaus_web_pools.erl"},
                                               {line,211}]},
                                             {menelaus_web_pools,pool_info,6,
                                              [{file,"src/menelaus_web_pools.erl"},
                                               {line,106}]},
                                             {menelaus_web_pools,handle_pool_info_wait,
                                              5,
                                              [{file,"src/menelaus_web_pools.erl"},
                                               {line,118}]},
                                             {request_tracker,request,2,
                                              [{file,"src/request_tracker.erl"},
                                               {line,40}]},
                                             {menelaus_util,handle_request,2,
                                              [{file,"src/menelaus_util.erl"},
                                               {line,221}]},
                                             {mochiweb_http,headers,6,
                                              [{file,
                                                "/home/couchbase/jenkins/workspace/couchbase-server-unix/couchdb/src/mochiweb/mochiweb_http.erl"},
                                               {line,153}]},
                                             {proc_lib,init_p_do_apply,3,
                                              [{file,"proc_lib.erl"},{line,226}]}]}]

          {noformat}

          Maybe it is another side effect of MB-49512 which is also hit frequently during cleanups related to bucket not being dropped completely.
          cbcollect_info attached.
          Balakumaran.Gopal Balakumaran Gopal made changes -
          Attachment consoleText_MB-51319.txt [ 179076 ]
          Balakumaran.Gopal Balakumaran Gopal made changes -
          Description +Script to Repro+
          {noformat}
          There is not a particular test that repro's this. Basically cleanup can fail for any test rendering one or more nodes in the cluster unusable.
          {noformat}

          +172.23.100.35+
          {noformat}
          [ns_server:debug,2022-03-06T21:12:03.979-08:00,ns_1@172.23.100.35:<0.17157.46>:ns_memcached:ensure_bucket_inner:1318]Bucket "default" not found during ensure_bucket
          [ns_server:error,2022-03-06T21:12:04.012-08:00,ns_1@172.23.100.35:<0.17160.46>:ns_server_stats:report_prom_stats:172]ns_server stats reporting exception: error:badarg
          [{ets,lookup,
                [ns_server_stats,{c,{<<"rest_request_enters">>,[]}}],
                [{error_info,#{cause => id,module => erl_stdlib_errors}}]},
           {ns_server_stats,'-report_ns_server_lc_stats/1-fun-0-',2,
                            [{file,"src/ns_server_stats.erl"},{line,257}]},
           {lists,foreach,2,[{file,"lists.erl"},{line,1342}]},
           {ns_server_stats,'-report_prom_stats/2-fun-0-',2,
                            [{file,"src/ns_server_stats.erl"},{line,170}]},
           {ns_server_stats,report_prom_stats,2,
                            [{file,"src/ns_server_stats.erl"},{line,180}]},
           {async,'-async_init/4-fun-1-',3,[{file,"src/async.erl"},{line,191}]}]
          [ns_server:error,2022-03-06T21:12:04.012-08:00,ns_1@172.23.100.35:<0.17160.46>:ns_server_stats:report_prom_stats:172]system stats reporting exception: exit:{noproc,
                                                  {gen_server,call,
                                                   [ns_server_stats,get_stats]}}
          [{gen_server,call,2,[{file,"gen_server.erl"},{line,239}]},
           {ns_server_stats,report_system_stats,1,
                            [{file,"src/ns_server_stats.erl"},{line,188}]},
           {ns_server_stats,'-report_prom_stats/2-fun-0-',2,
                            [{file,"src/ns_server_stats.erl"},{line,170}]},
           {ns_server_stats,report_prom_stats,2,
                            [{file,"src/ns_server_stats.erl"},{line,182}]},
           {async,'-async_init/4-fun-1-',3,[{file,"src/async.erl"},{line,191}]}]
          [ns_server:debug,2022-03-06T21:12:04.110-08:00,ns_1@172.23.100.35:<0.17158.46>:ns_memcached:ensure_bucket_inner:1318]Bucket "default" not found during ensure_bucket
          [ns_server:error,2022-03-06T21:12:04.324-08:00,ns_1@172.23.100.35:<0.17166.46>:ns_server_stats:report_prom_stats:172]ns_server stats reporting exception: error:badarg
          [{ets,safe_fixtable,
                [ns_server_stats,true],
                [{error_info,#{cause => id,module => erl_stdlib_errors}}]},
           {ets,foldl,3,[{file,"ets.erl"},{line,625}]},
           {ns_server_stats,report_ns_server_hc_stats,1,
                            [{file,"src/ns_server_stats.erl"},{line,264}]},
           {ns_server_stats,'-report_prom_stats/2-fun-0-',2,
                            [{file,"src/ns_server_stats.erl"},{line,170}]},
           {ns_server_stats,report_prom_stats,2,
                            [{file,"src/ns_server_stats.erl"},{line,178}]},
           {async,'-async_init/4-fun-1-',3,[{file,"src/async.erl"},{line,191}]}]
          [ns_server:error,2022-03-06T21:12:04.651-08:00,ns_1@172.23.100.35:<0.17058.46>:menelaus_util:reply_server_error_before_close:210]Server error during processing: ["web request failed",
                                           {path,"/pools/default"},
                                           {method,'GET'},
                                           {type,exit},
                                           {what,
                                            {noproc,
                                             {gen_server,call,
                                              ['service_status_keeper-index',
                                               get_version]}}},
                                           {trace,
                                            [{gen_server,call,2,
                                              [{file,"gen_server.erl"},{line,239}]},
                                             {menelaus_web_pools,do_build_pool_info,4,
                                              [{file,"src/menelaus_web_pools.erl"},
                                               {line,211}]},
                                             {menelaus_web_pools,pool_info,6,
                                              [{file,"src/menelaus_web_pools.erl"},
                                               {line,106}]},
                                             {menelaus_web_pools,handle_pool_info_wait,
                                              5,
                                              [{file,"src/menelaus_web_pools.erl"},
                                               {line,118}]},
                                             {request_tracker,request,2,
                                              [{file,"src/request_tracker.erl"},
                                               {line,40}]},
                                             {menelaus_util,handle_request,2,
                                              [{file,"src/menelaus_util.erl"},
                                               {line,221}]},
                                             {mochiweb_http,headers,6,
                                              [{file,
                                                "/home/couchbase/jenkins/workspace/couchbase-server-unix/couchdb/src/mochiweb/mochiweb_http.erl"},
                                               {line,153}]},
                                             {proc_lib,init_p_do_apply,3,
                                              [{file,"proc_lib.erl"},{line,226}]}]}]

          {noformat}

          Maybe it is another side effect of MB-49512 which is also hit frequently during cleanups related to bucket not being dropped completely.
          cbcollect_info attached.
          +Script to Repro+
          {noformat}
          There is not a particular test that repro's this. Basically cleanup can fail for any test rendering one or more nodes in the cluster unusable.
          {noformat}

          Sequence of steps that led to the node being unusable.

          {noformat}
          2022-03-06 13:49:41,844 | test | INFO | MainThread | [basetestcase:log_setup_status:647] ========= BaseTestCase setup started for test #5 test_data_load_collections_with_graceful_failover_rebalance_out =========
          2022-03-06 13:50:22,311 | test | INFO | MainThread | [rest_client:monitorRebalance:1610] Rebalance done. Taken 11.0540001392 seconds to complete
          2022-03-06 13:50:22,319 | test | INFO | MainThread | [common_lib:sleep:23] Sleep 5 seconds. Reason: Wait after rebalance complete
          2022-03-06 13:50:27,359 | test | ERROR | MainThread | [rest_client:_http_request:834] GET http://172.23.100.35:8091/nodes/self body: headers: {'Accept': '*/*', 'Connection': 'close', 'Authorization': 'Basic QWRtaW5pc3RyYXRvcjpwYXNzd29yZA==', 'Content-Type': 'application/json'} error: 404 reason: unknown "Node is unknown to this cluster." auth: Administrator:password
          http://172.23.100.35:8091/nodes/self with status 0: Node is unknown to this cluster.
          2022-03-06 13:50:27,362 | test | ERROR | MainThread | [rest_client:__init__:312] Error Node is unknown to this cluster. was gotten, 5 seconds sleep before retry
          2022-03-06 13:50:32,378 | test | ERROR | MainThread | [rest_client:_http_request:834] GET http://172.23.100.35:8091/nodes/self body: headers: {'Accept': '*/*', 'Connection': 'close', 'Authorization': 'Basic QWRtaW5pc3RyYXRvcjpwYXNzd29yZA==', 'Content-Type': 'application/json'} error: 404 reason: unknown "Node is unknown to this cluster." auth: Administrator:password
          http://172.23.100.35:8091/nodes/self with status 0: Node is unknown to this cluster.
          2022-03-06 13:50:32,380 | test | ERROR | MainThread | [rest_client:__init__:312] Error Node is unknown to this cluster. was gotten, 5 seconds sleep before retry
          2022-03-06 13:50:37,394 | test | ERROR | MainThread | [rest_client:_http_request:834] GET http://172.23.100.35:8091/nodes/self body: headers: {'Accept': '*/*', 'Connection': 'close', 'Authorization': 'Basic QWRtaW5pc3RyYXRvcjpwYXNzd29yZA==', 'Content-Type': 'application/json'} error: 404 reason: unknown "Node is unknown to this cluster." auth: Administrator:password
          http://172.23.100.35:8091/nodes/self with status 0: Node is unknown to this cluster.
          2022-03-06 13:50:37,395 | test | ERROR | MainThread | [rest_client:__init__:312] Error Node is unknown to this cluster. was gotten, 5 seconds sleep before retry
          2022-03-06 13:50:42,404 | test | ERROR | MainThread | [rest_client:__init__:317] Node 172.23.100.35:8091 is in a broken state!
          2022-03-06 13:50:42,404 | test | ERROR | MainThread | [cluster_ready_functions:cleanup_cluster:232] Can't create rest connection after rebalance out for ejected nodes, will retry after 10 seconds according to MB-8430: Unable to reach the host @ 172.23.100.35
          2022-03-06 13:50:42,411 | test | INFO | MainThread | [common_lib:sleep:23] Sleep 10 seconds. Reason: MB-8430
          2022-03-06 13:50:52,420 | test | ERROR | MainThread | [rest_client:_http_request:834] GET http://172.23.100.35:8091/nodes/self body: headers: {'Accept': '*/*', 'Connection': 'close', 'Authorization': 'Basic QWRtaW5pc3RyYXRvcjpwYXNzd29yZA==', 'Content-Type': 'application/json'} error: 404 reason: unknown "Node is unknown to this cluster." auth: Administrator:password
          http://172.23.100.35:8091/nodes/self with status 0: Node is unknown to this cluster.
          2022-03-06 13:50:52,421 | test | ERROR | MainThread | [rest_client:__init__:312] Error Node is unknown to this cluster. was gotten, 5 seconds sleep before retry
          2022-03-06 13:50:57,436 | test | ERROR | MainThread | [rest_client:_http_request:834] GET http://172.23.100.35:8091/nodes/self body: headers: {'Accept': '*/*', 'Connection': 'close', 'Authorization': 'Basic QWRtaW5pc3RyYXRvcjpwYXNzd29yZA==', 'Content-Type': 'application/json'} error: 404 reason: unknown "Node is unknown to this cluster." auth: Administrator:password
          http://172.23.100.35:8091/nodes/self with status 0: Node is unknown to this cluster.
          2022-03-06 13:50:57,437 | test | ERROR | MainThread | [rest_client:__init__:312] Error Node is unknown to this cluster. was gotten, 5 seconds sleep before retry
          2022-03-06 13:51:02,453 | test | ERROR | MainThread | [rest_client:_http_request:834] GET http://172.23.100.35:8091/nodes/self body: headers: {'Accept': '*/*', 'Connection': 'close', 'Authorization': 'Basic QWRtaW5pc3RyYXRvcjpwYXNzd29yZA==', 'Content-Type': 'application/json'} error: 404 reason: unknown "Node is unknown to this cluster." auth: Administrator:password
          http://172.23.100.35:8091/nodes/self with status 0: Node is unknown to this cluster.
          2022-03-06 13:51:02,456 | test | ERROR | MainThread | [rest_client:__init__:312] Error Node is unknown to this cluster. was gotten, 5 seconds sleep before retry
          2022-03-06 13:51:07,463 | test | ERROR | MainThread | [rest_client:__init__:317] Node 172.23.100.35:8091 is in a broken state!
          Traceback (most recent call last):
            File "pytests/basetestcase.py", line 363, in setUp
              self.cluster_util.cluster_cleanup(cluster,
            File "pytests/basetestcase.py", line 363, in setUp
              self.cluster_util.cluster_cleanup(cluster,
            File "couchbase_utils/cluster_utils/cluster_ready_functions.py", line 169, in cluster_cleanup
              self.cleanup_cluster(cluster, master=cluster.master)
            File "couchbase_utils/cluster_utils/cluster_ready_functions.py", line 237, in cleanup_cluster
              rest = RestConnection(removed)
            File "lib/membase/api/rest_client.py", line 319, in __init__
              raise ServerUnavailableException(self.ip)
          ServerUnavailableException: Unable to reach the host @ 172.23.100.35
          {noformat}

          +172.23.100.35+
          {noformat}
          [ns_server:debug,2022-03-06T21:12:03.979-08:00,ns_1@172.23.100.35:<0.17157.46>:ns_memcached:ensure_bucket_inner:1318]Bucket "default" not found during ensure_bucket
          [ns_server:error,2022-03-06T21:12:04.012-08:00,ns_1@172.23.100.35:<0.17160.46>:ns_server_stats:report_prom_stats:172]ns_server stats reporting exception: error:badarg
          [{ets,lookup,
                [ns_server_stats,{c,{<<"rest_request_enters">>,[]}}],
                [{error_info,#{cause => id,module => erl_stdlib_errors}}]},
           {ns_server_stats,'-report_ns_server_lc_stats/1-fun-0-',2,
                            [{file,"src/ns_server_stats.erl"},{line,257}]},
           {lists,foreach,2,[{file,"lists.erl"},{line,1342}]},
           {ns_server_stats,'-report_prom_stats/2-fun-0-',2,
                            [{file,"src/ns_server_stats.erl"},{line,170}]},
           {ns_server_stats,report_prom_stats,2,
                            [{file,"src/ns_server_stats.erl"},{line,180}]},
           {async,'-async_init/4-fun-1-',3,[{file,"src/async.erl"},{line,191}]}]
          [ns_server:error,2022-03-06T21:12:04.012-08:00,ns_1@172.23.100.35:<0.17160.46>:ns_server_stats:report_prom_stats:172]system stats reporting exception: exit:{noproc,
                                                  {gen_server,call,
                                                   [ns_server_stats,get_stats]}}
          [{gen_server,call,2,[{file,"gen_server.erl"},{line,239}]},
           {ns_server_stats,report_system_stats,1,
                            [{file,"src/ns_server_stats.erl"},{line,188}]},
           {ns_server_stats,'-report_prom_stats/2-fun-0-',2,
                            [{file,"src/ns_server_stats.erl"},{line,170}]},
           {ns_server_stats,report_prom_stats,2,
                            [{file,"src/ns_server_stats.erl"},{line,182}]},
           {async,'-async_init/4-fun-1-',3,[{file,"src/async.erl"},{line,191}]}]
          [ns_server:debug,2022-03-06T21:12:04.110-08:00,ns_1@172.23.100.35:<0.17158.46>:ns_memcached:ensure_bucket_inner:1318]Bucket "default" not found during ensure_bucket
          [ns_server:error,2022-03-06T21:12:04.324-08:00,ns_1@172.23.100.35:<0.17166.46>:ns_server_stats:report_prom_stats:172]ns_server stats reporting exception: error:badarg
          [{ets,safe_fixtable,
                [ns_server_stats,true],
                [{error_info,#{cause => id,module => erl_stdlib_errors}}]},
           {ets,foldl,3,[{file,"ets.erl"},{line,625}]},
           {ns_server_stats,report_ns_server_hc_stats,1,
                            [{file,"src/ns_server_stats.erl"},{line,264}]},
           {ns_server_stats,'-report_prom_stats/2-fun-0-',2,
                            [{file,"src/ns_server_stats.erl"},{line,170}]},
           {ns_server_stats,report_prom_stats,2,
                            [{file,"src/ns_server_stats.erl"},{line,178}]},
           {async,'-async_init/4-fun-1-',3,[{file,"src/async.erl"},{line,191}]}]
          [ns_server:error,2022-03-06T21:12:04.651-08:00,ns_1@172.23.100.35:<0.17058.46>:menelaus_util:reply_server_error_before_close:210]Server error during processing: ["web request failed",
                                           {path,"/pools/default"},
                                           {method,'GET'},
                                           {type,exit},
                                           {what,
                                            {noproc,
                                             {gen_server,call,
                                              ['service_status_keeper-index',
                                               get_version]}}},
                                           {trace,
                                            [{gen_server,call,2,
                                              [{file,"gen_server.erl"},{line,239}]},
                                             {menelaus_web_pools,do_build_pool_info,4,
                                              [{file,"src/menelaus_web_pools.erl"},
                                               {line,211}]},
                                             {menelaus_web_pools,pool_info,6,
                                              [{file,"src/menelaus_web_pools.erl"},
                                               {line,106}]},
                                             {menelaus_web_pools,handle_pool_info_wait,
                                              5,
                                              [{file,"src/menelaus_web_pools.erl"},
                                               {line,118}]},
                                             {request_tracker,request,2,
                                              [{file,"src/request_tracker.erl"},
                                               {line,40}]},
                                             {menelaus_util,handle_request,2,
                                              [{file,"src/menelaus_util.erl"},
                                               {line,221}]},
                                             {mochiweb_http,headers,6,
                                              [{file,
                                                "/home/couchbase/jenkins/workspace/couchbase-server-unix/couchdb/src/mochiweb/mochiweb_http.erl"},
                                               {line,153}]},
                                             {proc_lib,init_p_do_apply,3,
                                              [{file,"proc_lib.erl"},{line,226}]}]}]

          {noformat}

          Maybe it is another side effect of MB-49512 which is also hit frequently during cleanups related to bucket not being dropped completely.
          cbcollect_info attached.
          Balakumaran.Gopal Balakumaran Gopal made changes -
          Description +Script to Repro+
          {noformat}
          There is not a particular test that repro's this. Basically cleanup can fail for any test rendering one or more nodes in the cluster unusable.
          {noformat}

          Sequence of steps that led to the node being unusable.

          {noformat}
          2022-03-06 13:49:41,844 | test | INFO | MainThread | [basetestcase:log_setup_status:647] ========= BaseTestCase setup started for test #5 test_data_load_collections_with_graceful_failover_rebalance_out =========
          2022-03-06 13:50:22,311 | test | INFO | MainThread | [rest_client:monitorRebalance:1610] Rebalance done. Taken 11.0540001392 seconds to complete
          2022-03-06 13:50:22,319 | test | INFO | MainThread | [common_lib:sleep:23] Sleep 5 seconds. Reason: Wait after rebalance complete
          2022-03-06 13:50:27,359 | test | ERROR | MainThread | [rest_client:_http_request:834] GET http://172.23.100.35:8091/nodes/self body: headers: {'Accept': '*/*', 'Connection': 'close', 'Authorization': 'Basic QWRtaW5pc3RyYXRvcjpwYXNzd29yZA==', 'Content-Type': 'application/json'} error: 404 reason: unknown "Node is unknown to this cluster." auth: Administrator:password
          http://172.23.100.35:8091/nodes/self with status 0: Node is unknown to this cluster.
          2022-03-06 13:50:27,362 | test | ERROR | MainThread | [rest_client:__init__:312] Error Node is unknown to this cluster. was gotten, 5 seconds sleep before retry
          2022-03-06 13:50:32,378 | test | ERROR | MainThread | [rest_client:_http_request:834] GET http://172.23.100.35:8091/nodes/self body: headers: {'Accept': '*/*', 'Connection': 'close', 'Authorization': 'Basic QWRtaW5pc3RyYXRvcjpwYXNzd29yZA==', 'Content-Type': 'application/json'} error: 404 reason: unknown "Node is unknown to this cluster." auth: Administrator:password
          http://172.23.100.35:8091/nodes/self with status 0: Node is unknown to this cluster.
          2022-03-06 13:50:32,380 | test | ERROR | MainThread | [rest_client:__init__:312] Error Node is unknown to this cluster. was gotten, 5 seconds sleep before retry
          2022-03-06 13:50:37,394 | test | ERROR | MainThread | [rest_client:_http_request:834] GET http://172.23.100.35:8091/nodes/self body: headers: {'Accept': '*/*', 'Connection': 'close', 'Authorization': 'Basic QWRtaW5pc3RyYXRvcjpwYXNzd29yZA==', 'Content-Type': 'application/json'} error: 404 reason: unknown "Node is unknown to this cluster." auth: Administrator:password
          http://172.23.100.35:8091/nodes/self with status 0: Node is unknown to this cluster.
          2022-03-06 13:50:37,395 | test | ERROR | MainThread | [rest_client:__init__:312] Error Node is unknown to this cluster. was gotten, 5 seconds sleep before retry
          2022-03-06 13:50:42,404 | test | ERROR | MainThread | [rest_client:__init__:317] Node 172.23.100.35:8091 is in a broken state!
          2022-03-06 13:50:42,404 | test | ERROR | MainThread | [cluster_ready_functions:cleanup_cluster:232] Can't create rest connection after rebalance out for ejected nodes, will retry after 10 seconds according to MB-8430: Unable to reach the host @ 172.23.100.35
          2022-03-06 13:50:42,411 | test | INFO | MainThread | [common_lib:sleep:23] Sleep 10 seconds. Reason: MB-8430
          2022-03-06 13:50:52,420 | test | ERROR | MainThread | [rest_client:_http_request:834] GET http://172.23.100.35:8091/nodes/self body: headers: {'Accept': '*/*', 'Connection': 'close', 'Authorization': 'Basic QWRtaW5pc3RyYXRvcjpwYXNzd29yZA==', 'Content-Type': 'application/json'} error: 404 reason: unknown "Node is unknown to this cluster." auth: Administrator:password
          http://172.23.100.35:8091/nodes/self with status 0: Node is unknown to this cluster.
          2022-03-06 13:50:52,421 | test | ERROR | MainThread | [rest_client:__init__:312] Error Node is unknown to this cluster. was gotten, 5 seconds sleep before retry
          2022-03-06 13:50:57,436 | test | ERROR | MainThread | [rest_client:_http_request:834] GET http://172.23.100.35:8091/nodes/self body: headers: {'Accept': '*/*', 'Connection': 'close', 'Authorization': 'Basic QWRtaW5pc3RyYXRvcjpwYXNzd29yZA==', 'Content-Type': 'application/json'} error: 404 reason: unknown "Node is unknown to this cluster." auth: Administrator:password
          http://172.23.100.35:8091/nodes/self with status 0: Node is unknown to this cluster.
          2022-03-06 13:50:57,437 | test | ERROR | MainThread | [rest_client:__init__:312] Error Node is unknown to this cluster. was gotten, 5 seconds sleep before retry
          2022-03-06 13:51:02,453 | test | ERROR | MainThread | [rest_client:_http_request:834] GET http://172.23.100.35:8091/nodes/self body: headers: {'Accept': '*/*', 'Connection': 'close', 'Authorization': 'Basic QWRtaW5pc3RyYXRvcjpwYXNzd29yZA==', 'Content-Type': 'application/json'} error: 404 reason: unknown "Node is unknown to this cluster." auth: Administrator:password
          http://172.23.100.35:8091/nodes/self with status 0: Node is unknown to this cluster.
          2022-03-06 13:51:02,456 | test | ERROR | MainThread | [rest_client:__init__:312] Error Node is unknown to this cluster. was gotten, 5 seconds sleep before retry
          2022-03-06 13:51:07,463 | test | ERROR | MainThread | [rest_client:__init__:317] Node 172.23.100.35:8091 is in a broken state!
          Traceback (most recent call last):
            File "pytests/basetestcase.py", line 363, in setUp
              self.cluster_util.cluster_cleanup(cluster,
            File "pytests/basetestcase.py", line 363, in setUp
              self.cluster_util.cluster_cleanup(cluster,
            File "couchbase_utils/cluster_utils/cluster_ready_functions.py", line 169, in cluster_cleanup
              self.cleanup_cluster(cluster, master=cluster.master)
            File "couchbase_utils/cluster_utils/cluster_ready_functions.py", line 237, in cleanup_cluster
              rest = RestConnection(removed)
            File "lib/membase/api/rest_client.py", line 319, in __init__
              raise ServerUnavailableException(self.ip)
          ServerUnavailableException: Unable to reach the host @ 172.23.100.35
          {noformat}

          +172.23.100.35+
          {noformat}
          [ns_server:debug,2022-03-06T21:12:03.979-08:00,ns_1@172.23.100.35:<0.17157.46>:ns_memcached:ensure_bucket_inner:1318]Bucket "default" not found during ensure_bucket
          [ns_server:error,2022-03-06T21:12:04.012-08:00,ns_1@172.23.100.35:<0.17160.46>:ns_server_stats:report_prom_stats:172]ns_server stats reporting exception: error:badarg
          [{ets,lookup,
                [ns_server_stats,{c,{<<"rest_request_enters">>,[]}}],
                [{error_info,#{cause => id,module => erl_stdlib_errors}}]},
           {ns_server_stats,'-report_ns_server_lc_stats/1-fun-0-',2,
                            [{file,"src/ns_server_stats.erl"},{line,257}]},
           {lists,foreach,2,[{file,"lists.erl"},{line,1342}]},
           {ns_server_stats,'-report_prom_stats/2-fun-0-',2,
                            [{file,"src/ns_server_stats.erl"},{line,170}]},
           {ns_server_stats,report_prom_stats,2,
                            [{file,"src/ns_server_stats.erl"},{line,180}]},
           {async,'-async_init/4-fun-1-',3,[{file,"src/async.erl"},{line,191}]}]
          [ns_server:error,2022-03-06T21:12:04.012-08:00,ns_1@172.23.100.35:<0.17160.46>:ns_server_stats:report_prom_stats:172]system stats reporting exception: exit:{noproc,
                                                  {gen_server,call,
                                                   [ns_server_stats,get_stats]}}
          [{gen_server,call,2,[{file,"gen_server.erl"},{line,239}]},
           {ns_server_stats,report_system_stats,1,
                            [{file,"src/ns_server_stats.erl"},{line,188}]},
           {ns_server_stats,'-report_prom_stats/2-fun-0-',2,
                            [{file,"src/ns_server_stats.erl"},{line,170}]},
           {ns_server_stats,report_prom_stats,2,
                            [{file,"src/ns_server_stats.erl"},{line,182}]},
           {async,'-async_init/4-fun-1-',3,[{file,"src/async.erl"},{line,191}]}]
          [ns_server:debug,2022-03-06T21:12:04.110-08:00,ns_1@172.23.100.35:<0.17158.46>:ns_memcached:ensure_bucket_inner:1318]Bucket "default" not found during ensure_bucket
          [ns_server:error,2022-03-06T21:12:04.324-08:00,ns_1@172.23.100.35:<0.17166.46>:ns_server_stats:report_prom_stats:172]ns_server stats reporting exception: error:badarg
          [{ets,safe_fixtable,
                [ns_server_stats,true],
                [{error_info,#{cause => id,module => erl_stdlib_errors}}]},
           {ets,foldl,3,[{file,"ets.erl"},{line,625}]},
           {ns_server_stats,report_ns_server_hc_stats,1,
                            [{file,"src/ns_server_stats.erl"},{line,264}]},
           {ns_server_stats,'-report_prom_stats/2-fun-0-',2,
                            [{file,"src/ns_server_stats.erl"},{line,170}]},
           {ns_server_stats,report_prom_stats,2,
                            [{file,"src/ns_server_stats.erl"},{line,178}]},
           {async,'-async_init/4-fun-1-',3,[{file,"src/async.erl"},{line,191}]}]
          [ns_server:error,2022-03-06T21:12:04.651-08:00,ns_1@172.23.100.35:<0.17058.46>:menelaus_util:reply_server_error_before_close:210]Server error during processing: ["web request failed",
                                           {path,"/pools/default"},
                                           {method,'GET'},
                                           {type,exit},
                                           {what,
                                            {noproc,
                                             {gen_server,call,
                                              ['service_status_keeper-index',
                                               get_version]}}},
                                           {trace,
                                            [{gen_server,call,2,
                                              [{file,"gen_server.erl"},{line,239}]},
                                             {menelaus_web_pools,do_build_pool_info,4,
                                              [{file,"src/menelaus_web_pools.erl"},
                                               {line,211}]},
                                             {menelaus_web_pools,pool_info,6,
                                              [{file,"src/menelaus_web_pools.erl"},
                                               {line,106}]},
                                             {menelaus_web_pools,handle_pool_info_wait,
                                              5,
                                              [{file,"src/menelaus_web_pools.erl"},
                                               {line,118}]},
                                             {request_tracker,request,2,
                                              [{file,"src/request_tracker.erl"},
                                               {line,40}]},
                                             {menelaus_util,handle_request,2,
                                              [{file,"src/menelaus_util.erl"},
                                               {line,221}]},
                                             {mochiweb_http,headers,6,
                                              [{file,
                                                "/home/couchbase/jenkins/workspace/couchbase-server-unix/couchdb/src/mochiweb/mochiweb_http.erl"},
                                               {line,153}]},
                                             {proc_lib,init_p_do_apply,3,
                                              [{file,"proc_lib.erl"},{line,226}]}]}]

          {noformat}

          Maybe it is another side effect of MB-49512 which is also hit frequently during cleanups related to bucket not being dropped completely.
          cbcollect_info attached.
          +Script to Repro+
          {noformat}
          There is not a particular test that repro's this. Basically cleanup can fail for any test rendering one or more nodes in the cluster unusable.
          {noformat}

          Logs before node became unusable.
          {noformat}
          2022-03-06 13:49:41,844 | test | INFO | MainThread | [basetestcase:log_setup_status:647] ========= BaseTestCase setup started for test #5 test_data_load_collections_with_graceful_failover_rebalance_out =========
          2022-03-06 13:50:22,311 | test | INFO | MainThread | [rest_client:monitorRebalance:1610] Rebalance done. Taken 11.0540001392 seconds to complete
          2022-03-06 13:50:22,319 | test | INFO | MainThread | [common_lib:sleep:23] Sleep 5 seconds. Reason: Wait after rebalance complete
          2022-03-06 13:50:27,359 | test | ERROR | MainThread | [rest_client:_http_request:834] GET http://172.23.100.35:8091/nodes/self body: headers: {'Accept': '*/*', 'Connection': 'close', 'Authorization': 'Basic QWRtaW5pc3RyYXRvcjpwYXNzd29yZA==', 'Content-Type': 'application/json'} error: 404 reason: unknown "Node is unknown to this cluster." auth: Administrator:password
          http://172.23.100.35:8091/nodes/self with status 0: Node is unknown to this cluster.
          2022-03-06 13:50:27,362 | test | ERROR | MainThread | [rest_client:__init__:312] Error Node is unknown to this cluster. was gotten, 5 seconds sleep before retry
          2022-03-06 13:50:32,378 | test | ERROR | MainThread | [rest_client:_http_request:834] GET http://172.23.100.35:8091/nodes/self body: headers: {'Accept': '*/*', 'Connection': 'close', 'Authorization': 'Basic QWRtaW5pc3RyYXRvcjpwYXNzd29yZA==', 'Content-Type': 'application/json'} error: 404 reason: unknown "Node is unknown to this cluster." auth: Administrator:password
          http://172.23.100.35:8091/nodes/self with status 0: Node is unknown to this cluster.
          2022-03-06 13:50:32,380 | test | ERROR | MainThread | [rest_client:__init__:312] Error Node is unknown to this cluster. was gotten, 5 seconds sleep before retry
          2022-03-06 13:50:37,394 | test | ERROR | MainThread | [rest_client:_http_request:834] GET http://172.23.100.35:8091/nodes/self body: headers: {'Accept': '*/*', 'Connection': 'close', 'Authorization': 'Basic QWRtaW5pc3RyYXRvcjpwYXNzd29yZA==', 'Content-Type': 'application/json'} error: 404 reason: unknown "Node is unknown to this cluster." auth: Administrator:password
          http://172.23.100.35:8091/nodes/self with status 0: Node is unknown to this cluster.
          2022-03-06 13:50:37,395 | test | ERROR | MainThread | [rest_client:__init__:312] Error Node is unknown to this cluster. was gotten, 5 seconds sleep before retry
          2022-03-06 13:50:42,404 | test | ERROR | MainThread | [rest_client:__init__:317] Node 172.23.100.35:8091 is in a broken state!
          2022-03-06 13:50:42,404 | test | ERROR | MainThread | [cluster_ready_functions:cleanup_cluster:232] Can't create rest connection after rebalance out for ejected nodes, will retry after 10 seconds according to MB-8430: Unable to reach the host @ 172.23.100.35
          2022-03-06 13:50:42,411 | test | INFO | MainThread | [common_lib:sleep:23] Sleep 10 seconds. Reason: MB-8430
          2022-03-06 13:50:52,420 | test | ERROR | MainThread | [rest_client:_http_request:834] GET http://172.23.100.35:8091/nodes/self body: headers: {'Accept': '*/*', 'Connection': 'close', 'Authorization': 'Basic QWRtaW5pc3RyYXRvcjpwYXNzd29yZA==', 'Content-Type': 'application/json'} error: 404 reason: unknown "Node is unknown to this cluster." auth: Administrator:password
          http://172.23.100.35:8091/nodes/self with status 0: Node is unknown to this cluster.
          2022-03-06 13:50:52,421 | test | ERROR | MainThread | [rest_client:__init__:312] Error Node is unknown to this cluster. was gotten, 5 seconds sleep before retry
          2022-03-06 13:50:57,436 | test | ERROR | MainThread | [rest_client:_http_request:834] GET http://172.23.100.35:8091/nodes/self body: headers: {'Accept': '*/*', 'Connection': 'close', 'Authorization': 'Basic QWRtaW5pc3RyYXRvcjpwYXNzd29yZA==', 'Content-Type': 'application/json'} error: 404 reason: unknown "Node is unknown to this cluster." auth: Administrator:password
          http://172.23.100.35:8091/nodes/self with status 0: Node is unknown to this cluster.
          2022-03-06 13:50:57,437 | test | ERROR | MainThread | [rest_client:__init__:312] Error Node is unknown to this cluster. was gotten, 5 seconds sleep before retry
          2022-03-06 13:51:02,453 | test | ERROR | MainThread | [rest_client:_http_request:834] GET http://172.23.100.35:8091/nodes/self body: headers: {'Accept': '*/*', 'Connection': 'close', 'Authorization': 'Basic QWRtaW5pc3RyYXRvcjpwYXNzd29yZA==', 'Content-Type': 'application/json'} error: 404 reason: unknown "Node is unknown to this cluster." auth: Administrator:password
          http://172.23.100.35:8091/nodes/self with status 0: Node is unknown to this cluster.
          2022-03-06 13:51:02,456 | test | ERROR | MainThread | [rest_client:__init__:312] Error Node is unknown to this cluster. was gotten, 5 seconds sleep before retry
          2022-03-06 13:51:07,463 | test | ERROR | MainThread | [rest_client:__init__:317] Node 172.23.100.35:8091 is in a broken state!
          Traceback (most recent call last):
            File "pytests/basetestcase.py", line 363, in setUp
              self.cluster_util.cluster_cleanup(cluster,
            File "pytests/basetestcase.py", line 363, in setUp
              self.cluster_util.cluster_cleanup(cluster,
            File "couchbase_utils/cluster_utils/cluster_ready_functions.py", line 169, in cluster_cleanup
              self.cleanup_cluster(cluster, master=cluster.master)
            File "couchbase_utils/cluster_utils/cluster_ready_functions.py", line 237, in cleanup_cluster
              rest = RestConnection(removed)
            File "lib/membase/api/rest_client.py", line 319, in __init__
              raise ServerUnavailableException(self.ip)
          ServerUnavailableException: Unable to reach the host @ 172.23.100.35
          {noformat}

          +172.23.100.35+
          {noformat}
          [ns_server:debug,2022-03-06T21:12:03.979-08:00,ns_1@172.23.100.35:<0.17157.46>:ns_memcached:ensure_bucket_inner:1318]Bucket "default" not found during ensure_bucket
          [ns_server:error,2022-03-06T21:12:04.012-08:00,ns_1@172.23.100.35:<0.17160.46>:ns_server_stats:report_prom_stats:172]ns_server stats reporting exception: error:badarg
          [{ets,lookup,
                [ns_server_stats,{c,{<<"rest_request_enters">>,[]}}],
                [{error_info,#{cause => id,module => erl_stdlib_errors}}]},
           {ns_server_stats,'-report_ns_server_lc_stats/1-fun-0-',2,
                            [{file,"src/ns_server_stats.erl"},{line,257}]},
           {lists,foreach,2,[{file,"lists.erl"},{line,1342}]},
           {ns_server_stats,'-report_prom_stats/2-fun-0-',2,
                            [{file,"src/ns_server_stats.erl"},{line,170}]},
           {ns_server_stats,report_prom_stats,2,
                            [{file,"src/ns_server_stats.erl"},{line,180}]},
           {async,'-async_init/4-fun-1-',3,[{file,"src/async.erl"},{line,191}]}]
          [ns_server:error,2022-03-06T21:12:04.012-08:00,ns_1@172.23.100.35:<0.17160.46>:ns_server_stats:report_prom_stats:172]system stats reporting exception: exit:{noproc,
                                                  {gen_server,call,
                                                   [ns_server_stats,get_stats]}}
          [{gen_server,call,2,[{file,"gen_server.erl"},{line,239}]},
           {ns_server_stats,report_system_stats,1,
                            [{file,"src/ns_server_stats.erl"},{line,188}]},
           {ns_server_stats,'-report_prom_stats/2-fun-0-',2,
                            [{file,"src/ns_server_stats.erl"},{line,170}]},
           {ns_server_stats,report_prom_stats,2,
                            [{file,"src/ns_server_stats.erl"},{line,182}]},
           {async,'-async_init/4-fun-1-',3,[{file,"src/async.erl"},{line,191}]}]
          [ns_server:debug,2022-03-06T21:12:04.110-08:00,ns_1@172.23.100.35:<0.17158.46>:ns_memcached:ensure_bucket_inner:1318]Bucket "default" not found during ensure_bucket
          [ns_server:error,2022-03-06T21:12:04.324-08:00,ns_1@172.23.100.35:<0.17166.46>:ns_server_stats:report_prom_stats:172]ns_server stats reporting exception: error:badarg
          [{ets,safe_fixtable,
                [ns_server_stats,true],
                [{error_info,#{cause => id,module => erl_stdlib_errors}}]},
           {ets,foldl,3,[{file,"ets.erl"},{line,625}]},
           {ns_server_stats,report_ns_server_hc_stats,1,
                            [{file,"src/ns_server_stats.erl"},{line,264}]},
           {ns_server_stats,'-report_prom_stats/2-fun-0-',2,
                            [{file,"src/ns_server_stats.erl"},{line,170}]},
           {ns_server_stats,report_prom_stats,2,
                            [{file,"src/ns_server_stats.erl"},{line,178}]},
           {async,'-async_init/4-fun-1-',3,[{file,"src/async.erl"},{line,191}]}]
          [ns_server:error,2022-03-06T21:12:04.651-08:00,ns_1@172.23.100.35:<0.17058.46>:menelaus_util:reply_server_error_before_close:210]Server error during processing: ["web request failed",
                                           {path,"/pools/default"},
                                           {method,'GET'},
                                           {type,exit},
                                           {what,
                                            {noproc,
                                             {gen_server,call,
                                              ['service_status_keeper-index',
                                               get_version]}}},
                                           {trace,
                                            [{gen_server,call,2,
                                              [{file,"gen_server.erl"},{line,239}]},
                                             {menelaus_web_pools,do_build_pool_info,4,
                                              [{file,"src/menelaus_web_pools.erl"},
                                               {line,211}]},
                                             {menelaus_web_pools,pool_info,6,
                                              [{file,"src/menelaus_web_pools.erl"},
                                               {line,106}]},
                                             {menelaus_web_pools,handle_pool_info_wait,
                                              5,
                                              [{file,"src/menelaus_web_pools.erl"},
                                               {line,118}]},
                                             {request_tracker,request,2,
                                              [{file,"src/request_tracker.erl"},
                                               {line,40}]},
                                             {menelaus_util,handle_request,2,
                                              [{file,"src/menelaus_util.erl"},
                                               {line,221}]},
                                             {mochiweb_http,headers,6,
                                              [{file,
                                                "/home/couchbase/jenkins/workspace/couchbase-server-unix/couchdb/src/mochiweb/mochiweb_http.erl"},
                                               {line,153}]},
                                             {proc_lib,init_p_do_apply,3,
                                              [{file,"proc_lib.erl"},{line,226}]}]}]

          {noformat}

          Maybe it is another side effect of MB-49512 which is also hit frequently during cleanups related to bucket not being dropped completely.
          cbcollect_info attached.
          Balakumaran.Gopal Balakumaran Gopal made changes -
          Labels affects-neo-testing functional-test affects-neo-testing functional-test magma
          Balakumaran.Gopal Balakumaran Gopal made changes -
          Assignee Balakumaran Gopal [ balakumaran.gopal ] Daniel Owen [ owend ]
          owend Daniel Owen made changes -
          Status Open [ 1 ] In Progress [ 3 ]
          owend Daniel Owen added a comment - - edited

          Hi Balakumaran Gopal,

          On .35 from the memcached.log we see the last delete of the "Default bucket"

          2022-03-06T11:54:09.830406-08:00 INFO <none>: Delete bucket [default]. Notifying engine
          2022-03-06T11:54:09.837324-08:00 INFO <none>: Delete bucket [default]. Engine ready for shutdown
          2022-03-06T11:54:09.837327-08:00 INFO <none>: Delete bucket [default]. Wait for 6 clients to disconnect
          2022-03-06T11:54:09.838533-08:00 INFO <none>: Delete bucket [default]. Shut down the bucket
          2022-03-06T11:54:14.502380-08:00 INFO (default) Deleted KvBucket.
          2022-03-06T11:54:16.056152-08:00 INFO <none>: Delete bucket [default]. Clean up allocated resources
          2022-03-06T11:54:16.056200-08:00 INFO <none>: Delete bucket [default] complete
          

          Looking at the 2022-03-07/172.23.100.35.zip logs the last create bucket for default is

          2022-03-06T11:55:30.125015-08:00 INFO 96: Create Couchbase bucket [default]
          

          It appears Warmup completed successfully

          2022-03-06T11:55:30.147945-08:00 INFO (default) Warmup::processCreateVBucketsComplete unblocking 1 cookie(s)
          2022-03-06T11:55:30.148367-08:00 INFO (default) Warmup completed: 0 keys and 0 values loaded in 4231 us (0 keys/s), mem_used now at 4.588072 MB (1084.349639130889 MB/s)
          

          However can see from the ns_server.debug.log that ns_server appears stuck

          bash-5.1$ grep "Bucket \"default\" not found during ensure_bucket" ns_server.debug.log  | grep "2022-03-06T12"  | wc -l
                 0
          bash-5.1$ grep "Bucket \"default\" not found during ensure_bucket" ns_server.debug.log  | grep "2022-03-06T13"  | wc -l
              3710
          bash-5.1$ grep "Bucket \"default\" not found during ensure_bucket" ns_server.debug.log  | grep "2022-03-06T14"  | wc -l
             15766
          bash-5.1$ grep "Bucket \"default\" not found during ensure_bucket" ns_server.debug.log  | grep "2022-03-06T15"  | wc -l
             14758
          bash-5.1$ grep "Bucket \"default\" not found during ensure_bucket" ns_server.debug.log  | grep "2022-03-06T16"  | wc -l
             14758
          bash-5.1$ grep "Bucket \"default\" not found during ensure_bucket" ns_server.debug.log  | grep "2022-03-06T17"  | wc -l
             14632
          bash-5.1$ grep "Bucket \"default\" not found during ensure_bucket" ns_server.debug.log  | grep "2022-03-06T18"  | wc -l
             13969
          bash-5.1$ grep "Bucket \"default\" not found during ensure_bucket" ns_server.debug.log  | grep "2022-03-06T19" | wc -l
             13680
          bash-5.1$ grep "Bucket \"default\" not found during ensure_bucket" ns_server.debug.log  | grep "2022-03-06T20" | wc -l
             13425
          bash-5.1$ grep "Bucket \"default\" not found during ensure_bucket" ns_server.debug.log  | grep "2022-03-06T21" | wc -l
              5716
          

          From KV it appears we have enabled traffic

          2022-03-06T11:55:31.106216-08:00 INFO (default) EventuallyPersistentEngine::enableTraffic() result true
          2022-03-06T11:55:31.106224-08:00 INFO (default) EventuallyPersistentEngine::handleTrafficControlCmd() Data traffic to persistence engine is enabled
          

          owend Daniel Owen added a comment - - edited Hi Balakumaran Gopal , On .35 from the memcached.log we see the last delete of the "Default bucket" 2022-03-06T11:54:09.830406-08:00 INFO <none>: Delete bucket [default]. Notifying engine 2022-03-06T11:54:09.837324-08:00 INFO <none>: Delete bucket [default]. Engine ready for shutdown 2022-03-06T11:54:09.837327-08:00 INFO <none>: Delete bucket [default]. Wait for 6 clients to disconnect 2022-03-06T11:54:09.838533-08:00 INFO <none>: Delete bucket [default]. Shut down the bucket 2022-03-06T11:54:14.502380-08:00 INFO (default) Deleted KvBucket. 2022-03-06T11:54:16.056152-08:00 INFO <none>: Delete bucket [default]. Clean up allocated resources 2022-03-06T11:54:16.056200-08:00 INFO <none>: Delete bucket [default] complete Looking at the 2022-03-07/172.23.100.35.zip logs the last create bucket for default is 2022-03-06T11:55:30.125015-08:00 INFO 96: Create Couchbase bucket [default] It appears Warmup completed successfully 2022-03-06T11:55:30.147945-08:00 INFO (default) Warmup::processCreateVBucketsComplete unblocking 1 cookie(s) 2022-03-06T11:55:30.148367-08:00 INFO (default) Warmup completed: 0 keys and 0 values loaded in 4231 us (0 keys/s), mem_used now at 4.588072 MB (1084.349639130889 MB/s) However can see from the ns_server.debug.log that ns_server appears stuck bash-5.1$ grep "Bucket \"default\" not found during ensure_bucket" ns_server.debug.log | grep "2022-03-06T12" | wc -l 0 bash-5.1$ grep "Bucket \"default\" not found during ensure_bucket" ns_server.debug.log | grep "2022-03-06T13" | wc -l 3710 bash-5.1$ grep "Bucket \"default\" not found during ensure_bucket" ns_server.debug.log | grep "2022-03-06T14" | wc -l 15766 bash-5.1$ grep "Bucket \"default\" not found during ensure_bucket" ns_server.debug.log | grep "2022-03-06T15" | wc -l 14758 bash-5.1$ grep "Bucket \"default\" not found during ensure_bucket" ns_server.debug.log | grep "2022-03-06T16" | wc -l 14758 bash-5.1$ grep "Bucket \"default\" not found during ensure_bucket" ns_server.debug.log | grep "2022-03-06T17" | wc -l 14632 bash-5.1$ grep "Bucket \"default\" not found during ensure_bucket" ns_server.debug.log | grep "2022-03-06T18" | wc -l 13969 bash-5.1$ grep "Bucket \"default\" not found during ensure_bucket" ns_server.debug.log | grep "2022-03-06T19" | wc -l 13680 bash-5.1$ grep "Bucket \"default\" not found during ensure_bucket" ns_server.debug.log | grep "2022-03-06T20" | wc -l 13425 bash-5.1$ grep "Bucket \"default\" not found during ensure_bucket" ns_server.debug.log | grep "2022-03-06T21" | wc -l 5716 From KV it appears we have enabled traffic 2022-03-06T11:55:31.106216-08:00 INFO (default) EventuallyPersistentEngine::enableTraffic() result true 2022-03-06T11:55:31.106224-08:00 INFO (default) EventuallyPersistentEngine::handleTrafficControlCmd() Data traffic to persistence engine is enabled
          owend Daniel Owen added a comment -

          In the ns_server.debug.log we are seeing:

          [ns_server:error,2022-03-06T21:25:34.012-08:00,ns_1@172.23.100.35:<0.8764.47>:ns_server_stats:report_prom_stats:172]ns_server stats reporting exception: error:badarg
          [{ets,lookup,
                [ns_server_stats,{c,{<<"rest_request_enters">>,[]}}],
                [{error_info,#{cause => id,module => erl_stdlib_errors}}]},
           {ns_server_stats,'-report_ns_server_lc_stats/1-fun-0-',2,
                            [{file,"src/ns_server_stats.erl"},{line,257}]},
           {lists,foreach,2,[{file,"lists.erl"},{line,1342}]},
           {ns_server_stats,'-report_prom_stats/2-fun-0-',2,
                            [{file,"src/ns_server_stats.erl"},{line,170}]},
           {ns_server_stats,report_prom_stats,2,
                            [{file,"src/ns_server_stats.erl"},{line,180}]},
           {async,'-async_init/4-fun-1-',3,[{file,"src/async.erl"},{line,191}]}]
          [ns_server:error,2022-03-06T21:25:34.012-08:00,ns_1@172.23.100.35:<0.8764.47>:ns_server_stats:report_prom_stats:172]system stats reporting exception: exit:{noproc,
                                                  {gen_server,call,
                                                   [ns_server_stats,get_stats]}}
          [{gen_server,call,2,[{file,"gen_server.erl"},{line,239}]},
           {ns_server_stats,report_system_stats,1,
                            [{file,"src/ns_server_stats.erl"},{line,188}]},
           {ns_server_stats,'-report_prom_stats/2-fun-0-',2,
                            [{file,"src/ns_server_stats.erl"},{line,170}]},
           {ns_server_stats,report_prom_stats,2,
                            [{file,"src/ns_server_stats.erl"},{line,182}]},
           {async,'-async_init/4-fun-1-',3,[{file,"src/async.erl"},{line,191}]}]
          

          The issue appears to be similar to MB-50962 where there is an issue deleting a bucket. And we do see

          [ns_server:info,2022-03-06T11:54:09.160-08:00,ns_1@172.23.100.35:ns_memcached-default<0.1374.23>:ns_memcached:delete_bucket:876]Deleting bucket "default" from memcached (force = true)
          [ns_server:error,2022-03-06T11:54:09.163-08:00,ns_1@172.23.100.35:ns_memcached-default<0.1374.23>:ns_memcached:delete_bucket:883]Failed to delete bucket "default": {error,{badmatch,{error,closed}}}
          

          However the underlying cause looks to be different as in this case KV believes it has successfully deleted the bucket.
          Currently look to be an ns_server issue.

          owend Daniel Owen added a comment - In the ns_server.debug.log we are seeing: [ns_server:error,2022-03-06T21:25:34.012-08:00,ns_1@172.23.100.35:<0.8764.47>:ns_server_stats:report_prom_stats:172]ns_server stats reporting exception: error:badarg [{ets,lookup, [ns_server_stats,{c,{<<"rest_request_enters">>,[]}}], [{error_info,#{cause => id,module => erl_stdlib_errors}}]}, {ns_server_stats,'-report_ns_server_lc_stats/1-fun-0-',2, [{file,"src/ns_server_stats.erl"},{line,257}]}, {lists,foreach,2,[{file,"lists.erl"},{line,1342}]}, {ns_server_stats,'-report_prom_stats/2-fun-0-',2, [{file,"src/ns_server_stats.erl"},{line,170}]}, {ns_server_stats,report_prom_stats,2, [{file,"src/ns_server_stats.erl"},{line,180}]}, {async,'-async_init/4-fun-1-',3,[{file,"src/async.erl"},{line,191}]}] [ns_server:error,2022-03-06T21:25:34.012-08:00,ns_1@172.23.100.35:<0.8764.47>:ns_server_stats:report_prom_stats:172]system stats reporting exception: exit:{noproc, {gen_server,call, [ns_server_stats,get_stats]}} [{gen_server,call,2,[{file,"gen_server.erl"},{line,239}]}, {ns_server_stats,report_system_stats,1, [{file,"src/ns_server_stats.erl"},{line,188}]}, {ns_server_stats,'-report_prom_stats/2-fun-0-',2, [{file,"src/ns_server_stats.erl"},{line,170}]}, {ns_server_stats,report_prom_stats,2, [{file,"src/ns_server_stats.erl"},{line,182}]}, {async,'-async_init/4-fun-1-',3,[{file,"src/async.erl"},{line,191}]}] The issue appears to be similar to MB-50962 where there is an issue deleting a bucket. And we do see [ns_server:info,2022-03-06T11:54:09.160-08:00,ns_1@172.23.100.35:ns_memcached-default<0.1374.23>:ns_memcached:delete_bucket:876]Deleting bucket "default" from memcached (force = true) [ns_server:error,2022-03-06T11:54:09.163-08:00,ns_1@172.23.100.35:ns_memcached-default<0.1374.23>:ns_memcached:delete_bucket:883]Failed to delete bucket "default": {error,{badmatch,{error,closed}}} However the underlying cause looks to be different as in this case KV believes it has successfully deleted the bucket. Currently look to be an ns_server issue.
          owend Daniel Owen made changes -
          Component/s couchbase-bucket [ 10173 ]
          Component/s ns_server [ 10019 ]
          owend Daniel Owen made changes -
          Status In Progress [ 3 ] Open [ 1 ]
          owend Daniel Owen made changes -
          Assignee Daniel Owen [ owend ] Dave Finlay [ dfinlay ]
          dfinlay Dave Finlay added a comment - - edited

          The problem looks to be the KV hasn't returned from deleting bucket2 after .35 was failed over at 12:56:14. (These logs are in latest-diag.log which I manually grabbed from the cluster and will attach.)

          2022-03-06T12:56:13.974-08:00, failover:0:info:message(ns_1@172.23.105.164) - Starting failing over ['ns_1@172.23.100.35']
          2022-03-06T12:56:13.974-08:00, ns_orchestrator:0:info:message(ns_1@172.23.105.164) - Starting failover of nodes ['ns_1@172.23.100.35']. Operation Id = 29f343f24f75cfd315eb971c4e193d8d
          2022-03-06T12:56:14.311-08:00, ns_memcached:0:info:message(ns_1@172.23.100.35) - Shutting down bucket "bucket1" on 'ns_1@172.23.100.35' for deletion
          2022-03-06T12:56:14.455-08:00, ns_memcached:0:info:message(ns_1@172.23.100.35) - Shutting down bucket "bucket2" on 'ns_1@172.23.100.35' for deletion
          2022-03-06T12:56:14.590-08:00, failover:0:info:message(ns_1@172.23.105.164) - Failed over ['ns_1@172.23.100.35']: ok
          2022-03-06T12:56:14.596-08:00, failover:0:info:message(ns_1@172.23.105.164) - Deactivating failed over nodes ['ns_1@172.23.100.35']
          2022-03-06T12:56:14.762-08:00, ns_orchestrator:0:info:message(ns_1@172.23.105.164) - Failover completed successfully.
          

          We can see that bucket shutdown has been initiated for bucket2.

          30 seconds later a "slow bucket stop" diag trace is triggered and we can see bucket2 is waiting for a response from memcached for the delete_bucket command. Note that the name of the process is 'ns_memcached-bucket2' with pid <0.3068.0>. (These traces are from debug.log.1 which I manually grabbed from the cluster and will attach.)

          [ns_server:error,2022-03-06T12:56:44.618-08:00,ns_1@172.23.100.35:<0.32666.16>:timeout_diag_logger:do_diag:123]
          {<0.3068.0>,
           [{backtrace,[<<"Program counter: 0x00007efd8184c578 (prim_inet:recv0/3 + 240)">>,
                        <<>>,<<"0x00007efcc046b8b0 53">>,<<>>,
                        <<"0x00007efcc046b8b8 #Port<0.201>">>,<<>>,
                        <<"0x00007efcc046b8c0 Return addr 0x00007efccae86f2c (mc_binary:recv/3 + 228)">>,
                        <<"y(0)     infinity">>,<<"y(1)     res">>,
                        <<"y(2)     #Port<0.201>">>,<<>>,
                        <<"0x00007efcc046b8e0 Return addr 0x00007efccae7b300 (mc_client_binary:cmd_vocal_recv/5 + 144)">>,
                        <<"y(0)     []">>,<<"y(1)     undefined">>,
                        <<"y(2)     undefined">>,<<"y(3)     134">>,<<>>,
                        <<"0x00007efcc046b908 Return addr 0x00007efccae7c1f4 (mc_client_binary:delete_bucket/3 + 420)">>,
                        <<"y(0)     []">>,<<"y(1)     []">>,<<>>,
                        <<"0x00007efcc046b920 Return addr 0x00007efccacc612c (ns_memcached:delete_bucket/4 + 372)">>,
                        <<"y(0)     false">>,<<"y(1)     []">>,
                        <<"y(2)     \"bucket2\"">>,<<"y(3)     []">>,
                        <<"y(4)     Catch 0x00007efccacc6157 (ns_memcached:delete_bucket/4 + 415)">>,
                        <<"y(5)     Catch 0x00007efccacc62a4 (ns_memcached:delete_bucket/4 + 748)">>,
                        <<>>,
                        <<"0x00007efcc046b958 Return addr 0x00007efccacc5a80 (ns_memcached:terminate/2 + 232)">>,
                        <<"y(0)     []">>,<<"y(1)     []">>,<<"y(2)     \"bucket2\"">>,
                        <<"y(3)     Catch 0x00007efccacc5b3f (ns_memcached:terminate/2 + 423)">>,
                        <<>>,
          ...
            {messages,[<<"{'$gen_call',{<0.31320.16>,[alias|#Ref<0.1760518760.1410138119.215166>]},warmed}">>,
                       <<"check_config">>,
                       <<"{'$gen_call',{<0.31792.16>,[alias|#Ref<0.1760518760.1409875976.139592>]},warmed}">>,
                       <<"{'$gen_call',{<0.32230.16>,[alias|#Ref<0.1760518760.1411448833.201399>]},warmed}">>]},
           {dictionary,[{'$initial_call',{ns_memcached,init,1}},
                         {'$ancestors',[<0.3067.0>,'single_bucket_kv_sup-bucket2',
                                        ns_bucket_sup,ns_bucket_worker_sup,
                                        ns_server_sup,ns_server_nodes_sup,<0.1752.0>,
                                        ns_server_cluster_sup,root_sup,<0.145.0>]}]},
            {registered_name,'ns_memcached-bucket2'},
          

          Approx 45 mins later .35 is added back into the cluster. The fails because we're still waiting to delete bucket2:

          2022-03-06T13:42:49.784-08:00, ns_orchestrator:0:info:message(ns_1@172.23.105.164) - Starting rebalance, KeepNodes = ['ns_1@172.23.105.206','ns_1@172.23.106.177',
                                           'ns_1@172.23.100.35','ns_1@172.23.105.164'], EjectNodes = [], Failed over and being ejected nodes = [], Delta recovery nodes = ['ns_1@172.23.100.35'],  Delta recovery b\
          uckets = all; Operation Id = e3e3a85f0e65f39a1b9eae473faeeed4
          2022-03-06T13:44:49.789-08:00, ns_rebalancer:0:critical:message(ns_1@172.23.105.164) - Failed to wait deletion of some buckets on some nodes: [{'ns_1@172.23.100.35',
                                                                   {'EXIT',
                                                                    {old_buckets_shutdown_wait_failed,
                                                                     ["default",
                                                                      "bucket2"]}}}]
          

          We know it's still the same process that's waiting to terminate and that it's waiting on the same backtrace as I ran this diag/eval on the cluster just now:

          # date; curl -s localhost:8091/diag/eval -u Administrator:password -d "Pid = whereis('ns_memcached-bucket2')"
          Mon Mar  7 05:25:03 PST 2022
          <0.3068.0>
          ...
          # date; curl -s localhost:8091/diag/eval -u Administrator:password -d "Pid = whereis('ns_memcached-bucket2'), erlang:process_info(Pid, backtrace)"
          Mon Mar  7 05:23:30 PST 2022
          {backtrace,<<"Program counter: 0x00007efd8184c578 (prim_inet:recv0/3 + 240)\n\n0x00007efcc046b8b0 53\n\n0x00007efcc046b8b8 #Port<0.201>\n\n0x00007efcc046b8c0 Return addr 0x00007efccae86f2c (mc_binary:recv/3 + 228)\ny(0)     infinity\ny(1)     res\ny(2)     #Port<0.201>\n\n0x00007efcc046b8e0 Return addr 0x00007efccae7b300 (mc_client_binary:cmd_vocal_recv/5 + 144)\ny(0)     []\ny(1)     undefined\ny(2)     undefined\ny(3)     134\n\n0x00007efcc046b908 Return addr 0x00007efccae7c1f4 (mc_client_binary:delete_bucket/3 + 420)\ny(0)     []\ny(1)     []\n\n0x00007efcc046b920 Return addr 0x00007efccacc612c (ns_memcached:delete_bucket/4 + 372)\ny(0)     false\ny(1)     []\ny(2)     \"bucket2\"\ny(3)     []\ny(4)     Catch 0x00007efccacc6157 (ns_memcached:delete_bucket/4 + 415)\ny(5)     Catch 0x00007efccacc62a4 (ns_memcached:delete_bucket/4 + 748)\n\n0x00007efcc046b958 Return addr 0x00007efccacc5a80 (ns_memcached:terminate/2 + 232)\ny(0)     []\ny(1)     []\ny(2)     \"bucket2\"\ny(3)     Catch 0x00007efccacc5b3f (ns_memcached:terminate/2 + 423)\n\n0x00007efcc046b980 Return addr 0x00007efd81a120f4 (gen_server:try_terminate/3 + 332)\ny(0)     []\ny(1)     []\ny(2)     []\ny(3)     Catch 0x00007efd81a1214a (gen_server:try_terminate/3 + 418)\n\n0x00007efcc046b9a8 Return addr 0x00007efd81a14738 (gen_server:terminate/10 + 192)\ny(0)     []\ny(1)     []\ny(2)     []\ny(3)     {state,0,0,0,{[],[]},{[],[]},{[],[]},warmed,{1646,596522,559857},\"bucket2\",[collections,json],#Port<0.201>,[{<0.3102.0>,#Ref<0.1760518760.1411383297.200086>},{<0.3099.0>,#Ref<0.1760518760.1409548295.56458>},{<0.3101.0>,#Ref<0.1760518760.1409548293.57702>},{<0.3100.0>,#Ref<0.1760518760.1409548295.56455>}],[],<0.3069.0>,false,10000}\ny(4)     ns_memcached\ny(5)     {'EXIT',<0.3067.0>,shutdown}\ny(6)     undefined\ny(7)     'ns_memcached-bucket2'\ny(8)     shutdown\ny(9)     [{gen_server,decode_msg,9,[{file,\"gen_server.erl\"},{line,481}]},{proc_lib,init_p_do_apply,3,[{file,\"proc_lib.erl\"},{line,226}]}]\ny(10)    shutdown\ny(11)    exit\n\n0x00007efcc046ba10 Return addr 0x00007efd819416d0 (proc_lib:init_p_do_apply/3 + 192)\ny(0)     []\ny(1)     []\ny(2)     Catch 0x00007efd819416f0 (proc_lib:init_p_do_apply/3 + 224)\n\n0x00007efcc046ba30 Return addr 0x00007efd818308c4 (unknown function)\n\n0x00007efcc046ba38 Return addr 0x00007efd818308c8 (<terminate process normally>)\n">>}
          

          The reason the node isn't accessible is it has been removed from the cluster and has only partly shutdown.

          In memcached we see similar evidence. bucket1 is deleted, the deletion of bucket2 is begun, but never completes. (The log traces are from memcached.log.00021 which I manually grabbed and will attach.)

          2022-03-06T12:56:14.313614-08:00 INFO 2663: Delete bucket [bucket1]. Wait for 254 clients to disconnect
          2022-03-06T12:56:14.405602-08:00 INFO 2663: Delete bucket [bucket1] complete
          2022-03-06T12:56:14.456964-08:00 INFO 67: Delete bucket [bucket2].
          

          Will assign to KV to look at the deletion of bucket2. (Note the cluster is still up so it may be possible to get a backtrace from Memcached to understand where the delete bucket is stuck.)

          dfinlay Dave Finlay added a comment - - edited The problem looks to be the KV hasn't returned from deleting bucket2 after .35 was failed over at 12:56:14. (These logs are in latest-diag.log which I manually grabbed from the cluster and will attach.) 2022-03-06T12:56:13.974-08:00, failover:0:info:message(ns_1@172.23.105.164) - Starting failing over ['ns_1@172.23.100.35'] 2022-03-06T12:56:13.974-08:00, ns_orchestrator:0:info:message(ns_1@172.23.105.164) - Starting failover of nodes ['ns_1@172.23.100.35']. Operation Id = 29f343f24f75cfd315eb971c4e193d8d 2022-03-06T12:56:14.311-08:00, ns_memcached:0:info:message(ns_1@172.23.100.35) - Shutting down bucket "bucket1" on 'ns_1@172.23.100.35' for deletion 2022-03-06T12:56:14.455-08:00, ns_memcached:0:info:message(ns_1@172.23.100.35) - Shutting down bucket "bucket2" on 'ns_1@172.23.100.35' for deletion 2022-03-06T12:56:14.590-08:00, failover:0:info:message(ns_1@172.23.105.164) - Failed over ['ns_1@172.23.100.35']: ok 2022-03-06T12:56:14.596-08:00, failover:0:info:message(ns_1@172.23.105.164) - Deactivating failed over nodes ['ns_1@172.23.100.35'] 2022-03-06T12:56:14.762-08:00, ns_orchestrator:0:info:message(ns_1@172.23.105.164) - Failover completed successfully. We can see that bucket shutdown has been initiated for bucket2. 30 seconds later a "slow bucket stop" diag trace is triggered and we can see bucket2 is waiting for a response from memcached for the delete_bucket command. Note that the name of the process is 'ns_memcached-bucket2' with pid <0.3068.0>. (These traces are from debug.log.1 which I manually grabbed from the cluster and will attach.) [ns_server:error,2022-03-06T12:56:44.618-08:00,ns_1@172.23.100.35:<0.32666.16>:timeout_diag_logger:do_diag:123] {<0.3068.0>, [{backtrace,[<<"Program counter: 0x00007efd8184c578 (prim_inet:recv0/3 + 240)">>, <<>>,<<"0x00007efcc046b8b0 53">>,<<>>, <<"0x00007efcc046b8b8 #Port<0.201>">>,<<>>, <<"0x00007efcc046b8c0 Return addr 0x00007efccae86f2c (mc_binary:recv/3 + 228)">>, <<"y(0) infinity">>,<<"y(1) res">>, <<"y(2) #Port<0.201>">>,<<>>, <<"0x00007efcc046b8e0 Return addr 0x00007efccae7b300 (mc_client_binary:cmd_vocal_recv/5 + 144)">>, <<"y(0) []">>,<<"y(1) undefined">>, <<"y(2) undefined">>,<<"y(3) 134">>,<<>>, <<"0x00007efcc046b908 Return addr 0x00007efccae7c1f4 (mc_client_binary:delete_bucket/3 + 420)">>, <<"y(0) []">>,<<"y(1) []">>,<<>>, <<"0x00007efcc046b920 Return addr 0x00007efccacc612c (ns_memcached:delete_bucket/4 + 372)">>, <<"y(0) false">>,<<"y(1) []">>, <<"y(2) \"bucket2\"">>,<<"y(3) []">>, <<"y(4) Catch 0x00007efccacc6157 (ns_memcached:delete_bucket/4 + 415)">>, <<"y(5) Catch 0x00007efccacc62a4 (ns_memcached:delete_bucket/4 + 748)">>, <<>>, <<"0x00007efcc046b958 Return addr 0x00007efccacc5a80 (ns_memcached:terminate/2 + 232)">>, <<"y(0) []">>,<<"y(1) []">>,<<"y(2) \"bucket2\"">>, <<"y(3) Catch 0x00007efccacc5b3f (ns_memcached:terminate/2 + 423)">>, <<>>, ... {messages,[<<"{'$gen_call',{<0.31320.16>,[alias|#Ref<0.1760518760.1410138119.215166>]},warmed}">>, <<"check_config">>, <<"{'$gen_call',{<0.31792.16>,[alias|#Ref<0.1760518760.1409875976.139592>]},warmed}">>, <<"{'$gen_call',{<0.32230.16>,[alias|#Ref<0.1760518760.1411448833.201399>]},warmed}">>]}, {dictionary,[{'$initial_call',{ns_memcached,init,1}}, {'$ancestors',[<0.3067.0>,'single_bucket_kv_sup-bucket2', ns_bucket_sup,ns_bucket_worker_sup, ns_server_sup,ns_server_nodes_sup,<0.1752.0>, ns_server_cluster_sup,root_sup,<0.145.0>]}]}, {registered_name,'ns_memcached-bucket2'}, Approx 45 mins later .35 is added back into the cluster. The fails because we're still waiting to delete bucket2: 2022-03-06T13:42:49.784-08:00, ns_orchestrator:0:info:message(ns_1@172.23.105.164) - Starting rebalance, KeepNodes = ['ns_1@172.23.105.206','ns_1@172.23.106.177', 'ns_1@172.23.100.35','ns_1@172.23.105.164'], EjectNodes = [], Failed over and being ejected nodes = [], Delta recovery nodes = ['ns_1@172.23.100.35'], Delta recovery b\ uckets = all; Operation Id = e3e3a85f0e65f39a1b9eae473faeeed4 2022-03-06T13:44:49.789-08:00, ns_rebalancer:0:critical:message(ns_1@172.23.105.164) - Failed to wait deletion of some buckets on some nodes: [{'ns_1@172.23.100.35', {'EXIT', {old_buckets_shutdown_wait_failed, ["default", "bucket2"]}}}] We know it's still the same process that's waiting to terminate and that it's waiting on the same backtrace as I ran this diag/eval on the cluster just now: # date; curl -s localhost:8091/diag/eval -u Administrator:password -d "Pid = whereis('ns_memcached-bucket2')" Mon Mar 7 05:25:03 PST 2022 <0.3068.0> ... # date; curl -s localhost:8091/diag/eval -u Administrator:password -d "Pid = whereis('ns_memcached-bucket2'), erlang:process_info(Pid, backtrace)" Mon Mar 7 05:23:30 PST 2022 {backtrace,<<"Program counter: 0x00007efd8184c578 (prim_inet:recv0/3 + 240)\n\n0x00007efcc046b8b0 53\n\n0x00007efcc046b8b8 #Port<0.201>\n\n0x00007efcc046b8c0 Return addr 0x00007efccae86f2c (mc_binary:recv/3 + 228)\ny(0) infinity\ny(1) res\ny(2) #Port<0.201>\n\n0x00007efcc046b8e0 Return addr 0x00007efccae7b300 (mc_client_binary:cmd_vocal_recv/5 + 144)\ny(0) []\ny(1) undefined\ny(2) undefined\ny(3) 134\n\n0x00007efcc046b908 Return addr 0x00007efccae7c1f4 (mc_client_binary:delete_bucket/3 + 420)\ny(0) []\ny(1) []\n\n0x00007efcc046b920 Return addr 0x00007efccacc612c (ns_memcached:delete_bucket/4 + 372)\ny(0) false\ny(1) []\ny(2) \"bucket2\"\ny(3) []\ny(4) Catch 0x00007efccacc6157 (ns_memcached:delete_bucket/4 + 415)\ny(5) Catch 0x00007efccacc62a4 (ns_memcached:delete_bucket/4 + 748)\n\n0x00007efcc046b958 Return addr 0x00007efccacc5a80 (ns_memcached:terminate/2 + 232)\ny(0) []\ny(1) []\ny(2) \"bucket2\"\ny(3) Catch 0x00007efccacc5b3f (ns_memcached:terminate/2 + 423)\n\n0x00007efcc046b980 Return addr 0x00007efd81a120f4 (gen_server:try_terminate/3 + 332)\ny(0) []\ny(1) []\ny(2) []\ny(3) Catch 0x00007efd81a1214a (gen_server:try_terminate/3 + 418)\n\n0x00007efcc046b9a8 Return addr 0x00007efd81a14738 (gen_server:terminate/10 + 192)\ny(0) []\ny(1) []\ny(2) []\ny(3) {state,0,0,0,{[],[]},{[],[]},{[],[]},warmed,{1646,596522,559857},\"bucket2\",[collections,json],#Port<0.201>,[{<0.3102.0>,#Ref<0.1760518760.1411383297.200086>},{<0.3099.0>,#Ref<0.1760518760.1409548295.56458>},{<0.3101.0>,#Ref<0.1760518760.1409548293.57702>},{<0.3100.0>,#Ref<0.1760518760.1409548295.56455>}],[],<0.3069.0>,false,10000}\ny(4) ns_memcached\ny(5) {'EXIT',<0.3067.0>,shutdown}\ny(6) undefined\ny(7) 'ns_memcached-bucket2'\ny(8) shutdown\ny(9) [{gen_server,decode_msg,9,[{file,\"gen_server.erl\"},{line,481}]},{proc_lib,init_p_do_apply,3,[{file,\"proc_lib.erl\"},{line,226}]}]\ny(10) shutdown\ny(11) exit\n\n0x00007efcc046ba10 Return addr 0x00007efd819416d0 (proc_lib:init_p_do_apply/3 + 192)\ny(0) []\ny(1) []\ny(2) Catch 0x00007efd819416f0 (proc_lib:init_p_do_apply/3 + 224)\n\n0x00007efcc046ba30 Return addr 0x00007efd818308c4 (unknown function)\n\n0x00007efcc046ba38 Return addr 0x00007efd818308c8 (<terminate process normally>)\n">>} The reason the node isn't accessible is it has been removed from the cluster and has only partly shutdown. In memcached we see similar evidence. bucket1 is deleted, the deletion of bucket2 is begun, but never completes. (The log traces are from memcached.log.00021 which I manually grabbed and will attach.) 2022-03-06T12:56:14.313614-08:00 INFO 2663: Delete bucket [bucket1]. Wait for 254 clients to disconnect 2022-03-06T12:56:14.405602-08:00 INFO 2663: Delete bucket [bucket1] complete 2022-03-06T12:56:14.456964-08:00 INFO 67: Delete bucket [bucket2]. Will assign to KV to look at the deletion of bucket2. (Note the cluster is still up so it may be possible to get a backtrace from Memcached to understand where the delete bucket is stuck.)
          dfinlay Dave Finlay made changes -
          Assignee Dave Finlay [ dfinlay ] Daniel Owen [ owend ]
          dfinlay Dave Finlay made changes -
          Component/s couchbase-bucket [ 10173 ]
          dfinlay Dave Finlay made changes -
          Attachment latest-diag.log [ 179095 ]
          Attachment debug.log.1 [ 179094 ]
          Attachment memcached.log.000021.txt [ 179093 ]
          owend Daniel Owen added a comment - - edited

          Thank you Dave Finlay for your help and analysis.
          Looking at what happens to bucket2 on .35 we see

          2022-03-06T12:56:14.455819-08:00 INFO 67: Delete bucket [bucket2]. Notifying engine
          2022-03-06T12:56:14.455836-08:00 INFO (bucket2) Shutting down all DCP connections in preparation for bucket deletion.
          2022-03-06T12:56:14.455840-08:00 INFO (bucket2) Shutting down dcp connections!
          2022-03-06T12:56:14.455885-08:00 INFO 1433: (bucket2) DCP (Producer) eq_dcpq:replication:ns_1@172.23.100.35->ns_1@172.23.106.177:bucket2 - Destroying connection. Created 3646 s ago. Paused 18501 times, for 1h:0m:46s total. Details: {"Initializing": {"count":1, "duration":"55 us"},"ReadyListEmpty": {"count":18500, "duration":"1h:0m:46s"}}
          2022-03-06T12:56:14.455934-08:00 INFO 1438: (bucket2) DCP (Producer) eq_dcpq:replication:ns_1@172.23.100.35->ns_1@172.23.105.206:bucket2 - Destroying connection. Created 3646 s ago. Paused 17701 times, for 1h:0m:46s total. Details: {"Initializing": {"count":1, "duration":"36 us"},"ReadyListEmpty": {"count":17700, "duration":"1h:0m:46s"}}
          2022-03-06T12:56:14.455977-08:00 INFO 1436: (bucket2) DCP (Producer) eq_dcpq:replication:ns_1@172.23.100.35->ns_1@172.23.105.164:bucket2 - Destroying connection. Created 3646 s ago. Paused 16684 times, for 1h:0m:46s total. Details: {"Initializing": {"count":1, "duration":"37 us"},"ReadyListEmpty": {"count":16683, "duration":"1h:0m:46s"}}
          2022-03-06T12:56:14.456010-08:00 INFO 1442: (bucket2) DCP (Consumer) eq_dcpq:replication:ns_1@172.23.106.177->ns_1@172.23.100.35:bucket2 - Destroying connection. Created 3646 s ago. Paused 13323 times, for 1h:0m:46s total. Details: {"ReadyListEmpty": {"count":13323, "duration":"1h:0m:46s"}}
          2022-03-06T12:56:14.456038-08:00 INFO 1432: (bucket2) DCP (Consumer) eq_dcpq:replication:ns_1@172.23.105.206->ns_1@172.23.100.35:bucket2 - Destroying connection. Created 3646 s ago. Paused 13312 times, for 1h:0m:46s total. Details: {"ReadyListEmpty": {"count":13312, "duration":"1h:0m:46s"}}
          2022-03-06T12:56:14.456060-08:00 INFO 1405: (bucket2) DCP (Consumer) eq_dcpq:replication:ns_1@172.23.105.164->ns_1@172.23.100.35:bucket2 - Destroying connection. Created 3646 s ago. Paused 13763 times, for 1h:0m:47s total. Details: {"ReadyListEmpty": {"count":13763, "duration":"1h:0m:47s"}}
          2022-03-06T12:56:14.456946-08:00 INFO 67: Delete bucket [bucket2]. Engine ready for shutdown
          2022-03-06T12:56:14.456964-08:00 INFO 67: Delete bucket [bucket2]. Wait for 255 clients to disconnect
          2022-03-06T12:56:14.461984-08:00 INFO 67: Delete bucket [bucket2]. Shut down the bucket
          2022-03-06T12:56:14.461999-08:00 INFO (bucket2) Shutting down dcp connections!
          2022-03-06T12:56:14.462005-08:00 INFO (bucket2) Attempting to stop flusher:0
          2022-03-06T12:56:14.462054-08:00 INFO (bucket2) Attempting to stop flusher:1
          2022-03-06T12:56:14.462057-08:00 INFO (bucket2) Attempting to stop flusher:2
          2022-03-06T12:56:14.462062-08:00 INFO (bucket2) Attempting to stop flusher:3
          2022-03-06T12:56:14.462131-08:00 INFO (bucket2) Attempting to stop flusher:4
          2022-03-06T12:56:14.462135-08:00 INFO (bucket2) Attempting to stop flusher:5
          2022-03-06T12:56:14.462140-08:00 INFO (bucket2) Attempting to stop flusher:6
          2022-03-06T12:56:14.462146-08:00 INFO (bucket2) Attempting to stop flusher:7
          2022-03-06T12:56:14.462161-08:00 INFO (bucket2) Stopping bg fetchers
          2022-03-06T12:56:14.462760-08:00 INFO (bucket2) KVBucket::deinitialize forceShutdown:true
          2022-03-06T12:56:14.475861-08:00 INFO (bucket2) MagmaKVStore: 0 deinitializing
          2022-03-06T12:56:14.475878-08:00 INFO [(bucket2) magma_0]Closing magma db (thread pool refcount 16)
          2022-03-06T12:56:14.478263-08:00 INFO [(bucket2) magma_0/kvstore-0/rev-000000001]KVStore::Shutdown
          ..
          2022-03-06T12:56:14.664401-08:00 INFO [(bucket2) magma_0]Completed closing magma db
          2022-03-06T12:56:14.665002-08:00 INFO (bucket2) MagmaKVStore: 0 deinitialized
          2022-03-06T12:56:14.665011-08:00 INFO (bucket2) MagmaKVStore: 1 deinitializing
          2022-03-06T12:56:14.665017-08:00 INFO [(bucket2) magma_1]Closing magma db (thread pool refcount 15)
          2022-03-06T12:56:14.666930-08:00 INFO [(bucket2) magma_1/kvstore-1/rev-000000001]KVStore::Shutdown
          ..
          2022-03-06T12:56:14.929878-08:00 INFO [(bucket2) magma_2]Completed closing magma db
          2022-03-06T12:56:14.930431-08:00 INFO (bucket2) MagmaKVStore: 2 deinitialized
          2022-03-06T12:56:14.930446-08:00 INFO (bucket2) MagmaKVStore: 3 deinitializing
          2022-03-06T12:56:14.930450-08:00 INFO [(bucket2) magma_3]Closing magma db (thread pool refcount 13)
          2022-03-06T12:56:14.932092-08:00 INFO [(bucket2) magma_3/kvstore-3/rev-000000001]KVStore::Shutdown
          

          We DO NOT see

          INFO [(bucket2) magma_3]Completed closing magma db
          INFO (bucket2) MagmaKVStore: 3 deinitialized
          

          owend Daniel Owen added a comment - - edited Thank you Dave Finlay for your help and analysis. Looking at what happens to bucket2 on .35 we see 2022-03-06T12:56:14.455819-08:00 INFO 67: Delete bucket [bucket2]. Notifying engine 2022-03-06T12:56:14.455836-08:00 INFO (bucket2) Shutting down all DCP connections in preparation for bucket deletion. 2022-03-06T12:56:14.455840-08:00 INFO (bucket2) Shutting down dcp connections! 2022-03-06T12:56:14.455885-08:00 INFO 1433: (bucket2) DCP (Producer) eq_dcpq:replication:ns_1@172.23.100.35->ns_1@172.23.106.177:bucket2 - Destroying connection. Created 3646 s ago. Paused 18501 times, for 1h:0m:46s total. Details: {"Initializing": {"count":1, "duration":"55 us"},"ReadyListEmpty": {"count":18500, "duration":"1h:0m:46s"}} 2022-03-06T12:56:14.455934-08:00 INFO 1438: (bucket2) DCP (Producer) eq_dcpq:replication:ns_1@172.23.100.35->ns_1@172.23.105.206:bucket2 - Destroying connection. Created 3646 s ago. Paused 17701 times, for 1h:0m:46s total. Details: {"Initializing": {"count":1, "duration":"36 us"},"ReadyListEmpty": {"count":17700, "duration":"1h:0m:46s"}} 2022-03-06T12:56:14.455977-08:00 INFO 1436: (bucket2) DCP (Producer) eq_dcpq:replication:ns_1@172.23.100.35->ns_1@172.23.105.164:bucket2 - Destroying connection. Created 3646 s ago. Paused 16684 times, for 1h:0m:46s total. Details: {"Initializing": {"count":1, "duration":"37 us"},"ReadyListEmpty": {"count":16683, "duration":"1h:0m:46s"}} 2022-03-06T12:56:14.456010-08:00 INFO 1442: (bucket2) DCP (Consumer) eq_dcpq:replication:ns_1@172.23.106.177->ns_1@172.23.100.35:bucket2 - Destroying connection. Created 3646 s ago. Paused 13323 times, for 1h:0m:46s total. Details: {"ReadyListEmpty": {"count":13323, "duration":"1h:0m:46s"}} 2022-03-06T12:56:14.456038-08:00 INFO 1432: (bucket2) DCP (Consumer) eq_dcpq:replication:ns_1@172.23.105.206->ns_1@172.23.100.35:bucket2 - Destroying connection. Created 3646 s ago. Paused 13312 times, for 1h:0m:46s total. Details: {"ReadyListEmpty": {"count":13312, "duration":"1h:0m:46s"}} 2022-03-06T12:56:14.456060-08:00 INFO 1405: (bucket2) DCP (Consumer) eq_dcpq:replication:ns_1@172.23.105.164->ns_1@172.23.100.35:bucket2 - Destroying connection. Created 3646 s ago. Paused 13763 times, for 1h:0m:47s total. Details: {"ReadyListEmpty": {"count":13763, "duration":"1h:0m:47s"}} 2022-03-06T12:56:14.456946-08:00 INFO 67: Delete bucket [bucket2]. Engine ready for shutdown 2022-03-06T12:56:14.456964-08:00 INFO 67: Delete bucket [bucket2]. Wait for 255 clients to disconnect 2022-03-06T12:56:14.461984-08:00 INFO 67: Delete bucket [bucket2]. Shut down the bucket 2022-03-06T12:56:14.461999-08:00 INFO (bucket2) Shutting down dcp connections! 2022-03-06T12:56:14.462005-08:00 INFO (bucket2) Attempting to stop flusher:0 2022-03-06T12:56:14.462054-08:00 INFO (bucket2) Attempting to stop flusher:1 2022-03-06T12:56:14.462057-08:00 INFO (bucket2) Attempting to stop flusher:2 2022-03-06T12:56:14.462062-08:00 INFO (bucket2) Attempting to stop flusher:3 2022-03-06T12:56:14.462131-08:00 INFO (bucket2) Attempting to stop flusher:4 2022-03-06T12:56:14.462135-08:00 INFO (bucket2) Attempting to stop flusher:5 2022-03-06T12:56:14.462140-08:00 INFO (bucket2) Attempting to stop flusher:6 2022-03-06T12:56:14.462146-08:00 INFO (bucket2) Attempting to stop flusher:7 2022-03-06T12:56:14.462161-08:00 INFO (bucket2) Stopping bg fetchers 2022-03-06T12:56:14.462760-08:00 INFO (bucket2) KVBucket::deinitialize forceShutdown:true 2022-03-06T12:56:14.475861-08:00 INFO (bucket2) MagmaKVStore: 0 deinitializing 2022-03-06T12:56:14.475878-08:00 INFO [(bucket2) magma_0]Closing magma db (thread pool refcount 16) 2022-03-06T12:56:14.478263-08:00 INFO [(bucket2) magma_0/kvstore-0/rev-000000001]KVStore::Shutdown .. 2022-03-06T12:56:14.664401-08:00 INFO [(bucket2) magma_0]Completed closing magma db 2022-03-06T12:56:14.665002-08:00 INFO (bucket2) MagmaKVStore: 0 deinitialized 2022-03-06T12:56:14.665011-08:00 INFO (bucket2) MagmaKVStore: 1 deinitializing 2022-03-06T12:56:14.665017-08:00 INFO [(bucket2) magma_1]Closing magma db (thread pool refcount 15) 2022-03-06T12:56:14.666930-08:00 INFO [(bucket2) magma_1/kvstore-1/rev-000000001]KVStore::Shutdown .. 2022-03-06T12:56:14.929878-08:00 INFO [(bucket2) magma_2]Completed closing magma db 2022-03-06T12:56:14.930431-08:00 INFO (bucket2) MagmaKVStore: 2 deinitialized 2022-03-06T12:56:14.930446-08:00 INFO (bucket2) MagmaKVStore: 3 deinitializing 2022-03-06T12:56:14.930450-08:00 INFO [(bucket2) magma_3]Closing magma db (thread pool refcount 13) 2022-03-06T12:56:14.932092-08:00 INFO [(bucket2) magma_3/kvstore-3/rev-000000001]KVStore::Shutdown We DO NOT see INFO [(bucket2) magma_3]Completed closing magma db INFO (bucket2) MagmaKVStore: 3 deinitialized
          owend Daniel Owen made changes -
          Component/s ns_server [ 10019 ]
          Component/s storage-engine [ 10175 ]
          owend Daniel Owen made changes -
          Assignee Daniel Owen [ owend ] Sarath Lakshman [ sarath ]
          owend Daniel Owen made changes -
          Assignee Sarath Lakshman [ sarath ] Paolo Cocchi [ paolo.cocchi ]
          owend Daniel Owen added a comment -

          Quick grep of the code shows:

          magma/db.cc:        LOG_INFO("Completed closing magma db");
          

          Looks like the issue is we have not Completed and so don't close the magma db file.

          owend Daniel Owen added a comment - Quick grep of the code shows: magma/db.cc: LOG_INFO("Completed closing magma db"); Looks like the issue is we have not Completed and so don't close the magma db file.
          owend Daniel Owen added a comment -

          Hey Paolo Cocchi Could you take a look at the live cluster and do a gdb attach to 172.23.100.35 see where we are hanging.

          thanks

          owend Daniel Owen added a comment - Hey Paolo Cocchi Could you take a look at the live cluster and do a gdb attach to 172.23.100.35 see where we are hanging. thanks
          paolo.cocchi Paolo Cocchi made changes -
          Attachment node35.gdb.log [ 179104 ]
          paolo.cocchi Paolo Cocchi added a comment - - edited

          Last logging for bucket2:

          ..
          2022-03-06T12:56:14.929567-08:00 INFO [(bucket2) magma_2/kvstore-850/rev-000000001]KVStore::Shutdown
          2022-03-06T12:56:14.929878-08:00 INFO [(bucket2) magma_2]Completed closing magma db
          2022-03-06T12:56:14.930431-08:00 INFO (bucket2) MagmaKVStore: 2 deinitialized
          2022-03-06T12:56:14.930446-08:00 INFO (bucket2) MagmaKVStore: 3 deinitializing
          2022-03-06T12:56:14.930450-08:00 INFO [(bucket2) magma_3]Closing magma db (thread pool refcount 13)
          2022-03-06T12:56:14.932092-08:00 INFO [(bucket2) magma_3/kvstore-3/rev-000000001]KVStore::Shutdown
          2022-03-06T12:56:14.933891-08:00 INFO [(bucket2) magma_3/kvstore-11/rev-000000001]KVStore::Shutdown
          2022-03-06T12:56:14.935874-08:00 INFO [(bucket2) magma_3/kvstore-19/rev-000000001]KVStore::Shutdown
          2022-03-06T12:56:14.937778-08:00 INFO [(bucket2) magma_3/kvstore-27/rev-000000001]KVStore::Shutdown
          ..
          2022-03-06T12:56:15.054037-08:00 INFO [(bucket2) magma_3/kvstore-851/rev-000000001]KVStore::Shutdown
          

          So it seems like magma just completed shutdown for shard-2 but shard-3 is in pending shutdown.

          Thread 25 seems stuck in magma::Magma::Impl::Close():

          Thread 25 (Thread 0x7f9e53fff700 (LWP 123256)):
          #0  0x00007f9ec7cb9a35 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
          #1  0x00007f9ec5ea88bc in __gthread_cond_wait (__mutex=<optimized out>, __cond=__cond@entry=0x7f9ec01d8ba0)
              at /tmp/deploy/objdir/x86_64-pc-linux-gnu/libstdc++-v3/include/x86_64-pc-linux-gnu/bits/gthr-default.h:865
          #2  std::condition_variable::wait (this=this@entry=0x7f9ec01d8ba0, __lock=...) at /tmp/deploy/objdir/../gcc-10.2.0/libstdc++-v3/src/c++11/condition_variable.cc:53
          #3  0x000000000098bfcc in magma::WaitGroup::Wait (this=this@entry=0x7f9ec01d8b68) at /home/couchbase/jenkins/workspace/couchbase-server-unix/magma/util/common.cc:74
          #4  0x000000000091e681 in magma::Magma::Impl::Close() () at /home/couchbase/jenkins/workspace/couchbase-server-unix/magma/magma/db.cc:204
          #5  0x000000000091e8c8 in magma::Magma::Close (this=<optimized out>) at /home/couchbase/jenkins/workspace/couchbase-server-unix/magma/magma/db.cc:216
          #6  0x0000000000887c1b in MagmaMemoryTrackingProxy::Close() ()
              at /home/couchbase/jenkins/workspace/couchbase-server-unix/kv_engine/engines/ep/src/kvstore/magma-kvstore/magma-memory-tracking-proxy.cc:113
          #7  0x0000000000865d81 in MagmaKVStore::deinitialize (this=0x7f9ec016b400)
              at /home/couchbase/jenkins/workspace/couchbase-server-unix/kv_engine/engines/ep/src/kvstore/magma-kvstore/magma-kvstore.cc:734
          #8  0x000000000081a586 in operator() (shard=..., __closure=<synthetic pointer>)
              at /home/couchbase/jenkins/workspace/couchbase-server-unix/kv_engine/engines/ep/src/kvshard.h:72
          #9  forEachShard<EPBucket::deinitialize()::<lambda(KVShard&)> > (p=..., this=<optimized out>)
              at /home/couchbase/jenkins/workspace/couchbase-server-unix/kv_engine/engines/ep/src/vbucketmap.h:148
          #10 EPBucket::deinitialize (this=<optimized out>) at /home/couchbase/jenkins/workspace/couchbase-server-unix/kv_engine/engines/ep/src/ep_bucket.cc:333
          #11 0x00000000006a30d2 in EventuallyPersistentEngine::destroyInner (this=0x7f9ec00f8000, force=<optimized out>)
              at /home/couchbase/jenkins/workspace/couchbase-server-unix/kv_engine/engines/ep/src/ep_engine.cc:2256
          #12 0x00000000006a3162 in EventuallyPersistentEngine::destroy(bool) ()
              at /home/couchbase/jenkins/workspace/couchbase-server-unix/kv_engine/engines/ep/src/ep_engine.cc:207
          #13 0x00000000005c4eca in operator() (engine=<optimized out>, this=<optimized out>)
              at /home/couchbase/jenkins/workspace/couchbase-server-unix/kv_engine/include/memcached/engine.h:793
          #14 reset (__p=0x0, this=<optimized out>) at /opt/gcc-10.2.0/include/c++/10.2.0/bits/unique_ptr.h:182
          #15 reset (__p=0x0, this=<optimized out>) at /opt/gcc-10.2.0/include/c++/10.2.0/bits/unique_ptr.h:456
          #16 destroyEngine (force=true, this=<optimized out>) at /home/couchbase/jenkins/workspace/couchbase-server-unix/kv_engine/daemon/buckets.cc:62
          #17 BucketManager::destroy(Cookie*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, bool) ()
              at /home/couchbase/jenkins/workspace/couchbase-server-unix/kv_engine/daemon/buckets.cc:436
          #18 0x000000000062cbbe in operator() (__closure=0x7f9e07b91420) at /opt/gcc-10.2.0/include/c++/10.2.0/bits/char_traits.h:322
          #19 __invoke_impl<void, CreateRemoveBucketCommandContext::remove()::<lambda()>&> (__f=...) at /opt/gcc-10.2.0/include/c++/10.2.0/bits/invoke.h:60
          #20 __invoke_r<void, CreateRemoveBucketCommandContext::remove()::<lambda()>&> (__fn=...) at /opt/gcc-10.2.0/include/c++/10.2.0/bits/invoke.h:110
          #21 std::_Function_handler<void (), CreateRemoveBucketCommandContext::remove()::{lambda()#1}>::_M_invoke(std::_Any_data const&) ()
              at /opt/gcc-10.2.0/include/c++/10.2.0/bits/std_function.h:291
          #22 0x000000000062b204 in operator() (this=0x7f9eb81b1608) at /opt/gcc-10.2.0/include/c++/10.2.0/bits/std_function.h:248
          #23 OneShotTask::run (this=0x7f9eb81b1590) at /home/couchbase/jenkins/workspace/couchbase-server-unix/kv_engine/daemon/one_shot_task.h:50
          #24 0x0000000000a9fb89 in GlobalTask::execute(std::basic_string_view<char, std::char_traits<char> >) ()
              at /home/couchbase/jenkins/workspace/couchbase-server-unix/kv_engine/executor/globaltask.cc:98
          #25 0x0000000000a9927a in FollyExecutorPool::TaskProxy::scheduleViaCPUPool()::{lambda()#2}::operator()() const (__closure=0x7f9e53fed650)
              at /home/couchbase/jenkins/workspace/couchbase-server-unix/kv_engine/executor/folly_executorpool.cc:309
          #26 0x0000000000aa0f3e in operator() (this=0x7f9e53fed650)
              at /home/couchbase/jenkins/workspace/couchbase-server-unix/kv_engine/executor/cancellable_cpu_executor.cc:42
          #27 CancellableCPUExecutor::add(GlobalTask*, folly::Function<void ()>)::{lambda()#1}::operator()() const ()
              at /home/couchbase/jenkins/workspace/couchbase-server-unix/kv_engine/executor/cancellable_cpu_executor.cc:42
          #28 0x0000000000bf5a40 in operator() (this=0x7f9e53fed840)
              at /home/couchbase/jenkins/workspace/cbdeps-platform-build-old/deps/packages/build/folly/folly-prefix/src/folly/folly/Function.h:416
          #29 folly::ThreadPoolExecutor::runTask(std::shared_ptr<folly::ThreadPoolExecutor::Thread> const&, folly::ThreadPoolExecutor::Task&&) (this=this@entry=0x7f9ec3f51d00, 
              thread=..., task=task@entry=<unknown type in /usr/lib/debug/opt/couchbase/bin/memcached-7.1.0-2440.x86_64.debug, CU 0xa389644, DIE 0xa40d579>)
              at /home/couchbase/jenkins/workspace/cbdeps-platform-build-old/deps/packages/build/folly/folly-prefix/src/folly/folly/executors/ThreadPoolExecutor.cpp:97
          #30 0x0000000000be04da in folly::CPUThreadPoolExecutor::threadRun (this=0x7f9ec3f51d00, thread=...)
              at /home/couchbase/jenkins/workspace/cbdeps-platform-build-old/deps/packages/build/folly/folly-prefix/src/folly/folly/executors/CPUThreadPoolExecutor.cpp:265
          #31 0x0000000000bf89f9 in __invoke_impl<void, void (folly::ThreadPoolExecutor::*&)(std::shared_ptr<folly::ThreadPoolExecutor::Thread>), folly::ThreadPoolExecutor*&, std::shared_ptr<folly::ThreadPoolExecutor::Thread>&> (__t=<optimized out>, __f=<optimized out>) at /usr/local/include/c++/7.3.0/bits/invoke.h:73
          #32 __invoke<void (folly::ThreadPoolExecutor::*&)(std::shared_ptr<folly::ThreadPoolExecutor::Thread>), folly::ThreadPoolExecutor*&, std::shared_ptr<folly::ThreadPoolExecutor::Thread>&> (__fn=<optimized out>) at /usr/local/include/c++/7.3.0/bits/invoke.h:95
          #33 __call<void, 0, 1> (__args=<optimized out>, this=<optimized out>) at /usr/local/include/c++/7.3.0/functional:467
          #34 operator()<> (this=<optimized out>) at /usr/local/include/c++/7.3.0/functional:551
          #35 folly::detail::function::FunctionTraits<void ()>::callBig<std::_Bind<void (folly::ThreadPoolExecutor::*(folly::ThreadPoolExecutor*, std::shared_ptr<folly::ThreadPoolExecutor::Thread>))(std::shared_ptr<folly::ThreadPoolExecutor::Thread>)> >(folly::detail::function::Data&) (p=...)
              at /home/couchbase/jenkins/workspace/cbdeps-platform-build-old/deps/packages/build/folly/folly-prefix/src/folly/folly/Function.h:401
          #36 0x0000000000a98f74 in operator() (this=0x7f9ec43e1d40) at /home/couchbase/jenkins/workspace/couchbase-server-unix/kv_engine/executor/folly_executorpool.cc:49
          #37 operator() (__closure=0x7f9ec43e1d40) at /home/couchbase/jenkins/workspace/couchbase-server-unix/kv_engine/executor/folly_executorpool.cc:49
          #38 folly::detail::function::FunctionTraits<void ()>::callBig<CBRegisteredThreadFactory::newThread(folly::Function<void ()>&&)::{lambda()#1}>(folly::detail::function::Data&) (p=...) at /home/couchbase/jenkins/workspace/couchbase-server-unix/server_build/tlm/deps/folly.exploded/include/folly/Function.h:401
          #39 0x00007f9ec5eadd40 in execute_native_thread_routine () at /tmp/deploy/objdir/../gcc-10.2.0/libstdc++-v3/src/c++11/thread.cc:80
          #40 0x00007f9ec7cb5ea5 in start_thread () from /lib64/libpthread.so.0
          #41 0x00007f9ec55f69fd in clone () from /lib64/libc.so.6
          

          void EPBucket::deinitialize() {
              stopFlusher();
              stopBgFetcher();
           
              KVBucket::deinitialize();
           
              // Perform a snapshot of the stats before shutting down so we can
              // persist the type of shutdown (stats.forceShutdown), and consequently
              // on the next warmup can determine is there was a clean shutdown - see
              // Warmup::cleanShutdown. This must be called after KVBucket::deinitialize()
              // to prevent a race condition with the StatSnap task as the deinitialize
              // function will cancel all running tasks.
              snapshotStats(true /*shuttingDown*/);
           
              // Now that we've stopped all of our tasks, stop any tasks the storage
              // layer may have created.
              vbMap.forEachShard([](KVShard& shard) {
                  shard.getRWUnderlying()->deinitialize();
              });
          }
          

          void MagmaKVStore::deinitialize() {
              logger->info("MagmaKVStore: {} deinitializing", configuration.getShardId());
           
              // Close shuts down all of the magma background threads (compaction is the
              // one that we care about here). The compaction callbacks require the magma
              // instance to exist so we must do this before we reset it.
              magma->Close();
           
              // Flusher should have already been stopped so it should be safe to destroy
              // the magma instance now
              magma.reset();
           
              logger->info("MagmaKVStore: {} deinitialized", configuration.getShardId());
          }
          

          void Magma::Impl::Close() {
              if (opened) {
                  LOG_INFO("Closing magma db (thread pool refcount {})",
                           threadPool.use_count());
                  heartbeatWg.Deactivate();
                  heartbeatToken.Cancel();
                  if (threadPool) {
                      threadPool->CancelTasks(threadPoolGroupId);
                  }
           
                  heartbeatWg.Wait();
           
                  kvstores.Shutdown(cfg.CheckpointOnShutdown, wal.get());
           
                  kvstoreWg.Wait();                      <--- !!
                  opened = false;
                  LOG_INFO("Completed closing magma db");
              }
              {
                  ExecutionEnvGuard envGuard(
                          ExecutionEnvRegistry::GetDefaultExecutionEnv());
                  threadPool.reset();
              }
          }
          

          That's a cond-var internal to magma, this is its semantic:

          KVStore::KVStore(KVStoreConfig& config,
                           Magma::KVStoreID kvID,
                           Magma::KVStoreRevision kvsRev)
              : cfg(config),
                kvID(kvID),
                kvsRev(kvsRev),
                purgeChecker(cfg.GetExpiryFragThreshold(), cfg.GetTimeNow),
                remover(config.GetConfig().FS,
                        config.GetThreadPool(),
                        config.GetThreadPoolGroupId(),
                        config.GetConfig().ExecutionEnv,
                        fileRemoverWg) {
          ..
              cfg.GetWaitGroup().Add(1);
          }
          

          KVStore::~KVStore() {
              // Wait for any pending flushes
              flusherWg.Deactivate();
              flusherWg.Wait();
           
              currSnapshot.reset();
              currDiskSnapshot.reset();
              keyIndex.reset();
              seqIndex.reset();
              localIndex.reset();
           
              fileRemoverWg.Deactivate();
              fileRemoverWg.Wait();
           
              {
                  auto cb = destroyCallback.lock();
                  if (*cb) {
                      LOG_INFO("KVStore::~KVStore Destroying {}", GetPath());
                      (*cb)(GetPath());
                  }
              }
           
              cfg.GetWaitGroup().Add(-1);
          }
          

          So it seems that we haven't reach the end of ~KVStore yet.
          Note that ~KVStore is in turn potentially stuck waiting for the completion of other operations like flush or file-removal.

          paolo.cocchi Paolo Cocchi added a comment - - edited Last logging for bucket2: .. 2022-03-06T12:56:14.929567-08:00 INFO [(bucket2) magma_2/kvstore-850/rev-000000001]KVStore::Shutdown 2022-03-06T12:56:14.929878-08:00 INFO [(bucket2) magma_2]Completed closing magma db 2022-03-06T12:56:14.930431-08:00 INFO (bucket2) MagmaKVStore: 2 deinitialized 2022-03-06T12:56:14.930446-08:00 INFO (bucket2) MagmaKVStore: 3 deinitializing 2022-03-06T12:56:14.930450-08:00 INFO [(bucket2) magma_3]Closing magma db (thread pool refcount 13) 2022-03-06T12:56:14.932092-08:00 INFO [(bucket2) magma_3/kvstore-3/rev-000000001]KVStore::Shutdown 2022-03-06T12:56:14.933891-08:00 INFO [(bucket2) magma_3/kvstore-11/rev-000000001]KVStore::Shutdown 2022-03-06T12:56:14.935874-08:00 INFO [(bucket2) magma_3/kvstore-19/rev-000000001]KVStore::Shutdown 2022-03-06T12:56:14.937778-08:00 INFO [(bucket2) magma_3/kvstore-27/rev-000000001]KVStore::Shutdown .. 2022-03-06T12:56:15.054037-08:00 INFO [(bucket2) magma_3/kvstore-851/rev-000000001]KVStore::Shutdown So it seems like magma just completed shutdown for shard-2 but shard-3 is in pending shutdown. Thread 25 seems stuck in magma::Magma::Impl::Close(): Thread 25 (Thread 0x7f9e53fff700 (LWP 123256)): #0 0x00007f9ec7cb9a35 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0 #1 0x00007f9ec5ea88bc in __gthread_cond_wait (__mutex=<optimized out>, __cond=__cond@entry=0x7f9ec01d8ba0) at /tmp/deploy/objdir/x86_64-pc-linux-gnu/libstdc++-v3/include/x86_64-pc-linux-gnu/bits/gthr-default.h:865 #2 std::condition_variable::wait (this=this@entry=0x7f9ec01d8ba0, __lock=...) at /tmp/deploy/objdir/../gcc-10.2.0/libstdc++-v3/src/c++11/condition_variable.cc:53 #3 0x000000000098bfcc in magma::WaitGroup::Wait (this=this@entry=0x7f9ec01d8b68) at /home/couchbase/jenkins/workspace/couchbase-server-unix/magma/util/common.cc:74 #4 0x000000000091e681 in magma::Magma::Impl::Close() () at /home/couchbase/jenkins/workspace/couchbase-server-unix/magma/magma/db.cc:204 #5 0x000000000091e8c8 in magma::Magma::Close (this=<optimized out>) at /home/couchbase/jenkins/workspace/couchbase-server-unix/magma/magma/db.cc:216 #6 0x0000000000887c1b in MagmaMemoryTrackingProxy::Close() () at /home/couchbase/jenkins/workspace/couchbase-server-unix/kv_engine/engines/ep/src/kvstore/magma-kvstore/magma-memory-tracking-proxy.cc:113 #7 0x0000000000865d81 in MagmaKVStore::deinitialize (this=0x7f9ec016b400) at /home/couchbase/jenkins/workspace/couchbase-server-unix/kv_engine/engines/ep/src/kvstore/magma-kvstore/magma-kvstore.cc:734 #8 0x000000000081a586 in operator() (shard=..., __closure=<synthetic pointer>) at /home/couchbase/jenkins/workspace/couchbase-server-unix/kv_engine/engines/ep/src/kvshard.h:72 #9 forEachShard<EPBucket::deinitialize()::<lambda(KVShard&)> > (p=..., this=<optimized out>) at /home/couchbase/jenkins/workspace/couchbase-server-unix/kv_engine/engines/ep/src/vbucketmap.h:148 #10 EPBucket::deinitialize (this=<optimized out>) at /home/couchbase/jenkins/workspace/couchbase-server-unix/kv_engine/engines/ep/src/ep_bucket.cc:333 #11 0x00000000006a30d2 in EventuallyPersistentEngine::destroyInner (this=0x7f9ec00f8000, force=<optimized out>) at /home/couchbase/jenkins/workspace/couchbase-server-unix/kv_engine/engines/ep/src/ep_engine.cc:2256 #12 0x00000000006a3162 in EventuallyPersistentEngine::destroy(bool) () at /home/couchbase/jenkins/workspace/couchbase-server-unix/kv_engine/engines/ep/src/ep_engine.cc:207 #13 0x00000000005c4eca in operator() (engine=<optimized out>, this=<optimized out>) at /home/couchbase/jenkins/workspace/couchbase-server-unix/kv_engine/include/memcached/engine.h:793 #14 reset (__p=0x0, this=<optimized out>) at /opt/gcc-10.2.0/include/c++/10.2.0/bits/unique_ptr.h:182 #15 reset (__p=0x0, this=<optimized out>) at /opt/gcc-10.2.0/include/c++/10.2.0/bits/unique_ptr.h:456 #16 destroyEngine (force=true, this=<optimized out>) at /home/couchbase/jenkins/workspace/couchbase-server-unix/kv_engine/daemon/buckets.cc:62 #17 BucketManager::destroy(Cookie*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, bool) () at /home/couchbase/jenkins/workspace/couchbase-server-unix/kv_engine/daemon/buckets.cc:436 #18 0x000000000062cbbe in operator() (__closure=0x7f9e07b91420) at /opt/gcc-10.2.0/include/c++/10.2.0/bits/char_traits.h:322 #19 __invoke_impl<void, CreateRemoveBucketCommandContext::remove()::<lambda()>&> (__f=...) at /opt/gcc-10.2.0/include/c++/10.2.0/bits/invoke.h:60 #20 __invoke_r<void, CreateRemoveBucketCommandContext::remove()::<lambda()>&> (__fn=...) at /opt/gcc-10.2.0/include/c++/10.2.0/bits/invoke.h:110 #21 std::_Function_handler<void (), CreateRemoveBucketCommandContext::remove()::{lambda()#1}>::_M_invoke(std::_Any_data const&) () at /opt/gcc-10.2.0/include/c++/10.2.0/bits/std_function.h:291 #22 0x000000000062b204 in operator() (this=0x7f9eb81b1608) at /opt/gcc-10.2.0/include/c++/10.2.0/bits/std_function.h:248 #23 OneShotTask::run (this=0x7f9eb81b1590) at /home/couchbase/jenkins/workspace/couchbase-server-unix/kv_engine/daemon/one_shot_task.h:50 #24 0x0000000000a9fb89 in GlobalTask::execute(std::basic_string_view<char, std::char_traits<char> >) () at /home/couchbase/jenkins/workspace/couchbase-server-unix/kv_engine/executor/globaltask.cc:98 #25 0x0000000000a9927a in FollyExecutorPool::TaskProxy::scheduleViaCPUPool()::{lambda()#2}::operator()() const (__closure=0x7f9e53fed650) at /home/couchbase/jenkins/workspace/couchbase-server-unix/kv_engine/executor/folly_executorpool.cc:309 #26 0x0000000000aa0f3e in operator() (this=0x7f9e53fed650) at /home/couchbase/jenkins/workspace/couchbase-server-unix/kv_engine/executor/cancellable_cpu_executor.cc:42 #27 CancellableCPUExecutor::add(GlobalTask*, folly::Function<void ()>)::{lambda()#1}::operator()() const () at /home/couchbase/jenkins/workspace/couchbase-server-unix/kv_engine/executor/cancellable_cpu_executor.cc:42 #28 0x0000000000bf5a40 in operator() (this=0x7f9e53fed840) at /home/couchbase/jenkins/workspace/cbdeps-platform-build-old/deps/packages/build/folly/folly-prefix/src/folly/folly/Function.h:416 #29 folly::ThreadPoolExecutor::runTask(std::shared_ptr<folly::ThreadPoolExecutor::Thread> const&, folly::ThreadPoolExecutor::Task&&) (this=this@entry=0x7f9ec3f51d00, thread=..., task=task@entry=<unknown type in /usr/lib/debug/opt/couchbase/bin/memcached-7.1.0-2440.x86_64.debug, CU 0xa389644, DIE 0xa40d579>) at /home/couchbase/jenkins/workspace/cbdeps-platform-build-old/deps/packages/build/folly/folly-prefix/src/folly/folly/executors/ThreadPoolExecutor.cpp:97 #30 0x0000000000be04da in folly::CPUThreadPoolExecutor::threadRun (this=0x7f9ec3f51d00, thread=...) at /home/couchbase/jenkins/workspace/cbdeps-platform-build-old/deps/packages/build/folly/folly-prefix/src/folly/folly/executors/CPUThreadPoolExecutor.cpp:265 #31 0x0000000000bf89f9 in __invoke_impl<void, void (folly::ThreadPoolExecutor::*&)(std::shared_ptr<folly::ThreadPoolExecutor::Thread>), folly::ThreadPoolExecutor*&, std::shared_ptr<folly::ThreadPoolExecutor::Thread>&> (__t=<optimized out>, __f=<optimized out>) at /usr/local/include/c++/7.3.0/bits/invoke.h:73 #32 __invoke<void (folly::ThreadPoolExecutor::*&)(std::shared_ptr<folly::ThreadPoolExecutor::Thread>), folly::ThreadPoolExecutor*&, std::shared_ptr<folly::ThreadPoolExecutor::Thread>&> (__fn=<optimized out>) at /usr/local/include/c++/7.3.0/bits/invoke.h:95 #33 __call<void, 0, 1> (__args=<optimized out>, this=<optimized out>) at /usr/local/include/c++/7.3.0/functional:467 #34 operator()<> (this=<optimized out>) at /usr/local/include/c++/7.3.0/functional:551 #35 folly::detail::function::FunctionTraits<void ()>::callBig<std::_Bind<void (folly::ThreadPoolExecutor::*(folly::ThreadPoolExecutor*, std::shared_ptr<folly::ThreadPoolExecutor::Thread>))(std::shared_ptr<folly::ThreadPoolExecutor::Thread>)> >(folly::detail::function::Data&) (p=...) at /home/couchbase/jenkins/workspace/cbdeps-platform-build-old/deps/packages/build/folly/folly-prefix/src/folly/folly/Function.h:401 #36 0x0000000000a98f74 in operator() (this=0x7f9ec43e1d40) at /home/couchbase/jenkins/workspace/couchbase-server-unix/kv_engine/executor/folly_executorpool.cc:49 #37 operator() (__closure=0x7f9ec43e1d40) at /home/couchbase/jenkins/workspace/couchbase-server-unix/kv_engine/executor/folly_executorpool.cc:49 #38 folly::detail::function::FunctionTraits<void ()>::callBig<CBRegisteredThreadFactory::newThread(folly::Function<void ()>&&)::{lambda()#1}>(folly::detail::function::Data&) (p=...) at /home/couchbase/jenkins/workspace/couchbase-server-unix/server_build/tlm/deps/folly.exploded/include/folly/Function.h:401 #39 0x00007f9ec5eadd40 in execute_native_thread_routine () at /tmp/deploy/objdir/../gcc-10.2.0/libstdc++-v3/src/c++11/thread.cc:80 #40 0x00007f9ec7cb5ea5 in start_thread () from /lib64/libpthread.so.0 #41 0x00007f9ec55f69fd in clone () from /lib64/libc.so.6 void EPBucket::deinitialize() { stopFlusher(); stopBgFetcher();   KVBucket::deinitialize();   // Perform a snapshot of the stats before shutting down so we can // persist the type of shutdown (stats.forceShutdown), and consequently // on the next warmup can determine is there was a clean shutdown - see // Warmup::cleanShutdown. This must be called after KVBucket::deinitialize() // to prevent a race condition with the StatSnap task as the deinitialize // function will cancel all running tasks. snapshotStats(true /*shuttingDown*/);   // Now that we've stopped all of our tasks, stop any tasks the storage // layer may have created. vbMap.forEachShard([](KVShard& shard) { shard.getRWUnderlying()->deinitialize(); }); } void MagmaKVStore::deinitialize() { logger->info("MagmaKVStore: {} deinitializing", configuration.getShardId());   // Close shuts down all of the magma background threads (compaction is the // one that we care about here). The compaction callbacks require the magma // instance to exist so we must do this before we reset it. magma->Close();   // Flusher should have already been stopped so it should be safe to destroy // the magma instance now magma.reset();   logger->info("MagmaKVStore: {} deinitialized", configuration.getShardId()); } void Magma::Impl::Close() { if (opened) { LOG_INFO("Closing magma db (thread pool refcount {})", threadPool.use_count()); heartbeatWg.Deactivate(); heartbeatToken.Cancel(); if (threadPool) { threadPool->CancelTasks(threadPoolGroupId); }   heartbeatWg.Wait();   kvstores.Shutdown(cfg.CheckpointOnShutdown, wal.get());   kvstoreWg.Wait(); <--- !! opened = false; LOG_INFO("Completed closing magma db"); } { ExecutionEnvGuard envGuard( ExecutionEnvRegistry::GetDefaultExecutionEnv()); threadPool.reset(); } } That's a cond-var internal to magma, this is its semantic: KVStore::KVStore(KVStoreConfig& config, Magma::KVStoreID kvID, Magma::KVStoreRevision kvsRev) : cfg(config), kvID(kvID), kvsRev(kvsRev), purgeChecker(cfg.GetExpiryFragThreshold(), cfg.GetTimeNow), remover(config.GetConfig().FS, config.GetThreadPool(), config.GetThreadPoolGroupId(), config.GetConfig().ExecutionEnv, fileRemoverWg) { .. cfg.GetWaitGroup().Add(1); } KVStore::~KVStore() { // Wait for any pending flushes flusherWg.Deactivate(); flusherWg.Wait();   currSnapshot.reset(); currDiskSnapshot.reset(); keyIndex.reset(); seqIndex.reset(); localIndex.reset();   fileRemoverWg.Deactivate(); fileRemoverWg.Wait();   { auto cb = destroyCallback.lock(); if (*cb) { LOG_INFO("KVStore::~KVStore Destroying {}", GetPath()); (*cb)(GetPath()); } }   cfg.GetWaitGroup().Add(-1); } So it seems that we haven't reach the end of ~KVStore yet. Note that ~KVStore is in turn potentially stuck waiting for the completion of other operations like flush or file-removal.
          dfinlay Dave Finlay added a comment -

          Balakumaran Gopal, Ritam Sharma: my guess is the Magma folks may need to take a look at this live cluster. Please don't shut down the cluster for a while yet, until they get a chance to look at it.

          dfinlay Dave Finlay added a comment - Balakumaran Gopal , Ritam Sharma : my guess is the Magma folks may need to take a look at this live cluster. Please don't shut down the cluster for a while yet, until they get a chance to look at it.
          Balakumaran.Gopal Balakumaran Gopal added a comment - Sure Dave Finlay
          paolo.cocchi Paolo Cocchi made changes -
          Assignee Paolo Cocchi [ paolo.cocchi ] Balakumaran Gopal [ balakumaran.gopal ]
          sarath Sarath Lakshman added a comment - - edited

          Thanks Paolo Cocchi.

          Looking at the shard, there are 65 buckets and all of them have logged a shutdown message.

          # grep 'KVStore::Shutdown' /opt/couchbase/var/lib/couchbase/logs/memcached.log.000021.txt | grep bucket2 | grep magma_3  |wc -l
          65
          

          But the destructor is not executed for some kvstores as the shared_ptr of the kvstores are held by somebody.

          Is there a possibility that kv-engine is still holding a seqtree iterator created through NewSeqIterator API ?

          sarath Sarath Lakshman added a comment - - edited Thanks Paolo Cocchi . Looking at the shard, there are 65 buckets and all of them have logged a shutdown message. # grep 'KVStore::Shutdown' /opt/couchbase/var/lib/couchbase/logs/memcached.log.000021.txt | grep bucket2 | grep magma_3 |wc -l 65 But the destructor is not executed for some kvstores as the shared_ptr of the kvstores are held by somebody. Is there a possibility that kv-engine is still holding a seqtree iterator created through NewSeqIterator API ?
          owend Daniel Owen made changes -
          Assignee Balakumaran Gopal [ balakumaran.gopal ] Srinath Duvuru [ srinath.duvuru ]
          sarath Sarath Lakshman made changes -
          Assignee Srinath Duvuru [ srinath.duvuru ] Sarath Lakshman [ sarath ]
          owend Daniel Owen made changes -
          Sprint KV Post-Neo [ 2016 ]
          owend Daniel Owen made changes -
          Sprint KV Post-Neo [ 2016 ] KV 2022-Feb [ 2002 ]
          owend Daniel Owen made changes -
          Rank Ranked higher
          owend Daniel Owen made changes -
          Sprint KV 2022-Feb [ 2002 ] KV 2022-Feb, KV Post-Neo 2 [ 2002, 2050 ]
          paolo.cocchi Paolo Cocchi added a comment - - edited

          Hi Sarath Lakshman, as per offline chat, the backtrace doesn't show any backfill executing on any AuxIO thread, so that isn't expected to hold any magma::KVStore reference. Thanks

          paolo.cocchi Paolo Cocchi added a comment - - edited Hi Sarath Lakshman , as per offline chat, the backtrace doesn't show any backfill executing on any AuxIO thread, so that isn't expected to hold any magma::KVStore reference. Thanks
          owend Daniel Owen made changes -
          Component/s couchbase-bucket [ 10173 ]
          sarath Sarath Lakshman added a comment - - edited

          As part of MB-49512, we introduced cancellable tasks. During the db shutdown, we cancel the tasks which are queued, but yet to be executed by the thread pool. This requires a cancel method to be setup for the tasks. When a purger task is scheduled, it holds a kvstore shared_ptr as part of the purger task context. Since a cancel method is not setup for the purger task, the context is not reset when the task is cancelled during db shutdown. During the Magma::Close(), it waits for all live kvstore refcounts to become zero.

          In this particular issue, a purger task was queued at the time of db shutdown and it ran into the above problem.

          sarath Sarath Lakshman added a comment - - edited As part of MB-49512 , we introduced cancellable tasks. During the db shutdown, we cancel the tasks which are queued, but yet to be executed by the thread pool. This requires a cancel method to be setup for the tasks. When a purger task is scheduled, it holds a kvstore shared_ptr as part of the purger task context. Since a cancel method is not setup for the purger task, the context is not reset when the task is cancelled during db shutdown. During the Magma::Close() , it waits for all live kvstore refcounts to become zero. In this particular issue, a purger task was queued at the time of db shutdown and it ran into the above problem.
          srinath.duvuru Srinath Duvuru made changes -
          Due Date 10/Mar/22

          Build couchbase-server-7.1.0-2455 contains magma commit e499a90 with commit message:
          MB-51319 lsm: Make cancel method mandatory for scheduled tasks

          build-team Couchbase Build Team added a comment - Build couchbase-server-7.1.0-2455 contains magma commit e499a90 with commit message: MB-51319 lsm: Make cancel method mandatory for scheduled tasks

          Build couchbase-server-7.1.0-2455 contains magma commit 4ef992f with commit message:
          MB-51319 tests: Add unit test for db shutdown during purging

          build-team Couchbase Build Team added a comment - Build couchbase-server-7.1.0-2455 contains magma commit 4ef992f with commit message: MB-51319 tests: Add unit test for db shutdown during purging

          Build couchbase-server-7.1.0-2455 contains magma commit 42dbe00 with commit message:
          MB-51319 lsm: Fix setting up cancel method for purger task

          build-team Couchbase Build Team added a comment - Build couchbase-server-7.1.0-2455 contains magma commit 42dbe00 with commit message: MB-51319 lsm: Fix setting up cancel method for purger task
          sarath Sarath Lakshman made changes -
          Resolution Fixed [ 1 ]
          Status Open [ 1 ] Resolved [ 5 ]

          Build couchbase-server-7.2.0-1011 contains magma commit e499a90 with commit message:
          MB-51319 lsm: Make cancel method mandatory for scheduled tasks

          build-team Couchbase Build Team added a comment - Build couchbase-server-7.2.0-1011 contains magma commit e499a90 with commit message: MB-51319 lsm: Make cancel method mandatory for scheduled tasks

          Build couchbase-server-7.2.0-1011 contains magma commit 4ef992f with commit message:
          MB-51319 tests: Add unit test for db shutdown during purging

          build-team Couchbase Build Team added a comment - Build couchbase-server-7.2.0-1011 contains magma commit 4ef992f with commit message: MB-51319 tests: Add unit test for db shutdown during purging

          Build couchbase-server-7.2.0-1011 contains magma commit 42dbe00 with commit message:
          MB-51319 lsm: Fix setting up cancel method for purger task

          build-team Couchbase Build Team added a comment - Build couchbase-server-7.2.0-1011 contains magma commit 42dbe00 with commit message: MB-51319 lsm: Fix setting up cancel method for purger task
          ritam.sharma Ritam Sharma made changes -
          Assignee Sarath Lakshman [ sarath ] Balakumaran Gopal [ balakumaran.gopal ]

          2 runs are currently running. Even then a weekly run over the weekend where we run couple of thousands of jobs run would be a better way to validate this.

          http://172.23.121.80/job/temp_rebalance_magma_1/35/console
          http://172.23.121.80/job/temp_rebalance_magma/292/console

          Balakumaran.Gopal Balakumaran Gopal added a comment - 2 runs are currently running. Even then a weekly run over the weekend where we run couple of thousands of jobs run would be a better way to validate this. http://172.23.121.80/job/temp_rebalance_magma_1/35/console http://172.23.121.80/job/temp_rebalance_magma/292/console

          Did not notice any occurrence of this on the runs we had on 7.1.0-2475.
          Logs :- http://greenboard.sc.couchbase.com/#!/server/7.1.0/2475?platforms=CENTOS&bucket_storage=COUCH&features=MAGMA

          Marking this closed.

          Balakumaran.Gopal Balakumaran Gopal added a comment - Did not notice any occurrence of this on the runs we had on 7.1.0-2475. Logs :- http://greenboard.sc.couchbase.com/#!/server/7.1.0/2475?platforms=CENTOS&bucket_storage=COUCH&features=MAGMA Marking this closed.
          Balakumaran.Gopal Balakumaran Gopal made changes -
          Status Resolved [ 5 ] Closed [ 6 ]

          People

            Balakumaran.Gopal Balakumaran Gopal
            Balakumaran.Gopal Balakumaran Gopal
            Votes:
            0 Vote for this issue
            Watchers:
            11 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Gerrit Reviews

                There are no open Gerrit changes

                PagerDuty