Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-45877

[windows] Rebalance failures with error buckets_shutdown_wait_failed

    XMLWordPrintable

Details

    • Untriaged
    • 1
    • No

    Description

      7.0.0-4960

      6 GB RAM and 6 core boxes

      Test:
      ./testrunner -i /tmp/win10-gsi.ini -p get-cbcollect-info=True -t clitest.collectinfotest.CollectinfoTests.collectinfo_test,sasl_buckets=1,standard_buckets=1,GROUP=P0

      [2021-04-22 19:32:07,487] - [rest_client:1873] ERROR -

      {'status': 'none', 'errorMessage': 'Rebalance failed. See logs for detailed reason. You can try again.'}

      - rebalance failed
      [2021-04-22 19:32:10,762] - [rest_client:3804] INFO - Latest logs from UI on 172.23.106.249:
      [2021-04-22 19:32:10,763] - [rest_client:3805] ERROR - {'node': 'ns_1@172.23.106.249', 'type': 'critical', 'code': 0, 'module': 'ns_orchestrator', 'tstamp': 1619145120056, 'shortText': 'message', 'text': 'Rebalance exited with reason {buckets_shutdown_wait_failed,\n [{\'ns_1@172.23.106.249\',\n {\'EXIT\',\n

      {old_buckets_shutdown_wait_failed,\n ["standard_bucket0"]}

      }}]}.\nRebalance Operation Id = 9bc9227c38dc7da499ddf0916205a12e', 'serverTime': '2021-04-22T19:32:00.056Z'}
      [2021-04-22 19:32:10,763] - [rest_client:3805] ERROR - {'node': 'ns_1@172.23.106.249', 'type': 'critical', 'code': 0, 'module': 'ns_rebalancer', 'tstamp': 1619145120054, 'shortText': 'message', 'text': 'Failed to wait deletion of some buckets on some nodes: [{\'ns_1@172.23.106.249\',\n {\'EXIT\',\n

      {old_buckets_shutdown_wait_failed,\n ["standard_bucket0"]}

      }}]\n', 'serverTime': '2021-04-22T19:32:00.054Z'}
      [2021-04-22 19:32:10,763] - [rest_client:3805] ERROR -

      {'node': 'ns_1@172.23.106.249', 'type': 'info', 'code': 0, 'module': 'ns_orchestrator', 'tstamp': 1619145060052, 'shortText': 'message', 'text': "Starting rebalance, KeepNodes = ['ns_1@172.23.106.249'], EjectNodes = ['ns_1@172.23.136.127',\n 'ns_1@172.23.136.129',\n 'ns_1@172.23.136.252',\n 'ns_1@172.23.136.253'], Failed over and being ejected nodes = []; no delta recovery nodes; Operation Id = 9bc9227c38dc7da499ddf0916205a12e", 'serverTime': '2021-04-22T19:31:00.052Z'}

      [2021-04-22 19:32:10,763] - [rest_client:3805] ERROR -

      {'node': 'ns_1@172.23.106.249', 'type': 'warning', 'code': 102, 'module': 'menelaus_web', 'tstamp': 1619145060049, 'shortText': 'client-side error report', 'text': 'Client-side error-report for user "Administrator" on node \'ns_1@172.23.106.249\':\nUser-Agent:Python-httplib2/0.13.1 (gzip)\nStarting rebalance from test, ejected nodes [\'ns_1@172.23.136.127\', \'ns_1@172.23.136.129\', \'ns_1@172.23.136.252\', \'ns_1@172.23.136.253\']', 'serverTime': '2021-04-22T19:31:00.049Z'}

      [2021-04-22 19:32:10,763] - [rest_client:3805] ERROR -

      {'node': 'ns_1@172.23.106.249', 'type': 'info', 'code': 0, 'module': 'ns_memcached', 'tstamp': 1619145024855, 'shortText': 'message', 'text': 'Shutting down bucket "standard_bucket0" on \'ns_1@172.23.106.249\' for deletion', 'serverTime': '2021-04-22T19:30:24.855Z'}

      [2021-04-22 19:32:10,763] - [rest_client:3805] ERROR -

      {'node': 'ns_1@172.23.106.249', 'type': 'info', 'code': 11, 'module': 'menelaus_web', 'tstamp': 1619145018262, 'shortText': 'message', 'text': 'Deleted bucket "default"\n', 'serverTime': '2021-04-22T19:30:18.262Z'}

      [2021-04-22 19:32:10,763] - [rest_client:3805] ERROR -

      {'node': 'ns_1@172.23.106.249', 'type': 'info', 'code': 0, 'module': 'auto_failover', 'tstamp': 1619144999705, 'shortText': 'message', 'text': 'Enabled auto-failover with timeout 120 and max count 1 (repeated 1 times, last seen 13.904235 secs ago)', 'serverTime': '2021-04-22T19:29:59.705Z'}

      [2021-04-22 19:32:10,763] - [rest_client:3805] ERROR -

      {'node': 'ns_1@172.23.106.249', 'type': 'info', 'code': 0, 'module': 'ns_memcached', 'tstamp': 1619144989195, 'shortText': 'message', 'text': 'Shutting down bucket "default" on \'ns_1@172.23.106.249\' for deletion', 'serverTime': '2021-04-22T19:29:49.195Z'}

      [2021-04-22 19:32:10,763] - [rest_client:3805] ERROR -

      {'node': 'ns_1@172.23.106.249', 'type': 'info', 'code': 0, 'module': 'auto_failover', 'tstamp': 1619144982686, 'shortText': 'message', 'text': 'Enabled auto-failover with timeout 120 and max count 1', 'serverTime': '2021-04-22T19:29:42.686Z'}

      [2021-04-22 19:32:10,763] - [rest_client:3805] ERROR -

      {'node': 'ns_1@172.23.106.249', 'type': 'info', 'code': 0, 'module': 'ns_memcached', 'tstamp': 1619144940096, 'shortText': 'message', 'text': 'Shutting down bucket "bucket0" on \'ns_1@172.23.106.249\' for deletion', 'serverTime': '2021-04-22T19:29:00.096Z'}

      Cluster instance shutdown with force

      Attaching logs

      Attachments

        1. 172.23.106.249-20210422-1932-diag.zip
          12.34 MB
        2. 172.23.136.127-20210422-1932-diag.zip
          1.75 MB
        3. 172.23.136.129-20210422-1932-diag.zip
          1.86 MB
        4. 172.23.136.250-20210422-1932-diag.zip
          1.32 MB
        5. 172.23.136.252-20210422-1932-diag.zip
          1.99 MB
        6. 172.23.136.253-20210422-1932-diag.zip
          2.05 MB
        7. screenshot-1.png
          screenshot-1.png
          48 kB
        8. screenshot-2.png
          screenshot-2.png
          32 kB
        9. screenshot-3.png
          screenshot-3.png
          25 kB
        10. screenshot-4.png
          screenshot-4.png
          33 kB
        11. screenshot-5.png
          screenshot-5.png
          128 kB
        12. sys_cpu_utilization_rate.png
          sys_cpu_utilization_rate.png
          30 kB

        Issue Links

          No reviews matched the request. Check your Options in the drop-down menu of this sections header.

          Activity

            owend Daniel Owen added a comment -

            Taking a look at ns_server.debug.log on .249

            [user:info,2021-04-22T19:30:24.855-07:00,ns_1@172.23.106.249:ns_memcached-standard_bucket0<0.10204.25>:ns_memcached:do_terminate:809]Shutting down bucket "standard_bucket0" on 'ns_1@172.23.106.249' for deletion
            [ns_server:info,2021-04-22T19:30:24.859-07:00,ns_1@172.23.106.249:ns_memcached-standard_bucket0<0.10204.25>:ns_memcached:delete_bucket:838]Deleting bucket "standard_bucket0" from memcached (force = true)
             
            [ns_server:warn,2021-04-22T19:30:54.344-07:00,ns_1@172.23.106.249:<0.3677.18>:ns_orchestrator:idle:641]Nodes ['ns_1@172.23.106.249'] failed to delete bucket "standard_bucket0" within expected time.
             
            [rebalance:error,2021-04-22T19:32:00.054-07:00,ns_1@172.23.106.249:<0.27456.34>:ns_rebalancer:do_wait_buckets_shutdown:226]Failed to wait deletion of some buckets on some nodes: [{'ns_1@172.23.106.249',
                                                                     {'EXIT',
                                                                      {old_buckets_shutdown_wait_failed,
                                                                       ["standard_bucket0"]}}}]
            

            From memcached POV we finished deletion at 19:30:25

            2021-04-22T19:30:24.882004-07:00 INFO 2416: Delete bucket [standard_bucket0]. Notifying engine
            2021-04-22T19:30:24.903240-07:00 INFO 2416: Delete bucket [standard_bucket0]. Engine ready for shutdown
            2021-04-22T19:30:24.903263-07:00 INFO 2416: Delete bucket [standard_bucket0]. Wait for 18 clients to disconnect
            2021-04-22T19:30:24.906028-07:00 INFO 2416: Delete bucket [standard_bucket0]. Shut down the bucket
            2021-04-22T19:30:25.108134-07:00 INFO (standard_bucket0) Deleted KvBucket.
            2021-04-22T19:30:25.176408-07:00 INFO 2416: Delete bucket [standard_bucket0]. Clean up allocated resources 
            2021-04-22T19:30:25.176456-07:00 INFO 2416: Delete bucket [standard_bucket0] complete
            

            However ns_server complains that the delete failed after a 30 second time period. Time of interest is 2021-04-22T19:30:24 to 2021-04-22T19:30:54
            Interestingly we see repeated Attempt to access non existent bucket e.g.

            [ns_server:debug,2021-04-22T19:30:28.891-07:00,ns_1@172.23.106.249:<0.22365.34>:menelaus_web:check_bucket_uuid:1164]Attempt to access non existent bucket "standard_bucket0"
            

            Also see

            [ns_server:error,2021-04-22T17:26:02.559-07:00,ns_1@172.23.106.249:<0.8301.19>:timeout_diag_logger:do_diag:121]Got timeout {slow_bucket_stop,"travel-sample"}
            [ns_server:error,2021-04-22T18:12:46.186-07:00,ns_1@172.23.106.249:<0.25148.24>:timeout_diag_logger:do_diag:121]Got timeout {slow_bucket_stop,"default"}
            [ns_server:error,2021-04-22T19:29:46.013-07:00,ns_1@172.23.106.249:<0.24140.34>:timeout_diag_logger:do_diag:121]Got timeout {slow_bucket_stop,"bucket0"}
            [ns_server:error,2021-04-22T19:30:54.526-07:00,ns_1@172.23.106.249:<0.27057.34>:timeout_diag_logger:do_diag:121]Got timeout {slow_bucket_stop,"standard_bucket0"}
            

            cpu_utilization appears to be pegged at 100%, we are also seeing some stats blips
            Therefore I'm inclined to put it down to the issues seeing in MB-45061. However will let ns_server make the call...

            owend Daniel Owen added a comment - Taking a look at ns_server.debug.log on .249 [user:info,2021-04-22T19:30:24.855-07:00,ns_1@172.23.106.249:ns_memcached-standard_bucket0<0.10204.25>:ns_memcached:do_terminate:809]Shutting down bucket "standard_bucket0" on 'ns_1@172.23.106.249' for deletion [ns_server:info,2021-04-22T19:30:24.859-07:00,ns_1@172.23.106.249:ns_memcached-standard_bucket0<0.10204.25>:ns_memcached:delete_bucket:838]Deleting bucket "standard_bucket0" from memcached (force = true)   [ns_server:warn,2021-04-22T19:30:54.344-07:00,ns_1@172.23.106.249:<0.3677.18>:ns_orchestrator:idle:641]Nodes ['ns_1@172.23.106.249'] failed to delete bucket "standard_bucket0" within expected time.   [rebalance:error,2021-04-22T19:32:00.054-07:00,ns_1@172.23.106.249:<0.27456.34>:ns_rebalancer:do_wait_buckets_shutdown:226]Failed to wait deletion of some buckets on some nodes: [{'ns_1@172.23.106.249', {'EXIT', {old_buckets_shutdown_wait_failed, ["standard_bucket0"]}}}] From memcached POV we finished deletion at 19:30:25 2021-04-22T19:30:24.882004-07:00 INFO 2416: Delete bucket [standard_bucket0]. Notifying engine 2021-04-22T19:30:24.903240-07:00 INFO 2416: Delete bucket [standard_bucket0]. Engine ready for shutdown 2021-04-22T19:30:24.903263-07:00 INFO 2416: Delete bucket [standard_bucket0]. Wait for 18 clients to disconnect 2021-04-22T19:30:24.906028-07:00 INFO 2416: Delete bucket [standard_bucket0]. Shut down the bucket 2021-04-22T19:30:25.108134-07:00 INFO (standard_bucket0) Deleted KvBucket. 2021-04-22T19:30:25.176408-07:00 INFO 2416: Delete bucket [standard_bucket0]. Clean up allocated resources 2021-04-22T19:30:25.176456-07:00 INFO 2416: Delete bucket [standard_bucket0] complete However ns_server complains that the delete failed after a 30 second time period. Time of interest is 2021-04-22T19:30:24 to 2021-04-22T19:30:54 Interestingly we see repeated Attempt to access non existent bucket e.g. [ns_server:debug,2021-04-22T19:30:28.891-07:00,ns_1@172.23.106.249:<0.22365.34>:menelaus_web:check_bucket_uuid:1164]Attempt to access non existent bucket "standard_bucket0" Also see [ns_server:error,2021-04-22T17:26:02.559-07:00,ns_1@172.23.106.249:<0.8301.19>:timeout_diag_logger:do_diag:121]Got timeout {slow_bucket_stop,"travel-sample"} [ns_server:error,2021-04-22T18:12:46.186-07:00,ns_1@172.23.106.249:<0.25148.24>:timeout_diag_logger:do_diag:121]Got timeout {slow_bucket_stop,"default"} [ns_server:error,2021-04-22T19:29:46.013-07:00,ns_1@172.23.106.249:<0.24140.34>:timeout_diag_logger:do_diag:121]Got timeout {slow_bucket_stop,"bucket0"} [ns_server:error,2021-04-22T19:30:54.526-07:00,ns_1@172.23.106.249:<0.27057.34>:timeout_diag_logger:do_diag:121]Got timeout {slow_bucket_stop,"standard_bucket0"} cpu_utilization appears to be pegged at 100%, we are also seeing some stats blips Therefore I'm inclined to put it down to the issues seeing in MB-45061 . However will let ns_server make the call...

            Looking at ns_server.debug.log we see the bucket deletion request come in and
            we start to tear things down.

            [ns_server:debug,2021-04-22T19:30:23.189-07:00,ns_1@172.23.106.249:<0.25552.34>:menelaus_web_buckets:handle_bucket_delete:421]Received request to delete bucket "standard_bucket0"; will attempt to delete
            [ns_server:debug,2021-04-22T19:30:23.191-07:00,ns_1@172.23.106.249:<0.3677.18>:menelaus_users:cleanup_bucket_roles:774]Delete all roles for bucket "standard_bucket0"
            [ns_server:debug,2021-04-22T19:30:23.413-07:00,ns_1@172.23.106.249:ns_janitor_server<0.3675.18>:ns_janitor_server:handle_call:101]Deleted bucket "standard_bucket0" from janitor_requests
             
            [ns_server:debug,2021-04-22T19:30:24.209-07:00,ns_1@172.23.106.249:ns_bucket_worker<0.3502.18>:ns_bucket_worker:stop_one_bucket:108]Stopping child for dead bucket: "standard_bucket0"
            

            But then ns_memcached gets an exception when it does an ensure_bucket which gets
            and exception as the bucket is gone (from the ns_bucket module perspective).
            This exception has been resolved via MB-45682.

            ns_server:debug,2021-04-22T19:30:24.544-07:00,ns_1@172.23.106.249:<0.25660.34>:ns_memcached_sockets_pool:executing_on_socket:90]Exception while executing on socket {{127,0,0,1},50142}: {error,
                                                                      {badmatch,
                                                                       {error,
                                                                        {badmatch,
                                                                         not_present}}},
                                                                      [{ns_memcached,
                                                                        '-perform_very_long_call_with_timing/3-fun-0-',
                                                                        3,
                                                                        [{file,
                                                                          "src/ns_memcached.erl"},
                                                                         {line,867}]},
                                                                       {ns_memcached,
                                                                        '-perform_very_long_call/3-fun-0-',
                                                                        2,
                                                                        [{file,
                                                                          "src/ns_memcached.erl"},
                                                                         {line,323}]},
                                                                       {ns_memcached_sockets_pool,
                                                                        '-executing_on_socket/3-fun-0-',
                                                                        3,
                                                                        [{file,
                                                                          "src/ns_memcached_sockets_pool.erl"},
                                                                         {line,86}]},
                                                                       {async,
                                                                        '-async_init/4-fun-1-',
                                                                        3,
                                                                        [{file,
                                                                          "src/async.erl"},
                                                                         {line,191}]}]}
            

            and here's the associated crash

            =========================CRASH REPORT=========================
              crasher:
                initial call: work_queue:init/1
                pid: <0.10205.25>
                registered_name: []
                exception error: no match of right hand side value
                                 {error,{badmatch,not_present}}
                  in function  ns_memcached:'-perform_very_long_call_with_timing/3-fun-0-'/3 (src/ns_memcached.erl, line 867)
                  in call from ns_memcached:'-perform_very_long_call/3-fun-0-'/2 (src/ns_memcached.erl, line 323)
                  in call from ns_memcached_sockets_pool:'-executing_on_socket/3-fun-0-'/3 (src/ns_memcached_sockets_pool.erl, line 86)
                  in call from async:'-async_init/4-fun-1-'/3 (src/async.erl, line 191)
                ancestors: ['ns_memcached-standard_bucket0',<0.9149.25>,
                              'single_bucket_kv_sup-standard_bucket0',ns_bucket_sup,
                              ns_bucket_worker_sup,ns_server_sup,ns_server_nodes_sup,
                              <0.2637.18>,ns_server_cluster_sup,root_sup,<0.137.0>]
                message_queue_len: 0
                messages: []
                links: [<0.10204.25>]
                dictionary: []
                trap_exit: false
                status: running
                heap_size: 987
                stack_size: 27
                reductions: 72065
              neighbours:
            

            We see ns_memcached report the deletion

            [user:info,2021-04-22T19:30:24.855-07:00,ns_1@172.23.106.249:ns_memcached-standard_bucket0<0.10204.25>:ns_memcached:do_terminate:809]Shutting down bucket "standard_bucket0" on 'ns_1@172.23.106.249' for deletion
            [ns_server:info,2021-04-22T19:30:24.859-07:00,ns_1@172.23.106.249:ns_memcached-standard_bucket0<0.10204.25>:ns_memcached:delete_bucket:838]Deleting bucket "standard_bucket0" from memcached (force = true)
            [ns_server:debug,2021-04-22T19:30:25.176-07:00,ns_1@172.23.106.249:ns_memcached-standard_bucket0<0.10204.25>:ns_memcached:delete_bucket:849]Proceeding into vbuckets dbs deletions
            

            And the notice of the failure to delete on time

            [ns_server:warn,2021-04-22T19:30:54.344-07:00,ns_1@172.23.106.249:<0.3677.18>:ns_orchestrator:idle:641]Nodes ['ns_1@172.23.106.249'] failed to delete bucket "standard_bucket0" within expected time.
            [ns_server:debug,2021-04-22T19:30:54.406-07:00,ns_1@172.23.106.249:<0.25676.34>:ns_bucket_worker:stop_one_bucket:113]Observing slow bucket supervisor stop request for "standard_bucket0"
            [ns_server:error,2021-04-22T19:30:54.526-07:00,ns_1@172.23.106.249:<0.27057.34>:timeout_diag_logger:do_diag:121]Got timeout {slow_bucket_stop,"standard_bucket0"}
            Processes snapshot is:
             
            [ns_server:error,2021-04-22T19:30:54.527-07:00,ns_1@172.23.106.249:<0.27057.34>:timeout_diag_logger:do_diag:123]
            {<0.0.0>,
             [{backtrace,[<<"Program counter: 0x000002310e839c88 (init:boot_loop/2 + 56)">>,
                          <<"CP: 0x0000000000000000 (invalid)">>,<<>>,
                          <<"0x000002310f498950 Return addr 0x00007ffc69164398 (<terminate process normally>)">>,
                          <<"y(0)     []">>,
                          <<"(1)     {state,[{root,[<<\"C:/Program Files/Couchbase/Server\">>]},{progname,[<<\"erl\">>]},{home,[<<\"C:\\Windows\\system32\\co">>,
                          <<"y(2)     <0.9.0>">>,<<>>]},
              {messages,[]},
              {dictionary,[]},
              {registered_name,init},
              {status,waiting},
              {initial_call,{erl_init,start,2}},
              {error_handler,error_handler},
              {garbage_collection,[{max_heap_size,#{error_logger => true,kill => true,
                                                    size => 0}},
                                   {min_bin_vheap_size,46422},
                                   {min_heap_size,233},
                                   {fullsweep_after,512},
                                   {minor_gcs,12}]},
              {garbage_collection_info,[{old_heap_block_size,1598},
                                        {heap_block_size,2586},
                                        {mbuf_size,0},
                                        {recent_size,21},
                                        {stack_size,4},
                                        {old_heap_size,959},
                                        {heap_size,546},
                                        {bin_vheap_size,0},
                                        {bin_vheap_block_size,46422},
                                        {bin_old_vheap_size,0},
                                        {bin_old_vheap_block_size,46422}]},
              {links,[<0.10.0>,<0.42.0>,<0.44.0>,<0.9.0>]},
              {monitors,[]},
              {monitored_by,[]},
              {memory,34392},
              {message_queue_len,0},
              {reductions,27081},
              {trap_exit,true},
              {current_location,{init,boot_loop,2,[{file,"init.erl"},{line,331}]}}]}
            

            steve.watanabe Steve Watanabe added a comment - Looking at ns_server.debug.log we see the bucket deletion request come in and we start to tear things down. [ns_server:debug,2021-04-22T19:30:23.189-07:00,ns_1@172.23.106.249:<0.25552.34>:menelaus_web_buckets:handle_bucket_delete:421]Received request to delete bucket "standard_bucket0"; will attempt to delete [ns_server:debug,2021-04-22T19:30:23.191-07:00,ns_1@172.23.106.249:<0.3677.18>:menelaus_users:cleanup_bucket_roles:774]Delete all roles for bucket "standard_bucket0" [ns_server:debug,2021-04-22T19:30:23.413-07:00,ns_1@172.23.106.249:ns_janitor_server<0.3675.18>:ns_janitor_server:handle_call:101]Deleted bucket "standard_bucket0" from janitor_requests   [ns_server:debug,2021-04-22T19:30:24.209-07:00,ns_1@172.23.106.249:ns_bucket_worker<0.3502.18>:ns_bucket_worker:stop_one_bucket:108]Stopping child for dead bucket: "standard_bucket0" But then ns_memcached gets an exception when it does an ensure_bucket which gets and exception as the bucket is gone (from the ns_bucket module perspective). This exception has been resolved via MB-45682 . ns_server:debug,2021-04-22T19:30:24.544-07:00,ns_1@172.23.106.249:<0.25660.34>:ns_memcached_sockets_pool:executing_on_socket:90]Exception while executing on socket {{127,0,0,1},50142}: {error, {badmatch, {error, {badmatch, not_present}}}, [{ns_memcached, '-perform_very_long_call_with_timing/3-fun-0-', 3, [{file, "src/ns_memcached.erl"}, {line,867}]}, {ns_memcached, '-perform_very_long_call/3-fun-0-', 2, [{file, "src/ns_memcached.erl"}, {line,323}]}, {ns_memcached_sockets_pool, '-executing_on_socket/3-fun-0-', 3, [{file, "src/ns_memcached_sockets_pool.erl"}, {line,86}]}, {async, '-async_init/4-fun-1-', 3, [{file, "src/async.erl"}, {line,191}]}]} and here's the associated crash =========================CRASH REPORT========================= crasher: initial call: work_queue:init/1 pid: <0.10205.25> registered_name: [] exception error: no match of right hand side value {error,{badmatch,not_present}} in function ns_memcached:'-perform_very_long_call_with_timing/3-fun-0-'/3 (src/ns_memcached.erl, line 867) in call from ns_memcached:'-perform_very_long_call/3-fun-0-'/2 (src/ns_memcached.erl, line 323) in call from ns_memcached_sockets_pool:'-executing_on_socket/3-fun-0-'/3 (src/ns_memcached_sockets_pool.erl, line 86) in call from async:'-async_init/4-fun-1-'/3 (src/async.erl, line 191) ancestors: ['ns_memcached-standard_bucket0',<0.9149.25>, 'single_bucket_kv_sup-standard_bucket0',ns_bucket_sup, ns_bucket_worker_sup,ns_server_sup,ns_server_nodes_sup, <0.2637.18>,ns_server_cluster_sup,root_sup,<0.137.0>] message_queue_len: 0 messages: [] links: [<0.10204.25>] dictionary: [] trap_exit: false status: running heap_size: 987 stack_size: 27 reductions: 72065 neighbours: We see ns_memcached report the deletion [user:info,2021-04-22T19:30:24.855-07:00,ns_1@172.23.106.249:ns_memcached-standard_bucket0<0.10204.25>:ns_memcached:do_terminate:809]Shutting down bucket "standard_bucket0" on 'ns_1@172.23.106.249' for deletion [ns_server:info,2021-04-22T19:30:24.859-07:00,ns_1@172.23.106.249:ns_memcached-standard_bucket0<0.10204.25>:ns_memcached:delete_bucket:838]Deleting bucket "standard_bucket0" from memcached (force = true) [ns_server:debug,2021-04-22T19:30:25.176-07:00,ns_1@172.23.106.249:ns_memcached-standard_bucket0<0.10204.25>:ns_memcached:delete_bucket:849]Proceeding into vbuckets dbs deletions And the notice of the failure to delete on time [ns_server:warn,2021-04-22T19:30:54.344-07:00,ns_1@172.23.106.249:<0.3677.18>:ns_orchestrator:idle:641]Nodes ['ns_1@172.23.106.249'] failed to delete bucket "standard_bucket0" within expected time. [ns_server:debug,2021-04-22T19:30:54.406-07:00,ns_1@172.23.106.249:<0.25676.34>:ns_bucket_worker:stop_one_bucket:113]Observing slow bucket supervisor stop request for "standard_bucket0" [ns_server:error,2021-04-22T19:30:54.526-07:00,ns_1@172.23.106.249:<0.27057.34>:timeout_diag_logger:do_diag:121]Got timeout {slow_bucket_stop,"standard_bucket0"} Processes snapshot is:   [ns_server:error,2021-04-22T19:30:54.527-07:00,ns_1@172.23.106.249:<0.27057.34>:timeout_diag_logger:do_diag:123] {<0.0.0>, [{backtrace,[<<"Program counter: 0x000002310e839c88 (init:boot_loop/2 + 56)">>, <<"CP: 0x0000000000000000 (invalid)">>,<<>>, <<"0x000002310f498950 Return addr 0x00007ffc69164398 (<terminate process normally>)">>, <<"y(0) []">>, <<"(1) {state,[{root,[<<\"C:/Program Files/Couchbase/Server\">>]},{progname,[<<\"erl\">>]},{home,[<<\"C:\\Windows\\system32\\co">>, <<"y(2) <0.9.0>">>,<<>>]}, {messages,[]}, {dictionary,[]}, {registered_name,init}, {status,waiting}, {initial_call,{erl_init,start,2}}, {error_handler,error_handler}, {garbage_collection,[{max_heap_size,#{error_logger => true,kill => true, size => 0}}, {min_bin_vheap_size,46422}, {min_heap_size,233}, {fullsweep_after,512}, {minor_gcs,12}]}, {garbage_collection_info,[{old_heap_block_size,1598}, {heap_block_size,2586}, {mbuf_size,0}, {recent_size,21}, {stack_size,4}, {old_heap_size,959}, {heap_size,546}, {bin_vheap_size,0}, {bin_vheap_block_size,46422}, {bin_old_vheap_size,0}, {bin_old_vheap_block_size,46422}]}, {links,[<0.10.0>,<0.42.0>,<0.44.0>,<0.9.0>]}, {monitors,[]}, {monitored_by,[]}, {memory,34392}, {message_queue_len,0}, {reductions,27081}, {trap_exit,true}, {current_location,{init,boot_loop,2,[{file,"init.erl"},{line,331}]}}]}

            Arunkumar Senthilnathan A fix for the ns_memcached crashes was made via MB-45682 first in build 4991. Would you run the test on a build containing the fix to see if the issue persists. And if it does, please gather the logs.

            steve.watanabe Steve Watanabe added a comment - Arunkumar Senthilnathan  A fix for the ns_memcached crashes was made via MB-45682 first in build 4991. Would you run the test on a build containing the fix to see if the issue persists. And if it does, please gather the logs.
            dfinlay Dave Finlay added a comment -

            I decided to take a look at this one as I looked at one that was quite similar to this in the past: MB-44452 (and MB-41895).

            It just takes a long time to delete all the vbuckets. It starts here:

            [ns_server:info,2021-04-22T19:30:24.859-07:00,ns_1@172.23.106.249:ns_memcached-standard_bucket0<0.10204.25>:ns_memcached:delete_bucket:838]Deleting bucket "standard_bucket0" from memcached (force = true)
            

            Bucket deletion happens rapidly in memcached and we get to the part where ns_server deletes the files on disk, which begins here:

            [ns_server:debug,2021-04-22T19:30:25.176-07:00,ns_1@172.23.106.249:ns_memcached-standard_bucket0<0.10204.25>:ns_memcached:delete_bucket:849]Proceeding into vbuckets dbs deletions
            

            The master DB is deleted and we continue to delete the data files:

            [ns_server:info,2021-04-22T19:30:25.195-07:00,couchdb_ns_1@cb.local:<0.20988.2>:ns_couchdb_storage:delete_couch_database:65]Deleting database <<"standard_bucket0/master">>: ok
             
            [ns_server:info,2021-04-22T19:30:25.195-07:00,couchdb_ns_1@cb.local:<0.20988.2>:ns_couchdb_storage:delete_databases_and_files:30]Couch dbs are deleted. Proceeding with bucket directory "c:/Program Files/Couchbase/Server/var/lib/couchbase/data/standard_bucket0"
            

            It turns out that .249 is the only node in the cluster so there are 1024 files to delete:

            They're small files though - just 8 MB in total:

            And zero items - which of course is why it got deleted so fast in memcached:

            30 s later the "slow bucket stop" hits and we get the backtrace of all the processes. We can see that the deletion task is still busy trying to remove all the files. In particular it's working on 15.couch.1 at this time:

            [ns_server:error,2021-04-22T19:30:54.494-07:00,couchdb_ns_1@cb.local:<0.21275.2>:timeout_diag_logger:do_diag:123]
            {<0.20988.2>,
             [{backtrace,[<<"Program counter: 0x000001cff7f18dc8 (gen:do_call/4 + 320)">>,
                          <<"CP: 0x0000000000000000 (invalid)">>,<<>>,
                          <<"0x000001cffb272128 Return addr 0x000001cff7df15f8 (gen_server:call/3 + 128)">>,
                          <<"y(0)     infinity">>,<<"y(1)     []">>,
                          <<"y(2)     #Ref<0.3113985660.1347944450.214387>">>,
                          <<"y(3)     <0.71.0>">>,<<>>,
                          <<"0x000001cffb272150 Return addr 0x000001cffb487730 (file:call/2 + 224)">>,
                          <<"y(0)     infinity">>,
                          <<"y(1)     {read_link_info,\"c:/Program Files/Couchbase/Server/var/lib/couchbase/data/standard_bucket0/15.couch.1\"}">>,
                          <<"y(2)     file_server_2">>,
                          <<"y(3)     Catch 0x000001cff7df15f8 (gen_server:call/3 + 128)">>,
                          <<>>,
                          <<"0x000001cffb272178 Return addr 0x000001cffb0dd608 (misc:rm_rf_is_dir/1 + 56)">>,
                          <<"y(0)     true">>,<<>>,
                          <<"0x000001cffb272188 Return addr 0x000001cffb0dd150 (misc:rm_rf/1 + 88)">>,
                      <<>>,
                      <<"0x000001cffb272190 Return addr 0x000001cffb0dd7a8 (misc:rm_rf_loop/2 + 136)">>,
                      <<"y(0)     []">>,
                      <<"y(1)     \"c:/Program Files/Couchbase/Server/var/lib/couchbase/data/standard_bucket0/15.couch.1\"">>,
                          <<>>,
                          <<"0x000001cffb2721a8 Return addr 0x000001cffb0dd3c8 (misc:rm_rf/1 + 720)">>,
                          <<"y(0)     []">>,
                          <<"y(1)     \"c:/Program Files/Couchbase/Server/var/lib/couchbase/data/standard_bucket0/15.couch.1\"">>,
                          <<"(2)     [\"150.couch.1\",\"151.couch.1\",\"152.couch.1\",\"153.couch.1\",\"154.couch.1\",\"155.couch.1\",\"156.couch.1\",\"157.couch.1\"">>,
                          <<"y(3)     \"c:/Program Files/Couchbase/Server/var/lib/couchbase/data/standard_bucket0\"">>,
                          <<>>,
                          <<"0x000001cffb2721d0 Return addr 0x000001cffd37c260 (ns_couchdb_storage:delete_databases_and_files/1 + 560)">>,
                          <<"y(0)     []">>,
                          <<"y(1)     \"c:/Program Files/Couchbase/Server/var/lib/couchbase/data/standard_bucket0\"">>,
                          <<>>,
                          <<"0x000001cffb2721e8 Return addr 0x000001cffaa39fa8 (rpc:'-handle_call_call/6-fun-0-'/5 + 144)">>,
                          <<"y(0)     []">>,<<"y(1)     \"standard_bucket0\"">>,<<>>,
                          <<"0x000001cffb272200 Return addr 0x00007ffc69164398 (<terminate process normally>)">>,
                          <<"y(0)     {<8864.10204.25>,#Ref<8864.317157537.4271374341.2802>}">>,
                          <<"y(1)     []">>,<<"y(2)     []">>,<<"y(3)     []">>,
                          <<"y(4)     Catch 0x000001cffaa39fa8 (rpc:'-handle_call_call/6-fun-0-'/5 + 144)">>,
            

            Some 6 minutes later it's finally done:

            [ns_server:info,2021-04-22T19:36:03.780-07:00,couchdb_ns_1@cb.local:<0.20988.2>:ns_couchdb_storage:delete_databases_and_files:40]Bucket "standard_bucket0" deletion has finished with ok
            

            So, why is it taking so long? For one thing, deleting files on Windows is slow. For a second, the machine is completely maxed out:

            Its also completely unclear which process is consuming all the CPU. None of the Couchbase Server processes are using more than 1/2 a core - and this is a 6 core machine.

            I think this is environmental.

            dfinlay Dave Finlay added a comment - I decided to take a look at this one as I looked at one that was quite similar to this in the past: MB-44452 (and MB-41895 ). It just takes a long time to delete all the vbuckets. It starts here: [ns_server:info,2021-04-22T19:30:24.859-07:00,ns_1@172.23.106.249:ns_memcached-standard_bucket0<0.10204.25>:ns_memcached:delete_bucket:838]Deleting bucket "standard_bucket0" from memcached (force = true) Bucket deletion happens rapidly in memcached and we get to the part where ns_server deletes the files on disk, which begins here: [ns_server:debug,2021-04-22T19:30:25.176-07:00,ns_1@172.23.106.249:ns_memcached-standard_bucket0<0.10204.25>:ns_memcached:delete_bucket:849]Proceeding into vbuckets dbs deletions The master DB is deleted and we continue to delete the data files: [ns_server:info,2021-04-22T19:30:25.195-07:00,couchdb_ns_1@cb.local:<0.20988.2>:ns_couchdb_storage:delete_couch_database:65]Deleting database <<"standard_bucket0/master">>: ok   [ns_server:info,2021-04-22T19:30:25.195-07:00,couchdb_ns_1@cb.local:<0.20988.2>:ns_couchdb_storage:delete_databases_and_files:30]Couch dbs are deleted. Proceeding with bucket directory "c:/Program Files/Couchbase/Server/var/lib/couchbase/data/standard_bucket0" It turns out that .249 is the only node in the cluster so there are 1024 files to delete: They're small files though - just 8 MB in total: And zero items - which of course is why it got deleted so fast in memcached: 30 s later the "slow bucket stop" hits and we get the backtrace of all the processes. We can see that the deletion task is still busy trying to remove all the files. In particular it's working on 15.couch.1 at this time: [ns_server:error,2021-04-22T19:30:54.494-07:00,couchdb_ns_1@cb.local:<0.21275.2>:timeout_diag_logger:do_diag:123] {<0.20988.2>, [{backtrace,[<<"Program counter: 0x000001cff7f18dc8 (gen:do_call/4 + 320)">>, <<"CP: 0x0000000000000000 (invalid)">>,<<>>, <<"0x000001cffb272128 Return addr 0x000001cff7df15f8 (gen_server:call/3 + 128)">>, <<"y(0) infinity">>,<<"y(1) []">>, <<"y(2) #Ref<0.3113985660.1347944450.214387>">>, <<"y(3) <0.71.0>">>,<<>>, <<"0x000001cffb272150 Return addr 0x000001cffb487730 (file:call/2 + 224)">>, <<"y(0) infinity">>, <<"y(1) {read_link_info,\"c:/Program Files/Couchbase/Server/var/lib/couchbase/data/standard_bucket0/15.couch.1\"}">>, <<"y(2) file_server_2">>, <<"y(3) Catch 0x000001cff7df15f8 (gen_server:call/3 + 128)">>, <<>>, <<"0x000001cffb272178 Return addr 0x000001cffb0dd608 (misc:rm_rf_is_dir/1 + 56)">>, <<"y(0) true">>,<<>>, <<"0x000001cffb272188 Return addr 0x000001cffb0dd150 (misc:rm_rf/1 + 88)">>, <<>>, <<"0x000001cffb272190 Return addr 0x000001cffb0dd7a8 (misc:rm_rf_loop/2 + 136)">>, <<"y(0) []">>, <<"y(1) \"c:/Program Files/Couchbase/Server/var/lib/couchbase/data/standard_bucket0/15.couch.1\"">>, <<>>, <<"0x000001cffb2721a8 Return addr 0x000001cffb0dd3c8 (misc:rm_rf/1 + 720)">>, <<"y(0) []">>, <<"y(1) \"c:/Program Files/Couchbase/Server/var/lib/couchbase/data/standard_bucket0/15.couch.1\"">>, <<"(2) [\"150.couch.1\",\"151.couch.1\",\"152.couch.1\",\"153.couch.1\",\"154.couch.1\",\"155.couch.1\",\"156.couch.1\",\"157.couch.1\"">>, <<"y(3) \"c:/Program Files/Couchbase/Server/var/lib/couchbase/data/standard_bucket0\"">>, <<>>, <<"0x000001cffb2721d0 Return addr 0x000001cffd37c260 (ns_couchdb_storage:delete_databases_and_files/1 + 560)">>, <<"y(0) []">>, <<"y(1) \"c:/Program Files/Couchbase/Server/var/lib/couchbase/data/standard_bucket0\"">>, <<>>, <<"0x000001cffb2721e8 Return addr 0x000001cffaa39fa8 (rpc:'-handle_call_call/6-fun-0-'/5 + 144)">>, <<"y(0) []">>,<<"y(1) \"standard_bucket0\"">>,<<>>, <<"0x000001cffb272200 Return addr 0x00007ffc69164398 (<terminate process normally>)">>, <<"y(0) {<8864.10204.25>,#Ref<8864.317157537.4271374341.2802>}">>, <<"y(1) []">>,<<"y(2) []">>,<<"y(3) []">>, <<"y(4) Catch 0x000001cffaa39fa8 (rpc:'-handle_call_call/6-fun-0-'/5 + 144)">>, Some 6 minutes later it's finally done: [ns_server:info,2021-04-22T19:36:03.780-07:00,couchdb_ns_1@cb.local:<0.20988.2>:ns_couchdb_storage:delete_databases_and_files:40]Bucket "standard_bucket0" deletion has finished with ok So, why is it taking so long? For one thing, deleting files on Windows is slow. For a second, the machine is completely maxed out: Its also completely unclear which process is consuming all the CPU. None of the Couchbase Server processes are using more than 1/2 a core - and this is a 6 core machine. I think this is environmental.

            Reran the tests against 7.0.0-5017 - http://qa.sc.couchbase.com/job/temp_win_private_eventing_job_1/14/consoleFull - issue was not seen in this run

            arunkumar Arunkumar Senthilnathan added a comment - Reran the tests against 7.0.0-5017 - http://qa.sc.couchbase.com/job/temp_win_private_eventing_job_1/14/consoleFull - issue was not seen in this run

            People

              arunkumar Arunkumar Senthilnathan
              arunkumar Arunkumar Senthilnathan
              Votes:
              0 Vote for this issue
              Watchers:
              9 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Gerrit Reviews

                  There are no open Gerrit changes

                  PagerDuty