Details
-
Bug
-
Status: Closed
-
Critical
-
Resolution: Cannot Reproduce
-
Cheshire-Cat
-
Untriaged
-
1
-
No
Description
7.0.0-4960
6 GB RAM and 6 core boxes
Test:
./testrunner -i /tmp/win10-gsi.ini -p get-cbcollect-info=True -t clitest.collectinfotest.CollectinfoTests.collectinfo_test,sasl_buckets=1,standard_buckets=1,GROUP=P0
[2021-04-22 19:32:07,487] - [rest_client:1873] ERROR -
{'status': 'none', 'errorMessage': 'Rebalance failed. See logs for detailed reason. You can try again.'} - rebalance failed
[2021-04-22 19:32:10,762] - [rest_client:3804] INFO - Latest logs from UI on 172.23.106.249:
[2021-04-22 19:32:10,763] - [rest_client:3805] ERROR - {'node': 'ns_1@172.23.106.249', 'type': 'critical', 'code': 0, 'module': 'ns_orchestrator', 'tstamp': 1619145120056, 'shortText': 'message', 'text': 'Rebalance exited with reason {buckets_shutdown_wait_failed,\n [{\'ns_1@172.23.106.249\',\n {\'EXIT\',\n
}}]}.\nRebalance Operation Id = 9bc9227c38dc7da499ddf0916205a12e', 'serverTime': '2021-04-22T19:32:00.056Z'}
[2021-04-22 19:32:10,763] - [rest_client:3805] ERROR - {'node': 'ns_1@172.23.106.249', 'type': 'critical', 'code': 0, 'module': 'ns_rebalancer', 'tstamp': 1619145120054, 'shortText': 'message', 'text': 'Failed to wait deletion of some buckets on some nodes: [{\'ns_1@172.23.106.249\',\n {\'EXIT\',\n
}}]\n', 'serverTime': '2021-04-22T19:32:00.054Z'}
[2021-04-22 19:32:10,763] - [rest_client:3805] ERROR -
[2021-04-22 19:32:10,763] - [rest_client:3805] ERROR -
{'node': 'ns_1@172.23.106.249', 'type': 'warning', 'code': 102, 'module': 'menelaus_web', 'tstamp': 1619145060049, 'shortText': 'client-side error report', 'text': 'Client-side error-report for user "Administrator" on node \'ns_1@172.23.106.249\':\nUser-Agent:Python-httplib2/0.13.1 (gzip)\nStarting rebalance from test, ejected nodes [\'ns_1@172.23.136.127\', \'ns_1@172.23.136.129\', \'ns_1@172.23.136.252\', \'ns_1@172.23.136.253\']', 'serverTime': '2021-04-22T19:31:00.049Z'}[2021-04-22 19:32:10,763] - [rest_client:3805] ERROR -
{'node': 'ns_1@172.23.106.249', 'type': 'info', 'code': 0, 'module': 'ns_memcached', 'tstamp': 1619145024855, 'shortText': 'message', 'text': 'Shutting down bucket "standard_bucket0" on \'ns_1@172.23.106.249\' for deletion', 'serverTime': '2021-04-22T19:30:24.855Z'}[2021-04-22 19:32:10,763] - [rest_client:3805] ERROR -
{'node': 'ns_1@172.23.106.249', 'type': 'info', 'code': 11, 'module': 'menelaus_web', 'tstamp': 1619145018262, 'shortText': 'message', 'text': 'Deleted bucket "default"\n', 'serverTime': '2021-04-22T19:30:18.262Z'}[2021-04-22 19:32:10,763] - [rest_client:3805] ERROR -
{'node': 'ns_1@172.23.106.249', 'type': 'info', 'code': 0, 'module': 'auto_failover', 'tstamp': 1619144999705, 'shortText': 'message', 'text': 'Enabled auto-failover with timeout 120 and max count 1 (repeated 1 times, last seen 13.904235 secs ago)', 'serverTime': '2021-04-22T19:29:59.705Z'}[2021-04-22 19:32:10,763] - [rest_client:3805] ERROR -
{'node': 'ns_1@172.23.106.249', 'type': 'info', 'code': 0, 'module': 'ns_memcached', 'tstamp': 1619144989195, 'shortText': 'message', 'text': 'Shutting down bucket "default" on \'ns_1@172.23.106.249\' for deletion', 'serverTime': '2021-04-22T19:29:49.195Z'}[2021-04-22 19:32:10,763] - [rest_client:3805] ERROR -
{'node': 'ns_1@172.23.106.249', 'type': 'info', 'code': 0, 'module': 'auto_failover', 'tstamp': 1619144982686, 'shortText': 'message', 'text': 'Enabled auto-failover with timeout 120 and max count 1', 'serverTime': '2021-04-22T19:29:42.686Z'}[2021-04-22 19:32:10,763] - [rest_client:3805] ERROR -
{'node': 'ns_1@172.23.106.249', 'type': 'info', 'code': 0, 'module': 'ns_memcached', 'tstamp': 1619144940096, 'shortText': 'message', 'text': 'Shutting down bucket "bucket0" on \'ns_1@172.23.106.249\' for deletion', 'serverTime': '2021-04-22T19:29:00.096Z'}Cluster instance shutdown with force
Attaching logs
Taking a look at ns_server.debug.log on .249
[user:info,2021-04-22T19:30:24.855-07:00,ns_1@172.23.106.249:ns_memcached-standard_bucket0<0.10204.25>:ns_memcached:do_terminate:809]Shutting down bucket "standard_bucket0" on 'ns_1@172.23.106.249' for deletion
[ns_server:info,2021-04-22T19:30:24.859-07:00,ns_1@172.23.106.249:ns_memcached-standard_bucket0<0.10204.25>:ns_memcached:delete_bucket:838]Deleting bucket "standard_bucket0" from memcached (force = true)
[ns_server:warn,2021-04-22T19:30:54.344-07:00,ns_1@172.23.106.249:<0.3677.18>:ns_orchestrator:idle:641]Nodes ['ns_1@172.23.106.249'] failed to delete bucket "standard_bucket0" within expected time.
[rebalance:error,2021-04-22T19:32:00.054-07:00,ns_1@172.23.106.249:<0.27456.34>:ns_rebalancer:do_wait_buckets_shutdown:226]Failed to wait deletion of some buckets on some nodes: [{'ns_1@172.23.106.249',
{'EXIT',
{old_buckets_shutdown_wait_failed,
["standard_bucket0"]}}}]
From memcached POV we finished deletion at 19:30:25
2021-04-22T19:30:24.882004-07:00 INFO 2416: Delete bucket [standard_bucket0]. Notifying engine
2021-04-22T19:30:24.903240-07:00 INFO 2416: Delete bucket [standard_bucket0]. Engine ready for shutdown
2021-04-22T19:30:24.903263-07:00 INFO 2416: Delete bucket [standard_bucket0]. Wait for 18 clients to disconnect
2021-04-22T19:30:24.906028-07:00 INFO 2416: Delete bucket [standard_bucket0]. Shut down the bucket
2021-04-22T19:30:25.108134-07:00 INFO (standard_bucket0) Deleted KvBucket.
2021-04-22T19:30:25.176408-07:00 INFO 2416: Delete bucket [standard_bucket0]. Clean up allocated resources
2021-04-22T19:30:25.176456-07:00 INFO 2416: Delete bucket [standard_bucket0] complete
However ns_server complains that the delete failed after a 30 second time period. Time of interest is 2021-04-22T19:30:24 to 2021-04-22T19:30:54
Interestingly we see repeated Attempt to access non existent bucket e.g.
[ns_server:debug,2021-04-22T19:30:28.891-07:00,ns_1@172.23.106.249:<0.22365.34>:menelaus_web:check_bucket_uuid:1164]Attempt to access non existent bucket "standard_bucket0"
Also see
[ns_server:error,2021-04-22T17:26:02.559-07:00,ns_1@172.23.106.249:<0.8301.19>:timeout_diag_logger:do_diag:121]Got timeout {slow_bucket_stop,"travel-sample"}
[ns_server:error,2021-04-22T18:12:46.186-07:00,ns_1@172.23.106.249:<0.25148.24>:timeout_diag_logger:do_diag:121]Got timeout {slow_bucket_stop,"default"}
[ns_server:error,2021-04-22T19:29:46.013-07:00,ns_1@172.23.106.249:<0.24140.34>:timeout_diag_logger:do_diag:121]Got timeout {slow_bucket_stop,"bucket0"}
[ns_server:error,2021-04-22T19:30:54.526-07:00,ns_1@172.23.106.249:<0.27057.34>:timeout_diag_logger:do_diag:121]Got timeout {slow_bucket_stop,"standard_bucket0"}
cpu_utilization appears to be pegged at 100%, we are also seeing some stats blips
Therefore I'm inclined to put it down to the issues seeing in
MB-45061. However will let ns_server make the call...