Details
-
Bug
-
Resolution: Fixed
-
Major
-
master
-
Untriaged
-
Unknown
-
Magma: Jan 20 - Feb 2
Description
I had a cluster with ~1.8B 1k documents spread across 4 nodes. It was in the middle of a rebalance when all nodes were shut down by the cloud provider (automated shutdown, not unexpected).
When starting them all up a few hours later, the cluster came back online but initially appeared stuck in a warmup loop. After waiting for a while (~14 minutes) the nodes did start to show some signs of progress but the warmup messages in the UI for the bucket showed "ec2-34-210-32-106.us-west-2.compute.amazonaws.com:8091 starting ep-engine" repeated for all nodes multiple times and 0% progress for warmup indicator, and the statistics for the nodes themselves kept flashing on and off as if they were restarting. After watching for a while I initially thought this was an infinite loop but eventually noticed that some nodes were making progress and eventually they all did after about 25 minutes.
Memcached logs on the nodes have a ton of the log message below repeated.
2020-02-01T15:29:56.851285+00:00 INFO 13264: Client 127.0.0.1:49156 authenticated as <ud>@ns_server</ud> |
2020-02-01T15:29:56.851413+00:00 INFO 13264: HELO [regular] [ 127.0.0.1:49156 - 127.0.0.1:11209 (<ud>@ns_server</ud>) ] |
2020-02-01T15:29:56.852536+00:00 INFO 13264 Create bucket [magma] |
2020-02-01T15:29:56.852542+00:00 WARNING 13264 Create bucket [magma] failed - Already exists |
Logs about 10 minutes into warmup:
https://s3.amazonaws.com/cb-engineering/perry/magma-warmup/collectinfo-2020-02-01T153035-ns_1%40ec2-34-210-32-106.us-west-2.compute.amazonaws.com.zip
https://s3.amazonaws.com/cb-engineering/perry/magma-warmup/collectinfo-2020-02-01T153035-ns_1%40ec2-44-231-57-190.us-west-2.compute.amazonaws.com.zip
https://s3.amazonaws.com/cb-engineering/perry/magma-warmup/collectinfo-2020-02-01T153035-ns_1%40ec2-44-231-99-176.us-west-2.compute.amazonaws.com.zip
https://s3.amazonaws.com/cb-engineering/perry/magma-warmup/collectinfo-2020-02-01T153035-ns_1%40ec2-44-232-229-241.us-west-2.compute.amazonaws.com.zip
And logs once it finally finished:
https://s3.amazonaws.com/cb-engineering/perry/magma-warmup-finished/collectinfo-2020-02-01T154757-ns_1%40ec2-34-210-32-106.us-west-2.compute.amazonaws.com.zip
https://s3.amazonaws.com/cb-engineering/perry/magma-warmup-finished/collectinfo-2020-02-01T154757-ns_1%40ec2-44-231-57-190.us-west-2.compute.amazonaws.com.zip
https://s3.amazonaws.com/cb-engineering/perry/magma-warmup-finished/collectinfo-2020-02-01T154757-ns_1%40ec2-44-231-99-176.us-west-2.compute.amazonaws.com.zip
https://s3.amazonaws.com/cb-engineering/perry/magma-warmup-finished/collectinfo-2020-02-01T154757-ns_1%40ec2-44-232-229-241.us-west-2.compute.amazonaws.com.zip
Ec2-34-210-32-106.Us-West-2.Compute.Amazonaws.Com seemed to take the longest to finish.
Ultimately it did succeed and so the main issue is the need for an indication of progress to the user so that they don't think it is stuck, but I suspect there are also some learnings from how to improve the speed and/or reduce the amount of noise in the logs.
Attachments
Issue Links
- is triggering
-
MB-37786 Large number of "Create bucket [magma] failed"
- Closed