Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-37763

[Magma] - Warmup takes a long time with no indication of progress

    XMLWordPrintable

Details

    • Bug
    • Resolution: Fixed
    • Major
    • 7.0.0
    • master
    • storage-engine
    • Untriaged
    • Unknown
    • Magma: Jan 20 - Feb 2

    Description

      I had a cluster with ~1.8B 1k documents spread across 4 nodes.  It was in the middle of a rebalance when all nodes were shut down by the cloud provider (automated shutdown, not unexpected).

       

      When starting them all up a few hours later, the cluster came back online but initially appeared stuck in a warmup loop.  After waiting for a while (~14 minutes) the nodes did start to show some signs of progress but the warmup messages in the UI for the bucket showed "ec2-34-210-32-106.us-west-2.compute.amazonaws.com:8091 starting ep-engine" repeated for all nodes multiple times and 0% progress for warmup indicator, and the statistics for the nodes themselves kept flashing on and off as if they were restarting.  After watching for a while I initially thought this was an infinite loop but eventually noticed that some nodes were making progress and eventually they all did after about 25 minutes.

      Memcached logs on the nodes have a ton of the log message below repeated.  

      2020-02-01T15:29:56.851285+00:00 INFO 13264: Client 127.0.0.1:49156 authenticated as <ud>@ns_server</ud> 
      2020-02-01T15:29:56.851413+00:00 INFO 13264: HELO [regular] [ 127.0.0.1:49156 - 127.0.0.1:11209 (<ud>@ns_server</ud>) ] 
      2020-02-01T15:29:56.852536+00:00 INFO 13264 Create bucket [magma] 
      2020-02-01T15:29:56.852542+00:00 WARNING 13264 Create bucket [magma] failed - Already exists
      

      Logs about 10 minutes into warmup:

      https://s3.amazonaws.com/cb-engineering/perry/magma-warmup/collectinfo-2020-02-01T153035-ns_1%40ec2-34-210-32-106.us-west-2.compute.amazonaws.com.zip
      https://s3.amazonaws.com/cb-engineering/perry/magma-warmup/collectinfo-2020-02-01T153035-ns_1%40ec2-44-231-57-190.us-west-2.compute.amazonaws.com.zip
      https://s3.amazonaws.com/cb-engineering/perry/magma-warmup/collectinfo-2020-02-01T153035-ns_1%40ec2-44-231-99-176.us-west-2.compute.amazonaws.com.zip
      https://s3.amazonaws.com/cb-engineering/perry/magma-warmup/collectinfo-2020-02-01T153035-ns_1%40ec2-44-232-229-241.us-west-2.compute.amazonaws.com.zip

       

      And logs once it finally finished:
      https://s3.amazonaws.com/cb-engineering/perry/magma-warmup-finished/collectinfo-2020-02-01T154757-ns_1%40ec2-34-210-32-106.us-west-2.compute.amazonaws.com.zip
      https://s3.amazonaws.com/cb-engineering/perry/magma-warmup-finished/collectinfo-2020-02-01T154757-ns_1%40ec2-44-231-57-190.us-west-2.compute.amazonaws.com.zip
      https://s3.amazonaws.com/cb-engineering/perry/magma-warmup-finished/collectinfo-2020-02-01T154757-ns_1%40ec2-44-231-99-176.us-west-2.compute.amazonaws.com.zip
      https://s3.amazonaws.com/cb-engineering/perry/magma-warmup-finished/collectinfo-2020-02-01T154757-ns_1%40ec2-44-232-229-241.us-west-2.compute.amazonaws.com.zip
      Ec2-34-210-32-106.Us-West-2.Compute.Amazonaws.Com seemed to take the longest to finish.

      Ultimately it did succeed and so the main issue is the need for an indication of progress to the user so that they don't think it is stuck, but I suspect there are also some learnings from how to improve the speed and/or reduce the amount of noise in the logs.

      Attachments

        Issue Links

          No reviews matched the request. Check your Options in the drop-down menu of this sections header.

          Activity

            People

              apaar.gupta Apaar Gupta
              perry Perry Krug
              Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Time Tracking

                  Estimated:
                  Original Estimate - 16h
                  16h
                  Remaining:
                  Remaining Estimate - 16h
                  16h
                  Logged:
                  Time Spent - Not Specified
                  Not Specified

                  PagerDuty