Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-51477

[Magma] Rebalance exited with reason buckets_shutdown_wait_failed

    XMLWordPrintable

Details

    • Untriaged
    • Centos 64-bit
    • 1
    • Unknown

    Description

      Seeing this issue during cluster teardown:

      rebalance_report_20220314T223148.json:
       
      {"stageInfo":{"data":{"startTime":"2022-03-14T15:30:48.099-07:00","completedTime":false,"timeTaken":60280}},"rebalanceId":"966b008ec2d7d181ce4f40571d563618","nodesInfo":{"active_nodes":["ns_1@172.23.100.38","ns_1@172.23.100.39"],"keep_nodes":["ns_1@172.23.100.38"],"eject_nodes":["ns_1@172.23.100.39"],"delta_nodes":[],"failed_nodes":[]},"masterNode":"ns_1@172.23.100.38","startTime":"2022-03-14T15:30:48.098-07:00","completedTime":"2022-03-14T15:31:48.379-07:00","timeTaken":60281,"completionMessage":"Rebalance exited with reason {buckets_shutdown_wait_failed,\n                              [{'ns_1@172.23.100.38',\n                                {'EXIT',\n                                 {old_buckets_shutdown_wait_failed,\n                                  [\"default\"]}}}]}."}
      
      

      Attachments

        1. 172.23.100.38.zip
          24.32 MB
        2. 172.23.100.39.zip
          28.18 MB

        Issue Links

          No reviews matched the request. Check your Options in the drop-down menu of this sections header.

          Activity

            ritam.sharma Ritam Sharma added a comment -

            Tests running with toy builds - http://latestbuilds.service.couchbase.com/builds/latestbuilds/couchbase-server/toybuilds/14543/ and http://latestbuilds.service.couchbase.com/builds/latestbuilds/couchbase-server/toybuilds/14544/ . This will help us possibly mitigate the issue. Please run the tests with 5 minute timeout.

            Test are running for XDCR for different RR ratio - 1%, 50%, 100%

            ritam.sharma Ritam Sharma added a comment - Tests running with toy builds - http://latestbuilds.service.couchbase.com/builds/latestbuilds/couchbase-server/toybuilds/14543/ and http://latestbuilds.service.couchbase.com/builds/latestbuilds/couchbase-server/toybuilds/14544/ . This will help us possibly mitigate the issue. Please run the tests with 5 minute timeout. Test are running for XDCR for different RR ratio - 1%, 50%, 100%

            Jobs on toybuild 7.1.0-14543 are in progress and havent run into the rebalance issue yet.
            Posting the jobs here for tracking:
            unidirectional-1-magma-100DGM: http://qe-jenkins1.sc.couchbase.com/job/test_suite_executor/706/console
            unidirectional-1-magma-50DGM: http://qe-jenkins1.sc.couchbase.com/job/test_suite_executor/707/console
            unidirectional-1-magma-10DGM: http://qe-jenkins1.sc.couchbase.com/job/test_suite_executor/705/console
            unidirectional-1-magma-1DGM: http://qe-jenkins1.sc.couchbase.com/job/test_suite_executor/704/console

            Jobs on toybuild 7.1.0-14544 have just been triggerred, will update in a few hours.

            pavithra.mahamani Pavithra Mahamani added a comment - Jobs on toybuild 7.1.0-14543 are in progress and havent run into the rebalance issue yet. Posting the jobs here for tracking: unidirectional-1-magma-100DGM: http://qe-jenkins1.sc.couchbase.com/job/test_suite_executor/706/console unidirectional-1-magma-50DGM: http://qe-jenkins1.sc.couchbase.com/job/test_suite_executor/707/console unidirectional-1-magma-10DGM: http://qe-jenkins1.sc.couchbase.com/job/test_suite_executor/705/console unidirectional-1-magma-1DGM: http://qe-jenkins1.sc.couchbase.com/job/test_suite_executor/704/console Jobs on toybuild 7.1.0-14544 have just been triggerred, will update in a few hours.
            pavithra.mahamani Pavithra Mahamani added a comment - - edited Rebalance exit not seen in 7.1.0-14544 runs too. http://qe-jenkins.sc.couchbase.com/job/test_suite_executor/723/ http://qe-jenkins.sc.couchbase.com/job/test_suite_executor/722/
            ritam.sharma Ritam Sharma added a comment - New toy build available, running against regular machines http://latestbuilds.service.couchbase.com/builds/latestbuilds/couchbase-server/toybuilds/14563/couchbase-server-enterprise-7.1.0-14563-centos7.x86_64.rpm http://qe-jenkins.sc.couchbase.com/job/test_suite_executor/726/ http://qe-jenkins.sc.couchbase.com/job/test_suite_executor/727/ http://qe-jenkins.sc.couchbase.com/job/test_suite_executor/728/ http://qe-jenkins.sc.couchbase.com/job/test_suite_executor/729/ http://qe-jenkins.sc.couchbase.com/job/test_suite_executor/731/

             Toy build verification and code review in progress. Updating the due to date.

            srinath.duvuru Srinath Duvuru added a comment -  Toy build verification and code review in progress. Updating the due to date.

            Saw 1 occurrence of the issue on latest toybuild 7.1.0-14563:

            {"stageInfo":{"data":{"startTime":"2022-03-18T13:25:55.861-07:00","completedTime":false,"timeTaken":60392}},"rebalanceId":"60936e895117fcae12449ab2145737bd","nodesInfo":{"active_nodes":["ns_1@172.23.100.38","ns_1@172.23.100.39"],"keep_nodes":["ns_1@172.23.100.38"],"eject_nodes":["ns_1@172.23.100.39"],"delta_nodes":[],"failed_nodes":[]},"masterNode":"ns_1@172.23.100.38","startTime":"2022-03-18T13:25:55.860-07:00","completedTime":"2022-03-18T13:26:56.254-07:00","timeTaken":60393,"completionMessage":"Rebalance exited with reason {buckets_shutdown_wait_failed,\n                              [{'ns_1@172.23.100.38',\n                                {'EXIT',\n                                 {old_buckets_shutdown_wait_failed,\n                                  [\"default\"]}}}]}."}
            

            Logs:
            http://cb-jira.s3.us-east-2.amazonaws.com/logs/magma/172.23.100.38_14563.zip
            http://cb-jira.s3.us-east-2.amazonaws.com/logs/magma/172.23.100.39_14563.zip

            pavithra.mahamani Pavithra Mahamani added a comment - Saw 1 occurrence of the issue on latest toybuild 7.1.0-14563: {"stageInfo":{"data":{"startTime":"2022-03-18T13:25:55.861-07:00","completedTime":false,"timeTaken":60392}},"rebalanceId":"60936e895117fcae12449ab2145737bd","nodesInfo":{"active_nodes":["ns_1@172.23.100.38","ns_1@172.23.100.39"],"keep_nodes":["ns_1@172.23.100.38"],"eject_nodes":["ns_1@172.23.100.39"],"delta_nodes":[],"failed_nodes":[]},"masterNode":"ns_1@172.23.100.38","startTime":"2022-03-18T13:25:55.860-07:00","completedTime":"2022-03-18T13:26:56.254-07:00","timeTaken":60393,"completionMessage":"Rebalance exited with reason {buckets_shutdown_wait_failed,\n [{'ns_1@172.23.100.38',\n {'EXIT',\n {old_buckets_shutdown_wait_failed,\n [\"default\"]}}}]}."} Logs: http://cb-jira.s3.us-east-2.amazonaws.com/logs/magma/172.23.100.38_14563.zip http://cb-jira.s3.us-east-2.amazonaws.com/logs/magma/172.23.100.39_14563.zip
            apaar.gupta Apaar Gupta added a comment - - edited

            The issue seems to be better but will still show up when the machine is very slow. In this case it is taking extremely long for kv flushers to process mutations. Similarly Magma::Shutdown is also taking very long since we do a bunch of IO during shutdown.

            2022-03-18T08:32:10.824923-07:00 WARNING (sasl_bucket_1) Slow runtime for 'Running a flusher loop: flusher 3' on thread WriterPool3: 197 s
            

            This message is similar to MB-51422. Pavithra Mahamani was suspecting a performance regression so I ran some Magma perf tests to ensure that we did not have a regression in the toy build. It seems to be fine. I think the shared VM's are getting slow during a sustained workload since we did not see this when the test was ran on machines with dedicated disks.

            The fix to abort ongoing flushes seems to mitigate the issue but we can't be sure. I would recommend we either raise the timeout to 10 minutes or defer the issue. There also are cases where kv_engine will be unable to similarly shutdown in the given 5 min. This operation failing is not fatal and retrying the rebalance works.

            apaar.gupta Apaar Gupta added a comment - - edited The issue seems to be better but will still show up when the machine is very slow. In this case it is taking extremely long for kv flushers to process mutations. Similarly Magma::Shutdown is also taking very long since we do a bunch of IO during shutdown. 2022 - 03 -18T08: 32 : 10.824923 - 07 : 00 WARNING (sasl_bucket_1) Slow runtime for 'Running a flusher loop: flusher 3' on thread WriterPool3: 197 s This message is similar to MB-51422. Pavithra Mahamani was suspecting a performance regression so I ran some Magma perf tests to ensure that we did not have a regression in the toy build. It seems to be fine. I think the shared VM's are getting slow during a sustained workload since we did not see this when the test was ran on machines with dedicated disks. The fix to abort ongoing flushes seems to mitigate the issue but we can't be sure. I would recommend we either raise the timeout to 10 minutes or defer the issue. There also are cases where kv_engine will be unable to similarly shutdown in the given 5 min. This operation failing is not fatal and retrying the rebalance works.

            Thanks Apaar Gupta. Do we want to defer the fix to Morpheus then?

            pavithra.mahamani Pavithra Mahamani added a comment - Thanks Apaar Gupta . Do we want to defer the fix to Morpheus then?
            ritam.sharma Ritam Sharma added a comment -

            Apaar Gupta - We did try with dedicated disk without the toy and with 10 mins sleep, the results were not very good; the tests failed with bucket deletion timeout.
            Pavithra Mahamani - Please confirm on above.

            ritam.sharma Ritam Sharma added a comment - Apaar Gupta - We did try with dedicated disk without the toy and with 10 mins sleep, the results were not very good; the tests failed with bucket deletion timeout. Pavithra Mahamani - Please confirm on above.
            apaar.gupta Apaar Gupta added a comment -

            Ritam Sharma does that mean the toy build was better than the 10 min timeout build? In the toy build we saw this issue only once. If we are seeing a measurable benefit from the toy build we will look into merging the fix for neo too. If we are not sure we can merge just to master.

            apaar.gupta Apaar Gupta added a comment - Ritam Sharma does that mean the toy build was better than the 10 min timeout build? In the toy build we saw this issue only once. If we are seeing a measurable benefit from the toy build we will look into merging the fix for neo too. If we are not sure we can merge just to master.
            apaar.gupta Apaar Gupta added a comment -

            We have merged patches to mitigate this issue. The issue cannot be completely solved since Magma does Io during shutdown and extremely slow disks can still cause us to exceed the 5 minute threshold.

            apaar.gupta Apaar Gupta added a comment - We have merged patches to mitigate this issue. The issue cannot be completely solved since Magma does Io during shutdown and extremely slow disks can still cause us to exceed the 5 minute threshold.

            Build couchbase-server-7.2.0-1069 contains magma commit c768712 with commit message:
            MB-51477 magma: Don't checkpoint on shutdown based on a threshold

            build-team Couchbase Build Team added a comment - Build couchbase-server-7.2.0-1069 contains magma commit c768712 with commit message: MB-51477 magma: Don't checkpoint on shutdown based on a threshold

            Build couchbase-server-7.2.0-1069 contains magma commit 757617d with commit message:
            MB-51477 magma: Cancel inflight SSTable writes during shutdown

            build-team Couchbase Build Team added a comment - Build couchbase-server-7.2.0-1069 contains magma commit 757617d with commit message: MB-51477 magma: Cancel inflight SSTable writes during shutdown

            People

              apaar.gupta Apaar Gupta
              pavithra.mahamani Pavithra Mahamani
              Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Gerrit Reviews

                  There is 1 open Gerrit change

                  PagerDuty