Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-49512

[Magma] - Cleaning up of the cluster fails with "Rebalance exited with reason {buckets_shutdown_wait_failed"

    XMLWordPrintable

Details

    • Triaged
    • Centos 64-bit
    • 1
    • No
    • KV 2021-Dec, KV 2022-Feb, KV March-22

    Description

      Script to Repro

      This can happen in the tearDown part of the any test. So, in tearDown method we drop all the buckets and remove all the nodes in the cluster. This fails as shown below.
      

      172.23.120.206 10:05:03 PM 11 Nov, 2021 ( 2021-11-11T22:05:03.228-08:00 )

      Rebalance exited with reason {buckets_shutdown_wait_failed,
      [{'ns_1@172.23.120.206',
      {'EXIT',
      {old_buckets_shutdown_wait_failed,
      ["-6AT-Evkts1eHVShDkwV6uJIF5j5BxpFu2DwiLTw0PnB0bYy-33-378000"]}}}]}.
      Rebalance Operation Id = dbb8d76ebc02c654f2c23fbbabac68e9
      

      Even retried rebalance failed.
      172.23.120.206 10:06:28 PM 11 Nov, 2021

      Rebalance exited with reason {buckets_shutdown_wait_failed,
      [{'ns_1@172.23.120.206',
      {'EXIT',
      {old_buckets_shutdown_wait_failed,
      ["-6AT-Evkts1eHVShDkwV6uJIF5j5BxpFu2DwiLTw0PnB0bYy-33-378000"]}}}]}.
      Rebalance Operation Id = 1754a1e783a53551c9f546338cebb3d7
      

      Based on the failures it does look like the previously dropped bucket too longer than expected to get deleted.

      172.23.104.186 10:03:32 PM 11 Nov, 2021

      Shutting down bucket "-6AT-Evkts1eHVShDkwV6uJIF5j5BxpFu2DwiLTw0PnB0bYy-33-378000" on 'ns_1@172.23.104.186' for deletion
      

      Maybe we need to figure out a way to disable the rebalance button until the bucket is fully deleted.

      cbcollect_info attached.

      Attachments

        1. 172.23.100.38.zip
          24.32 MB
        2. 172.23.100.39.zip
          28.18 MB
        3. consoleText_MB-49512_rerun.txt
          320 kB
        4. consoleText_MB-49512_run2_2211.txt
          3.08 MB
        5. screenshot-1.png
          screenshot-1.png
          36 kB
        6. Screenshot 2022-02-26 at 4.19.57 PM.png
          Screenshot 2022-02-26 at 4.19.57 PM.png
          294 kB
        7. UI_MB-49512.png
          UI_MB-49512.png
          595 kB

        Issue Links

          For Gerrit Dashboard: MB-49512
          # Subject Branch Project Status CR V

          Activity

            I am still seeing this issue in 7.1.0-2475 in the XDCR functional tests.

            rebalance_report_20220314T223148.json:
            {"stageInfo":{"data":{"startTime":"2022-03-14T15:30:48.099-07:00","completedTime":false,"timeTaken":60280}},"rebalanceId":"966b008ec2d7d181ce4f40571d563618","nodesInfo":{"active_nodes":["ns_1@172.23.100.38","ns_1@172.23.100.39"],"keep_nodes":["ns_1@172.23.100.38"],"eject_nodes":["ns_1@172.23.100.39"],"delta_nodes":[],"failed_nodes":[]},"masterNode":"ns_1@172.23.100.38","startTime":"2022-03-14T15:30:48.098-07:00","completedTime":"2022-03-14T15:31:48.379-07:00","timeTaken":60281,"completionMessage":"Rebalance exited with reason {buckets_shutdown_wait_failed,\n                              [{'ns_1@172.23.100.38',\n                                {'EXIT',\n                                 {old_buckets_shutdown_wait_failed,\n                                  [\"default\"]}}}]}."}
            

            Logs: 172.23.100.38.zip , 172.23.100.39.zip

            pavithra.mahamani Pavithra Mahamani (Inactive) added a comment - - edited I am still seeing this issue in 7.1.0-2475 in the XDCR functional tests. rebalance_report_20220314T223148.json: {"stageInfo":{"data":{"startTime":"2022-03-14T15:30:48.099-07:00","completedTime":false,"timeTaken":60280}},"rebalanceId":"966b008ec2d7d181ce4f40571d563618","nodesInfo":{"active_nodes":["ns_1@172.23.100.38","ns_1@172.23.100.39"],"keep_nodes":["ns_1@172.23.100.38"],"eject_nodes":["ns_1@172.23.100.39"],"delta_nodes":[],"failed_nodes":[]},"masterNode":"ns_1@172.23.100.38","startTime":"2022-03-14T15:30:48.098-07:00","completedTime":"2022-03-14T15:31:48.379-07:00","timeTaken":60281,"completionMessage":"Rebalance exited with reason {buckets_shutdown_wait_failed,\n [{'ns_1@172.23.100.38',\n {'EXIT',\n {old_buckets_shutdown_wait_failed,\n [\"default\"]}}}]}."} Logs: 172.23.100.38.zip , 172.23.100.39.zip

            Looks to be a different root cause for this issue Pavithra Mahamani.

            2022-03-14T19:38:26.828070-07:00 INFO 62: Delete bucket [default]. Notifying engine
            2022-03-14T19:38:26.829941-07:00 INFO 62: Delete bucket [default]. Engine ready for shutdown
            2022-03-14T19:38:26.829959-07:00 INFO 62: Delete bucket [default]. Wait for 6 clients to disconnect
            2022-03-14T19:38:26.830542-07:00 INFO 62: Delete bucket [default]. Shut down the bucket
            2022-03-14T19:38:26.831182-07:00 INFO (default) KVBucket::deinitialize forceShutdown:true
            2022-03-14T19:39:33.410204-07:00 INFO (default) MagmaKVStore: 0 deinitializing
            2022-03-14T19:39:33.410240-07:00 INFO [(default) magma_0]Closing magma db (thread pool refcount 8)
            ...
            2022-03-14T19:41:44.954724-07:00 INFO [(default) magma_0]Completed closing magma db
            2022-03-14T19:41:44.955078-07:00 INFO (default) MagmaKVStore: 0 deinitialized
            2022-03-14T19:41:44.955092-07:00 INFO (default) MagmaKVStore: 1 deinitializing
            2022-03-14T19:41:44.955097-07:00 INFO [(default) magma_1]Closing magma db (thread pool refcount 7)
            ...
            2022-03-14T19:43:44.957415-07:00 INFO [(default) magma_1]Completed closing magma db
            2022-03-14T19:43:44.957819-07:00 INFO (default) MagmaKVStore: 1 deinitialized
            2022-03-14T19:43:44.957831-07:00 INFO (default) MagmaKVStore: 2 deinitializing
            2022-03-14T19:43:44.957835-07:00 INFO [(default) magma_2]Closing magma db (thread pool refcount 6)
            ...
            2022-03-14T19:45:54.011065-07:00 INFO [(default) magma_2]Completed closing magma db
            2022-03-14T19:45:54.011488-07:00 INFO (default) MagmaKVStore: 2 deinitialized
            2022-03-14T19:45:54.011502-07:00 INFO (default) MagmaKVStore: 3 deinitializing
            2022-03-14T19:45:54.011506-07:00 INFO [(default) magma_3]Closing magma db (thread pool refcount 5)
            ...
            2022-03-14T19:46:07.993157-07:00 INFO [(default) magma_3/kvstore-147/rev-000000001]KVStore::Shutdown
            

            The cleanup in kv_engine looks to be prompt, but the shutdown of the magma shards is rather slow taking multiple minutes. Sarath Lakshman, probably worth somebody on the magma team taking a look at the latest set of logs.

            ben.huddleston Ben Huddleston added a comment - Looks to be a different root cause for this issue Pavithra Mahamani . 2022-03-14T19:38:26.828070-07:00 INFO 62: Delete bucket [default]. Notifying engine 2022-03-14T19:38:26.829941-07:00 INFO 62: Delete bucket [default]. Engine ready for shutdown 2022-03-14T19:38:26.829959-07:00 INFO 62: Delete bucket [default]. Wait for 6 clients to disconnect 2022-03-14T19:38:26.830542-07:00 INFO 62: Delete bucket [default]. Shut down the bucket 2022-03-14T19:38:26.831182-07:00 INFO (default) KVBucket::deinitialize forceShutdown:true 2022-03-14T19:39:33.410204-07:00 INFO (default) MagmaKVStore: 0 deinitializing 2022-03-14T19:39:33.410240-07:00 INFO [(default) magma_0]Closing magma db (thread pool refcount 8) ... 2022-03-14T19:41:44.954724-07:00 INFO [(default) magma_0]Completed closing magma db 2022-03-14T19:41:44.955078-07:00 INFO (default) MagmaKVStore: 0 deinitialized 2022-03-14T19:41:44.955092-07:00 INFO (default) MagmaKVStore: 1 deinitializing 2022-03-14T19:41:44.955097-07:00 INFO [(default) magma_1]Closing magma db (thread pool refcount 7) ... 2022-03-14T19:43:44.957415-07:00 INFO [(default) magma_1]Completed closing magma db 2022-03-14T19:43:44.957819-07:00 INFO (default) MagmaKVStore: 1 deinitialized 2022-03-14T19:43:44.957831-07:00 INFO (default) MagmaKVStore: 2 deinitializing 2022-03-14T19:43:44.957835-07:00 INFO [(default) magma_2]Closing magma db (thread pool refcount 6) ... 2022-03-14T19:45:54.011065-07:00 INFO [(default) magma_2]Completed closing magma db 2022-03-14T19:45:54.011488-07:00 INFO (default) MagmaKVStore: 2 deinitialized 2022-03-14T19:45:54.011502-07:00 INFO (default) MagmaKVStore: 3 deinitializing 2022-03-14T19:45:54.011506-07:00 INFO [(default) magma_3]Closing magma db (thread pool refcount 5) ... 2022-03-14T19:46:07.993157-07:00 INFO [(default) magma_3/kvstore-147/rev-000000001]KVStore::Shutdown The cleanup in kv_engine looks to be prompt, but the shutdown of the magma shards is rather slow taking multiple minutes. Sarath Lakshman , probably worth somebody on the magma team taking a look at the latest set of logs.

            Closing, will file a new ticket based on Ben Huddleston's comment that the root cause is not the same for the latest occurrence.

            pavithra.mahamani Pavithra Mahamani (Inactive) added a comment - Closing, will file a new ticket based on Ben Huddleston 's comment that the root cause is not the same for the latest occurrence.

            Build couchbase-server-7.2.0-1021 contains kv_engine commit f74b76b with commit message:
            MB-49512: Reset task ptr on scheduler thread during taskable shutdown

            build-team Couchbase Build Team added a comment - Build couchbase-server-7.2.0-1021 contains kv_engine commit f74b76b with commit message: MB-49512 : Reset task ptr on scheduler thread during taskable shutdown

            Build couchbase-server-7.2.0-1021 contains kv_engine commit 0517833 with commit message:
            MB-49512: Make Taskable::isShutdown() const

            build-team Couchbase Build Team added a comment - Build couchbase-server-7.2.0-1021 contains kv_engine commit 0517833 with commit message: MB-49512 : Make Taskable::isShutdown() const

            People

              apaar.gupta Apaar Gupta
              Balakumaran.Gopal Balakumaran Gopal
              Votes:
              0 Vote for this issue
              Watchers:
              8 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Gerrit Reviews

                  There is 1 open Gerrit change

                  PagerDuty