Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-49512

[Magma] - Cleaning up of the cluster fails with "Rebalance exited with reason {buckets_shutdown_wait_failed"

    XMLWordPrintable

Details

    • Triaged
    • Centos 64-bit
    • 1
    • No
    • KV 2021-Dec, KV 2022-Feb, KV March-22

    Description

      Script to Repro

      This can happen in the tearDown part of the any test. So, in tearDown method we drop all the buckets and remove all the nodes in the cluster. This fails as shown below.
      

      172.23.120.206 10:05:03 PM 11 Nov, 2021 ( 2021-11-11T22:05:03.228-08:00 )

      Rebalance exited with reason {buckets_shutdown_wait_failed,
      [{'ns_1@172.23.120.206',
      {'EXIT',
      {old_buckets_shutdown_wait_failed,
      ["-6AT-Evkts1eHVShDkwV6uJIF5j5BxpFu2DwiLTw0PnB0bYy-33-378000"]}}}]}.
      Rebalance Operation Id = dbb8d76ebc02c654f2c23fbbabac68e9
      

      Even retried rebalance failed.
      172.23.120.206 10:06:28 PM 11 Nov, 2021

      Rebalance exited with reason {buckets_shutdown_wait_failed,
      [{'ns_1@172.23.120.206',
      {'EXIT',
      {old_buckets_shutdown_wait_failed,
      ["-6AT-Evkts1eHVShDkwV6uJIF5j5BxpFu2DwiLTw0PnB0bYy-33-378000"]}}}]}.
      Rebalance Operation Id = 1754a1e783a53551c9f546338cebb3d7
      

      Based on the failures it does look like the previously dropped bucket too longer than expected to get deleted.

      172.23.104.186 10:03:32 PM 11 Nov, 2021

      Shutting down bucket "-6AT-Evkts1eHVShDkwV6uJIF5j5BxpFu2DwiLTw0PnB0bYy-33-378000" on 'ns_1@172.23.104.186' for deletion
      

      Maybe we need to figure out a way to disable the rebalance button until the bucket is fully deleted.

      cbcollect_info attached.

      Attachments

        1. 172.23.100.38.zip
          24.32 MB
        2. 172.23.100.39.zip
          28.18 MB
        3. consoleText_MB-49512_rerun.txt
          320 kB
        4. consoleText_MB-49512_run2_2211.txt
          3.08 MB
        5. screenshot-1.png
          screenshot-1.png
          36 kB
        6. Screenshot 2022-02-26 at 4.19.57 PM.png
          Screenshot 2022-02-26 at 4.19.57 PM.png
          294 kB
        7. UI_MB-49512.png
          UI_MB-49512.png
          595 kB

        Issue Links

          For Gerrit Dashboard: MB-49512
          # Subject Branch Project Status CR V

          Activity

            Balakumaran.Gopal Balakumaran Gopal created issue -
            Balakumaran.Gopal Balakumaran Gopal made changes -
            Field Original Value New Value
            Assignee Balakumaran Gopal [ balakumaran.gopal ] Daniel Owen [ owend ]
            owend Daniel Owen made changes -
            Rank Ranked higher
            owend Daniel Owen made changes -
            Assignee Daniel Owen [ owend ] James Harrison [ james.harrison ]
            james.harrison James Harrison made changes -
            Sprint KV 2021-Dec [ 1906 ]
            james.harrison James Harrison made changes -
            Description +Script to Repro+
            {noformat}
            This can happen in the tearDown part of the any test. So, in tearDown method we drop all the buckets and remove all the nodes in the cluster. This fails as shown below.
            {noformat}

            +172.23.120.206 10:05:03 PM 11 Nov, 2021+
            {noformat}
            Rebalance exited with reason {buckets_shutdown_wait_failed,
            [{'ns_1@172.23.120.206',
            {'EXIT',
            {old_buckets_shutdown_wait_failed,
            ["-6AT-Evkts1eHVShDkwV6uJIF5j5BxpFu2DwiLTw0PnB0bYy-33-378000"]}}}]}.
            Rebalance Operation Id = dbb8d76ebc02c654f2c23fbbabac68e9
            {noformat}

            Even retried rebalance failed.
            +172.23.120.206 10:06:28 PM 11 Nov, 2021+
            {noformat}
            Rebalance exited with reason {buckets_shutdown_wait_failed,
            [{'ns_1@172.23.120.206',
            {'EXIT',
            {old_buckets_shutdown_wait_failed,
            ["-6AT-Evkts1eHVShDkwV6uJIF5j5BxpFu2DwiLTw0PnB0bYy-33-378000"]}}}]}.
            Rebalance Operation Id = 1754a1e783a53551c9f546338cebb3d7
            {noformat}

            Based on the failures it does look like the previously dropped bucket too longer than expected to get deleted.

            +172.23.104.186 10:03:32 PM 11 Nov, 2021+
            {noformat}
            Shutting down bucket "-6AT-Evkts1eHVShDkwV6uJIF5j5BxpFu2DwiLTw0PnB0bYy-33-378000" on 'ns_1@172.23.104.186' for deletion
            {noformat}

            Maybe we need to figure out a way to disable the rebalance button until the bucket is fully deleted.

            cbcollect_info attached.
            +Script to Repro+
            {noformat}
            This can happen in the tearDown part of the any test. So, in tearDown method we drop all the buckets and remove all the nodes in the cluster. This fails as shown below.
            {noformat}

            +172.23.120.206 10:05:03 PM 11 Nov, 2021+ ( 2021-11-11T22:05:03.228-08:00 )
            {noformat}
            Rebalance exited with reason {buckets_shutdown_wait_failed,
            [{'ns_1@172.23.120.206',
            {'EXIT',
            {old_buckets_shutdown_wait_failed,
            ["-6AT-Evkts1eHVShDkwV6uJIF5j5BxpFu2DwiLTw0PnB0bYy-33-378000"]}}}]}.
            Rebalance Operation Id = dbb8d76ebc02c654f2c23fbbabac68e9
            {noformat}

            Even retried rebalance failed.
            +172.23.120.206 10:06:28 PM 11 Nov, 2021+
            {noformat}
            Rebalance exited with reason {buckets_shutdown_wait_failed,
            [{'ns_1@172.23.120.206',
            {'EXIT',
            {old_buckets_shutdown_wait_failed,
            ["-6AT-Evkts1eHVShDkwV6uJIF5j5BxpFu2DwiLTw0PnB0bYy-33-378000"]}}}]}.
            Rebalance Operation Id = 1754a1e783a53551c9f546338cebb3d7
            {noformat}

            Based on the failures it does look like the previously dropped bucket too longer than expected to get deleted.

            +172.23.104.186 10:03:32 PM 11 Nov, 2021+
            {noformat}
            Shutting down bucket "-6AT-Evkts1eHVShDkwV6uJIF5j5BxpFu2DwiLTw0PnB0bYy-33-378000" on 'ns_1@172.23.104.186' for deletion
            {noformat}

            Maybe we need to figure out a way to disable the rebalance button until the bucket is fully deleted.

            cbcollect_info attached.
            owend Daniel Owen made changes -
            Sprint KV 2021-Dec [ 1906 ]
            owend Daniel Owen made changes -
            Rank Ranked higher
            owend Daniel Owen made changes -
            Assignee James Harrison [ james.harrison ] Daniel Owen [ owend ]
            richard.demellow Richard deMellow made changes -
            Assignee Daniel Owen [ owend ] Richard deMellow [ richard.demellow ]
            owend Daniel Owen made changes -
            Sprint KV 2021-Dec [ 1906 ]
            owend Daniel Owen made changes -
            Rank Ranked higher
            richard.demellow Richard deMellow made changes -
            Attachment screenshot-1.png [ 171995 ]

            Looking at node 172.23.120.206 at 2021-11-11T22:03 we see the shutdown request come in and we start it on time. We're progressing quickly till after the shutdown log message Shut down the bucket

            2021-11-11T22:03:32.120181-08:00 INFO 45: Delete bucket [-6AT-Evkts1eHVShDkwV6uJIF5j5BxpFu2DwiLTw0PnB0bYy-33-378000]. Notifying engine
            2021-11-11T22:03:32.127597-08:00 INFO 45: Delete bucket [-6AT-Evkts1eHVShDkwV6uJIF5j5BxpFu2DwiLTw0PnB0bYy-33-378000]. Engine ready for shutdown
            2021-11-11T22:03:32.127608-08:00 INFO 45: Delete bucket [-6AT-Evkts1eHVShDkwV6uJIF5j5BxpFu2DwiLTw0PnB0bYy-33-378000]. Wait for 4 clients to disconnect
            2021-11-11T22:03:32.127916-08:00 INFO 45: Delete bucket [-6AT-Evkts1eHVShDkwV6uJIF5j5BxpFu2DwiLTw0PnB0bYy-33-378000]. Shut down the bucket
            2021-11-11T22:07:18.983387-08:00 INFO 45: Delete bucket [-6AT-Evkts1eHVShDkwV6uJIF5j5BxpFu2DwiLTw0PnB0bYy-33-378000]. Clean up allocated resources
            2021-11-11T22:07:18.983425-08:00 INFO 45: Delete bucket [-6AT-Evkts1eHVShDkwV6uJIF5j5BxpFu2DwiLTw0PnB0bYy-33-378000] complete
            

            During these two log messages we're trying to shutdown ep-engine for the bucket.

                LOG_INFO("{}: Delete bucket [{}]. Shut down the bucket",
                         connection_id,
                         name);
                bucket.destroyEngine(force);
             
                LOG_INFO("{}: Delete bucket [{}]. Clean up allocated resources ",
            

            Looking at the logs we see that the ~4.5 mins has been spent denitializing one of the magma shards.

            2021-11-11T22:03:32.127922-08:00 INFO (-6AT-Evkts1eHVShDkwV6uJIF5j5BxpFu2DwiLTw0PnB0bYy-33-378000) Shutting down dcp connections!
            2021-11-11T22:03:32.127972-08:00 INFO (-6AT-Evkts1eHVShDkwV6uJIF5j5BxpFu2DwiLTw0PnB0bYy-33-378000) Attempting to stop flusher:0
            ..
            2021-11-11T22:03:32.127984-08:00 INFO (-6AT-Evkts1eHVShDkwV6uJIF5j5BxpFu2DwiLTw0PnB0bYy-33-378000) Attempting to stop flusher:3
            2021-11-11T22:03:32.127986-08:00 INFO (-6AT-Evkts1eHVShDkwV6uJIF5j5BxpFu2DwiLTw0PnB0bYy-33-378000) Stopping bg fetchers
            2021-11-11T22:03:32.128613-08:00 INFO (-6AT-Evkts1eHVShDkwV6uJIF5j5BxpFu2DwiLTw0PnB0bYy-33-378000) KVBucket::deinitialize forceShutdown:true
            2021-11-11T22:03:32.357658-08:00 INFO (-6AT-Evkts1eHVShDkwV6uJIF5j5BxpFu2DwiLTw0PnB0bYy-33-378000) MagmaKVStore: 0 deinitializing
            2021-11-11T22:07:00.530267-08:00 INFO [(-6AT-Evkts1eHVShDkwV6uJIF5j5BxpFu2DwiLTw0PnB0bYy-33-378000) magma_0]Closing magma db (thread pool refcount 4)
            2021-11-11T22:07:00.530300-08:00 INFO [(-6AT-Evkts1eHVShDkwV6uJIF5j5BxpFu2DwiLTw0PnB0bYy-33-378000) magma_0/kvstore-0/rev-000000001]KVStore::Shutdown
            2021-11-11T22:07:00.580500-08:00 INFO [(-6AT-Evkts1eHVShDkwV6uJIF5j5BxpFu2DwiLTw0PnB0bYy-33-378000) magma_0/kvstore-4/rev-000000001]KVStore::Shutdown
            

            Looking at MagmaKVStore::deinitialize() we see that the time being spend must be in the Sync() method. As the KVStore::Shutdown messages are triggered by the call to Close().

            void MagmaKVStore::deinitialize() {
                logger->info("MagmaKVStore: {} deinitializing", configuration.getShardId());
             
                magma->Sync(true);
             
                // Close shuts down all of the magma background threads (compaction is the
                // one that we care about here). The compaction callbacks require the magma
                // instance to exist so we must do this before we reset it.
                magma->Close();
             
                // Flusher should have already been stopped so it should be safe to destroy
                // the magma instance now
                magma.reset();
             
                logger->info("MagmaKVStore: {} deinitialized", configuration.getShardId());
            }
            

            Looking at disk queue items we don't see anything in it so we're not trying to flush anything to disk, which makes sense given the flushers stopped quickly.

            Conclusion from KV-Engine point of view
            From a KV point of view ep-engine is shutting down fine but we're waiting on Magma to sync shard 0. I'm going to re-assign this over to the Magma team to investigate why the Sync is taking so long for one shard.

            richard.demellow Richard deMellow added a comment - Looking at node 172.23.120.206 at 2021-11-11T22:03 we see the shutdown request come in and we start it on time. We're progressing quickly till after the shutdown log message Shut down the bucket 2021-11-11T22:03:32.120181-08:00 INFO 45: Delete bucket [-6AT-Evkts1eHVShDkwV6uJIF5j5BxpFu2DwiLTw0PnB0bYy-33-378000]. Notifying engine 2021-11-11T22:03:32.127597-08:00 INFO 45: Delete bucket [-6AT-Evkts1eHVShDkwV6uJIF5j5BxpFu2DwiLTw0PnB0bYy-33-378000]. Engine ready for shutdown 2021-11-11T22:03:32.127608-08:00 INFO 45: Delete bucket [-6AT-Evkts1eHVShDkwV6uJIF5j5BxpFu2DwiLTw0PnB0bYy-33-378000]. Wait for 4 clients to disconnect 2021-11-11T22:03:32.127916-08:00 INFO 45: Delete bucket [-6AT-Evkts1eHVShDkwV6uJIF5j5BxpFu2DwiLTw0PnB0bYy-33-378000]. Shut down the bucket 2021-11-11T22:07:18.983387-08:00 INFO 45: Delete bucket [-6AT-Evkts1eHVShDkwV6uJIF5j5BxpFu2DwiLTw0PnB0bYy-33-378000]. Clean up allocated resources 2021-11-11T22:07:18.983425-08:00 INFO 45: Delete bucket [-6AT-Evkts1eHVShDkwV6uJIF5j5BxpFu2DwiLTw0PnB0bYy-33-378000] complete During these two log messages we're trying to shutdown ep-engine for the bucket. LOG_INFO("{}: Delete bucket [{}]. Shut down the bucket", connection_id, name); bucket.destroyEngine(force);   LOG_INFO("{}: Delete bucket [{}]. Clean up allocated resources ", Looking at the logs we see that the ~4.5 mins has been spent denitializing one of the magma shards. 2021-11-11T22:03:32.127922-08:00 INFO (-6AT-Evkts1eHVShDkwV6uJIF5j5BxpFu2DwiLTw0PnB0bYy-33-378000) Shutting down dcp connections! 2021-11-11T22:03:32.127972-08:00 INFO (-6AT-Evkts1eHVShDkwV6uJIF5j5BxpFu2DwiLTw0PnB0bYy-33-378000) Attempting to stop flusher:0 .. 2021-11-11T22:03:32.127984-08:00 INFO (-6AT-Evkts1eHVShDkwV6uJIF5j5BxpFu2DwiLTw0PnB0bYy-33-378000) Attempting to stop flusher:3 2021-11-11T22:03:32.127986-08:00 INFO (-6AT-Evkts1eHVShDkwV6uJIF5j5BxpFu2DwiLTw0PnB0bYy-33-378000) Stopping bg fetchers 2021-11-11T22:03:32.128613-08:00 INFO (-6AT-Evkts1eHVShDkwV6uJIF5j5BxpFu2DwiLTw0PnB0bYy-33-378000) KVBucket::deinitialize forceShutdown:true 2021-11-11T22:03:32.357658-08:00 INFO (-6AT-Evkts1eHVShDkwV6uJIF5j5BxpFu2DwiLTw0PnB0bYy-33-378000) MagmaKVStore: 0 deinitializing 2021-11-11T22:07:00.530267-08:00 INFO [(-6AT-Evkts1eHVShDkwV6uJIF5j5BxpFu2DwiLTw0PnB0bYy-33-378000) magma_0]Closing magma db (thread pool refcount 4) 2021-11-11T22:07:00.530300-08:00 INFO [(-6AT-Evkts1eHVShDkwV6uJIF5j5BxpFu2DwiLTw0PnB0bYy-33-378000) magma_0/kvstore-0/rev-000000001]KVStore::Shutdown 2021-11-11T22:07:00.580500-08:00 INFO [(-6AT-Evkts1eHVShDkwV6uJIF5j5BxpFu2DwiLTw0PnB0bYy-33-378000) magma_0/kvstore-4/rev-000000001]KVStore::Shutdown Looking at MagmaKVStore::deinitialize() we see that the time being spend must be in the Sync() method. As the KVStore::Shutdown messages are triggered by the call to Close() . void MagmaKVStore::deinitialize() { logger->info("MagmaKVStore: {} deinitializing", configuration.getShardId());   magma->Sync(true);   // Close shuts down all of the magma background threads (compaction is the // one that we care about here). The compaction callbacks require the magma // instance to exist so we must do this before we reset it. magma->Close();   // Flusher should have already been stopped so it should be safe to destroy // the magma instance now magma.reset();   logger->info("MagmaKVStore: {} deinitialized", configuration.getShardId()); } Looking at disk queue items we don't see anything in it so we're not trying to flush anything to disk, which makes sense given the flushers stopped quickly. Conclusion from KV-Engine point of view From a KV point of view ep-engine is shutting down fine but we're waiting on Magma to sync shard 0. I'm going to re-assign this over to the Magma team to investigate why the Sync is taking so long for one shard.
            richard.demellow Richard deMellow made changes -
            Component/s couchbase-bucket [ 10173 ]
            Component/s storage-engine [ 10175 ]
            richard.demellow Richard deMellow made changes -
            Assignee Richard deMellow [ richard.demellow ] John Liang [ jliang ]
            richard.demellow Richard deMellow made changes -
            Triage Untriaged [ 10351 ] Triaged [ 10350 ]
            richard.demellow Richard deMellow made changes -
            Assignee John Liang [ jliang ] Sarath Lakshman [ sarath ]
            Balakumaran.Gopal Balakumaran Gopal made changes -
            Assignee Sarath Lakshman [ sarath ] Srinath Duvuru [ srinath.duvuru ]
            srinath.duvuru Srinath Duvuru made changes -
            Assignee Srinath Duvuru [ srinath.duvuru ] Apaar Gupta [ apaar.gupta ]
            apaar.gupta Apaar Gupta added a comment - - edited

            Sync of Magma_0 took 268 sec, Magma_1 took 14 sec and Magma_3,4 under a second.

            Magma::Sync(true) flushes all memtables and creates checkpoints for all magma instances sequentially. It also waits for pending flushes in the background flushers to complete since there can only be one pending/queued flush per kvstore. This wait is what is causing the Sync to take ages for the first shard since the Flusher threadpool is shared and it is having to wait for flusher tasks from other shards to also complete. That would also explain why the time taken is reducing from the first Sync to the last.

            Unfortunately I am unable to verify the time flushes are taking cos the magma histograms are gone as the bucket it deleted. But I was able to reproduce a similar pattern(first shard taking ages and subsequent syncs taking less time).

            Let me see what can be done to make sure sync's flush does not wait but I don't think it will be an easy fix.

            apaar.gupta Apaar Gupta added a comment - - edited Sync of Magma_0 took 268 sec, Magma_1 took 14 sec and Magma_3,4 under a second. Magma::Sync(true) flushes all memtables and creates checkpoints for all magma instances sequentially. It also waits for pending flushes in the background flushers to complete since there can only be one pending/queued flush per kvstore. This wait is what is causing the Sync to take ages for the first shard since the Flusher threadpool is shared and it is having to wait for flusher tasks from other shards to also complete. That would also explain why the time taken is reducing from the first Sync to the last. Unfortunately I am unable to verify the time flushes are taking cos the magma histograms are gone as the bucket it deleted. But I was able to reproduce a similar pattern(first shard taking ages and subsequent syncs taking less time). Let me see what can be done to make sure sync's flush does not wait but I don't think it will be an easy fix.
            srinath.duvuru Srinath Duvuru made changes -
            Priority Major [ 3 ] Critical [ 2 ]

            Build couchbase-server-7.1.0-2133 contains magma commit a6db4ed with commit message:
            MB-49512 lsm: Allow for scheduled LSMTree flushes to be cancelled

            build-team Couchbase Build Team added a comment - Build couchbase-server-7.1.0-2133 contains magma commit a6db4ed with commit message: MB-49512 lsm: Allow for scheduled LSMTree flushes to be cancelled

            Build couchbase-server-7.1.0-2133 contains magma commit 5c35c0f with commit message:
            MB-49512 magma: Add a cleanup function to worker Tasks

            build-team Couchbase Build Team added a comment - Build couchbase-server-7.1.0-2133 contains magma commit 5c35c0f with commit message: MB-49512 magma: Add a cleanup function to worker Tasks

            Build couchbase-server-7.1.0-2133 contains magma commit c41478b with commit message:
            MB-49512 magma: Add task groups to ThreadPool

            build-team Couchbase Build Team added a comment - Build couchbase-server-7.1.0-2133 contains magma commit c41478b with commit message: MB-49512 magma: Add task groups to ThreadPool

            Build couchbase-server-7.1.0-2135 contains magma commit 094c0a3 with commit message:
            MB-49512 magma: Cancel all pending tasks during shutdown

            build-team Couchbase Build Team added a comment - Build couchbase-server-7.1.0-2135 contains magma commit 094c0a3 with commit message: MB-49512 magma: Cancel all pending tasks during shutdown

            Build couchbase-server-7.1.0-2170 contains magma commit 1f377a1 with commit message:
            MB-49512 magma: Create checkpoints during Magma::Close()

            build-team Couchbase Build Team added a comment - Build couchbase-server-7.1.0-2170 contains magma commit 1f377a1 with commit message: MB-49512 magma: Create checkpoints during Magma::Close()

            Build couchbase-server-7.1.0-2184 contains kv_engine commit 296897d with commit message:
            MB-49512: Remove magma checkpoint creation during shutdown

            build-team Couchbase Build Team added a comment - Build couchbase-server-7.1.0-2184 contains kv_engine commit 296897d with commit message: MB-49512 : Remove magma checkpoint creation during shutdown
            apaar.gupta Apaar Gupta added a comment -

            This issue should be fixed, added code to mitigate tasks waiting on those from other magma instances on shutdown. The shutdown times of all shards should be comparable now.

            If there still is an issue, we have to raise the timeout value since magma has a lot of files and pending tasks to cleanup on shutdown.

            apaar.gupta Apaar Gupta added a comment - This issue should be fixed, added code to mitigate tasks waiting on those from other magma instances on shutdown. The shutdown times of all shards should be comparable now. If there still is an issue, we have to raise the timeout value since magma has a lot of files and pending tasks to cleanup on shutdown.
            apaar.gupta Apaar Gupta made changes -
            Resolution Fixed [ 1 ]
            Status Open [ 1 ] Resolved [ 5 ]
            Balakumaran.Gopal Balakumaran Gopal made changes -
            Attachment consoleText_MB-49512_run2_2211.txt [ 176391 ]
            Balakumaran.Gopal Balakumaran Gopal made changes -
            Resolution Fixed [ 1 ]
            Status Resolved [ 5 ] Reopened [ 4 ]
            apaar.gupta Apaar Gupta added a comment -

            We made changes to magma's shutdown to reduce the time spent waiting on pending tasks. I ran grep -r -e "deinitializing" -e "deinitialized" * on all four nodes and looked at the first and last messages of magma shutdown on all the nodes.
            On all the nodes we see

            cbcollect_info_ns_1@172.23.105.206_20220203-041554/memcached.log:2022-02-02T20:16:58.309641-08:00 INFO (bucket2) MagmaKVStore: 0 deinitializing
            cbcollect_info_ns_1@172.23.105.206_20220203-041554/memcached.log:2022-02-02T20:17:00.117316-08:00 INFO (bucket2) MagmaKVStore: 7 deinitialized
            cbcollect_info_ns_1@172.23.105.206_20220203-041554/memcached.log:2022-02-02T20:17:03.370625-08:00 INFO (default) MagmaKVStore: 0 deinitializing
            cbcollect_info_ns_1@172.23.105.206_20220203-041554/memcached.log:2022-02-02T20:17:05.056566-08:00 INFO (default) MagmaKVStore: 7 deinitialized
             
            cbcollect_info_ns_1@172.23.105.36_20220203-041554/memcached.log:2022-02-02T20:16:17.028810-08:00 INFO (bucket2) MagmaKVStore: 0 deinitializing
            cbcollect_info_ns_1@172.23.105.36_20220203-041554/memcached.log:2022-02-02T20:16:19.070000-08:00 INFO (bucket2) MagmaKVStore: 7 deinitialized
            cbcollect_info_ns_1@172.23.105.36_20220203-041554/memcached.log:2022-02-02T20:16:23.042844-08:00 INFO (default) MagmaKVStore: 0 deinitializing
            cbcollect_info_ns_1@172.23.105.36_20220203-041554/memcached.log:2022-02-02T20:16:24.849667-08:00 INFO (default) MagmaKVStore: 7 deinitialized
            

            This indicates 172.23.105.206 took around 2 sec and 172.23.105.36 under a second for each bucket(we have two buckets) to shutdown it's magma instances. Looks like the magma shutdown time issue has been resolved.

            I am not very sure why the rebalance is timing out. Assigning to kv-team to take a look.

            apaar.gupta Apaar Gupta added a comment - We made changes to magma's shutdown to reduce the time spent waiting on pending tasks. I ran grep -r -e "deinitializing" -e "deinitialized" * on all four nodes and looked at the first and last messages of magma shutdown on all the nodes. On all the nodes we see cbcollect_info_ns_1 @172 .23. 105 .206_20220203- 041554 /memcached.log: 2022 - 02 -02T20: 16 : 58.309641 - 08 : 00 INFO (bucket2) MagmaKVStore: 0 deinitializing cbcollect_info_ns_1 @172 .23. 105 .206_20220203- 041554 /memcached.log: 2022 - 02 -02T20: 17 : 00.117316 - 08 : 00 INFO (bucket2) MagmaKVStore: 7 deinitialized cbcollect_info_ns_1 @172 .23. 105 .206_20220203- 041554 /memcached.log: 2022 - 02 -02T20: 17 : 03.370625 - 08 : 00 INFO ( default ) MagmaKVStore: 0 deinitializing cbcollect_info_ns_1 @172 .23. 105 .206_20220203- 041554 /memcached.log: 2022 - 02 -02T20: 17 : 05.056566 - 08 : 00 INFO ( default ) MagmaKVStore: 7 deinitialized   cbcollect_info_ns_1 @172 .23. 105 .36_20220203- 041554 /memcached.log: 2022 - 02 -02T20: 16 : 17.028810 - 08 : 00 INFO (bucket2) MagmaKVStore: 0 deinitializing cbcollect_info_ns_1 @172 .23. 105 .36_20220203- 041554 /memcached.log: 2022 - 02 -02T20: 16 : 19.070000 - 08 : 00 INFO (bucket2) MagmaKVStore: 7 deinitialized cbcollect_info_ns_1 @172 .23. 105 .36_20220203- 041554 /memcached.log: 2022 - 02 -02T20: 16 : 23.042844 - 08 : 00 INFO ( default ) MagmaKVStore: 0 deinitializing cbcollect_info_ns_1 @172 .23. 105 .36_20220203- 041554 /memcached.log: 2022 - 02 -02T20: 16 : 24.849667 - 08 : 00 INFO ( default ) MagmaKVStore: 7 deinitialized This indicates 172.23.105.206 took around 2 sec and 172.23.105.36 under a second for each bucket(we have two buckets) to shutdown it's magma instances. Looks like the magma shutdown time issue has been resolved. I am not very sure why the rebalance is timing out. Assigning to kv-team to take a look.
            apaar.gupta Apaar Gupta made changes -
            Assignee Apaar Gupta [ apaar.gupta ] Daniel Owen [ owend ]
            owend Daniel Owen made changes -
            Component/s couchbase-bucket [ 10173 ]
            owend Daniel Owen made changes -
            Rank Ranked lower
            owend Daniel Owen made changes -
            Assignee Daniel Owen [ owend ] Ben Huddleston [ ben.huddleston ]
            owend Daniel Owen made changes -
            Sprint KV 2021-Dec [ 1906 ] KV 2021-Dec, KV 2022-Feb [ 1906, 2002 ]
            owend Daniel Owen made changes -
            Rank Ranked higher

            Last bucket shutdown is slow becuase we're running through the list of scheduled Compaction tasks.

            2022-02-02T20:11:02.019733-08:00 INFO 3076: Delete bucket [bucket1]. Shut down the bucket
            ...
            2022-02-02T20:11:02.020675-08:00 INFO (bucket1) KVBucket::deinitialize forceShutdown:true]
            ...
            2022-02-02T20:11:02.022445-08:00 INFO (bucket1) Compaction of vb:56 done (ok). purged tombstones:0, prepares:0, prepareBytes:0 collection_items_erased:alive:100,deleted:100, collections_erased:100, size/items/tombstones/purge_seqno pre{15
            1651, 100, 209, 0}, post{16483, 0, 109, 0}
            2022-02-02T20:11:02.023649-08:00 WARNING (No Engine) Slow scheduling for AuxIO task 'Compact DB file 453' on thread AuxIoPool2. Schedule overhead: 17 s
            2022-02-02T20:11:02.023674-08:00 INFO (bucket2) MagmaKVStore::compactDBInternal: vb:453 purge_before_ts:1643602261 purge_before_seq:0 drop_deletes:false retain_erroneous_tombstones:true
            2022-02-02T20:11:02.023737-08:00 INFO (bucket1) Compaction of vb:435 done (ok). purged tombstones:0, prepares:0, prepareBytes:0 collection_items_erased:alive:100,deleted:0, collections_erased:100, size/items/tombstones/purge_seqno pre{143
            459, 100, 109, 0}, post{16483, 0, 109, 0}
            ...
            2022-02-02T20:17:53.289950-08:00 WARNING (No Engine) Slow scheduling for AuxIO task 'Compact DB file 672' on thread AuxIoPool6. Schedule overhead: 382 s
            

            And we didn't log "EventuallyPersistentEngine::destroyInner(): Completed deinitialize" so we're still waiting on tasks to get finished by the looks of it.

            ben.huddleston Ben Huddleston added a comment - Last bucket shutdown is slow becuase we're running through the list of scheduled Compaction tasks. 2022-02-02T20:11:02.019733-08:00 INFO 3076: Delete bucket [bucket1]. Shut down the bucket ... 2022-02-02T20:11:02.020675-08:00 INFO (bucket1) KVBucket::deinitialize forceShutdown:true] ... 2022-02-02T20:11:02.022445-08:00 INFO (bucket1) Compaction of vb:56 done (ok). purged tombstones:0, prepares:0, prepareBytes:0 collection_items_erased:alive:100,deleted:100, collections_erased:100, size/items/tombstones/purge_seqno pre{15 1651, 100, 209, 0}, post{16483, 0, 109, 0} 2022-02-02T20:11:02.023649-08:00 WARNING (No Engine) Slow scheduling for AuxIO task 'Compact DB file 453' on thread AuxIoPool2. Schedule overhead: 17 s 2022-02-02T20:11:02.023674-08:00 INFO (bucket2) MagmaKVStore::compactDBInternal: vb:453 purge_before_ts:1643602261 purge_before_seq:0 drop_deletes:false retain_erroneous_tombstones:true 2022-02-02T20:11:02.023737-08:00 INFO (bucket1) Compaction of vb:435 done (ok). purged tombstones:0, prepares:0, prepareBytes:0 collection_items_erased:alive:100,deleted:0, collections_erased:100, size/items/tombstones/purge_seqno pre{143 459, 100, 109, 0}, post{16483, 0, 109, 0} ... 2022-02-02T20:17:53.289950-08:00 WARNING (No Engine) Slow scheduling for AuxIO task 'Compact DB file 672' on thread AuxIoPool6. Schedule overhead: 382 s And we didn't log "EventuallyPersistentEngine::destroyInner(): Completed deinitialize" so we're still waiting on tasks to get finished by the looks of it.

            Collection dropping a minute earlier probably scheduled all these:

            2022-02-02T20:10:10.338734-08:00 INFO (default) drop collection:id:0x100, scope:0x10, manifest:0x103
            

            ben.huddleston Ben Huddleston added a comment - Collection dropping a minute earlier probably scheduled all these: 2022-02-02T20:10:10.338734-08:00 INFO (default) drop collection:id:0x100, scope:0x10, manifest:0x103
            drigby Dave Rigby added a comment -

            Note that CompactTask appears to have completeBeforeShutdown set to false, so we shouldn't schedule any additional tasks to run (which are not already running) when a bucket is deleted.

            However, a Compact task could potentially take a long time to complete once it starts running...

            drigby Dave Rigby added a comment - Note that CompactTask appears to have completeBeforeShutdown set to false, so we shouldn't schedule any additional tasks to run (which are not already running) when a bucket is deleted. However, a Compact task could potentially take a long time to complete once it starts running...

            We could have a lot in the running state if we have exceeded the 5 seconds execution delay as the running state also includes tasks that would run but are waiting on workers to run them, right?

            Also, what actually stops us from scheduling a new task during shutdown? At this point we'll be polling in unregisterTaskable() for numTasksForOwner to be 0 but we won't have removed the taskable yet.

            ben.huddleston Ben Huddleston added a comment - We could have a lot in the running state if we have exceeded the 5 seconds execution delay as the running state also includes tasks that would run but are waiting on workers to run them, right? Also, what actually stops us from scheduling a new task during shutdown? At this point we'll be polling in unregisterTaskable() for numTasksForOwner to be 0 but we won't have removed the taskable yet.
            drigby Dave Rigby added a comment -

            Yes, many could be running - at least up to the number of threads in the given thread pool.

            In terms of what stops scheduling - see http://src.couchbase.org/source/xref/trunk/kv_engine/executor/folly_executorpool.cc#538 where we immediately call cancel on any task not marked as "completeBeforeShutdown". That ultimately sets the GlobalTask's state to TASK_DEAD, which is checked just before we actually run it on the CPU pool here: http://src.couchbase.org/source/xref/trunk/kv_engine/executor/folly_executorpool.cc#163

            So ultimately we should only be waiting for all tasks which were already running at the point the bucket was deleted to finish; no more should run.

            drigby Dave Rigby added a comment - Yes, many could be running - at least up to the number of threads in the given thread pool. In terms of what stops scheduling - see http://src.couchbase.org/source/xref/trunk/kv_engine/executor/folly_executorpool.cc#538 where we immediately call cancel on any task not marked as "completeBeforeShutdown". That ultimately sets the GlobalTask's state to TASK_DEAD, which is checked just before we actually run it on the CPU pool here: http://src.couchbase.org/source/xref/trunk/kv_engine/executor/folly_executorpool.cc#163 So ultimately we should only be waiting for all tasks which were already running at the point the bucket was deleted to finish; no more should run.

            Thanks for the explanation. What I'm seeing in these logs are dozens or hundreds of compactions running after we have start the deinitialization of the KVBucket. I'll keep digging into it.

            ben.huddleston Ben Huddleston added a comment - Thanks for the explanation. What I'm seeing in these logs are dozens or hundreds of compactions running after we have start the deinitialization of the KVBucket. I'll keep digging into it.
            ben.huddleston Ben Huddleston added a comment - - edited

            I need to confirm this with a unit test tomorrow but I believe I have a theory as to what is happening here. We are waiting trying to unregister the taskable for bucket1 but it's bucket 2 and default bucket that are stuck running our compactions. I suspect that the AUX IO pool is full of pending compaction tasks that are due to run immediately and the run of the task to cancel it is at the back of that queue. The next question would be how could we end up with so many compaction tasks running, as we're supposed to limit the number that can run concurrently and any mroe should be snoozed indefinitely. That is the case when we create a new compaction task and we already have more than our allocated running tasks, but it looks like if we find a task already existing for this vBucket then we'll re-snooze it with the collection drop compaction execution delay of 5 seconds (rather than re-snooze it indefinitely if it was already snoozed indefinitely). I suspect that this also leads to saturation of the AUX IO threads with compaction tasks regardless of shutdown issues, but need to confirm that too.

            Additionally, given that we've now moved the compaction tasks to run on the AUX IO pool, should we still be scheduling a percentage of our number of writers of compaction tasks? If writers > AUX IO threads then we could starve other AUX IO tasks.

            ben.huddleston Ben Huddleston added a comment - - edited I need to confirm this with a unit test tomorrow but I believe I have a theory as to what is happening here. We are waiting trying to unregister the taskable for bucket1 but it's bucket 2 and default bucket that are stuck running our compactions. I suspect that the AUX IO pool is full of pending compaction tasks that are due to run immediately and the run of the task to cancel it is at the back of that queue. The next question would be how could we end up with so many compaction tasks running, as we're supposed to limit the number that can run concurrently and any mroe should be snoozed indefinitely. That is the case when we create a new compaction task and we already have more than our allocated running tasks, but it looks like if we find a task already existing for this vBucket then we'll re-snooze it with the collection drop compaction execution delay of 5 seconds (rather than re-snooze it indefinitely if it was already snoozed indefinitely). I suspect that this also leads to saturation of the AUX IO threads with compaction tasks regardless of shutdown issues, but need to confirm that too. Additionally, given that we've now moved the compaction tasks to run on the AUX IO pool, should we still be scheduling a percentage of our number of writers of compaction tasks? If writers > AUX IO threads then we could starve other AUX IO tasks.

            Tasks stats shows a bunch of Compaction tasks with negative sleep (should have been run already) and 8 running concurrnetly (number of AUX IO threads) so I think that the above is likely.

            14522    5  R   default    -05:36.71   0:00.00   0:00.00        0  AuxIO   CompactVBucketTask                   0x00007f8f023893d0  Compact DB file 701
            14523    5  R   default    -05:38.73   0:00.00   0:00.00        0  AuxIO   CompactVBucketTask                   0x00007f8f0a31d750  Compact DB file 191
            14524    5  R   default    -05:38.96   0:00.00   0:00.00        0  AuxIO   CompactVBucketTask                   0x00007f8f0254f710  Compact DB file 830
            14525    5  R   default    -05:39.46   0:00.00   0:00.00        0  AuxIO   CompactVBucketTask                   0x00007f8f025529d0  Compact DB file 819
            14526    5  R   default    -05:41.45   0:00.00   0:00.00        0  AuxIO   CompactVBucketTask                   0x00007f8f02552810  Compact DB file 162
            14527    5  R   default    -05:41.52   0:00.00   0:00.00        0  AuxIO   CompactVBucketTask                   0x00007f8f02553450  Compact DB file 353
            14528    5  R   default    -05:41.59   0:00.00   0:00.00        0  AuxIO   CompactVBucketTask                   0x00007f8f02553990  Compact DB file 832
            14529    5  R   default    -05:42.58   0:00.00   0:00.00        0  AuxIO   CompactVBucketTask                   0x00007f8f02552d50  Compact DB file 709
            14530    5  R   default    -05:43.44   0:00.00   0:00.00        0  AuxIO   CompactVBucketTask                   0x00007f8f02553df0  Compact DB file 588
            14531    5  R   default    -05:43.54   0:00.00   0:00.00        0  AuxIO   CompactVBucketTask                   0x00007f8f02554a30  Compact DB file 351
            14532    5  R   default    -05:43.64   0:00.00   0:00.00        0  AuxIO   CompactVBucketTask                   0x00007f8f02555830  Compact DB file 827
            14533    5  R   default    -05:45.02   0:00.00   0:00.00        0  AuxIO   CompactVBucketTask                   0x00007f8f02553a70  Compact DB file 194
            14534    5  R   default    -05:46.11   0:00.00   0:00.00        0  AuxIO   CompactVBucketTask                   0x00007f8f02553fb0  Compact DB file 361
            14535    5  R   default    -05:46.70   0:00.00   0:00.00        0  AuxIO   CompactVBucketTask                   0x00007f8f02555f30  Compact DB file 150
            14536    5  R   default    -05:46.90   0:00.00   0:00.00        0  AuxIO   CompactVBucketTask                   0x00007f8f02554b10  Compact DB file 144
            14537    5  R   default    -05:48.20   0:00.00   0:00.00        0  AuxIO   CompactVBucketTask                   0x00007f8f02572c50  Compact DB file 692
            14538    5  R   default    -05:48.54   0:00.00   0:00.00        0  AuxIO   CompactVBucketTask                   0x00007f8f025546b0  Compact DB file 202
            14539    5  R   default    -05:49.00   0:00.00   0:00.00        0  AuxIO   CompactVBucketTask                   0x00007f8f02552570  Compact DB file 199
            14540    5  R   default    -05:49.10   0:00.00   0:00.00        0  AuxIO   CompactVBucketTask                   0x00007f8f02555e50  Compact DB file 835
            14541    5  R   default    -05:50.92   0:00.00   0:00.00        0  AuxIO   CompactVBucketTask                   0x00007f8f02573270  Compact DB file 158
            14542    5  R   default    -05:51.08   0:00.00   0:00.00        0  AuxIO   CompactVBucketTask                   0x00007f8f30156350  Compact DB file 789
            14543    5  R   default    -05:51.31   0:00.00   0:00.00        0  AuxIO   CompactVBucketTask                   0x00007f8f02553ed0  Compact DB file 553
            14544    5  R   default    -05:52.16   0:00.00   0:00.00        0  AuxIO   CompactVBucketTask                   0x00007f8f02572010  Compact DB file 152
            14545    5  R   default    -05:52.57   0:00.00   0:00.00        0  AuxIO   CompactVBucketTask                   0x00007f8f02572d30  Compact DB file 700
            14546    5  R   default    -05:52.83   0:00.00   0:00.00        0  AuxIO   CompactVBucketTask                   0x00007f8f037db330  Compact DB file 147
            14547    5  R   default    -05:53.03   0:00.00   0:00.00        0  AuxIO   CompactVBucketTask                   0x00007f8f02342450  Compact DB file 359
            14548    5  R   default    -05:53.31   0:00.00   0:00.00        0  AuxIO   CompactVBucketTask                   0x00007f8f02572e10  Compact DB file 350
            14549    5  R   default    -05:54.43   0:00.00   0:00.00        0  AuxIO   CompactVBucketTask                   0x00007f8f02553d10  Compact DB file 354
            14551    5  R   default    -05:55.44   0:00.00   0:00.00        0  AuxIO   CompactVBucketTask                   0x00007f8f02573510  Compact DB file 821
            14552    5  R   default    -05:55.75   0:00.00   0:00.00        0  AuxIO   CompactVBucketTask                   0x00007f8f02554db0  Compact DB file 155
            14553    5  R   default    -05:57.03   0:00.00   0:00.00        0  AuxIO   CompactVBucketTask                   0x00007f8f02573b30  Compact DB file 708
            14549    5  R   default    -05:54.43   0:00.00   0:00.00        0  AuxIO   CompactVBucketTask                   0x00007f8f02553d10  Compact DB file 354
            14551    5  R   default    -05:55.44   0:00.00   0:00.00        0  AuxIO   CompactVBucketTask                   0x00007f8f02573510  Compact DB file 821
            14552    5  R   default    -05:55.75   0:00.00   0:00.00        0  AuxIO   CompactVBucketTask                   0x00007f8f02554db0  Compact DB file 155
            14553    5  R   default    -05:57.03   0:00.00   0:00.00        0  AuxIO   CompactVBucketTask                   0x00007f8f02573b30  Compact DB file 708
            14554    5  R   default    -05:58.14   0:00.00   0:00.00        0  AuxIO   CompactVBucketTask                   0x00007f8f0233f890  Compact DB file 198
            14555    5  R   default    -05:59.22   0:00.00   0:00.00        0  AuxIO   CompactVBucketTask                   0x00007f8f0254fe10  Compact DB file 561
            14556    5  R   default    -05:59.55   0:00.00   0:00.00        0  AuxIO   CompactVBucketTask                   0x00007f8f025729b0  Compact DB file 160
            14557    5  R   default    -05:59.84   0:00.00   0:00.00        0  AuxIO   CompactVBucketTask                   0x00007f8f089f0850  Compact DB file 163
            14558    5  R   default    -06:00.09   0:00.00   0:00.00        0  AuxIO   CompactVBucketTask                   0x00007f8f037dac30  Compact DB file 358
            14559    5  R   default    -06:00.50   0:00.00   0:00.00        0  AuxIO   CompactVBucketTask                   0x00007f8f031b7eb0  Compact DB file 551
            14560    5  R   default    -06:00.76   0:00.00   0:00.00        0  AuxIO   CompactVBucketTask                   0x00007f8f02555910  Compact DB file 362
            14561    5  R   default    -06:00.94   0:00.00   0:00.00        0  AuxIO   CompactVBucketTask                   0x00007f8f037db6b0  Compact DB file 788
            14562    5  R   default    -06:01.76   0:00.00   0:00.00        0  AuxIO   CompactVBucketTask                   0x00007f8f037d6d30  Compact DB file 829
            14563    5  R   default    -06:03.37   0:00.00   0:00.00        0  AuxIO   CompactVBucketTask                   0x00007f8f025728d0  Compact DB file 601
            14564    5  R   default    -06:03.86   0:00.00   0:00.00        0  AuxIO   CompactVBucketTask                   0x00007f8f02572b70  Compact DB file 394
            14565    5  R   default    -06:03.87   0:00.00   0:00.00        0  AuxIO   CompactVBucketTask                   0x00007f8f02573970  Compact DB file 195
            14566    5  R   default    -06:04.62   0:00.00   0:00.00        0  AuxIO   CompactVBucketTask                   0x00007f8f025730b0  Compact DB file 192
            14567    5  R   default    -06:04.71   0:00.00   0:00.00        0  AuxIO   CompactVBucketTask                   0x00007f8f037daa70  Compact DB file 559
            14568    5  R   default    -06:05.62   0:00.00   0:00.00        0  AuxIO   CompactVBucketTask                   0x00007f8f02552e30  Compact DB file 203
            14569    5  R   default    -06:06.31   0:00.00   0:00.00        0  AuxIO   CompactVBucketTask                   0x00007f8f02574770  Compact DB file 550
            14570    5  R   default    -06:06.69   0:00.00   0:00.00        0  AuxIO   CompactVBucketTask                   0x00007f8f02555c90  Compact DB file 617
            14571    5  R   default    -06:07.50   0:00.00   0:00.00        0  AuxIO   CompactVBucketTask                   0x00007f8f02573190  Compact DB file 989
            14572    5  R   default    -06:08.03   0:00.00   0:00.00        0  AuxIO   CompactVBucketTask                   0x00007f8f02573c10  Compact DB file 820
            14573    5  R   default    -06:08.43   0:00.00   0:00.00        0  AuxIO   CompactVBucketTask                   0x00007f8f30156eb0  Compact DB file 200
            14574    5  R   default    -06:08.48   0:00.00   0:00.00        0  AuxIO   CompactVBucketTask                   0x00007f8f02574cb0  Compact DB file 599
            14575    5  R   default    -06:10.97   0:00.00   0:00.00        0  AuxIO   CompactVBucketTask                   0x00007f8f301582d0  Compact DB file 554
            14576    5  R   default    -06:11.07   0:00.00   0:00.00        0  AuxIO   CompactVBucketTask                   0x00007f8f025720f0  Compact DB file 625
            14577    5  R   default    -06:11.13   0:00.00   0:00.00        0  AuxIO   CompactVBucketTask                   0x00007f8f02573cf0  Compact DB file 149
            14578    5  R   default    -06:11.30   0:00.00   0:00.00        0  AuxIO   CompactVBucketTask                   0x00007f8f02573f90  Compact DB file 352
            14579    5  R   default    -06:11.62   0:00.00   0:00.00        0  AuxIO   CompactVBucketTask                   0x00007f8f02573a50  Compact DB file 558
            14580    5  R   default    -06:12.37   0:00.00   0:00.00        0  AuxIO   CompactVBucketTask                   0x00007f8f038e7670  Compact DB file 828
            14581    5  R   default    -06:12.40   0:00.00   0:00.00        0  AuxIO   CompactVBucketTask                   0x00007f8f02574850  Compact DB file 355
            14582    5  R   default    -06:13.49   0:00.00   0:00.00        0  AuxIO   CompactVBucketTask                   0x00007f8f08c346b0  Compact DB file 615
            14583    5  R   default    -06:13.58   0:00.00   0:00.00        0  AuxIO   CompactVBucketTask                   0x00007f8f08b4b730  Compact DB file 598
            14584    5  R   default    -06:14.22   0:00.00   0:00.00        0  AuxIO   CompactVBucketTask                   0x00007f8f08b4c450  Compact DB file 157
            14585    5  R   default    -06:16.19   0:00.00   0:00.00        0  AuxIO   CompactVBucketTask                   0x00007f8f02550a50  Compact DB file 602
            14586    5  R   default    -06:16.72   0:00.00   0:00.00        0  AuxIO   CompactVBucketTask                   0x00007f8f0a176030  Compact DB file 988
            14587    5  R   default    -06:17.04   0:00.00   0:00.00        0  AuxIO   CompactVBucketTask                   0x00007f8f03ddf650  Compact DB file 360
            14588    5  R   default    -06:17.59   0:00.00   0:00.00        0  AuxIO   CompactVBucketTask                   0x00007f8f02573eb0  Compact DB file 363
            14589    5  R   default    -06:18.56   0:00.00   0:00.00        0  AuxIO   CompactVBucketTask                   0x00007f8f03de0290  Compact DB file 633
            14590    5  R   default    -06:18.69   0:00.00   0:00.00        0  AuxIO   CompactVBucketTask                   0x00007f8f02574f50  Compact DB file 614
            14591    5  R   default    -06:19.23   0:00.00   0:00.00        0  AuxIO   CompactVBucketTask                   0x00007f8f02573350  Compact DB file 197
            14592    5  R   default    -06:20.75   0:00.00   0:00.00        0  AuxIO   CompactVBucketTask                   0x00007f8f301558d0  Compact DB file 623
            14593    5  R   default    -06:21.60   0:00.00   0:00.00        0  AuxIO   CompactVBucketTask                   0x00007f8f0361c2b0  Compact DB file 552
            14594    5  R   default    -06:21.69   0:00.00   0:00.00        0  AuxIO   CompactVBucketTask                   0x00007f8f02574150  Compact DB file 618
            14595    5  R   default    -06:21.73   0:00.00   0:00.00        0  AuxIO   CompactVBucketTask                   0x00007f8f03de0450  Compact DB file 148
            14596    5  R   default    -06:22.14   0:00.00   0:00.00        0  AuxIO   CompactVBucketTask                   0x00007f8f0a198150  Compact DB file 641
            14597    5  R   default      0:00.00  *0:00.03   0:00.00        0  AuxIO   CompactVBucketTask                   0x00007f8f02342ed0  Compact DB file 631
            14598    5  R   default      0:00.00  *0:00.91   0:00.00        0  AuxIO   CompactVBucketTask                   0x00007f8f038e1fd0  Compact DB file 349
            14599    5  R   default      0:00.00  *0:00.94   0:00.00        0  AuxIO   CompactVBucketTask                   0x00007f8f02eae070  Compact DB file 156
            14600    5  R   default      0:00.00  *0:01.16   0:00.00        0  AuxIO   CompactVBucketTask                   0x00007f8f09d42db0  Compact DB file 622
            14601    5  R   default      0:00.00  *0:02.33   0:00.00        0  AuxIO   CompactVBucketTask                   0x00007f8f03de0610  Compact DB file 626
            14603    5  R   default      0:00.00  *0:03.46   0:00.00        0  AuxIO   CompactVBucketTask                   0x00007f8f30157770  Compact DB file 560
            14609    5  R   default      0:00.00  *0:04.97   0:00.00        0  AuxIO   CompactVBucketTask                   0x00007f8f02574690  Compact DB file 634
            14612    5  R   default      0:00.00  *0:06.40   0:00.00        0  AuxIO   CompactVBucketTask                   0x00007f8f02575570  Compact DB file 630
            

            ben.huddleston Ben Huddleston added a comment - Tasks stats shows a bunch of Compaction tasks with negative sleep (should have been run already) and 8 running concurrnetly (number of AUX IO threads) so I think that the above is likely. 14522 5 R default -05:36.71 0:00.00 0:00.00 0 AuxIO CompactVBucketTask 0x00007f8f023893d0 Compact DB file 701 14523 5 R default -05:38.73 0:00.00 0:00.00 0 AuxIO CompactVBucketTask 0x00007f8f0a31d750 Compact DB file 191 14524 5 R default -05:38.96 0:00.00 0:00.00 0 AuxIO CompactVBucketTask 0x00007f8f0254f710 Compact DB file 830 14525 5 R default -05:39.46 0:00.00 0:00.00 0 AuxIO CompactVBucketTask 0x00007f8f025529d0 Compact DB file 819 14526 5 R default -05:41.45 0:00.00 0:00.00 0 AuxIO CompactVBucketTask 0x00007f8f02552810 Compact DB file 162 14527 5 R default -05:41.52 0:00.00 0:00.00 0 AuxIO CompactVBucketTask 0x00007f8f02553450 Compact DB file 353 14528 5 R default -05:41.59 0:00.00 0:00.00 0 AuxIO CompactVBucketTask 0x00007f8f02553990 Compact DB file 832 14529 5 R default -05:42.58 0:00.00 0:00.00 0 AuxIO CompactVBucketTask 0x00007f8f02552d50 Compact DB file 709 14530 5 R default -05:43.44 0:00.00 0:00.00 0 AuxIO CompactVBucketTask 0x00007f8f02553df0 Compact DB file 588 14531 5 R default -05:43.54 0:00.00 0:00.00 0 AuxIO CompactVBucketTask 0x00007f8f02554a30 Compact DB file 351 14532 5 R default -05:43.64 0:00.00 0:00.00 0 AuxIO CompactVBucketTask 0x00007f8f02555830 Compact DB file 827 14533 5 R default -05:45.02 0:00.00 0:00.00 0 AuxIO CompactVBucketTask 0x00007f8f02553a70 Compact DB file 194 14534 5 R default -05:46.11 0:00.00 0:00.00 0 AuxIO CompactVBucketTask 0x00007f8f02553fb0 Compact DB file 361 14535 5 R default -05:46.70 0:00.00 0:00.00 0 AuxIO CompactVBucketTask 0x00007f8f02555f30 Compact DB file 150 14536 5 R default -05:46.90 0:00.00 0:00.00 0 AuxIO CompactVBucketTask 0x00007f8f02554b10 Compact DB file 144 14537 5 R default -05:48.20 0:00.00 0:00.00 0 AuxIO CompactVBucketTask 0x00007f8f02572c50 Compact DB file 692 14538 5 R default -05:48.54 0:00.00 0:00.00 0 AuxIO CompactVBucketTask 0x00007f8f025546b0 Compact DB file 202 14539 5 R default -05:49.00 0:00.00 0:00.00 0 AuxIO CompactVBucketTask 0x00007f8f02552570 Compact DB file 199 14540 5 R default -05:49.10 0:00.00 0:00.00 0 AuxIO CompactVBucketTask 0x00007f8f02555e50 Compact DB file 835 14541 5 R default -05:50.92 0:00.00 0:00.00 0 AuxIO CompactVBucketTask 0x00007f8f02573270 Compact DB file 158 14542 5 R default -05:51.08 0:00.00 0:00.00 0 AuxIO CompactVBucketTask 0x00007f8f30156350 Compact DB file 789 14543 5 R default -05:51.31 0:00.00 0:00.00 0 AuxIO CompactVBucketTask 0x00007f8f02553ed0 Compact DB file 553 14544 5 R default -05:52.16 0:00.00 0:00.00 0 AuxIO CompactVBucketTask 0x00007f8f02572010 Compact DB file 152 14545 5 R default -05:52.57 0:00.00 0:00.00 0 AuxIO CompactVBucketTask 0x00007f8f02572d30 Compact DB file 700 14546 5 R default -05:52.83 0:00.00 0:00.00 0 AuxIO CompactVBucketTask 0x00007f8f037db330 Compact DB file 147 14547 5 R default -05:53.03 0:00.00 0:00.00 0 AuxIO CompactVBucketTask 0x00007f8f02342450 Compact DB file 359 14548 5 R default -05:53.31 0:00.00 0:00.00 0 AuxIO CompactVBucketTask 0x00007f8f02572e10 Compact DB file 350 14549 5 R default -05:54.43 0:00.00 0:00.00 0 AuxIO CompactVBucketTask 0x00007f8f02553d10 Compact DB file 354 14551 5 R default -05:55.44 0:00.00 0:00.00 0 AuxIO CompactVBucketTask 0x00007f8f02573510 Compact DB file 821 14552 5 R default -05:55.75 0:00.00 0:00.00 0 AuxIO CompactVBucketTask 0x00007f8f02554db0 Compact DB file 155 14553 5 R default -05:57.03 0:00.00 0:00.00 0 AuxIO CompactVBucketTask 0x00007f8f02573b30 Compact DB file 708 14549 5 R default -05:54.43 0:00.00 0:00.00 0 AuxIO CompactVBucketTask 0x00007f8f02553d10 Compact DB file 354 14551 5 R default -05:55.44 0:00.00 0:00.00 0 AuxIO CompactVBucketTask 0x00007f8f02573510 Compact DB file 821 14552 5 R default -05:55.75 0:00.00 0:00.00 0 AuxIO CompactVBucketTask 0x00007f8f02554db0 Compact DB file 155 14553 5 R default -05:57.03 0:00.00 0:00.00 0 AuxIO CompactVBucketTask 0x00007f8f02573b30 Compact DB file 708 14554 5 R default -05:58.14 0:00.00 0:00.00 0 AuxIO CompactVBucketTask 0x00007f8f0233f890 Compact DB file 198 14555 5 R default -05:59.22 0:00.00 0:00.00 0 AuxIO CompactVBucketTask 0x00007f8f0254fe10 Compact DB file 561 14556 5 R default -05:59.55 0:00.00 0:00.00 0 AuxIO CompactVBucketTask 0x00007f8f025729b0 Compact DB file 160 14557 5 R default -05:59.84 0:00.00 0:00.00 0 AuxIO CompactVBucketTask 0x00007f8f089f0850 Compact DB file 163 14558 5 R default -06:00.09 0:00.00 0:00.00 0 AuxIO CompactVBucketTask 0x00007f8f037dac30 Compact DB file 358 14559 5 R default -06:00.50 0:00.00 0:00.00 0 AuxIO CompactVBucketTask 0x00007f8f031b7eb0 Compact DB file 551 14560 5 R default -06:00.76 0:00.00 0:00.00 0 AuxIO CompactVBucketTask 0x00007f8f02555910 Compact DB file 362 14561 5 R default -06:00.94 0:00.00 0:00.00 0 AuxIO CompactVBucketTask 0x00007f8f037db6b0 Compact DB file 788 14562 5 R default -06:01.76 0:00.00 0:00.00 0 AuxIO CompactVBucketTask 0x00007f8f037d6d30 Compact DB file 829 14563 5 R default -06:03.37 0:00.00 0:00.00 0 AuxIO CompactVBucketTask 0x00007f8f025728d0 Compact DB file 601 14564 5 R default -06:03.86 0:00.00 0:00.00 0 AuxIO CompactVBucketTask 0x00007f8f02572b70 Compact DB file 394 14565 5 R default -06:03.87 0:00.00 0:00.00 0 AuxIO CompactVBucketTask 0x00007f8f02573970 Compact DB file 195 14566 5 R default -06:04.62 0:00.00 0:00.00 0 AuxIO CompactVBucketTask 0x00007f8f025730b0 Compact DB file 192 14567 5 R default -06:04.71 0:00.00 0:00.00 0 AuxIO CompactVBucketTask 0x00007f8f037daa70 Compact DB file 559 14568 5 R default -06:05.62 0:00.00 0:00.00 0 AuxIO CompactVBucketTask 0x00007f8f02552e30 Compact DB file 203 14569 5 R default -06:06.31 0:00.00 0:00.00 0 AuxIO CompactVBucketTask 0x00007f8f02574770 Compact DB file 550 14570 5 R default -06:06.69 0:00.00 0:00.00 0 AuxIO CompactVBucketTask 0x00007f8f02555c90 Compact DB file 617 14571 5 R default -06:07.50 0:00.00 0:00.00 0 AuxIO CompactVBucketTask 0x00007f8f02573190 Compact DB file 989 14572 5 R default -06:08.03 0:00.00 0:00.00 0 AuxIO CompactVBucketTask 0x00007f8f02573c10 Compact DB file 820 14573 5 R default -06:08.43 0:00.00 0:00.00 0 AuxIO CompactVBucketTask 0x00007f8f30156eb0 Compact DB file 200 14574 5 R default -06:08.48 0:00.00 0:00.00 0 AuxIO CompactVBucketTask 0x00007f8f02574cb0 Compact DB file 599 14575 5 R default -06:10.97 0:00.00 0:00.00 0 AuxIO CompactVBucketTask 0x00007f8f301582d0 Compact DB file 554 14576 5 R default -06:11.07 0:00.00 0:00.00 0 AuxIO CompactVBucketTask 0x00007f8f025720f0 Compact DB file 625 14577 5 R default -06:11.13 0:00.00 0:00.00 0 AuxIO CompactVBucketTask 0x00007f8f02573cf0 Compact DB file 149 14578 5 R default -06:11.30 0:00.00 0:00.00 0 AuxIO CompactVBucketTask 0x00007f8f02573f90 Compact DB file 352 14579 5 R default -06:11.62 0:00.00 0:00.00 0 AuxIO CompactVBucketTask 0x00007f8f02573a50 Compact DB file 558 14580 5 R default -06:12.37 0:00.00 0:00.00 0 AuxIO CompactVBucketTask 0x00007f8f038e7670 Compact DB file 828 14581 5 R default -06:12.40 0:00.00 0:00.00 0 AuxIO CompactVBucketTask 0x00007f8f02574850 Compact DB file 355 14582 5 R default -06:13.49 0:00.00 0:00.00 0 AuxIO CompactVBucketTask 0x00007f8f08c346b0 Compact DB file 615 14583 5 R default -06:13.58 0:00.00 0:00.00 0 AuxIO CompactVBucketTask 0x00007f8f08b4b730 Compact DB file 598 14584 5 R default -06:14.22 0:00.00 0:00.00 0 AuxIO CompactVBucketTask 0x00007f8f08b4c450 Compact DB file 157 14585 5 R default -06:16.19 0:00.00 0:00.00 0 AuxIO CompactVBucketTask 0x00007f8f02550a50 Compact DB file 602 14586 5 R default -06:16.72 0:00.00 0:00.00 0 AuxIO CompactVBucketTask 0x00007f8f0a176030 Compact DB file 988 14587 5 R default -06:17.04 0:00.00 0:00.00 0 AuxIO CompactVBucketTask 0x00007f8f03ddf650 Compact DB file 360 14588 5 R default -06:17.59 0:00.00 0:00.00 0 AuxIO CompactVBucketTask 0x00007f8f02573eb0 Compact DB file 363 14589 5 R default -06:18.56 0:00.00 0:00.00 0 AuxIO CompactVBucketTask 0x00007f8f03de0290 Compact DB file 633 14590 5 R default -06:18.69 0:00.00 0:00.00 0 AuxIO CompactVBucketTask 0x00007f8f02574f50 Compact DB file 614 14591 5 R default -06:19.23 0:00.00 0:00.00 0 AuxIO CompactVBucketTask 0x00007f8f02573350 Compact DB file 197 14592 5 R default -06:20.75 0:00.00 0:00.00 0 AuxIO CompactVBucketTask 0x00007f8f301558d0 Compact DB file 623 14593 5 R default -06:21.60 0:00.00 0:00.00 0 AuxIO CompactVBucketTask 0x00007f8f0361c2b0 Compact DB file 552 14594 5 R default -06:21.69 0:00.00 0:00.00 0 AuxIO CompactVBucketTask 0x00007f8f02574150 Compact DB file 618 14595 5 R default -06:21.73 0:00.00 0:00.00 0 AuxIO CompactVBucketTask 0x00007f8f03de0450 Compact DB file 148 14596 5 R default -06:22.14 0:00.00 0:00.00 0 AuxIO CompactVBucketTask 0x00007f8f0a198150 Compact DB file 641 14597 5 R default 0:00.00 *0:00.03 0:00.00 0 AuxIO CompactVBucketTask 0x00007f8f02342ed0 Compact DB file 631 14598 5 R default 0:00.00 *0:00.91 0:00.00 0 AuxIO CompactVBucketTask 0x00007f8f038e1fd0 Compact DB file 349 14599 5 R default 0:00.00 *0:00.94 0:00.00 0 AuxIO CompactVBucketTask 0x00007f8f02eae070 Compact DB file 156 14600 5 R default 0:00.00 *0:01.16 0:00.00 0 AuxIO CompactVBucketTask 0x00007f8f09d42db0 Compact DB file 622 14601 5 R default 0:00.00 *0:02.33 0:00.00 0 AuxIO CompactVBucketTask 0x00007f8f03de0610 Compact DB file 626 14603 5 R default 0:00.00 *0:03.46 0:00.00 0 AuxIO CompactVBucketTask 0x00007f8f30157770 Compact DB file 560 14609 5 R default 0:00.00 *0:04.97 0:00.00 0 AuxIO CompactVBucketTask 0x00007f8f02574690 Compact DB file 634 14612 5 R default 0:00.00 *0:06.40 0:00.00 0 AuxIO CompactVBucketTask 0x00007f8f02575570 Compact DB file 630

            Confirmed with a unit test that rescheduling a Compaction task that was snoozing due to the concurrency limit reschedules without regard to the concurreny limit meaning that we can run more compaction tasks than intended.

            ben.huddleston Ben Huddleston added a comment - Confirmed with a unit test that rescheduling a Compaction task that was snoozing due to the concurrency limit reschedules without regard to the concurreny limit meaning that we can run more compaction tasks than intended.
            drigby Dave Rigby added a comment -

            Good spot Ben Huddleston. Something similar came up a week or two ago when I was looking at the ns_server scheduled compactions exceeding the limit - see the Slack discussion here: https://couchbase.slack.com/archives/CFJDXSGUA/p1643205220328800

            I agree we should fix this - the compaction concurrency limit should not be exceeded at any point, to avoid impacting Flusher latency and by construction front-end write latency.

            However, even if we address the issue you found (not obeying the compaction concurrency limit), I t think could still run into problems with not being able to cancel tasks. For example, consider a scenario with 3 buckets, each of which has been told to compact (either ns_server or due to collection drop). Even if we assume that we change the current concurrency max:

                        const int maxConcurrentWriterTasks = std::min(
                                ExecutorPool::get()->getNumWriters(), vbMap.getNumShards());
             
                        // Calculate how many compaction tasks we will permit. We always
                        // allow at least one (see `if (handle->size() > 1)` above,
                        // then we limit to a fraction of the available WriterTasks,
                        // however imposing an upper bound so there is at least 1
                        // Writer task slot available for other tasks (i.e. Flusher).
                        const int maxConcurrentCompactTasks = std::min(
                                int(maxConcurrentWriterTasks * compactionMaxConcurrency),
                                maxConcurrentWriterTasks - 1);
            

            Say we include getNumAuxIO() in the min calculation for maxConcurrentWriterTasks so we end up only using at most half of the AuxIO threads for Compaction tasks. However, the limit is only per-Bucket - so with 3 buckets we are already "over-subscribed", and could end up with all AuxIO threads running compaction tasks, and unable to wake (as part of cancel) any other ones.

            I think the fundamental problem here is once a CompactionTask starts running, there is no way to cancel it - that's what we ideally want to happen, not to waste a bunch of time waiting for unnecessary compaction to complete.

            drigby Dave Rigby added a comment - Good spot Ben Huddleston . Something similar came up a week or two ago when I was looking at the ns_server scheduled compactions exceeding the limit - see the Slack discussion here: https://couchbase.slack.com/archives/CFJDXSGUA/p1643205220328800 I agree we should fix this - the compaction concurrency limit should not be exceeded at any point, to avoid impacting Flusher latency and by construction front-end write latency. However, even if we address the issue you found (not obeying the compaction concurrency limit), I t think could still run into problems with not being able to cancel tasks. For example, consider a scenario with 3 buckets, each of which has been told to compact (either ns_server or due to collection drop). Even if we assume that we change the current concurrency max: const int maxConcurrentWriterTasks = std::min( ExecutorPool::get()->getNumWriters(), vbMap.getNumShards());   // Calculate how many compaction tasks we will permit. We always // allow at least one (see `if (handle->size() > 1)` above, // then we limit to a fraction of the available WriterTasks, // however imposing an upper bound so there is at least 1 // Writer task slot available for other tasks (i.e. Flusher). const int maxConcurrentCompactTasks = std::min( int (maxConcurrentWriterTasks * compactionMaxConcurrency), maxConcurrentWriterTasks - 1); Say we include getNumAuxIO() in the min calculation for maxConcurrentWriterTasks so we end up only using at most half of the AuxIO threads for Compaction tasks. However, the limit is only per-Bucket - so with 3 buckets we are already "over-subscribed", and could end up with all AuxIO threads running compaction tasks, and unable to wake (as part of cancel) any other ones. I think the fundamental problem here is once a CompactionTask starts running, there is no way to cancel it - that's what we ideally want to happen, not to waste a bunch of time waiting for unnecessary compaction to complete.

            However, the limit is only per-Bucket - so with 3 buckets we are already "over-subscribed", and could end up with all AuxIO threads running compaction tasks, and unable to wake (as part of cancel) any other ones.

            Indeed, we could still saturate the AuxIO pool with enough (3) buckets and hit a similar problem. We could perhaps limit running compactors to be process-wide rather than bucket-wide, but that would limit performance in some situations.

            I think the fundamental problem here is once a CompactionTask starts running, there is no way to cancel it - that's what we ideally want to happen, not to waste a bunch of time waiting for unnecessary compaction to complete.

            I think that that's one part of it, certainly that work is unnecessary if the bucket in question is going away, but we'd also have to consider what other buckets are doing. If we have a large AuxIO backlog then task cancels are going to take some time to be processed. We could perhaps use priorities which folly seems to support to make this slightly better (task cancels could be re-scheduled with highest priority) but we would still have to wait for one other task to finish to start cancelling.

            On the topic of cancelling the task we could plumb something down to the compaction callbacks and have them abort the compaction which feels like it might be good enough.

            ben.huddleston Ben Huddleston added a comment - However, the limit is only per-Bucket - so with 3 buckets we are already "over-subscribed", and could end up with all AuxIO threads running compaction tasks, and unable to wake (as part of cancel) any other ones. Indeed, we could still saturate the AuxIO pool with enough (3) buckets and hit a similar problem. We could perhaps limit running compactors to be process-wide rather than bucket-wide, but that would limit performance in some situations. I think the fundamental problem here is once a CompactionTask starts running, there is no way to cancel it - that's what we ideally want to happen, not to waste a bunch of time waiting for unnecessary compaction to complete. I think that that's one part of it, certainly that work is unnecessary if the bucket in question is going away, but we'd also have to consider what other buckets are doing. If we have a large AuxIO backlog then task cancels are going to take some time to be processed. We could perhaps use priorities which folly seems to support to make this slightly better (task cancels could be re-scheduled with highest priority) but we would still have to wait for one other task to finish to start cancelling. On the topic of cancelling the task we could plumb something down to the compaction callbacks and have them abort the compaction which feels like it might be good enough.
            drigby Dave Rigby added a comment -

            > However, the limit is only per-Bucket - so with 3 buckets we are already "over-subscribed", and could end up with all AuxIO threads running compaction tasks, and unable to wake (as part of cancel) any other ones.
            Indeed, we could still saturate the AuxIO pool with enough (3) buckets and hit a similar problem. We could perhaps limit running compactors to be process-wide rather than bucket-wide, but that would limit performance in some situations

            We could; however then you're back to triggering similar things with Backfill / AccessScanner / all the other AuxIO tasks. Ultimately it's a contention problem - if some tasks have low-latency requirements, and others don't (and have potentially long runtimes) - then the tasks with low latency requirements can get stuck behind the long runtime ones.

            This is the same problem as we had with the Flusher/Compactor on Writer threads - but arguably more severe - as the Flusher tasks are running all the time (not just bucket shutdown) and hence latency times are always important.

            > I think the fundamental problem here is once a CompactionTask starts running, there is no way to cancel it - that's what we ideally want to happen, not to waste a bunch of time waiting for unnecessary compaction to complete.
            I think that that's one part of it, certainly that work is unnecessary if the bucket in question is going away, but we'd also have to consider what other buckets are doing.

            That's a fair point - and related to the fact we perform userspace co-operative scheduling, where we cannot pre-empt already running tasks to do something else in the meantime, then return the original task.

            If we have a large AuxIO backlog then task cancels are going to take some time to be processed. We could perhaps use priorities which folly seems to support to make this slightly better (task cancels could be re-scheduled with highest priority) but we would still have to wait for one other task to finish to start cancelling.

            Priorities could help, but as you highlight you still have the problem that if a task is already running, it doesn't matter what it's priority is vs the one(s) waiting.

            Probably the "best" way to solve this particular cancellation problem would be to make it possible to somehow cancel tasks (for buckets being deleted) without having to wait to run() them on their thread pool. (I don't immediately know how to do that, but if we could I think it would solve all these issues

            IIRC, we did allow that for CB3ExecutorPool, but I changed it with FollyExecutorPool, for a combination of reasons:

            1. The readyQueue which holds all tasks ready to run is simply the tasks queued against FollyExecutorPool:: readerPool/writerPool etc - and once items are queued I don't believe they can be removed without the threadPool attempting to run them.
            2. To simplify task cancellation - have the CPU threadPool always cancel tasks, as even if you allow tasks to be cancelled before they are added to the CPU pools, you also have to handle the cases where (a) they are queued on the CPU pool but not yet run or (b) they are already running.

            Note part of the issue here is that the tasks are already on the CPU pool (i.e. we have decided that N Compaction tasks should run "now", but there is only A < N AuxIO threads possible for them to run on, so there are a non-zero number of tasks already on the CPU pool queue - which we cannot remove without finding a thread to run them.

            drigby Dave Rigby added a comment - > However, the limit is only per-Bucket - so with 3 buckets we are already "over-subscribed", and could end up with all AuxIO threads running compaction tasks, and unable to wake (as part of cancel) any other ones. Indeed, we could still saturate the AuxIO pool with enough (3) buckets and hit a similar problem. We could perhaps limit running compactors to be process-wide rather than bucket-wide, but that would limit performance in some situations We could; however then you're back to triggering similar things with Backfill / AccessScanner / all the other AuxIO tasks. Ultimately it's a contention problem - if some tasks have low-latency requirements, and others don't (and have potentially long runtimes) - then the tasks with low latency requirements can get stuck behind the long runtime ones. This is the same problem as we had with the Flusher/Compactor on Writer threads - but arguably more severe - as the Flusher tasks are running all the time (not just bucket shutdown) and hence latency times are always important. > I think the fundamental problem here is once a CompactionTask starts running, there is no way to cancel it - that's what we ideally want to happen, not to waste a bunch of time waiting for unnecessary compaction to complete. I think that that's one part of it, certainly that work is unnecessary if the bucket in question is going away, but we'd also have to consider what other buckets are doing. That's a fair point - and related to the fact we perform userspace co-operative scheduling, where we cannot pre-empt already running tasks to do something else in the meantime, then return the original task. If we have a large AuxIO backlog then task cancels are going to take some time to be processed. We could perhaps use priorities which folly seems to support to make this slightly better (task cancels could be re-scheduled with highest priority) but we would still have to wait for one other task to finish to start cancelling. Priorities could help, but as you highlight you still have the problem that if a task is already running, it doesn't matter what it's priority is vs the one(s) waiting. — Probably the "best" way to solve this particular cancellation problem would be to make it possible to somehow cancel tasks (for buckets being deleted) without having to wait to run() them on their thread pool. (I don't immediately know how to do that, but if we could I think it would solve all these issues IIRC, we did allow that for CB3ExecutorPool, but I changed it with FollyExecutorPool, for a combination of reasons: The readyQueue which holds all tasks ready to run is simply the tasks queued against FollyExecutorPool:: readerPool/writerPool etc - and once items are queued I don't believe they can be removed without the threadPool attempting to run them. To simplify task cancellation - have the CPU threadPool always cancel tasks, as even if you allow tasks to be cancelled before they are added to the CPU pools, you also have to handle the cases where (a) they are queued on the CPU pool but not yet run or (b) they are already running. Note part of the issue here is that the tasks are already on the CPU pool (i.e. we have decided that N Compaction tasks should run "now", but there is only A < N AuxIO threads possible for them to run on, so there are a non-zero number of tasks already on the CPU pool queue - which we cannot remove without finding a thread to run them.
            drigby Dave Rigby added a comment -

            On the topic of cross-Bucket limiting of how many tasks of a given type run at once (we limit Compaction tasks to only ever using some fraction of the threads given they are long-running and cannot really be paused) - we could look at using https://github.com/facebook/folly/blob/main/folly/executors/MeteredExecutor.h which I think would allow this kind of thing.

            drigby Dave Rigby added a comment - On the topic of cross-Bucket limiting of how many tasks of a given type run at once (we limit Compaction tasks to only ever using some fraction of the threads given they are long-running and cannot really be paused) - we could look at using https://github.com/facebook/folly/blob/main/folly/executors/MeteredExecutor.h which I think would allow this kind of thing.

            Probably the "best" way to solve this particular cancellation problem would be to make it possible to somehow cancel tasks (for buckets being deleted) without having to wait to run() them on their thread pool.

            I had the same thought, and I agree, but it's definitely a harder problem to solve. I'll investigate it some more and see what we can do.

            ben.huddleston Ben Huddleston added a comment - Probably the "best" way to solve this particular cancellation problem would be to make it possible to somehow cancel tasks (for buckets being deleted) without having to wait to run() them on their thread pool. I had the same thought, and I agree, but it's definitely a harder problem to solve. I'll investigate it some more and see what we can do.
            drigby Dave Rigby added a comment -

            (One final thought - prior to Collections there wasn't an issue with multiple Buckets having compaction tasks outstanding - as ns_server always scheduled compaction serially across each bucket in turn. It is the introduction of collections and having kv-engine directly trigger compaction which exposes this issue.)

            drigby Dave Rigby added a comment - (One final thought - prior to Collections there wasn't an issue with multiple Buckets having compaction tasks outstanding - as ns_server always scheduled compaction serially across each bucket in turn. It is the introduction of collections and having kv-engine directly trigger compaction which exposes this issue.)
            drigby Dave Rigby added a comment -

            One final final thought - I think this is a problem exclusive to the CompactVBucketTask - every other task is either short-running (say under 1s), or can be pause-resumed such that a single run() call is short-running (e.g. Warmup tasks). As such, it begs the question if this is something we should try to solve generally (e.g. allowing tasks to be cancelled without running in their CPU pool), or specifically for Compaction?

            drigby Dave Rigby added a comment - One final final thought - I think this is a problem exclusive to the CompactVBucketTask - every other task is either short-running (say under 1s), or can be pause-resumed such that a single run() call is short-running (e.g. Warmup tasks). As such, it begs the question if this is something we should try to solve generally (e.g. allowing tasks to be cancelled without running in their CPU pool), or specifically for Compaction?

            MB-48872 looks to be very similar to the issue that we've been discussing. It looks like rollback can cause these issues too, but that looks to have been worsened by priorities which I believe aren't currently implemented in the folly exeuctor pool. It's a similar issue though so it might be worth trying to come up with a more general solution.

            ben.huddleston Ben Huddleston added a comment - MB-48872 looks to be very similar to the issue that we've been discussing. It looks like rollback can cause these issues too, but that looks to have been worsened by priorities which I believe aren't currently implemented in the folly exeuctor pool. It's a similar issue though so it might be worth trying to come up with a more general solution.
            owend Daniel Owen made changes -
            Component/s storage-engine [ 10175 ]

            Build couchbase-server-7.1.0-2241 contains kv_engine commit 280622f with commit message:
            MB-49512: Obey concurrent compaction limit when rescheduling

            build-team Couchbase Build Team added a comment - Build couchbase-server-7.1.0-2241 contains kv_engine commit 280622f with commit message: MB-49512 : Obey concurrent compaction limit when rescheduling

            Build couchbase-server-7.1.0-2242 contains kv_engine commit f9c4a69 with commit message:
            MB-49512: Use min of AuxIO and Writer threads to calc compaction limit

            build-team Couchbase Build Team added a comment - Build couchbase-server-7.1.0-2242 contains kv_engine commit f9c4a69 with commit message: MB-49512 : Use min of AuxIO and Writer threads to calc compaction limit

            Probably the "best" way to solve this particular cancellation problem would be to make it possible to somehow cancel tasks (for buckets being deleted) without having to wait to run() them on their thread pool.

            This is looking to be pretty difficult to solve. There's two ways we could try to make this work with the current implementation: 1) somehow remove all the cancelled tasks from the CPUThreadPoolExecutor, 2) allow the Taskable to go away with the Task still in the queue. I investigated 1 first by subclassing the CPUThreadPoolExecutor to get at the underlying queue with the idea that during shutdown we'd pull everything out of the queue, check if it's for a bucket that's going away, then put back anything else. That fell apart though because the queue contains Folly's final version of a std::function<void()> rather than some class object that we could subclass to get the proxy/task info out of the thing we stored in the queue. Looking then at idea 2 we end up with a bit of a nightmare when it comes it managing TaskProxy objects, in particular with respect to the futurePool scheduling which holds raw pointers to TaskProxies.

            Limiting the threads which can compact cross-bucket would fix this particular issue so will investigate that now.

            ben.huddleston Ben Huddleston added a comment - Probably the "best" way to solve this particular cancellation problem would be to make it possible to somehow cancel tasks (for buckets being deleted) without having to wait to run() them on their thread pool. This is looking to be pretty difficult to solve. There's two ways we could try to make this work with the current implementation: 1) somehow remove all the cancelled tasks from the CPUThreadPoolExecutor, 2) allow the Taskable to go away with the Task still in the queue. I investigated 1 first by subclassing the CPUThreadPoolExecutor to get at the underlying queue with the idea that during shutdown we'd pull everything out of the queue, check if it's for a bucket that's going away, then put back anything else. That fell apart though because the queue contains Folly's final version of a std::function<void()> rather than some class object that we could subclass to get the proxy/task info out of the thing we stored in the queue. Looking then at idea 2 we end up with a bit of a nightmare when it comes it managing TaskProxy objects, in particular with respect to the futurePool scheduling which holds raw pointers to TaskProxies. Limiting the threads which can compact cross-bucket would fix this particular issue so will investigate that now.
            drigby Dave Rigby added a comment -

            re: the Folly queues, one approach (similar to other folly Executor subclasses) might be to interpose our own queue in front of the underlying folly one; which could store a pair of say (Bucket, Folly::Function). Then you could check this "pre-queue" on shutdown and discard anything matching the bucket. Essentially we would have a separate "readyQueue" which it still in "Couchbase-land" (and we can manipulate at-will); and the underlying Folly Executor would only dequeue off this when it actually was ready to run.

            This might end up being quite invasive, but it appeals to me more than the alternatives as we solve the problem of already queued tasks "having" to be run on a thread pool before they can be cancelled, for all Task types, without having to code anything for them independently.

            I agree that we'd struggle with getting rid of Taskable before it's Tasks - given that Tasks are allowed to hold references to the Taskable / sub-objects.

            In terms of limiting stuff cross-bucket - I'm not sure that's a good route to go down - while we could do that, I suspect we could end up with similar problems with Warmup / Rollback etc - and then you can end up in the situation where you are not actually using the resources you want to use (i.e. the entire thread pool).

            drigby Dave Rigby added a comment - re: the Folly queues, one approach (similar to other folly Executor subclasses) might be to interpose our own queue in front of the underlying folly one; which could store a pair of say (Bucket, Folly::Function). Then you could check this "pre-queue" on shutdown and discard anything matching the bucket. Essentially we would have a separate "readyQueue" which it still in "Couchbase-land" (and we can manipulate at-will); and the underlying Folly Executor would only dequeue off this when it actually was ready to run. This might end up being quite invasive, but it appeals to me more than the alternatives as we solve the problem of already queued tasks "having" to be run on a thread pool before they can be cancelled, for all Task types, without having to code anything for them independently. I agree that we'd struggle with getting rid of Taskable before it's Tasks - given that Tasks are allowed to hold references to the Taskable / sub-objects. In terms of limiting stuff cross-bucket - I'm not sure that's a good route to go down - while we could do that, I suspect we could end up with similar problems with Warmup / Rollback etc - and then you can end up in the situation where you are not actually using the resources you want to use (i.e. the entire thread pool).

            Got a work in progress for this that seems to work by putting the tasks in a queue that we own and having all the work we schedule on the actual executor pool just pop a task from our queue and run it.

            ben.huddleston Ben Huddleston added a comment - Got a work in progress for this that seems to work by putting the tasks in a queue that we own and having all the work we schedule on the actual executor pool just pop a task from our queue and run it.
            owend Daniel Owen made changes -
            Due Date 24/Feb/22

            Build couchbase-server-7.1.0-2280 contains kv_engine commit d9a9d3f with commit message:
            MB-49512: Move logging and common executor code to GlobalTask::execute

            build-team Couchbase Build Team added a comment - Build couchbase-server-7.1.0-2280 contains kv_engine commit d9a9d3f with commit message: MB-49512 : Move logging and common executor code to GlobalTask::execute

            Build couchbase-server-7.1.0-2301 contains kv_engine commit cb836e2 with commit message:
            MB-49512: Cancel compactions during shutdown

            build-team Couchbase Build Team added a comment - Build couchbase-server-7.1.0-2301 contains kv_engine commit cb836e2 with commit message: MB-49512 : Cancel compactions during shutdown

            Build couchbase-server-7.1.0-2301 contains kv_engine commit 568d2c0 with commit message:
            MB-49512: Join cpuPool threads before reset

            build-team Couchbase Build Team added a comment - Build couchbase-server-7.1.0-2301 contains kv_engine commit 568d2c0 with commit message: MB-49512 : Join cpuPool threads before reset

            Build couchbase-server-7.1.0-2302 contains kv_engine commit 47cdb1c with commit message:
            MB-49512: Wait for flusher in test multiple vb compactions

            build-team Couchbase Build Team added a comment - Build couchbase-server-7.1.0-2302 contains kv_engine commit 47cdb1c with commit message: MB-49512 : Wait for flusher in test multiple vb compactions

            Build couchbase-server-7.1.0-2302 contains kv_engine commit bcbb76e with commit message:
            MB-49512: Avoid deadlock in cancel_can_schedule test

            build-team Couchbase Build Team added a comment - Build couchbase-server-7.1.0-2302 contains kv_engine commit bcbb76e with commit message: MB-49512 : Avoid deadlock in cancel_can_schedule test

            Build couchbase-server-7.1.0-2310 contains kv_engine commit 8fa87af with commit message:
            MB-49512: Don't run STItemPagerTest for nexus

            build-team Couchbase Build Team added a comment - Build couchbase-server-7.1.0-2310 contains kv_engine commit 8fa87af with commit message: MB-49512 : Don't run STItemPagerTest for nexus

            Build couchbase-server-7.1.0-2310 contains kv_engine commit 5ae473d with commit message:
            MB-49512: Split DurabilityEPBucketTest into smaller suites

            build-team Couchbase Build Team added a comment - Build couchbase-server-7.1.0-2310 contains kv_engine commit 5ae473d with commit message: MB-49512 : Split DurabilityEPBucketTest into smaller suites

            Build couchbase-server-7.1.0-2311 contains kv_engine commit e2309d3 with commit message:
            MB-49512: Unregister taskable in PoolThreadsAreRegisteredWithPhosphor

            build-team Couchbase Build Team added a comment - Build couchbase-server-7.1.0-2311 contains kv_engine commit e2309d3 with commit message: MB-49512 : Unregister taskable in PoolThreadsAreRegisteredWithPhosphor
            drigby Dave Rigby made changes -
            Link This issue relates to MB-50988 [ MB-50988 ]
            ben.huddleston Ben Huddleston made changes -
            Status Reopened [ 4 ] In Progress [ 3 ]

            Build couchbase-server-7.1.0-2320 contains kv_engine commit fdc7f2d with commit message:
            MB-49512: engine_testapp use std::cout over printf for Running...

            build-team Couchbase Build Team added a comment - Build couchbase-server-7.1.0-2320 contains kv_engine commit fdc7f2d with commit message: MB-49512 : engine_testapp use std::cout over printf for Running...

            Build couchbase-server-7.1.0-2320 contains kv_engine commit cdbd3b5 with commit message:
            MB-49512: Always flush stdout on engine_testapp result

            build-team Couchbase Build Team added a comment - Build couchbase-server-7.1.0-2320 contains kv_engine commit cdbd3b5 with commit message: MB-49512 : Always flush stdout on engine_testapp result

            Build couchbase-server-7.1.0-2320 contains kv_engine commit fe6f4ca with commit message:
            MB-50941: Revert "MB-49512: Obey concurrent compaction limit when rescheduling"

            build-team Couchbase Build Team added a comment - Build couchbase-server-7.1.0-2320 contains kv_engine commit fe6f4ca with commit message: MB-50941 : Revert " MB-49512 : Obey concurrent compaction limit when rescheduling"

            Build couchbase-server-7.1.0-2328 contains kv_engine commit 8572934 with commit message:
            MB-49512: FollyExecutorPool - allow scheduler thread re-entrancy [1/2]

            build-team Couchbase Build Team added a comment - Build couchbase-server-7.1.0-2328 contains kv_engine commit 8572934 with commit message: MB-49512 : FollyExecutorPool - allow scheduler thread re-entrancy [1/2]

            Build couchbase-server-7.1.0-2331 contains kv_engine commit 84d6594 with commit message:
            MB-49512: Reset tasks on scheduler thread [2/2]

            build-team Couchbase Build Team added a comment - Build couchbase-server-7.1.0-2331 contains kv_engine commit 84d6594 with commit message: MB-49512 : Reset tasks on scheduler thread [2/2]
            ben.huddleston Ben Huddleston made changes -
            Link This issue is duplicated by MB-48872 [ MB-48872 ]

            Build couchbase-server-7.1.0-2339 contains kv_engine commit 2d34894 with commit message:
            MB-49512: FollyExecutorPool use custom queue for actual work

            build-team Couchbase Build Team added a comment - Build couchbase-server-7.1.0-2339 contains kv_engine commit 2d34894 with commit message: MB-49512 : FollyExecutorPool use custom queue for actual work

            Build couchbase-server-7.1.0-2340 contains kv_engine commit 6fc9835 with commit message:
            MB-49512: Remove tasks from custom cpuPool queue on unregister

            build-team Couchbase Build Team added a comment - Build couchbase-server-7.1.0-2340 contains kv_engine commit 6fc9835 with commit message: MB-49512 : Remove tasks from custom cpuPool queue on unregister
            ben.huddleston Ben Huddleston made changes -
            Assignee Ben Huddleston [ ben.huddleston ] Balakumaran Gopal [ balakumaran.gopal ]
            Resolution Fixed [ 1 ]
            Status In Progress [ 3 ] Resolved [ 5 ]

            We have 3 jobs running to validate the issue.

            http://qe-jenkins.sc.couchbase.com/job/temp_rebalance_even/175/console
            http://qe-jenkins.sc.couchbase.com/job/temp_rebalance_magma/252/console
            http://qe-jenkins.sc.couchbase.com/job/temp_rebalance_magma_win/51/console

            temp_rebalance_even is of the highest importance because it was on this setup we were able to repro 100/100 times. So far I have seen it happen only 3/70 times in that particular job and 0 times in the other 2 jobs.

            Balakumaran.Gopal Balakumaran Gopal added a comment - We have 3 jobs running to validate the issue. http://qe-jenkins.sc.couchbase.com/job/temp_rebalance_even/175/console http://qe-jenkins.sc.couchbase.com/job/temp_rebalance_magma/252/console http://qe-jenkins.sc.couchbase.com/job/temp_rebalance_magma_win/51/console temp_rebalance_even is of the highest importance because it was on this setup we were able to repro 100/100 times. So far I have seen it happen only 3/70 times in that particular job and 0 times in the other 2 jobs.
            Balakumaran.Gopal Balakumaran Gopal made changes -
            Status Resolved [ 5 ] Closed [ 6 ]
            Balakumaran.Gopal Balakumaran Gopal made changes -
            Assignee Balakumaran Gopal [ balakumaran.gopal ] Daniel Owen [ owend ]
            Balakumaran.Gopal Balakumaran Gopal made changes -
            Attachment UI_MB-49512.png [ 178114 ]
            Balakumaran.Gopal Balakumaran Gopal made changes -
            Resolution Fixed [ 1 ]
            Status Closed [ 6 ] Reopened [ 4 ]
            owend Daniel Owen made changes -
            Assignee Daniel Owen [ owend ] Ben Huddleston [ ben.huddleston ]

            Latest logs show that we are slow between "Shut down the bucket" and "Clean up allocated resources". 15 seconds is normal though and I believe that the timeout should be 60 seconds per https://github.com/couchbase/ns_server/blob/master/src/ns_rebalancer.erl#L42.

            2022-02-22T04:42:25.712728-08:00 INFO 1312: Delete bucket [bucket-3]. Notifying engine
            2022-02-22T04:42:25.715244-08:00 INFO 1312: Delete bucket [bucket-3]. Engine ready for shutdown
            2022-02-22T04:42:25.715255-08:00 INFO 1312: Delete bucket [bucket-3]. Wait for 4 clients to disconnect
            2022-02-22T04:42:25.716086-08:00 INFO 1312: Delete bucket [bucket-3]. Shut down the bucket
            2022-02-22T04:42:41.266252-08:00 INFO 1312: Delete bucket [bucket-3]. Clean up allocated resources
            2022-02-22T04:42:41.266384-08:00 INFO 1312: Delete bucket [bucket-3] complete
            

            The majority of our 15 second shutdown was spent shutting down KVStores in magma. We also have some noisy logs to address, probably not worth logging cancelled compactions.

            2022-02-22T04:42:25.756320-08:00 WARNING (bucket-3) MagmaKVStore::compactDBInternal CompactKVStore vb:509 CID:<ud>cid:0x6d:</ud> failed status:Cancelled: Shutting down
            

            Approximately a minute after this ns_server threw out the error:

            [rebalance:error,2022-02-22T04:43:55.847-08:00,ns_1@172.23.136.156:<0.4431.5>:ns_rebalancer:do_wait_buckets_shutdown:192]Failed to wait deletion of some buckets on some nodes: [{'ns_1@172.23.136.156',
                                                                     {'EXIT',
                                                                      {old_buckets_shutdown_wait_failed,
                                                                       ["bucket-3"]}}}]
            

            Probably worth somebody from the ns_server team taking a look at this, it looks like the shutdown completed in a timely (enough) manner in kv_engine.

            ben.huddleston Ben Huddleston added a comment - Latest logs show that we are slow between "Shut down the bucket" and "Clean up allocated resources". 15 seconds is normal though and I believe that the timeout should be 60 seconds per https://github.com/couchbase/ns_server/blob/master/src/ns_rebalancer.erl#L42 . 2022-02-22T04:42:25.712728-08:00 INFO 1312: Delete bucket [bucket-3]. Notifying engine 2022-02-22T04:42:25.715244-08:00 INFO 1312: Delete bucket [bucket-3]. Engine ready for shutdown 2022-02-22T04:42:25.715255-08:00 INFO 1312: Delete bucket [bucket-3]. Wait for 4 clients to disconnect 2022-02-22T04:42:25.716086-08:00 INFO 1312: Delete bucket [bucket-3]. Shut down the bucket 2022-02-22T04:42:41.266252-08:00 INFO 1312: Delete bucket [bucket-3]. Clean up allocated resources 2022-02-22T04:42:41.266384-08:00 INFO 1312: Delete bucket [bucket-3] complete The majority of our 15 second shutdown was spent shutting down KVStores in magma. We also have some noisy logs to address, probably not worth logging cancelled compactions. 2022-02-22T04:42:25.756320-08:00 WARNING (bucket-3) MagmaKVStore::compactDBInternal CompactKVStore vb:509 CID:<ud>cid:0x6d:</ud> failed status:Cancelled: Shutting down Approximately a minute after this ns_server threw out the error: [rebalance:error,2022-02-22T04:43:55.847-08:00,ns_1@172.23.136.156:<0.4431.5>:ns_rebalancer:do_wait_buckets_shutdown:192]Failed to wait deletion of some buckets on some nodes: [{'ns_1@172.23.136.156', {'EXIT', {old_buckets_shutdown_wait_failed, ["bucket-3"]}}}] Probably worth somebody from the ns_server team taking a look at this, it looks like the shutdown completed in a timely (enough) manner in kv_engine.

            Dave Finlay/Steve Watanabe, could somebody from ns_server please take a look into the latest set of logs here. ns_server appears to timeout during a rebalance waiting for old buckets to be cleared up but from the kv_engine logs they should have been deleted a minute prior.

            ben.huddleston Ben Huddleston added a comment - Dave Finlay / Steve Watanabe , could somebody from ns_server please take a look into the latest set of logs here. ns_server appears to timeout during a rebalance waiting for old buckets to be cleared up but from the kv_engine logs they should have been deleted a minute prior.
            ben.huddleston Ben Huddleston made changes -
            Assignee Ben Huddleston [ ben.huddleston ] Dave Finlay [ dfinlay ]

            Build couchbase-server-7.1.0-2363 contains kv_engine commit 773e09e with commit message:
            MB-49512: Remove commented out include

            build-team Couchbase Build Team added a comment - Build couchbase-server-7.1.0-2363 contains kv_engine commit 773e09e with commit message: MB-49512 : Remove commented out include
            steve.watanabe Steve Watanabe made changes -
            Assignee Dave Finlay [ dfinlay ] Steve Watanabe [ steve.watanabe ]

            The code is waiting 60 seconds for the ns_memached-bucket-3 process to terminate... but it takes 67 seconds to terminate.

            [ns_server:debug,2022-02-22T04:42:55.845-08:00,ns_1@172.23.136.156:<0.4440.5>:ns_rebalancer:local_buckets_shutdown_loop:143]Waiting until the following old bucket instances are gone: ["bucket-3"]
             
            [rebalance:error,2022-02-22T04:43:55.847-08:00,ns_1@172.23.136.156:<0.4431.5>:ns_rebalancer:do_wait_buckets_shutdown:192]Failed to wait deletion of some buckets on some nodes: [{'ns_1@172.23.136.156',
                                                                     {'EXIT',
                                                                      {old_buckets_shutdown_wait_failed,
                                                                       ["bucket-3"]}}}]
            [ns_server:debug,2022-02-22T04:44:02.658-08:00,ns_1@172.23.136.156:ns_memcached-bucket-3<0.5119.0>:ns_memcached:terminate:832]Terminated.
            

            During that time we get a slow bucket stop which prints backtraces of the processes. The ns_memcached-bucket-3 process is deleting files associated with the bucket.

             [{backtrace,[<<"Program counter: 0x00000236eeec1788 (erpc:call/5 + 1616)">>,
                          <<>>,<<"0x0000023682cd59f0 []">>,<<>>,
                          <<"0x0000023682cd59f8 infinity">>,<<>>,
                          <<"0x0000023682cd5a00 []">>,<<>>,
                          <<"0x0000023682cd5a08 #Ref<0.433470769.1614282754.49484>">>,
                          <<>>,<<"0x0000023682cd5a10 -576460752303423481">>,<<>>,
                          <<"0x0000023682cd5a18 #Ref<0.433470769.1614282754.49483>">>,
                          <<>>,
                          <<"0x0000023682cd5a20 Return addr 0x00000236eb3667ec (rpc:call/5 + 244)">>,
                          <<"y(0)     infinity">>,<<"y(1)     []">>,
                          <<"y(2)     [{delete_databases_and_files,\"bucket-3\"}]">>,
                          <<"y(3)     handle_rpc">>,<<"y(4)     ns_couchdb_api">>,
                          <<"y(5)     'couchdb_ns_1@cb.local'">>,
                          <<"y(6)     Catch 0x00000236eb366863 (rpc:call/5 + 363)">>,<<>>,
                          <<"0x0000023682cd5a60 Return addr 0x00000236eeeb77e0 (ns_couchdb_api:rpc_couchdb_node/4 + 168)">>,
                          <<"y(0)     []">>,<<"y(1)     undefined">>,
                          <<"y(2)     {delete_databases_and_files,\"bucket-3\"}">>,<<>>,
                          <<"0x0000023682cd5a80 Return addr 0x00000236eeeb6b1c (ns_couchdb_api:delete_databases_and_files/1 + 132)">>,
                          <<"y(0)     []">>,<<"y(1)     \"bucket-3\"">>,<<>>,
                          <<"0x0000023682cd5a98 Return addr 0x00000236ef0548fc (ns_memcached:delete_bucket/4 + 724)">>,
                          <<"y(0)     []">>,<<"y(1)     []">>,<<"y(2)     []">>,
                          <<"y(3)     []">>,<<"y(4)     ok">>,<<"y(5)     []">>,<<>>,
                          <<"0x0000023682cd5ad0 Return addr 0x00000236ef0540b0 (ns_memcached:terminate/2 + 240)">>,
                          <<"y(0)     []">>,<<"y(1)     []">>,<<"y(2)     \"bucket-3\"">>,
                          <<"y(3)     Catch 0x00000236ef05416f (ns_memcached:terminate/2 + 431)">>,
                          <<>>,
                          <<"0x0000023682cd5af8 Return addr 0x00000236eb220a00 (gen_server:try_terminate/3 + 360)">>,
                          <<"y(0)     []">>,<<"y(1)     []">>,<<"y(2)     []">>,
                          <<"y(3)     Catch 0x00000236eb220a5a (gen_server:try_terminate/3 + 450)">>,
                          <<>>,
                          <<"0x0000023682cd5b20 Return addr 0x00000236eb223224 (gen_server:terminate/10 + 196)">>,
                          <<"y(0)     []">>,<<"y(1)     []">>,<<"y(2)     []">>,
                          <<"(3)     {state,0,0,0,{[],[]},{[],[]},{[],[]},warmed,{1645,532539,512000},\"bucket-3\",[collections,json],#Port<0.439>,[{<0">>,
                          <<"y(4)     ns_memcached">>,
                          <<"y(5)     {'EXIT',<0.5118.0>,shutdown}">>,
                          <<"y(6)     undefined">>,<<"y(7)     'ns_memcached-bucket-3'">>,
                          <<"y(8)     shutdown">>,
                          <<"(9)     [{gen_server,decode_msg,9,[{file,\"gen_server.erl\"},{line,481}]},{proc_lib,init_p_do_apply,3,[{file,\"proc_lib.erl">>,
                          <<"y(10)    shutdown">>,<<"y(11)    exit">>,<<>>,
                          <<"0x0000023682cd5b88 Return addr 0x00000236eb147478 (proc_lib:init_p_do_apply/3 + 208)">>,
                          <<"y(0)     []">>,<<"y(1)     []">>,
                          <<"y(2)     Catch 0x00000236eb147498 (proc_lib:init_p_do_apply/3 + 240)">>,
                          <<>>,
                          <<"0x0000023682cd5ba8 Return addr 0x00000236eb071bfc (unknown function)">>,
                          <<>>,
                          <<"0x0000023682cd5bb0 Return addr 0x00000236eb071c00 (<terminate process normally>)">>,
                          <<>>]},
              {messages,[<<"check_config">>,<<"check_config">>,<<"check_config">>,
                         <<"check_config">>,<<"check_config">>,<<"check_config">>,
                         <<"check_config">>,<<"check_config">>,<<"check_config">>,
                         <<"check_config">>]},
              {dictionary,[{'$initial_call',{ns_memcached,init,1}},
                           {'$ancestors',[<0.5118.0>,'single_bucket_kv_sup-bucket-3',
                                          ns_bucket_sup,ns_bucket_worker_sup,
                                          ns_server_sup,ns_server_nodes_sup,<0.270.0>,
                                          ns_server_cluster_sup,root_sup,<0.143.0>]}]},
              {registered_name,'ns_memcached-bucket-3'},
            

            This is running on a Windows VM

            ===Checker results for 'WIN-HS1RJUPNDMD / 'ns_1@172.23.136.156''===
            [info]  Collection Time     : 2022-02-22 04:56:46
            [warn]  OS Registered Org   : Unable to determine
            [info]  OS Name             : Microsoft Windows Server 2019 Datacenter
            [info]  OS Version          : Microsoft Windows Server 2019 Datacenter
            [info]  HW Platform         : HVM domU
            [BAD]   CB Version          : 7.1.0-2349-enterprise - Unsupported build
            

            The deletion of the files associated with bucket-3 are taking too long. This appears similar to MB-45877

            steve.watanabe Steve Watanabe added a comment - The code is waiting 60 seconds for the ns_memached-bucket-3 process to terminate... but it takes 67 seconds to terminate. [ns_server:debug,2022-02-22T04:42:55.845-08:00,ns_1@172.23.136.156:<0.4440.5>:ns_rebalancer:local_buckets_shutdown_loop:143]Waiting until the following old bucket instances are gone: ["bucket-3"]   [rebalance:error,2022-02-22T04:43:55.847-08:00,ns_1@172.23.136.156:<0.4431.5>:ns_rebalancer:do_wait_buckets_shutdown:192]Failed to wait deletion of some buckets on some nodes: [{'ns_1@172.23.136.156', {'EXIT', {old_buckets_shutdown_wait_failed, ["bucket-3"]}}}] [ns_server:debug,2022-02-22T04:44:02.658-08:00,ns_1@172.23.136.156:ns_memcached-bucket-3<0.5119.0>:ns_memcached:terminate:832]Terminated. During that time we get a slow bucket stop which prints backtraces of the processes. The ns_memcached-bucket-3 process is deleting files associated with the bucket. [{backtrace,[<<"Program counter: 0x00000236eeec1788 (erpc:call/5 + 1616)">>, <<>>,<<"0x0000023682cd59f0 []">>,<<>>, <<"0x0000023682cd59f8 infinity">>,<<>>, <<"0x0000023682cd5a00 []">>,<<>>, <<"0x0000023682cd5a08 #Ref<0.433470769.1614282754.49484>">>, <<>>,<<"0x0000023682cd5a10 -576460752303423481">>,<<>>, <<"0x0000023682cd5a18 #Ref<0.433470769.1614282754.49483>">>, <<>>, <<"0x0000023682cd5a20 Return addr 0x00000236eb3667ec (rpc:call/5 + 244)">>, <<"y(0) infinity">>,<<"y(1) []">>, <<"y(2) [{delete_databases_and_files,\"bucket-3\"}]">>, <<"y(3) handle_rpc">>,<<"y(4) ns_couchdb_api">>, <<"y(5) 'couchdb_ns_1@cb.local'">>, <<"y(6) Catch 0x00000236eb366863 (rpc:call/5 + 363)">>,<<>>, <<"0x0000023682cd5a60 Return addr 0x00000236eeeb77e0 (ns_couchdb_api:rpc_couchdb_node/4 + 168)">>, <<"y(0) []">>,<<"y(1) undefined">>, <<"y(2) {delete_databases_and_files,\"bucket-3\"}">>,<<>>, <<"0x0000023682cd5a80 Return addr 0x00000236eeeb6b1c (ns_couchdb_api:delete_databases_and_files/1 + 132)">>, <<"y(0) []">>,<<"y(1) \"bucket-3\"">>,<<>>, <<"0x0000023682cd5a98 Return addr 0x00000236ef0548fc (ns_memcached:delete_bucket/4 + 724)">>, <<"y(0) []">>,<<"y(1) []">>,<<"y(2) []">>, <<"y(3) []">>,<<"y(4) ok">>,<<"y(5) []">>,<<>>, <<"0x0000023682cd5ad0 Return addr 0x00000236ef0540b0 (ns_memcached:terminate/2 + 240)">>, <<"y(0) []">>,<<"y(1) []">>,<<"y(2) \"bucket-3\"">>, <<"y(3) Catch 0x00000236ef05416f (ns_memcached:terminate/2 + 431)">>, <<>>, <<"0x0000023682cd5af8 Return addr 0x00000236eb220a00 (gen_server:try_terminate/3 + 360)">>, <<"y(0) []">>,<<"y(1) []">>,<<"y(2) []">>, <<"y(3) Catch 0x00000236eb220a5a (gen_server:try_terminate/3 + 450)">>, <<>>, <<"0x0000023682cd5b20 Return addr 0x00000236eb223224 (gen_server:terminate/10 + 196)">>, <<"y(0) []">>,<<"y(1) []">>,<<"y(2) []">>, <<"(3) {state,0,0,0,{[],[]},{[],[]},{[],[]},warmed,{1645,532539,512000},\"bucket-3\",[collections,json],#Port<0.439>,[{<0">>, <<"y(4) ns_memcached">>, <<"y(5) {'EXIT',<0.5118.0>,shutdown}">>, <<"y(6) undefined">>,<<"y(7) 'ns_memcached-bucket-3'">>, <<"y(8) shutdown">>, <<"(9) [{gen_server,decode_msg,9,[{file,\"gen_server.erl\"},{line,481}]},{proc_lib,init_p_do_apply,3,[{file,\"proc_lib.erl">>, <<"y(10) shutdown">>,<<"y(11) exit">>,<<>>, <<"0x0000023682cd5b88 Return addr 0x00000236eb147478 (proc_lib:init_p_do_apply/3 + 208)">>, <<"y(0) []">>,<<"y(1) []">>, <<"y(2) Catch 0x00000236eb147498 (proc_lib:init_p_do_apply/3 + 240)">>, <<>>, <<"0x0000023682cd5ba8 Return addr 0x00000236eb071bfc (unknown function)">>, <<>>, <<"0x0000023682cd5bb0 Return addr 0x00000236eb071c00 (<terminate process normally>)">>, <<>>]}, {messages,[<<"check_config">>,<<"check_config">>,<<"check_config">>, <<"check_config">>,<<"check_config">>,<<"check_config">>, <<"check_config">>,<<"check_config">>,<<"check_config">>, <<"check_config">>]}, {dictionary,[{'$initial_call',{ns_memcached,init,1}}, {'$ancestors',[<0.5118.0>,'single_bucket_kv_sup-bucket-3', ns_bucket_sup,ns_bucket_worker_sup, ns_server_sup,ns_server_nodes_sup,<0.270.0>, ns_server_cluster_sup,root_sup,<0.143.0>]}]}, {registered_name,'ns_memcached-bucket-3'}, This is running on a Windows VM ===Checker results for 'WIN-HS1RJUPNDMD / 'ns_1@172.23.136.156''=== [info] Collection Time : 2022-02-22 04:56:46 [warn] OS Registered Org : Unable to determine [info] OS Name : Microsoft Windows Server 2019 Datacenter [info] OS Version : Microsoft Windows Server 2019 Datacenter [info] HW Platform : HVM domU [BAD] CB Version : 7.1.0-2349-enterprise - Unsupported build The deletion of the files associated with bucket-3 are taking too long. This appears similar to MB-45877
            steve.watanabe Steve Watanabe made changes -
            Assignee Steve Watanabe [ steve.watanabe ] Ben Huddleston [ ben.huddleston ]
            steve.watanabe Steve Watanabe made changes -
            Environment Enterprise Edition 7.1.0 build 1694 ‧ Enterprise Edition 7.1.0 build 1694 ‧
            ben.huddleston Ben Huddleston made changes -
            Assignee Ben Huddleston [ ben.huddleston ] Balakumaran Gopal [ balakumaran.gopal ]

            Ben Huddleston - Could you please explain what is needed from me for this bug ? I have already attached the repro's from both windows and linux runs. Is there something else you are looking for?

            Balakumaran.Gopal Balakumaran Gopal added a comment - Ben Huddleston - Could you please explain what is needed from me for this bug ? I have already attached the repro's from both windows and linux runs. Is there something else you are looking for?

            I am pretty sure we can repro this on 1 bucket on linux runs too. Let me give it a shot.

            Sorry Balakumaran Gopal, I thought by this you meant that you were trying to run this on linux. Are the logs you linked the same test on linux machines?

            ben.huddleston Ben Huddleston added a comment - I am pretty sure we can repro this on 1 bucket on linux runs too. Let me give it a shot. Sorry Balakumaran Gopal , I thought by this you meant that you were trying to run this on linux. Are the logs you linked the same test on linux machines?
            Balakumaran.Gopal Balakumaran Gopal added a comment - - edited

            Ya Ben Huddleston.

            1. Windows - Single bucket - We have cbcollect
            2. Linux - Multi bucket but on Magma boxes - We have cbcollect + Live cluster if you need.
            3. Linux - Single bucket but on regression boxes(slower than magma boxes on which this bug was filed) - This is under run.

            Hopefully that clarifies things. I am running 3 now. Will update once I hit it.

            Following logs are for 2.
            Supportal
            http://supportal.couchbase.com/snapshot/e467c957c865e617994c62381f018423::0

            s3://cb-customers-secure/mb-49512_rerun_post_fix_repro_linux/2022-02-23/collectinfo-2022-02-22t170313-ns_1@172.23.100.34.zip
            s3://cb-customers-secure/mb-49512_rerun_post_fix_repro_linux/2022-02-23/collectinfo-2022-02-22t170313-ns_1@172.23.100.35.zip
            s3://cb-customers-secure/mb-49512_rerun_post_fix_repro_linux/2022-02-23/collectinfo-2022-02-22t170313-ns_1@172.23.100.36.zip
            s3://cb-customers-secure/mb-49512_rerun_post_fix_repro_linux/2022-02-23/collectinfo-2022-02-22t170313-ns_1@172.23.105.164.zip
            s3://cb-customers-secure/mb-49512_rerun_post_fix_repro_linux/2022-02-23/collectinfo-2022-02-22t170313-ns_1@172.23.105.206.zip
            s3://cb-customers-secure/mb-49512_rerun_post_fix_repro_linux/2022-02-23/collectinfo-2022-02-22t170313-ns_1@172.23.106.177.zip
            s3://cb-customers-secure/mb-49512_rerun_post_fix_repro_linux/2022-02-23T04/collectinfo-2022-02-22t170313-ns_1@172.23.100.34.zip

            Balakumaran.Gopal Balakumaran Gopal added a comment - - edited Ya Ben Huddleston . Windows - Single bucket - We have cbcollect Linux - Multi bucket but on Magma boxes - We have cbcollect + Live cluster if you need. Linux - Single bucket but on regression boxes(slower than magma boxes on which this bug was filed) - This is under run. Hopefully that clarifies things. I am running 3 now. Will update once I hit it. Following logs are for 2. Supportal http://supportal.couchbase.com/snapshot/e467c957c865e617994c62381f018423::0 s3://cb-customers-secure/mb-49512_rerun_post_fix_repro_linux/2022-02-23/collectinfo-2022-02-22t170313-ns_1@172.23.100.34.zip s3://cb-customers-secure/mb-49512_rerun_post_fix_repro_linux/2022-02-23/collectinfo-2022-02-22t170313-ns_1@172.23.100.35.zip s3://cb-customers-secure/mb-49512_rerun_post_fix_repro_linux/2022-02-23/collectinfo-2022-02-22t170313-ns_1@172.23.100.36.zip s3://cb-customers-secure/mb-49512_rerun_post_fix_repro_linux/2022-02-23/collectinfo-2022-02-22t170313-ns_1@172.23.105.164.zip s3://cb-customers-secure/mb-49512_rerun_post_fix_repro_linux/2022-02-23/collectinfo-2022-02-22t170313-ns_1@172.23.105.206.zip s3://cb-customers-secure/mb-49512_rerun_post_fix_repro_linux/2022-02-23/collectinfo-2022-02-22t170313-ns_1@172.23.106.177.zip s3://cb-customers-secure/mb-49512_rerun_post_fix_repro_linux/2022-02-23T04/collectinfo-2022-02-22t170313-ns_1@172.23.100.34.zip

            Thanks Balakumaran Gopal. For this run we're waiting to disconnect compaction clients.

            2022-02-22T08:58:40.528824-08:00 INFO 2985: Delete bucket [bucket1]. Notifying engine
            2022-02-22T08:58:40.530570-08:00 INFO 2985: Delete bucket [bucket1]. Engine ready for shutdown
            2022-02-22T08:58:40.530591-08:00 INFO 2985: Delete bucket [bucket1]. Wait for 10 clients to disconnect
            2022-02-22T09:00:41.208924-08:00 INFO 2985: Delete bucket [bucket1]. Still waiting: 4 clients connected: {"22288":{"agent_name":"regular","bucket_index":2,"clustermap":{"epoch":0,"revno":0},"connection":"0x00007f1fb8db7680","connection_id
            ":"127.0.0.1:51268","cookies":[{"aiostat":"would block","connection":"[ {\"ip\":\"127.0.0.1\",\"port\":51268} - {\"ip\":\"127.0.0.1\",\"port\":11209} (System, <ud>@ns_server</ud>) ]","engine_storage":"0x00007f206c057000","ewouldblock":tru
            e,"packet":{"bodylen":24,"cas":0,"datatype":"raw","extlen":24,"key":"<ud></ud>","keylen":0,"magic":"ClientRequest","opaque":0,"opcode":"COMPACT_DB","vbucket":3},"refcount":0}],"datatype":"raw","dcp":false,"dcp_deleted_user_xattr":false,"d
            cp_no_value":false,"dcp_xattr_aware":false,"features":["tcp nodelay"],"internal":true,"max_reqs_per_event":20,"max_sched_time":"53244","min_sched_time":"689","nevents":19,"parent_port":11209,"peername":"{\"ip\":\"127.0.0.1\",\"port\":5126
            8}","priority":"Medium","protocol":"memcached","refcount":1,"sasl_enabled":true,"sendqueue":{"last":1082731650301306,"size":0,"term":false},"socket":22288,"sockname":"{\"ip\":\"127.0.0.1\",\"port\":11209}","ssl":false,"state":"running","t
            hread":4,"total_cpu_time":"4153140819","total_queued_send":12749537,"total_recv":547037,"total_send":12749537,"type":"normal","user":{"domain":"local","name":"@ns_server"},"yields":0},"23247":{"agent_name":"regular","bucket_index":2,"clus
            termap":{"epoch":0,"revno":0},"connection":"0x00007f207c1c9600","connection_id":"127.0.0.1:60652","cookies":[{"aiostat":"would block","connection":"[ {\"ip\":\"127.0.0.1\",\"port\":60652} - {\"ip\":\"127.0.0.1\",\"port\":11209} (System, <
            ud>@ns_server</ud>) ]","engine_storage":"0x00007f206c057000","ewouldblock":true,"packet":{"bodylen":24,"cas":0,"datatype":"raw","extlen":24,"key":"<ud></ud>","keylen":0,"magic":"ClientRequest","opaque":0,"opcode":"COMPACT_DB","vbucket":0}
            ,"refcount":0}],"datatype":"raw","dcp":false,"dcp_deleted_user_xattr":false,"dcp_no_value":false,"dcp_xattr_aware":false,"features":["tcp nodelay"],"internal":true,"max_reqs_per_event":20,"max_sched_time":"49329","min_sched_time":"991","n
            events":19,"parent_port":11209,"peername":"{\"ip\":\"127.0.0.1\",\"port\":60652}","priority":"Medium","protocol":"memcached","refcount":1,"sasl_enabled":true,"sendqueue":{"last":1082735234030157,"size":0,"term":false},"socket":23247,"sock
            name":"{\"ip\":\"127.0.0.1\",\"port\":11209}","ssl":false,"state":"running","thread":2,"total_cpu_time":"391642092","total_queued_send":1569430,"total_recv":38525,"total_send":1569430,"type":"normal","user":{"domain":"local","name":"@ns_s
            erver"},"yields":0},"27852":{"agent_name":"regular","bucket_index":2,"clustermap":{"epoch":0,"revno":0},"connection":"0x00007f207c0c0a00","connection_id":"127.0.0.1:33826","cookies":[{"aiostat":"would block","connection":"[ {\"ip\":\"127.
            0.0.1\",\"port\":33826} - {\"ip\":\"127.0.0.1\",\"port\":11209} (System, <ud>@ns_server</ud>) ]","engine_storage":"0x00007f206c057000","ewouldblock":true,"packet":{"bodylen":24,"cas":0,"datatype":"raw","extlen":24,"key":"<ud></ud>","keyle
            n":0,"magic":"ClientRequest","opaque":0,"opcode":"COMPACT_DB","vbucket":1},"refcount":0}],"datatype":"raw","dcp":false,"dcp_deleted_user_xattr":false,"dcp_no_value":false,"dcp_xattr_aware":false,"features":["tcp nodelay"],"internal":true,
            "max_reqs_per_event":20,"max_sched_time":"47040","min_sched_time":"1103","nevents":19,"parent_port":11209,"peername":"{\"ip\":\"127.0.0.1\",\"port\":33826}","priority":"Medium","protocol":"memcached","refcount":1,"sasl_enabled":true,"send
            queue":{"last":1082734699042133,"size":0,"term":false},"socket":27852,"sockname":"{\"ip\":\"127.0.0.1\",\"port\":11209}","ssl":false,"state":"running","thread":0,"total_cpu_time":"948942413","total_queued_send":3654740,"total_recv":70897,
            "total_send":3654740,"type":"normal","user":{"domain":"local","name":"@ns_server"},"yields":0},"70":{"agent_name":"regular","bucket_index":2,"clustermap":{"epoch":0,"revno":0},"connection":"0x00007f207c0c0280","connection_id":"127.0.0.1:6
            0480","cookies":[{"aiostat":"would block","connection":"[ {\"ip\":\"127.0.0.1\",\"port\":60480} - {\"ip\":\"127.0.0.1\",\"port\":11209} (System, <ud>@ns_server</ud>) ]","engine_storage":"0x00007f206c057000","ewouldblock":true,"packet":{"b
            odylen":24,"cas":0,"datatype":"raw","extlen":24,"key":"<ud></ud>","keylen":0,"magic":"ClientRequest","opaque":0,"opcode":"COMPACT_DB","vbucket":2},"refcount":0}],"datatype":"raw","dcp":false,"dcp_deleted_user_xattr":false,"dcp_no_value":f
            alse,"dcp_xattr_aware":false,"features":["tcp nodelay"],"internal":true,"max_reqs_per_event":20,"max_sched_time":"23917","min_sched_time":"667","nevents":19,"parent_port":11209,"peername":"{\"ip\":\"127.0.0.1\",\"port\":60480}","priority"
            :"Medium","protocol":"memcached","refcount":1,"sasl_enabled":true,"sendqueue":{"last":1082734976915686,"size":0,"term":false},"socket":70,"sockname":"{\"ip\":\"127.0.0.1\",\"port\":11209}","ssl":false,"state":"running","thread":2,"total_c
            pu_time":"3511137765","total_queued_send":11567945,"total_recv":471627,"total_send":11567945,"type":"normal","user":{"domain":"local","name":"@ns_server"},"yields":0}}
            

            I suspect that my change to abort compactions during shutdown is not notifiying the client but instead trying to reschedule the task which won't be run because we're trying to shutdown.

            ben.huddleston Ben Huddleston added a comment - Thanks Balakumaran Gopal . For this run we're waiting to disconnect compaction clients. 2022-02-22T08:58:40.528824-08:00 INFO 2985: Delete bucket [bucket1]. Notifying engine 2022-02-22T08:58:40.530570-08:00 INFO 2985: Delete bucket [bucket1]. Engine ready for shutdown 2022-02-22T08:58:40.530591-08:00 INFO 2985: Delete bucket [bucket1]. Wait for 10 clients to disconnect 2022-02-22T09:00:41.208924-08:00 INFO 2985: Delete bucket [bucket1]. Still waiting: 4 clients connected: {"22288":{"agent_name":"regular","bucket_index":2,"clustermap":{"epoch":0,"revno":0},"connection":"0x00007f1fb8db7680","connection_id ":"127.0.0.1:51268","cookies":[{"aiostat":"would block","connection":"[ {\"ip\":\"127.0.0.1\",\"port\":51268} - {\"ip\":\"127.0.0.1\",\"port\":11209} (System, <ud>@ns_server</ud>) ]","engine_storage":"0x00007f206c057000","ewouldblock":tru e,"packet":{"bodylen":24,"cas":0,"datatype":"raw","extlen":24,"key":"<ud></ud>","keylen":0,"magic":"ClientRequest","opaque":0,"opcode":"COMPACT_DB","vbucket":3},"refcount":0}],"datatype":"raw","dcp":false,"dcp_deleted_user_xattr":false,"d cp_no_value":false,"dcp_xattr_aware":false,"features":["tcp nodelay"],"internal":true,"max_reqs_per_event":20,"max_sched_time":"53244","min_sched_time":"689","nevents":19,"parent_port":11209,"peername":"{\"ip\":\"127.0.0.1\",\"port\":5126 8}","priority":"Medium","protocol":"memcached","refcount":1,"sasl_enabled":true,"sendqueue":{"last":1082731650301306,"size":0,"term":false},"socket":22288,"sockname":"{\"ip\":\"127.0.0.1\",\"port\":11209}","ssl":false,"state":"running","t hread":4,"total_cpu_time":"4153140819","total_queued_send":12749537,"total_recv":547037,"total_send":12749537,"type":"normal","user":{"domain":"local","name":"@ns_server"},"yields":0},"23247":{"agent_name":"regular","bucket_index":2,"clus termap":{"epoch":0,"revno":0},"connection":"0x00007f207c1c9600","connection_id":"127.0.0.1:60652","cookies":[{"aiostat":"would block","connection":"[ {\"ip\":\"127.0.0.1\",\"port\":60652} - {\"ip\":\"127.0.0.1\",\"port\":11209} (System, < ud>@ns_server</ud>) ]","engine_storage":"0x00007f206c057000","ewouldblock":true,"packet":{"bodylen":24,"cas":0,"datatype":"raw","extlen":24,"key":"<ud></ud>","keylen":0,"magic":"ClientRequest","opaque":0,"opcode":"COMPACT_DB","vbucket":0} ,"refcount":0}],"datatype":"raw","dcp":false,"dcp_deleted_user_xattr":false,"dcp_no_value":false,"dcp_xattr_aware":false,"features":["tcp nodelay"],"internal":true,"max_reqs_per_event":20,"max_sched_time":"49329","min_sched_time":"991","n events":19,"parent_port":11209,"peername":"{\"ip\":\"127.0.0.1\",\"port\":60652}","priority":"Medium","protocol":"memcached","refcount":1,"sasl_enabled":true,"sendqueue":{"last":1082735234030157,"size":0,"term":false},"socket":23247,"sock name":"{\"ip\":\"127.0.0.1\",\"port\":11209}","ssl":false,"state":"running","thread":2,"total_cpu_time":"391642092","total_queued_send":1569430,"total_recv":38525,"total_send":1569430,"type":"normal","user":{"domain":"local","name":"@ns_s erver"},"yields":0},"27852":{"agent_name":"regular","bucket_index":2,"clustermap":{"epoch":0,"revno":0},"connection":"0x00007f207c0c0a00","connection_id":"127.0.0.1:33826","cookies":[{"aiostat":"would block","connection":"[ {\"ip\":\"127. 0.0.1\",\"port\":33826} - {\"ip\":\"127.0.0.1\",\"port\":11209} (System, <ud>@ns_server</ud>) ]","engine_storage":"0x00007f206c057000","ewouldblock":true,"packet":{"bodylen":24,"cas":0,"datatype":"raw","extlen":24,"key":"<ud></ud>","keyle n":0,"magic":"ClientRequest","opaque":0,"opcode":"COMPACT_DB","vbucket":1},"refcount":0}],"datatype":"raw","dcp":false,"dcp_deleted_user_xattr":false,"dcp_no_value":false,"dcp_xattr_aware":false,"features":["tcp nodelay"],"internal":true, "max_reqs_per_event":20,"max_sched_time":"47040","min_sched_time":"1103","nevents":19,"parent_port":11209,"peername":"{\"ip\":\"127.0.0.1\",\"port\":33826}","priority":"Medium","protocol":"memcached","refcount":1,"sasl_enabled":true,"send queue":{"last":1082734699042133,"size":0,"term":false},"socket":27852,"sockname":"{\"ip\":\"127.0.0.1\",\"port\":11209}","ssl":false,"state":"running","thread":0,"total_cpu_time":"948942413","total_queued_send":3654740,"total_recv":70897, "total_send":3654740,"type":"normal","user":{"domain":"local","name":"@ns_server"},"yields":0},"70":{"agent_name":"regular","bucket_index":2,"clustermap":{"epoch":0,"revno":0},"connection":"0x00007f207c0c0280","connection_id":"127.0.0.1:6 0480","cookies":[{"aiostat":"would block","connection":"[ {\"ip\":\"127.0.0.1\",\"port\":60480} - {\"ip\":\"127.0.0.1\",\"port\":11209} (System, <ud>@ns_server</ud>) ]","engine_storage":"0x00007f206c057000","ewouldblock":true,"packet":{"b odylen":24,"cas":0,"datatype":"raw","extlen":24,"key":"<ud></ud>","keylen":0,"magic":"ClientRequest","opaque":0,"opcode":"COMPACT_DB","vbucket":2},"refcount":0}],"datatype":"raw","dcp":false,"dcp_deleted_user_xattr":false,"dcp_no_value":f alse,"dcp_xattr_aware":false,"features":["tcp nodelay"],"internal":true,"max_reqs_per_event":20,"max_sched_time":"23917","min_sched_time":"667","nevents":19,"parent_port":11209,"peername":"{\"ip\":\"127.0.0.1\",\"port\":60480}","priority" :"Medium","protocol":"memcached","refcount":1,"sasl_enabled":true,"sendqueue":{"last":1082734976915686,"size":0,"term":false},"socket":70,"sockname":"{\"ip\":\"127.0.0.1\",\"port\":11209}","ssl":false,"state":"running","thread":2,"total_c pu_time":"3511137765","total_queued_send":11567945,"total_recv":471627,"total_send":11567945,"type":"normal","user":{"domain":"local","name":"@ns_server"},"yields":0}} I suspect that my change to abort compactions during shutdown is not notifiying the client but instead trying to reschedule the task which won't be run because we're trying to shutdown.

            Sure Ben Huddleston. I do have runs scheduled for point no 3. Will keep the bug updated once I hit it.

            Balakumaran.Gopal Balakumaran Gopal added a comment - Sure Ben Huddleston . I do have runs scheduled for point no 3. Will keep the bug updated once I hit it.

            This is similar to my theory but subtly different. We cancel compaction tasks during engine destruction but before we do that we wait for clients to disconnect so we're still waiting for the compactions to run at this point. We need to move the compaction cancelling to the cancel_all_operations_in_ewb_state() phase of destruction.

            ben.huddleston Ben Huddleston added a comment - This is similar to my theory but subtly different. We cancel compaction tasks during engine destruction but before we do that we wait for clients to disconnect so we're still waiting for the compactions to run at this point. We need to move the compaction cancelling to the cancel_all_operations_in_ewb_state() phase of destruction.
            owend Daniel Owen made changes -
            Sprint KV 2021-Dec, KV 2022-Feb [ 1906, 2002 ] KV 2021-Dec [ 1906 ]
            owend Daniel Owen made changes -
            Rank Ranked lower
            ben.huddleston Ben Huddleston made changes -
            Due Date 24/Feb/22 25/Feb/22
            ben.huddleston Ben Huddleston made changes -
            Assignee Balakumaran Gopal [ balakumaran.gopal ] Ben Huddleston [ ben.huddleston ]
            drigby Dave Rigby made changes -
            Sprint KV 2021-Dec [ 1906 ] KV 2021-Dec, KV 2022-Feb [ 1906, 2002 ]
            drigby Dave Rigby made changes -
            Rank Ranked higher

            Build couchbase-server-7.1.0-2378 contains kv_engine commit e6d4ba8 with commit message:
            MB-49512: Return enum status from KVStore::compactDB

            build-team Couchbase Build Team added a comment - Build couchbase-server-7.1.0-2378 contains kv_engine commit e6d4ba8 with commit message: MB-49512 : Return enum status from KVStore::compactDB
            ben.huddleston Ben Huddleston made changes -
            Labels functional-test magma functional-test magma releasenote
            ben.huddleston Ben Huddleston added a comment - - edited

            Description for release notes:

            Summary: The per KVStore stat "failure_compaction" (i.e. "rw_0:failure_compaction") has now been consolidated into a single bucket wide stat named "ep_compaction_failed".

            ben.huddleston Ben Huddleston added a comment - - edited Description for release notes: Summary: The per KVStore stat "failure_compaction" (i.e. "rw_0:failure_compaction") has now been consolidated into a single bucket wide stat named "ep_compaction_failed".

            Build couchbase-server-7.1.0-2379 contains kv_engine commit c30171b with commit message:
            MB-49512: Move compaction failure stat out of KVStore

            build-team Couchbase Build Team added a comment - Build couchbase-server-7.1.0-2379 contains kv_engine commit c30171b with commit message: MB-49512 : Move compaction failure stat out of KVStore

            Build couchbase-server-7.1.0-2386 contains kv_engine commit b32189c with commit message:
            MB-49512: Drop abort compaction log to debug

            build-team Couchbase Build Team added a comment - Build couchbase-server-7.1.0-2386 contains kv_engine commit b32189c with commit message: MB-49512 : Drop abort compaction log to debug

            Build couchbase-server-7.1.0-2386 contains kv_engine commit 36beef8 with commit message:
            MB-49512: Don't treat aborted compactions as failures

            build-team Couchbase Build Team added a comment - Build couchbase-server-7.1.0-2386 contains kv_engine commit 36beef8 with commit message: MB-49512 : Don't treat aborted compactions as failures
            Balakumaran.Gopal Balakumaran Gopal made changes -
            Attachment consoleText_MB-49512_rerun.txt [ 178436 ]
            Balakumaran.Gopal Balakumaran Gopal made changes -
            ben.huddleston Ben Huddleston made changes -
            Due Date 25/Feb/22 01/Mar/22

            Build couchbase-server-7.1.0-2398 contains kv_engine commit 27275ed with commit message:
            MB-49512: Cancel compactions during EWB cancel

            build-team Couchbase Build Team added a comment - Build couchbase-server-7.1.0-2398 contains kv_engine commit 27275ed with commit message: MB-49512 : Cancel compactions during EWB cancel
            ben.huddleston Ben Huddleston made changes -
            Assignee Ben Huddleston [ ben.huddleston ] Balakumaran Gopal [ balakumaran.gopal ]
            Resolution Fixed [ 1 ]
            Status Reopened [ 4 ] Resolved [ 5 ]
            Balakumaran.Gopal Balakumaran Gopal added a comment - - edited

            Status update

            Have 2 jobs running
            http://172.23.121.80/job/temp_rebalance_even/203/console
            http://172.23.121.80/job/temp_rebalance_magma_win/67/console

            However a weekly run where were run 100's of magma jobs run would be a better way to validate this.

            Balakumaran.Gopal Balakumaran Gopal added a comment - - edited Status update Have 2 jobs running http://172.23.121.80/job/temp_rebalance_even/203/console http://172.23.121.80/job/temp_rebalance_magma_win/67/console However a weekly run where were run 100's of magma jobs run would be a better way to validate this.
            Balakumaran.Gopal Balakumaran Gopal made changes -
            Resolution Fixed [ 1 ]
            Status Resolved [ 5 ] Reopened [ 4 ]
            Balakumaran.Gopal Balakumaran Gopal made changes -
            Assignee Balakumaran Gopal [ balakumaran.gopal ] Ben Huddleston [ ben.huddleston ]
            ben.huddleston Ben Huddleston added a comment - - edited

            Test was also run on a Windows VM.

            Host Name:                 WIN-HS1RJUPNDMD
            OS Name:                   Microsoft Windows Server 2019 Datacenter
            OS Version:                10.0.17763 N/A Build 17763
            OS Manufacturer:           Microsoft Corporation
            

            Interesting logs:

            [ns_server:debug,2022-03-01T23:17:22.115-08:00,ns_1@172.23.136.195:<0.32166.4>:ns_rebalancer:local_buckets_shutdown_loop:143]Waiting until the following old bucket instances are gone: ["bucket-1"]
            

            2022-03-01T23:17:25.250975-08:00 INFO 1352: Delete bucket [bucket-1]. Clean up allocated resources
            2022-03-01T23:17:25.251117-08:00 INFO 1352: Delete bucket [bucket-1] complete
            

            [ns_server:debug,2022-03-01T23:17:25.251-08:00,ns_1@172.23.136.195:ns_memcached-bucket-1<0.6295.0>:ns_memcached:delete_bucket:887]Proceeding into vbuckets dbs deletions
            [rebalance:error,2022-03-01T23:18:22.116-08:00,ns_1@172.23.136.195:<0.32155.4>:ns_rebalancer:do_wait_buckets_shutdown:192]Failed to wait deletion of some buckets on some nodes: [{'ns_1@172.23.136.195',
                                                                     {'EXIT',
                                                                      {old_buckets_shutdown_wait_failed,
                                                                       ["bucket-1"]}}}]
            

            Looks like it's slow to delete files again.

            ben.huddleston Ben Huddleston added a comment - - edited Test was also run on a Windows VM. Host Name: WIN-HS1RJUPNDMD OS Name: Microsoft Windows Server 2019 Datacenter OS Version: 10.0.17763 N/A Build 17763 OS Manufacturer: Microsoft Corporation Interesting logs: [ns_server:debug,2022-03-01T23:17:22.115-08:00,ns_1@172.23.136.195:<0.32166.4>:ns_rebalancer:local_buckets_shutdown_loop:143]Waiting until the following old bucket instances are gone: ["bucket-1"] 2022-03-01T23:17:25.250975-08:00 INFO 1352: Delete bucket [bucket-1]. Clean up allocated resources 2022-03-01T23:17:25.251117-08:00 INFO 1352: Delete bucket [bucket-1] complete [ns_server:debug,2022-03-01T23:17:25.251-08:00,ns_1@172.23.136.195:ns_memcached-bucket-1<0.6295.0>:ns_memcached:delete_bucket:887]Proceeding into vbuckets dbs deletions [rebalance:error,2022-03-01T23:18:22.116-08:00,ns_1@172.23.136.195:<0.32155.4>:ns_rebalancer:do_wait_buckets_shutdown:192]Failed to wait deletion of some buckets on some nodes: [{'ns_1@172.23.136.195', {'EXIT', {old_buckets_shutdown_wait_failed, ["bucket-1"]}}}] Looks like it's slow to delete files again.
            ben.huddleston Ben Huddleston made changes -
            Assignee Ben Huddleston [ ben.huddleston ] Balakumaran Gopal [ balakumaran.gopal ]
            Resolution Fixed [ 1 ]
            Status Reopened [ 4 ] Resolved [ 5 ]
            Balakumaran.Gopal Balakumaran Gopal made changes -
            Resolution Fixed [ 1 ]
            Status Resolved [ 5 ] Reopened [ 4 ]
            Balakumaran.Gopal Balakumaran Gopal made changes -
            Assignee Balakumaran Gopal [ balakumaran.gopal ] Ben Huddleston [ ben.huddleston ]

            Thanks Balakumaran Gopal. This issue hit wasn't slow disks this time, rather it was a similar issue to before where we saw that waiting cookies were blocking shutdown.

            2022-03-02T12:06:49.052207-08:00 INFO 709: Delete bucket [bucket1]. Notifying engine
            2022-03-02T12:06:49.053474-08:00 INFO 709: Delete bucket [bucket1]. Engine ready for shutdown
            2022-03-02T12:06:49.053490-08:00 INFO 709: Delete bucket [bucket1]. Wait for 8 clients to disconnect
            2022-03-02T12:08:49.605986-08:00 INFO 709: Delete bucket [bucket1]. Still waiting: 1 clients connected: {"34097":{"agent_name":"regular","bucket_index":2,"clustermap":{"epoch":0,"revno":0},"connection":"0x00007f2c3819f900","connection_id":"127.0.0.1:56402","cookies":[{"aiostat":"would block","connection":"[ {\"ip\":\"127.0.0.1\",\"port\":56402} - {\"ip\":\"127.0.0.1\",\"port\":11209} (System, <ud>@ns_server</ud>) ]","engine_storage":"0x00007f2bf40f9000","ewouldblock":true,"packet":{"bodylen":24,"cas":0,"datatype":"raw","extlen":24,"key":"<ud></ud>","keylen":0,"magic":"ClientRequest","opaque":0,"opcode":"COMPACT_DB","vbucket":240},"refcount":0}],"datatype":"raw","dcp":false,"dcp_deleted_user_xattr":false,"dcp_no_value":false,"dcp_xattr_aware":false,"features":["tcp nodelay"],"internal":true,"max_reqs_per_event":20,"max_sched_time":"94796","min_sched_time":"725","nevents":19,"parent_port":11209,"peername":"{\"ip\":\"127.0.0.1\",\"port\":56402}","priority":"Medium","protocol":"memcached","refcount":1,"sasl_enabled":true,"sendqueue":{"last":357063307164346,"size":0,"term":false},"socket":34097,"sockname":"{\"ip\":\"127.0.0.1\",\"port\":11209}","ssl":false,"state":"running","thread":1,"total_cpu_time":"1326254748","total_queued_send":3193022,"total_recv":265059,"total_send":3193022,"type":"normal","user":{"domain":"local","name":"@ns_server"},"yields":0}}
            2022-03-02T12:09:49.871400-08:00 INFO 709: Delete bucket [bucket1]. Still waiting: 1 clients connected: {"34097":{"agent_name":"regular","bucket_index":2,"clustermap":{"epoch":0,"revno":0},"connection":"0x00007f2c3819f900","connection_id":"127.0.0.1:56402","cookies":[{"aiostat":"would block","connection":"[ {\"ip\":\"127.0.0.1\",\"port\":56402} - {\"ip\":\"127.0.0.1\",\"port\":11209} (System, <ud>@ns_server</ud>) ]","engine_storage":"0x00007f2bf40f9000","ewouldblock":true,"packet":{"bodylen":24,"cas":0,"datatype":"raw","extlen":24,"key":"<ud></ud>","keylen":0,"magic":"ClientRequest","opaque":0,"opcode":"COMPACT_DB","vbucket":240},"refcount":0}],"datatype":"raw","dcp":false,"dcp_deleted_user_xattr":false,"dcp_no_value":false,"dcp_xattr_aware":false,"features":["tcp nodelay"],"internal":true,"max_reqs_per_event":20,"max_sched_time":"94796","min_sched_time":"725","nevents":19,"parent_port":11209,"peername":"{\"ip\":\"127.0.0.1\",\"port\":56402}","priority":"Medium","protocol":"memcached","refcount":1,"sasl_enabled":true,"sendqueue":{"last":357063307164346,"size":0,"term":false},"socket":34097,"sockname":"{\"ip\":\"127.0.0.1\",\"port\":11209}","ssl":false,"state":"running","thread":1,"total_cpu_time":"1326254748","total_queued_send":3193022,"total_recv":265059,"total_send":3193022,"type":"normal","user":{"domain":"local","name":"@ns_server"},"yields":0}}
            2022-03-02T12:10:50.200004-08:00 INFO 709: Delete bucket [bucket1]. Still waiting: 1 clients connected: {"34097":{"agent_name":"regular","bucket_index":2,"clustermap":{"epoch":0,"revno":0},"connection":"0x00007f2c3819f900","connection_id":"127.0.0.1:56402","cookies":[{"aiostat":"would block","connection":"[ {\"ip\":\"127.0.0.1\",\"port\":56402} - {\"ip\":\"127.0.0.1\",\"port\":11209} (System, <ud>@ns_server</ud>) ]","engine_storage":"0x00007f2bf40f9000","ewouldblock":true,"packet":{"bodylen":24,"cas":0,"datatype":"raw","extlen":24,"key":"<ud></ud>","keylen":0,"magic":"ClientRequest","opaque":0,"opcode":"COMPACT_DB","vbucket":240},"refcount":0}],"datatype":"raw","dcp":false,"dcp_deleted_user_xattr":false,"dcp_no_value":false,"dcp_xattr_aware":false,"features":["tcp nodelay"],"internal":true,"max_reqs_per_event":20,"max_sched_time":"94796","min_sched_time":"725","nevents":19,"parent_port":11209,"peername":"{\"ip\":\"127.0.0.1\",\"port\":56402}","priority":"Medium","protocol":"memcached","refcount":1,"sasl_enabled":true,"sendqueue":{"last":357063307164346,"size":0,"term":false},"socket":34097,"sockname":"{\"ip\":\"127.0.0.1\",\"port\":11209}","ssl":false,"state":"running","thread":1,"total_cpu_time":"1326254748","total_queued_send":3193022,"total_recv":265059,"total_send":3193022,"type":"normal","user":{"domain":"local","name":"@ns_server"},"yields":0}}
            2022-03-02T12:11:50.501228-08:00 INFO 709: Delete bucket [bucket1]. Still waiting: 1 clients connected: {"34097":{"agent_name":"regular","bucket_index":2,"clustermap":{"epoch":0,"revno":0},"connection":"0x00007f2c3819f900","connection_id":"127.0.0.1:56402","cookies":[{"aiostat":"would block","connection":"[ {\"ip\":\"127.0.0.1\",\"port\":56402} - {\"ip\":\"127.0.0.1\",\"port\":11209} (System, <ud>@ns_server</ud>) ]","engine_storage":"0x00007f2bf40f9000","ewouldblock":true,"packet":{"bodylen":24,"cas":0,"datatype":"raw","extlen":24,"key":"<ud></ud>","keylen":0,"magic":"ClientRequest","opaque":0,"opcode":"COMPACT_DB","vbucket":240},"refcount":0}],"datatype":"raw","dcp":false,"dcp_deleted_user_xattr":false,"dcp_no_value":false,"dcp_xattr_aware":false,"features":["tcp nodelay"],"internal":true,"max_reqs_per_event":20,"max_sched_time":"94796","min_sched_time":"725","nevents":19,"parent_port":11209,"peername":"{\"ip\":\"127.0.0.1\",\"port\":56402}","priority":"Medium","protocol":"memcached","refcount":1,"sasl_enabled":true,"sendqueue":{"last":357063307164346,"size":0,"term":false},"socket":34097,"sockname":"{\"ip\":\"127.0.0.1\",\"port\":11209}","ssl":false,"state":"running","thread":1,"total_cpu_time":"1326254748","total_queued_send":3193022,"total_recv":265059,"total_send":3193022,"type":"normal","user":{"domain":"local","name":"@ns_server"},"yields":0}}
            2022-03-02T12:12:50.739003-08:00 INFO 709: Delete bucket [bucket1]. Still waiting: 1 clients connected: {"34097":{"agent_name":"regular","bucket_index":2,"clustermap":{"epoch":0,"revno":0},"connection":"0x00007f2c3819f900","connection_id":"127.0.0.1:56402","cookies":[{"aiostat":"would block","connection":"[ {\"ip\":\"127.0.0.1\",\"port\":56402} - {\"ip\":\"127.0.0.1\",\"port\":11209} (System, <ud>@ns_server</ud>) ]","engine_storage":"0x00007f2bf40f9000","ewouldblock":true,"packet":{"bodylen":24,"cas":0,"datatype":"raw","extlen":24,"key":"<ud></ud>","keylen":0,"magic":"ClientRequest","opaque":0,"opcode":"COMPACT_DB","vbucket":240},"refcount":0}],"datatype":"raw","dcp":false,"dcp_deleted_user_xattr":false,"dcp_no_value":false,"dcp_xattr_aware":false,"features":["tcp nodelay"],"internal":true,"max_reqs_per_event":20,"max_sched_time":"94796","min_sched_time":"725","nevents":19,"parent_port":11209,"peername":"{\"ip\":\"127.0.0.1\",\"port\":56402}","priority":"Medium","protocol":"memcached","refcount":1,"sasl_enabled":true,"sendqueue":{"last":357063307164346,"size":0,"term":false},"socket":34097,"sockname":"{\"ip\":\"127.0.0.1\",\"port\":11209}","ssl":false,"state":"running","thread":1,"total_cpu_time":"1326254748","total_queued_send":3193022,"total_recv":265059,"total_send":3193022,"type":"normal","user":{"domain":"local","name":"@ns_server"},"yields":0}}
            

            With my latest change I made it so that any running compaction at this point in time will see that the clients are pending disconnection and abort the compaction, but this failed to take into account that the compaction may not currently be running. The compactions could actually be stuck in the AuxIO queue behind compactions for the other buckets (I believe that that is the case here as default bucket has a lot of compaction tasks.

            12927    5  R   default    -0:27.62   0:00.00   0:00.00       0  AuxIO   CompactVBucketTask                   0x00007f2b73c81c30  Compact DB file 19
            12931    5  R   default    -0:56.14   0:00.00   0:00.00       0  AuxIO   CompactVBucketTask                   0x00007f2b5c77e310  Compact DB file 0
            ... (repeated for almost every vBucket) ...
            13845    5  R   default   -05:31.05   0:00.00   0:00.00       0  AuxIO   CompactVBucketTask                   0x00007f2b5cecf0f0  Compact DB file 979
            13858    5  R   default   -05:38.30   0:00.00   0:00.00       0  AuxIO   CompactVBucketTask                   0x00007f2b5cecf630  Compact DB file 987
            

            So, it seems that it's not enough to wait for the compaction to run here to abort it, becuase other buckets could be running compactions and saturating the AuxIO pool we need to be a bit more pro-active in disconnecting these connections. Perhaps we should notify all of the cookies instead at this point and allow the ExuectorPool cleanup method to deal with extra tasks.

            ben.huddleston Ben Huddleston added a comment - Thanks Balakumaran Gopal . This issue hit wasn't slow disks this time, rather it was a similar issue to before where we saw that waiting cookies were blocking shutdown. 2022-03-02T12:06:49.052207-08:00 INFO 709: Delete bucket [bucket1]. Notifying engine 2022-03-02T12:06:49.053474-08:00 INFO 709: Delete bucket [bucket1]. Engine ready for shutdown 2022-03-02T12:06:49.053490-08:00 INFO 709: Delete bucket [bucket1]. Wait for 8 clients to disconnect 2022-03-02T12:08:49.605986-08:00 INFO 709: Delete bucket [bucket1]. Still waiting: 1 clients connected: {"34097":{"agent_name":"regular","bucket_index":2,"clustermap":{"epoch":0,"revno":0},"connection":"0x00007f2c3819f900","connection_id":"127.0.0.1:56402","cookies":[{"aiostat":"would block","connection":"[ {\"ip\":\"127.0.0.1\",\"port\":56402} - {\"ip\":\"127.0.0.1\",\"port\":11209} (System, <ud>@ns_server</ud>) ]","engine_storage":"0x00007f2bf40f9000","ewouldblock":true,"packet":{"bodylen":24,"cas":0,"datatype":"raw","extlen":24,"key":"<ud></ud>","keylen":0,"magic":"ClientRequest","opaque":0,"opcode":"COMPACT_DB","vbucket":240},"refcount":0}],"datatype":"raw","dcp":false,"dcp_deleted_user_xattr":false,"dcp_no_value":false,"dcp_xattr_aware":false,"features":["tcp nodelay"],"internal":true,"max_reqs_per_event":20,"max_sched_time":"94796","min_sched_time":"725","nevents":19,"parent_port":11209,"peername":"{\"ip\":\"127.0.0.1\",\"port\":56402}","priority":"Medium","protocol":"memcached","refcount":1,"sasl_enabled":true,"sendqueue":{"last":357063307164346,"size":0,"term":false},"socket":34097,"sockname":"{\"ip\":\"127.0.0.1\",\"port\":11209}","ssl":false,"state":"running","thread":1,"total_cpu_time":"1326254748","total_queued_send":3193022,"total_recv":265059,"total_send":3193022,"type":"normal","user":{"domain":"local","name":"@ns_server"},"yields":0}} 2022-03-02T12:09:49.871400-08:00 INFO 709: Delete bucket [bucket1]. Still waiting: 1 clients connected: {"34097":{"agent_name":"regular","bucket_index":2,"clustermap":{"epoch":0,"revno":0},"connection":"0x00007f2c3819f900","connection_id":"127.0.0.1:56402","cookies":[{"aiostat":"would block","connection":"[ {\"ip\":\"127.0.0.1\",\"port\":56402} - {\"ip\":\"127.0.0.1\",\"port\":11209} (System, <ud>@ns_server</ud>) ]","engine_storage":"0x00007f2bf40f9000","ewouldblock":true,"packet":{"bodylen":24,"cas":0,"datatype":"raw","extlen":24,"key":"<ud></ud>","keylen":0,"magic":"ClientRequest","opaque":0,"opcode":"COMPACT_DB","vbucket":240},"refcount":0}],"datatype":"raw","dcp":false,"dcp_deleted_user_xattr":false,"dcp_no_value":false,"dcp_xattr_aware":false,"features":["tcp nodelay"],"internal":true,"max_reqs_per_event":20,"max_sched_time":"94796","min_sched_time":"725","nevents":19,"parent_port":11209,"peername":"{\"ip\":\"127.0.0.1\",\"port\":56402}","priority":"Medium","protocol":"memcached","refcount":1,"sasl_enabled":true,"sendqueue":{"last":357063307164346,"size":0,"term":false},"socket":34097,"sockname":"{\"ip\":\"127.0.0.1\",\"port\":11209}","ssl":false,"state":"running","thread":1,"total_cpu_time":"1326254748","total_queued_send":3193022,"total_recv":265059,"total_send":3193022,"type":"normal","user":{"domain":"local","name":"@ns_server"},"yields":0}} 2022-03-02T12:10:50.200004-08:00 INFO 709: Delete bucket [bucket1]. Still waiting: 1 clients connected: {"34097":{"agent_name":"regular","bucket_index":2,"clustermap":{"epoch":0,"revno":0},"connection":"0x00007f2c3819f900","connection_id":"127.0.0.1:56402","cookies":[{"aiostat":"would block","connection":"[ {\"ip\":\"127.0.0.1\",\"port\":56402} - {\"ip\":\"127.0.0.1\",\"port\":11209} (System, <ud>@ns_server</ud>) ]","engine_storage":"0x00007f2bf40f9000","ewouldblock":true,"packet":{"bodylen":24,"cas":0,"datatype":"raw","extlen":24,"key":"<ud></ud>","keylen":0,"magic":"ClientRequest","opaque":0,"opcode":"COMPACT_DB","vbucket":240},"refcount":0}],"datatype":"raw","dcp":false,"dcp_deleted_user_xattr":false,"dcp_no_value":false,"dcp_xattr_aware":false,"features":["tcp nodelay"],"internal":true,"max_reqs_per_event":20,"max_sched_time":"94796","min_sched_time":"725","nevents":19,"parent_port":11209,"peername":"{\"ip\":\"127.0.0.1\",\"port\":56402}","priority":"Medium","protocol":"memcached","refcount":1,"sasl_enabled":true,"sendqueue":{"last":357063307164346,"size":0,"term":false},"socket":34097,"sockname":"{\"ip\":\"127.0.0.1\",\"port\":11209}","ssl":false,"state":"running","thread":1,"total_cpu_time":"1326254748","total_queued_send":3193022,"total_recv":265059,"total_send":3193022,"type":"normal","user":{"domain":"local","name":"@ns_server"},"yields":0}} 2022-03-02T12:11:50.501228-08:00 INFO 709: Delete bucket [bucket1]. Still waiting: 1 clients connected: {"34097":{"agent_name":"regular","bucket_index":2,"clustermap":{"epoch":0,"revno":0},"connection":"0x00007f2c3819f900","connection_id":"127.0.0.1:56402","cookies":[{"aiostat":"would block","connection":"[ {\"ip\":\"127.0.0.1\",\"port\":56402} - {\"ip\":\"127.0.0.1\",\"port\":11209} (System, <ud>@ns_server</ud>) ]","engine_storage":"0x00007f2bf40f9000","ewouldblock":true,"packet":{"bodylen":24,"cas":0,"datatype":"raw","extlen":24,"key":"<ud></ud>","keylen":0,"magic":"ClientRequest","opaque":0,"opcode":"COMPACT_DB","vbucket":240},"refcount":0}],"datatype":"raw","dcp":false,"dcp_deleted_user_xattr":false,"dcp_no_value":false,"dcp_xattr_aware":false,"features":["tcp nodelay"],"internal":true,"max_reqs_per_event":20,"max_sched_time":"94796","min_sched_time":"725","nevents":19,"parent_port":11209,"peername":"{\"ip\":\"127.0.0.1\",\"port\":56402}","priority":"Medium","protocol":"memcached","refcount":1,"sasl_enabled":true,"sendqueue":{"last":357063307164346,"size":0,"term":false},"socket":34097,"sockname":"{\"ip\":\"127.0.0.1\",\"port\":11209}","ssl":false,"state":"running","thread":1,"total_cpu_time":"1326254748","total_queued_send":3193022,"total_recv":265059,"total_send":3193022,"type":"normal","user":{"domain":"local","name":"@ns_server"},"yields":0}} 2022-03-02T12:12:50.739003-08:00 INFO 709: Delete bucket [bucket1]. Still waiting: 1 clients connected: {"34097":{"agent_name":"regular","bucket_index":2,"clustermap":{"epoch":0,"revno":0},"connection":"0x00007f2c3819f900","connection_id":"127.0.0.1:56402","cookies":[{"aiostat":"would block","connection":"[ {\"ip\":\"127.0.0.1\",\"port\":56402} - {\"ip\":\"127.0.0.1\",\"port\":11209} (System, <ud>@ns_server</ud>) ]","engine_storage":"0x00007f2bf40f9000","ewouldblock":true,"packet":{"bodylen":24,"cas":0,"datatype":"raw","extlen":24,"key":"<ud></ud>","keylen":0,"magic":"ClientRequest","opaque":0,"opcode":"COMPACT_DB","vbucket":240},"refcount":0}],"datatype":"raw","dcp":false,"dcp_deleted_user_xattr":false,"dcp_no_value":false,"dcp_xattr_aware":false,"features":["tcp nodelay"],"internal":true,"max_reqs_per_event":20,"max_sched_time":"94796","min_sched_time":"725","nevents":19,"parent_port":11209,"peername":"{\"ip\":\"127.0.0.1\",\"port\":56402}","priority":"Medium","protocol":"memcached","refcount":1,"sasl_enabled":true,"sendqueue":{"last":357063307164346,"size":0,"term":false},"socket":34097,"sockname":"{\"ip\":\"127.0.0.1\",\"port\":11209}","ssl":false,"state":"running","thread":1,"total_cpu_time":"1326254748","total_queued_send":3193022,"total_recv":265059,"total_send":3193022,"type":"normal","user":{"domain":"local","name":"@ns_server"},"yields":0}} With my latest change I made it so that any running compaction at this point in time will see that the clients are pending disconnection and abort the compaction, but this failed to take into account that the compaction may not currently be running. The compactions could actually be stuck in the AuxIO queue behind compactions for the other buckets (I believe that that is the case here as default bucket has a lot of compaction tasks. 12927 5 R default -0:27.62 0:00.00 0:00.00 0 AuxIO CompactVBucketTask 0x00007f2b73c81c30 Compact DB file 19 12931 5 R default -0:56.14 0:00.00 0:00.00 0 AuxIO CompactVBucketTask 0x00007f2b5c77e310 Compact DB file 0 ... (repeated for almost every vBucket) ... 13845 5 R default -05:31.05 0:00.00 0:00.00 0 AuxIO CompactVBucketTask 0x00007f2b5cecf0f0 Compact DB file 979 13858 5 R default -05:38.30 0:00.00 0:00.00 0 AuxIO CompactVBucketTask 0x00007f2b5cecf630 Compact DB file 987 So, it seems that it's not enough to wait for the compaction to run here to abort it, becuase other buckets could be running compactions and saturating the AuxIO pool we need to be a bit more pro-active in disconnecting these connections. Perhaps we should notify all of the cookies instead at this point and allow the ExuectorPool cleanup method to deal with extra tasks.
            ben.huddleston Ben Huddleston made changes -
            Due Date 01/Mar/22 02/Mar/22
            ben.huddleston Ben Huddleston made changes -
            Due Date 02/Mar/22 04/Mar/22

            Build couchbase-server-7.1.0-2427 contains kv_engine commit 369c193 with commit message:
            MB-49512: Disconnect compaction cookies on EWB cancel immediately

            build-team Couchbase Build Team added a comment - Build couchbase-server-7.1.0-2427 contains kv_engine commit 369c193 with commit message: MB-49512 : Disconnect compaction cookies on EWB cancel immediately

            Build couchbase-server-7.2.0-1002 contains kv_engine commit 369c193 with commit message:
            MB-49512: Disconnect compaction cookies on EWB cancel immediately

            build-team Couchbase Build Team added a comment - Build couchbase-server-7.2.0-1002 contains kv_engine commit 369c193 with commit message: MB-49512 : Disconnect compaction cookies on EWB cancel immediately
            ben.huddleston Ben Huddleston made changes -
            Assignee Ben Huddleston [ ben.huddleston ] Balakumaran Gopal [ balakumaran.gopal ]
            Resolution Fixed [ 1 ]
            Status Reopened [ 4 ] Resolved [ 5 ]
            wayne Wayne Siu made changes -
            Link This issue causes MB-51132 [ MB-51132 ]
            Balakumaran.Gopal Balakumaran Gopal made changes -
            Status Resolved [ 5 ] Closed [ 6 ]
            Balakumaran.Gopal Balakumaran Gopal made changes -
            Assignee Balakumaran Gopal [ balakumaran.gopal ] Ben Huddleston [ ben.huddleston ]
            Balakumaran.Gopal Balakumaran Gopal made changes -
            Resolution Fixed [ 1 ]
            Status Closed [ 6 ] Reopened [ 4 ]

            First set of logs showed that we spent 5 mins or so waiting for the KVBucket deinitialization but then logs were collected. It's likely that we were still in the unregisterTaskable call in the ExecutorPool.

            Second set of logs are probably more interesting, same situation but after 6 minutes we "unblocked" and made progress.

            2022-03-07T11:45:55.639841+00:00 INFO (bucket2) KVBucket::deinitialize forceShutdown:true
            2022-03-07T11:45:55.639924+00:00 INFO (bucket2) Compaction of vb:517 done (0). purged tombstones:0, prepares:0, prepareBytes:0 collection_items_erased:alive:14,deleted:16, collections_erased:115, size/items/tombstones/purge_seqno pre{360448, 14, 0, 0}, post{266240, 0, 0, 0}
            2022-03-07T11:45:56.293427+00:00 INFO (bucket2) Compaction of vb:806 done (1). purged tombstones:0, prepares:0, prepareBytes:0 collection_items_erased:alive:16,deleted:4, collections_erased:44, size/items/tombstones/purge_seqno pre{184320, 22, 0, 0}, post{0, 0, 0, 0}
            2022-03-07T11:45:56.380545+00:00 INFO (bucket2) Compaction of vb:164 done (1). purged tombstones:0, prepares:0, prepareBytes:0 collection_items_erased:alive:12,deleted:0, collections_erased:33, size/items/tombstones/purge_seqno pre{454656, 60, 0, 0}, post{0, 0, 0, 0}
            2022-03-07T11:45:56.392765+00:00 INFO (bucket2) Compaction of vb:664 done (1). purged tombstones:0, prepares:0, prepareBytes:0 collection_items_erased:alive:56,deleted:0, collections_erased:39, size/items/tombstones/purge_seqno pre{643072, 105, 0, 0}, post{0, 0, 0, 0}
            2022-03-07T11:45:56.453551+00:00 INFO (bucket2) Compaction of vb:525 done (1). purged tombstones:0, prepares:0, prepareBytes:0 collection_items_erased:alive:0,deleted:0, collections_erased:25, size/items/tombstones/purge_seqno pre{368640, 45, 0, 0}, post{0, 0, 0, 0}
            2022-03-07T11:45:56.483648+00:00 INFO (bucket2) Compaction of vb:475 done (1). purged tombstones:0, prepares:0, prepareBytes:0 collection_items_erased:alive:0,deleted:0, collections_erased:28, size/items/tombstones/purge_seqno pre{307200, 14, 0, 0}, post{0, 0, 0, 0}
            2022-03-07T11:45:56.511478+00:00 INFO (bucket2) Compaction of vb:768 done (1). purged tombstones:0, prepares:0, prepareBytes:0 collection_items_erased:alive:0,deleted:0, collections_erased:23, size/items/tombstones/purge_seqno pre{798720, 120, 0, 0}, post{0, 0, 0, 0}
            2022-03-07T11:45:56.561981+00:00 INFO (bucket2) Compaction of vb:846 done (1). purged tombstones:0, prepares:0, prepareBytes:0 collection_items_erased:alive:0,deleted:0, collections_erased:17, size/items/tombstones/purge_seqno pre{630784, 120, 0, 0}, post{0, 0, 0, 0}
            2022-03-07T11:51:58.370756+00:00 INFO (bucket2) MagmaKVStore: 0 deinitializing
            

            I suspect that a task is executing while we are calling CancellableCPUExecutor::removeTasksForTaskable() and when it finishes it waits it's putting a reset task pointer task into the cpuPool queue which we then have to wait for to destroy the bucket. Looks like an issue with the fix for MB-51132.

            ben.huddleston Ben Huddleston added a comment - First set of logs showed that we spent 5 mins or so waiting for the KVBucket deinitialization but then logs were collected. It's likely that we were still in the unregisterTaskable call in the ExecutorPool. Second set of logs are probably more interesting, same situation but after 6 minutes we "unblocked" and made progress. 2022-03-07T11:45:55.639841+00:00 INFO (bucket2) KVBucket::deinitialize forceShutdown:true 2022-03-07T11:45:55.639924+00:00 INFO (bucket2) Compaction of vb:517 done (0). purged tombstones:0, prepares:0, prepareBytes:0 collection_items_erased:alive:14,deleted:16, collections_erased:115, size/items/tombstones/purge_seqno pre{360448, 14, 0, 0}, post{266240, 0, 0, 0} 2022-03-07T11:45:56.293427+00:00 INFO (bucket2) Compaction of vb:806 done (1). purged tombstones:0, prepares:0, prepareBytes:0 collection_items_erased:alive:16,deleted:4, collections_erased:44, size/items/tombstones/purge_seqno pre{184320, 22, 0, 0}, post{0, 0, 0, 0} 2022-03-07T11:45:56.380545+00:00 INFO (bucket2) Compaction of vb:164 done (1). purged tombstones:0, prepares:0, prepareBytes:0 collection_items_erased:alive:12,deleted:0, collections_erased:33, size/items/tombstones/purge_seqno pre{454656, 60, 0, 0}, post{0, 0, 0, 0} 2022-03-07T11:45:56.392765+00:00 INFO (bucket2) Compaction of vb:664 done (1). purged tombstones:0, prepares:0, prepareBytes:0 collection_items_erased:alive:56,deleted:0, collections_erased:39, size/items/tombstones/purge_seqno pre{643072, 105, 0, 0}, post{0, 0, 0, 0} 2022-03-07T11:45:56.453551+00:00 INFO (bucket2) Compaction of vb:525 done (1). purged tombstones:0, prepares:0, prepareBytes:0 collection_items_erased:alive:0,deleted:0, collections_erased:25, size/items/tombstones/purge_seqno pre{368640, 45, 0, 0}, post{0, 0, 0, 0} 2022-03-07T11:45:56.483648+00:00 INFO (bucket2) Compaction of vb:475 done (1). purged tombstones:0, prepares:0, prepareBytes:0 collection_items_erased:alive:0,deleted:0, collections_erased:28, size/items/tombstones/purge_seqno pre{307200, 14, 0, 0}, post{0, 0, 0, 0} 2022-03-07T11:45:56.511478+00:00 INFO (bucket2) Compaction of vb:768 done (1). purged tombstones:0, prepares:0, prepareBytes:0 collection_items_erased:alive:0,deleted:0, collections_erased:23, size/items/tombstones/purge_seqno pre{798720, 120, 0, 0}, post{0, 0, 0, 0} 2022-03-07T11:45:56.561981+00:00 INFO (bucket2) Compaction of vb:846 done (1). purged tombstones:0, prepares:0, prepareBytes:0 collection_items_erased:alive:0,deleted:0, collections_erased:17, size/items/tombstones/purge_seqno pre{630784, 120, 0, 0}, post{0, 0, 0, 0} 2022-03-07T11:51:58.370756+00:00 INFO (bucket2) MagmaKVStore: 0 deinitializing I suspect that a task is executing while we are calling CancellableCPUExecutor::removeTasksForTaskable() and when it finishes it waits it's putting a reset task pointer task into the cpuPool queue which we then have to wait for to destroy the bucket. Looks like an issue with the fix for MB-51132 .
            ben.huddleston Ben Huddleston made changes -
            Due Date 04/Mar/22 09/Mar/22
            owend Daniel Owen made changes -
            Sprint KV 2021-Dec, KV 2022-Feb [ 1906, 2002 ] KV 2021-Dec, KV 2022-Feb, KV Post-Neo 2 [ 1906, 2002, 2050 ]

            Build couchbase-server-7.2.0-1010 contains kv_engine commit b0bde22 with commit message:
            MB-49512: Make Taskable::isShutdown() const

            build-team Couchbase Build Team added a comment - Build couchbase-server-7.2.0-1010 contains kv_engine commit b0bde22 with commit message: MB-49512 : Make Taskable::isShutdown() const

            Build couchbase-server-7.2.0-1010 contains kv_engine commit e0128e5 with commit message:
            MB-49512: Remove redundant if statement

            build-team Couchbase Build Team added a comment - Build couchbase-server-7.2.0-1010 contains kv_engine commit e0128e5 with commit message: MB-49512 : Remove redundant if statement
            owend Daniel Owen made changes -
            Due Date 09/Mar/22 10/Mar/22

            Build couchbase-server-7.1.0-2455 contains kv_engine commit 0517833 with commit message:
            MB-49512: Make Taskable::isShutdown() const

            build-team Couchbase Build Team added a comment - Build couchbase-server-7.1.0-2455 contains kv_engine commit 0517833 with commit message: MB-49512 : Make Taskable::isShutdown() const

            Build couchbase-server-7.1.0-2456 contains kv_engine commit f74b76b with commit message:
            MB-49512: Reset task ptr on scheduler thread during taskable shutdown

            build-team Couchbase Build Team added a comment - Build couchbase-server-7.1.0-2456 contains kv_engine commit f74b76b with commit message: MB-49512 : Reset task ptr on scheduler thread during taskable shutdown
            ben.huddleston Ben Huddleston made changes -
            Assignee Ben Huddleston [ ben.huddleston ] Balakumaran Gopal [ balakumaran.gopal ]
            Resolution Fixed [ 1 ]
            Status Reopened [ 4 ] Resolved [ 5 ]

            2 runs are currently running. Even then a weekly run over the weekend where we run couple of thousands of jobs run would be a better way to validate this.

            http://172.23.121.80/job/temp_rebalance_magma_1/35/console
            http://172.23.121.80/job/temp_rebalance_magma/292/console

            Balakumaran.Gopal Balakumaran Gopal added a comment - 2 runs are currently running. Even then a weekly run over the weekend where we run couple of thousands of jobs run would be a better way to validate this. http://172.23.121.80/job/temp_rebalance_magma_1/35/console http://172.23.121.80/job/temp_rebalance_magma/292/console

            Validated this on 7.1.0-2475.
            Logs : http://greenboard.sc.couchbase.com/#!/server/7.1.0/2475?platforms=CENTOS&bucket_storage=COUCH&features=MAGMA

            Did not find a single occurrence of this issue. Marking this closed.

            Balakumaran.Gopal Balakumaran Gopal added a comment - Validated this on 7.1.0-2475. Logs : http://greenboard.sc.couchbase.com/#!/server/7.1.0/2475?platforms=CENTOS&bucket_storage=COUCH&features=MAGMA Did not find a single occurrence of this issue. Marking this closed.
            Balakumaran.Gopal Balakumaran Gopal made changes -
            Status Resolved [ 5 ] Closed [ 6 ]
            pavithra.mahamani Pavithra Mahamani (Inactive) made changes -
            Attachment 172.23.100.38.zip [ 179628 ]

            I am still seeing this issue in 7.1.0-2475 in the XDCR functional tests.

            rebalance_report_20220314T223148.json:
            {"stageInfo":{"data":{"startTime":"2022-03-14T15:30:48.099-07:00","completedTime":false,"timeTaken":60280}},"rebalanceId":"966b008ec2d7d181ce4f40571d563618","nodesInfo":{"active_nodes":["ns_1@172.23.100.38","ns_1@172.23.100.39"],"keep_nodes":["ns_1@172.23.100.38"],"eject_nodes":["ns_1@172.23.100.39"],"delta_nodes":[],"failed_nodes":[]},"masterNode":"ns_1@172.23.100.38","startTime":"2022-03-14T15:30:48.098-07:00","completedTime":"2022-03-14T15:31:48.379-07:00","timeTaken":60281,"completionMessage":"Rebalance exited with reason {buckets_shutdown_wait_failed,\n                              [{'ns_1@172.23.100.38',\n                                {'EXIT',\n                                 {old_buckets_shutdown_wait_failed,\n                                  [\"default\"]}}}]}."}
            

            Logs: 172.23.100.38.zip , 172.23.100.39.zip

            pavithra.mahamani Pavithra Mahamani (Inactive) added a comment - - edited I am still seeing this issue in 7.1.0-2475 in the XDCR functional tests. rebalance_report_20220314T223148.json: {"stageInfo":{"data":{"startTime":"2022-03-14T15:30:48.099-07:00","completedTime":false,"timeTaken":60280}},"rebalanceId":"966b008ec2d7d181ce4f40571d563618","nodesInfo":{"active_nodes":["ns_1@172.23.100.38","ns_1@172.23.100.39"],"keep_nodes":["ns_1@172.23.100.38"],"eject_nodes":["ns_1@172.23.100.39"],"delta_nodes":[],"failed_nodes":[]},"masterNode":"ns_1@172.23.100.38","startTime":"2022-03-14T15:30:48.098-07:00","completedTime":"2022-03-14T15:31:48.379-07:00","timeTaken":60281,"completionMessage":"Rebalance exited with reason {buckets_shutdown_wait_failed,\n [{'ns_1@172.23.100.38',\n {'EXIT',\n {old_buckets_shutdown_wait_failed,\n [\"default\"]}}}]}."} Logs: 172.23.100.38.zip , 172.23.100.39.zip
            pavithra.mahamani Pavithra Mahamani (Inactive) made changes -
            Resolution Fixed [ 1 ]
            Status Closed [ 6 ] Reopened [ 4 ]
            pavithra.mahamani Pavithra Mahamani (Inactive) made changes -
            Assignee Balakumaran Gopal [ balakumaran.gopal ] Ben Huddleston [ ben.huddleston ]
            ritam.sharma Ritam Sharma made changes -
            Labels functional-test magma releasenote affects-neo-testing functional-test magma releasenote
            pavithra.mahamani Pavithra Mahamani (Inactive) made changes -
            Attachment 172.23.100.39.zip [ 179629 ]

            Looks to be a different root cause for this issue Pavithra Mahamani.

            2022-03-14T19:38:26.828070-07:00 INFO 62: Delete bucket [default]. Notifying engine
            2022-03-14T19:38:26.829941-07:00 INFO 62: Delete bucket [default]. Engine ready for shutdown
            2022-03-14T19:38:26.829959-07:00 INFO 62: Delete bucket [default]. Wait for 6 clients to disconnect
            2022-03-14T19:38:26.830542-07:00 INFO 62: Delete bucket [default]. Shut down the bucket
            2022-03-14T19:38:26.831182-07:00 INFO (default) KVBucket::deinitialize forceShutdown:true
            2022-03-14T19:39:33.410204-07:00 INFO (default) MagmaKVStore: 0 deinitializing
            2022-03-14T19:39:33.410240-07:00 INFO [(default) magma_0]Closing magma db (thread pool refcount 8)
            ...
            2022-03-14T19:41:44.954724-07:00 INFO [(default) magma_0]Completed closing magma db
            2022-03-14T19:41:44.955078-07:00 INFO (default) MagmaKVStore: 0 deinitialized
            2022-03-14T19:41:44.955092-07:00 INFO (default) MagmaKVStore: 1 deinitializing
            2022-03-14T19:41:44.955097-07:00 INFO [(default) magma_1]Closing magma db (thread pool refcount 7)
            ...
            2022-03-14T19:43:44.957415-07:00 INFO [(default) magma_1]Completed closing magma db
            2022-03-14T19:43:44.957819-07:00 INFO (default) MagmaKVStore: 1 deinitialized
            2022-03-14T19:43:44.957831-07:00 INFO (default) MagmaKVStore: 2 deinitializing
            2022-03-14T19:43:44.957835-07:00 INFO [(default) magma_2]Closing magma db (thread pool refcount 6)
            ...
            2022-03-14T19:45:54.011065-07:00 INFO [(default) magma_2]Completed closing magma db
            2022-03-14T19:45:54.011488-07:00 INFO (default) MagmaKVStore: 2 deinitialized
            2022-03-14T19:45:54.011502-07:00 INFO (default) MagmaKVStore: 3 deinitializing
            2022-03-14T19:45:54.011506-07:00 INFO [(default) magma_3]Closing magma db (thread pool refcount 5)
            ...
            2022-03-14T19:46:07.993157-07:00 INFO [(default) magma_3/kvstore-147/rev-000000001]KVStore::Shutdown
            

            The cleanup in kv_engine looks to be prompt, but the shutdown of the magma shards is rather slow taking multiple minutes. Sarath Lakshman, probably worth somebody on the magma team taking a look at the latest set of logs.

            ben.huddleston Ben Huddleston added a comment - Looks to be a different root cause for this issue Pavithra Mahamani . 2022-03-14T19:38:26.828070-07:00 INFO 62: Delete bucket [default]. Notifying engine 2022-03-14T19:38:26.829941-07:00 INFO 62: Delete bucket [default]. Engine ready for shutdown 2022-03-14T19:38:26.829959-07:00 INFO 62: Delete bucket [default]. Wait for 6 clients to disconnect 2022-03-14T19:38:26.830542-07:00 INFO 62: Delete bucket [default]. Shut down the bucket 2022-03-14T19:38:26.831182-07:00 INFO (default) KVBucket::deinitialize forceShutdown:true 2022-03-14T19:39:33.410204-07:00 INFO (default) MagmaKVStore: 0 deinitializing 2022-03-14T19:39:33.410240-07:00 INFO [(default) magma_0]Closing magma db (thread pool refcount 8) ... 2022-03-14T19:41:44.954724-07:00 INFO [(default) magma_0]Completed closing magma db 2022-03-14T19:41:44.955078-07:00 INFO (default) MagmaKVStore: 0 deinitialized 2022-03-14T19:41:44.955092-07:00 INFO (default) MagmaKVStore: 1 deinitializing 2022-03-14T19:41:44.955097-07:00 INFO [(default) magma_1]Closing magma db (thread pool refcount 7) ... 2022-03-14T19:43:44.957415-07:00 INFO [(default) magma_1]Completed closing magma db 2022-03-14T19:43:44.957819-07:00 INFO (default) MagmaKVStore: 1 deinitialized 2022-03-14T19:43:44.957831-07:00 INFO (default) MagmaKVStore: 2 deinitializing 2022-03-14T19:43:44.957835-07:00 INFO [(default) magma_2]Closing magma db (thread pool refcount 6) ... 2022-03-14T19:45:54.011065-07:00 INFO [(default) magma_2]Completed closing magma db 2022-03-14T19:45:54.011488-07:00 INFO (default) MagmaKVStore: 2 deinitialized 2022-03-14T19:45:54.011502-07:00 INFO (default) MagmaKVStore: 3 deinitializing 2022-03-14T19:45:54.011506-07:00 INFO [(default) magma_3]Closing magma db (thread pool refcount 5) ... 2022-03-14T19:46:07.993157-07:00 INFO [(default) magma_3/kvstore-147/rev-000000001]KVStore::Shutdown The cleanup in kv_engine looks to be prompt, but the shutdown of the magma shards is rather slow taking multiple minutes. Sarath Lakshman , probably worth somebody on the magma team taking a look at the latest set of logs.
            ben.huddleston Ben Huddleston made changes -
            Assignee Ben Huddleston [ ben.huddleston ] Sarath Lakshman [ sarath ]
            ben.huddleston Ben Huddleston made changes -
            Component/s couchbase-bucket [ 10173 ]
            Component/s storage-engine [ 10175 ]
            sarath Sarath Lakshman made changes -
            Assignee Sarath Lakshman [ sarath ] Apaar Gupta [ apaar.gupta ]

            Closing, will file a new ticket based on Ben Huddleston's comment that the root cause is not the same for the latest occurrence.

            pavithra.mahamani Pavithra Mahamani (Inactive) added a comment - Closing, will file a new ticket based on Ben Huddleston 's comment that the root cause is not the same for the latest occurrence.
            pavithra.mahamani Pavithra Mahamani (Inactive) made changes -
            Resolution Fixed [ 1 ]
            Status Reopened [ 4 ] Closed [ 6 ]
            pavithra.mahamani Pavithra Mahamani (Inactive) made changes -
            Link This issue relates to MB-51477 [ MB-51477 ]

            Build couchbase-server-7.2.0-1021 contains kv_engine commit f74b76b with commit message:
            MB-49512: Reset task ptr on scheduler thread during taskable shutdown

            build-team Couchbase Build Team added a comment - Build couchbase-server-7.2.0-1021 contains kv_engine commit f74b76b with commit message: MB-49512 : Reset task ptr on scheduler thread during taskable shutdown

            Build couchbase-server-7.2.0-1021 contains kv_engine commit 0517833 with commit message:
            MB-49512: Make Taskable::isShutdown() const

            build-team Couchbase Build Team added a comment - Build couchbase-server-7.2.0-1021 contains kv_engine commit 0517833 with commit message: MB-49512 : Make Taskable::isShutdown() const

            People

              apaar.gupta Apaar Gupta
              Balakumaran.Gopal Balakumaran Gopal
              Votes:
              0 Vote for this issue
              Watchers:
              8 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Gerrit Reviews

                  There is 1 open Gerrit change

                  PagerDuty