We implemented a task cancellation mechanism for improving the shutdown times for magma db in 7.1. A taskgroupID was introduced to cancel a group of tasks. The taskGroupID is designed to be unique for each database. But, unfortunately, there was a bug in the initialization of the taskGroupID. It resulted in taskGroupID being assigned to each database to be random. Most of the times it works fine and most databases have unique ID. But, whenever it conflicts with multiple databases, we run into problems.
This problem occurs when we have multiple buckets and one bucket is warming up while another bucket is shutting down. If two databases between the buckets have the same taskGroupID, the task cancellation request (from the shutting down bucket) can cancel the tasks from the bucket that is warming up. It resulted in initializing some of the vbuckets without actually opening SSTables (SSTable Open was executed as a task and the task got canceled). As a result, when we try to read from the SSTable, it crashes.
|For Gerrit Dashboard: MB-53964|
|180809,3||MB-53964 [BP] lsm: Add unit test for table loader work cancellation||neo||magma||Status: NEW||0||-1|
|180802,1||MB-53964 [BP] lsm: Add cancel task method for table loader||master||magma||Status: ABANDONED||-1||-1|
|180803,1||MB-53964 [BP] lsm: Fix incorrect assignment of taskGroupID||master||magma||Status: ABANDONED||-1||-1|
|180804,1||MB-53964 [BP] lsm: Add unit test for table loader work cancellation||master||magma||Status: ABANDONED||-1||-1|
|180807,2||MB-53964 [BP] lsm: Add cancel task method for table loader||neo||magma||Status: MERGED||+2||+1|
|180808,2||MB-53964 [BP] lsm: Fix incorrect assignment of taskGroupID||neo||magma||Status: MERGED||+2||+1|