Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-53964

Backport for MB-53662 / Process crashes during delta recovery

    XMLWordPrintable

Details

    • 1
    • Yes

    Description

      We implemented a task cancellation mechanism for improving the shutdown times for magma db in 7.1. A taskgroupID was introduced to cancel a group of tasks. The taskGroupID is designed to be unique for each database. But, unfortunately, there was a bug in the initialization of the taskGroupID. It resulted in taskGroupID being assigned to each database to be random. Most of the times it works fine and most databases have unique ID. But, whenever it conflicts with multiple databases, we run into problems.

      This problem occurs when we have multiple buckets and one bucket is warming up while another bucket is shutting down. If two databases between the buckets have the same taskGroupID, the task cancellation request (from the shutting down bucket) can cancel the tasks from the bucket that is warming up. It resulted in initializing some of the vbuckets without actually opening SSTables (SSTable Open was executed as a task and the task got canceled). As a result, when we try to read from the SSTable, it crashes.

      Attachments

        Issue Links

          For Gerrit Dashboard: MB-53964
          # Subject Branch Project Status CR V

          Activity

            This is a low-risk fix and I will merge the change today

            sarath Sarath Lakshman added a comment - This is a low-risk fix and I will merge the change today

            Build couchbase-server-7.1.2-3445 contains magma commit edf5c5c with commit message:
            MB-53964 [BP] lsm: Fix incorrect assignment of taskGroupID

            build-team Couchbase Build Team added a comment - Build couchbase-server-7.1.2-3445 contains magma commit edf5c5c with commit message: MB-53964 [BP] lsm: Fix incorrect assignment of taskGroupID

            Build couchbase-server-7.1.2-3445 contains magma commit 8a2b18b with commit message:
            MB-53964 [BP] lsm: Add cancel task method for table loader

            build-team Couchbase Build Team added a comment - Build couchbase-server-7.1.2-3445 contains magma commit 8a2b18b with commit message: MB-53964 [BP] lsm: Add cancel task method for table loader

            Build couchbase-server-7.2.0-5000 contains magma commit edf5c5c with commit message:
            MB-53964 [BP] lsm: Fix incorrect assignment of taskGroupID

            build-team Couchbase Build Team added a comment - Build couchbase-server-7.2.0-5000 contains magma commit edf5c5c with commit message: MB-53964 [BP] lsm: Fix incorrect assignment of taskGroupID

            Build couchbase-server-7.2.0-5000 contains magma commit 8a2b18b with commit message:
            MB-53964 [BP] lsm: Add cancel task method for table loader

            build-team Couchbase Build Team added a comment - Build couchbase-server-7.2.0-5000 contains magma commit 8a2b18b with commit message: MB-53964 [BP] lsm: Add cancel task method for table loader
            ritam.sharma Ritam Sharma added a comment -

            Balakumaran Gopal / Ankush Sharma- Can you please validate this bug on priority for 7.1.2

            ritam.sharma Ritam Sharma added a comment - Balakumaran Gopal / Ankush Sharma - Can you please validate this bug on priority for 7.1.2

            Sure Ritam Sharma, Started working on this

            ankush.sharma Ankush Sharma added a comment - Sure Ritam Sharma , Started working on this

            This issue was not easily reproducible in 7.2 builds(where we first saw this issue). In 7.2 also we could reproduce this only twice( out of hundereds of iterations). Tried many iteration of delta recovery tests on build 7.1.2-3450. Didn't hit this once. Hence closing this.

            ankush.sharma Ankush Sharma added a comment - This issue was not easily reproducible in 7.2 builds(where we first saw this issue). In 7.2 also we could reproduce this only twice( out of hundereds of iterations). Tried many iteration of delta recovery tests on build 7.1.2-3450. Didn't hit this once. Hence closing this.

            People

              Balakumaran.Gopal Balakumaran Gopal
              sarath Sarath Lakshman
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                PagerDuty