Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-47749

handleIndexMergeSnapshot() panic from extraneous s.muSnap.Unlock() calls

    XMLWordPrintable

Details

    • Untriaged
    • 1
    • Unknown

    Description

      secondary/indexer/storage_manager.go handleIndexMergeSnapshot() does not lock, and its caller does not hold, the s.muSnap mutex (apparently it used to lock this at the top), but it still contains rarely-entered error reporting code blocks that unlock this mutex. If one of these error blocks gets entered, it triggers a panic.

      The fix is to delete the s.muSnap.Unlock() calls from this method. There are three of them (one is in commented-out code and two are in live code).
       

      Seen in currently undelivered new version of set14_rebalance_test.go TestFailoverAndRebalance:
      http://ci2i-unstable.northscale.in/gsi-04.08.2021-09.40.fail.html

      fatal error: sync: unlock of unlocked mutex
       
      goroutine 166 [running]:
      runtime.throw(0x131592e, 0x1e)
              /home/buildbot/.cbdepscache/exploded/x86_64/go-1.16.5/go/src/runtime/panic.go:1117 +0x72 fp=0xc006507ba0 sp=0xc006507b70 pc=0x43fdd2
      sync.throw(0x131592e, 0x1e)
              /home/buildbot/.cbdepscache/exploded/x86_64/go-1.16.5/go/src/runtime/panic.go:1103 +0x35 fp=0xc006507bc0 sp=0xc006507ba0 pc=0x474375
      sync.(*Mutex).unlockSlow(0xc004e96610, 0xffffffff)
              /home/buildbot/.cbdepscache/exploded/x86_64/go-1.16.5/go/src/sync/mutex.go:196 +0xd8 fp=0xc006507be8 sp=0xc006507bc0 pc=0x491e18
       
       
      Unlock call that panicked:
       
      sync.(*Mutex).Unlock(...)
              /home/buildbot/.cbdepscache/exploded/x86_64/go-1.16.5/go/src/sync/mutex.go:190
       
       
      Caused by s.muSnap.Unlock() in handleIndexMergeSnapshot (storage_manager.go:1720):
       
      github.com/couchbase/indexing/secondary/indexer.(*storageMgr).handleIndexMergeSnapshot(0xc004e96580, 0x14a4a80, 0xc00cdd0ff0)
              /opt/build/goproj/src/github.com/couchbase/indexing/secondary/indexer/storage_manager.go:1720 +0x699 fp=0xc006507f20 sp=0xc006507be8 pc=0xfb67f9
       
       
      github.com/couchbase/indexing/secondary/indexer.(*storageMgr).handleSupvervisorCommands(0xc004e96580, 0x14a4a80, 0xc00cdd0ff0)
              /opt/build/goproj/src/github.com/couchbase/indexing/secondary/indexer/storage_manager.go:224 +0x1b2 fp=0xc006507f58 sp=0xc006507f20 pc=0xfab8f2
      github.com/couchbase/indexing/secondary/indexer.(*storageMgr).run(0xc004e96580)
              /opt/build/goproj/src/github.com/couchbase/indexing/secondary/indexer/storage_manager.go:182 +0x48 fp=0xc006507fd8 sp=0xc006507f58 pc=0xfab628
      runtime.goexit()
              /home/buildbot/.cbdepscache/exploded/x86_64/go-1.16.5/go/src/runtime/asm_amd64.s:1371 +0x1 fp=0xc006507fe0 sp=0xc006507fd8 pc=0x4792e1
      created by github.com/couchbase/indexing/secondary/indexer.NewStorageManager
              /opt/build/goproj/src/github.com/couchbase/indexing/secondary/indexer/storage_manager.go:157 +0x2e5
      

      Attachments

        Issue Links

          For Gerrit Dashboard: MB-47749
          # Subject Branch Project Status CR V

          Activity

            kevin.cherkauer Kevin Cherkauer (Inactive) created issue -
            varun.velamuri Varun Velamuri made changes -
            Field Original Value New Value
            Component/s secondary-index [ 11211 ]
            Component/s storage-engine [ 10175 ]
            varun.velamuri Varun Velamuri made changes -
            Affects Version/s 7.0.0 [ 17233 ]
            Affects Version/s Neo [ 17615 ]
            varun.velamuri Varun Velamuri made changes -
            Fix Version/s 7.0.1 [ 17104 ]
            Fix Version/s Neo [ 17615 ]
            varun.velamuri Varun Velamuri made changes -
            Assignee John Liang [ jliang ] Jeelan Poola [ jeelan.poola ]
            kevin.cherkauer Kevin Cherkauer (Inactive) made changes -
            Assignee Jeelan Poola [ jeelan.poola ] Kevin Cherkauer [ JIRAUSER25138 ]

            Jeelan Poola, Mihir Kamdar, I think it would be good to fix this panic in 7.0.1. Although this panic happens in -ve case, crashing indexer does not seem to be a good idea.

            varun.velamuri Varun Velamuri added a comment - Jeelan Poola , Mihir Kamdar , I think it would be good to fix this panic in 7.0.1. Although this panic happens in -ve case, crashing indexer does not seem to be a good idea.
            kevin.cherkauer Kevin Cherkauer (Inactive) made changes -
            Priority Major [ 3 ] Critical [ 2 ]
            kevin.cherkauer Kevin Cherkauer (Inactive) made changes -
            Status Open [ 1 ] In Progress [ 3 ]
            kevin.cherkauer Kevin Cherkauer (Inactive) made changes -
            Summary handleIndexMergeSnapshot() delete extraneous s.muSnap.Unlock() calls handleIndexMergeSnapshot() panic from extraneous s.muSnap.Unlock() calls

            Varun Velamuri Agree. Would you share what the new test does? We can have an equivalent functional test implemented.

            mihir.kamdar Mihir Kamdar (Inactive) added a comment - Varun Velamuri Agree. Would you share what the new test does? We can have an equivalent functional test implemented.

            Kevin Cherkauer Is the author of that test. He should have more information  

            varun.velamuri Varun Velamuri added a comment - Kevin Cherkauer  Is the author of that test. He should have more information  

            Presumably follow-on failures from the panic, the following fatal errors were logged later during the same failed rebalance attempt, but Mihir Kamdar pinged me to say these look possibly the same as MB-46601 (which was already supposed to be fixed in Cheshire-Cat):

            17213:2021-08-04T13:20:49.204+05:30 [Error] plasmaSlice:NewplasmaSlice Id 0x102eaa0 IndexInstId 7292211632063708996 fatal error occured: Unable to initialize /opt/build/ns_server/data/n_3/data/@2i/default_set14_idx_7PARTITIONS_guid_balance_7292211632063708996_5.index/mainIndex, err = fatal: Fail to find shard for dedicated instance /opt/build/ns_server/data/n_3/data/@2i/default_set14_idx_7PARTITIONS_guid_balance_7292211632063708996_5.index/mainIndex due to instance not present on disk
            17214:2021-08-04T13:20:49.204+05:30 [Error] plasmaSlice:NewplasmaSlice Id 0 IndexInstId 7292211632063708996 PartitionId 5 fatal error occured: Storage corrupted and unrecoverable
            17231:2021-08-04T13:20:49.205+05:30 [Error] plasmaSlice:NewplasmaSlice Id 0x102eaa0 IndexInstId 7292211632063708996 fatal error occured: Unable to initialize /opt/build/ns_server/data/n_3/data/@2i/default_set14_idx_7PARTITIONS_guid_balance_7292211632063708996_6.index/mainIndex, err = fatal: Fail to find shard for dedicated instance /opt/build/ns_server/data/n_3/data/@2i/default_set14_idx_7PARTITIONS_guid_balance_7292211632063708996_6.index/mainIndex due to instance not present on disk
            17232:2021-08-04T13:20:49.205+05:30 [Error] plasmaSlice:NewplasmaSlice Id 0 IndexInstId 7292211632063708996 PartitionId 6 fatal error occured: Storage corrupted and unrecoverable
            17551:2021-08-04T13:20:49.534+05:30 [Error] plasmaSlice:NewplasmaSlice Id 0x102eaa0 IndexInstId 8114273593663889297 fatal error occured: Unable to initialize /opt/build/ns_server/data/n_3/data/@2i/default_set14_idx_7PARTITIONS_docid_isActive_8114273593663889297_6.index/mainIndex, err = fatal: Fail to find shard for dedicated instance /opt/build/ns_server/data/n_3/data/@2i/default_set14_idx_7PARTITIONS_docid_isActive_8114273593663889297_6.index/mainIndex due to instance not present on disk
            17552:2021-08-04T13:20:49.534+05:30 [Error] plasmaSlice:NewplasmaSlice Id 0 IndexInstId 8114273593663889297 PartitionId 6 fatal error occured: Storage corrupted and unrecoverable
            17569:2021-08-04T13:20:49.535+05:30 [Error] plasmaSlice:NewplasmaSlice Id 0x102eaa0 IndexInstId 8114273593663889297 fatal error occured: Unable to initialize /opt/build/ns_server/data/n_3/data/@2i/default_set14_idx_7PARTITIONS_docid_isActive_8114273593663889297_7.index/mainIndex, err = fatal: Fail to find shard for dedicated instance /opt/build/ns_server/data/n_3/data/@2i/default_set14_idx_7PARTITIONS_docid_isActive_8114273593663889297_7.index/mainIndex due to instance not present on disk
            17570:2021-08-04T13:20:49.535+05:30 [Error] plasmaSlice:NewplasmaSlice Id 0 IndexInstId 8114273593663889297 PartitionId 7 fatal error occured: Storage corrupted and unrecoverable
            

            kevin.cherkauer Kevin Cherkauer (Inactive) added a comment - - edited Presumably follow-on failures from the panic, the following fatal errors were logged later during the same failed rebalance attempt, but Mihir Kamdar pinged me to say these look possibly the same as MB-46601 (which was already supposed to be fixed in Cheshire-Cat): 17213:2021-08-04T13:20:49.204+05:30 [Error] plasmaSlice:NewplasmaSlice Id 0x102eaa0 IndexInstId 7292211632063708996 fatal error occured: Unable to initialize /opt/build/ns_server/data/n_3/data/@2i/default_set14_idx_7PARTITIONS_guid_balance_7292211632063708996_5.index/mainIndex, err = fatal: Fail to find shard for dedicated instance /opt/build/ns_server/data/n_3/data/@2i/default_set14_idx_7PARTITIONS_guid_balance_7292211632063708996_5.index/mainIndex due to instance not present on disk 17214:2021-08-04T13:20:49.204+05:30 [Error] plasmaSlice:NewplasmaSlice Id 0 IndexInstId 7292211632063708996 PartitionId 5 fatal error occured: Storage corrupted and unrecoverable 17231:2021-08-04T13:20:49.205+05:30 [Error] plasmaSlice:NewplasmaSlice Id 0x102eaa0 IndexInstId 7292211632063708996 fatal error occured: Unable to initialize /opt/build/ns_server/data/n_3/data/@2i/default_set14_idx_7PARTITIONS_guid_balance_7292211632063708996_6.index/mainIndex, err = fatal: Fail to find shard for dedicated instance /opt/build/ns_server/data/n_3/data/@2i/default_set14_idx_7PARTITIONS_guid_balance_7292211632063708996_6.index/mainIndex due to instance not present on disk 17232:2021-08-04T13:20:49.205+05:30 [Error] plasmaSlice:NewplasmaSlice Id 0 IndexInstId 7292211632063708996 PartitionId 6 fatal error occured: Storage corrupted and unrecoverable 17551:2021-08-04T13:20:49.534+05:30 [Error] plasmaSlice:NewplasmaSlice Id 0x102eaa0 IndexInstId 8114273593663889297 fatal error occured: Unable to initialize /opt/build/ns_server/data/n_3/data/@2i/default_set14_idx_7PARTITIONS_docid_isActive_8114273593663889297_6.index/mainIndex, err = fatal: Fail to find shard for dedicated instance /opt/build/ns_server/data/n_3/data/@2i/default_set14_idx_7PARTITIONS_docid_isActive_8114273593663889297_6.index/mainIndex due to instance not present on disk 17552:2021-08-04T13:20:49.534+05:30 [Error] plasmaSlice:NewplasmaSlice Id 0 IndexInstId 8114273593663889297 PartitionId 6 fatal error occured: Storage corrupted and unrecoverable 17569:2021-08-04T13:20:49.535+05:30 [Error] plasmaSlice:NewplasmaSlice Id 0x102eaa0 IndexInstId 8114273593663889297 fatal error occured: Unable to initialize /opt/build/ns_server/data/n_3/data/@2i/default_set14_idx_7PARTITIONS_docid_isActive_8114273593663889297_7.index/mainIndex, err = fatal: Fail to find shard for dedicated instance /opt/build/ns_server/data/n_3/data/@2i/default_set14_idx_7PARTITIONS_docid_isActive_8114273593663889297_7.index/mainIndex due to instance not present on disk 17570:2021-08-04T13:20:49.535+05:30 [Error] plasmaSlice:NewplasmaSlice Id 0 IndexInstId 8114273593663889297 PartitionId 7 fatal error occured: Storage corrupted and unrecoverable

            Mihir Kamdar You can see the new set14_rebalance_test.go functional tests here:

            http://review.couchbase.org/c/indexing/+/158772/1/secondary/tests/functionaltests/set14_rebalance_test.go

            The comment headers document in some detail what each one does, and the code is not long. There are 10 tests, but five of them are setup and teardown. The actual test cases are:

            • TestIndexNodeRebalanceIn
            • TestIndexNodeRebalanceOut
            • TestFailoverAndRebalance
            • TestSwapRebalance
            • TestRebalanceReplicaRepair

             The indexes used are created in

            • TestCreateIndexesBeforeRebalance
            • TestCreateReplicatedIndexesBeforeRebalance
            kevin.cherkauer Kevin Cherkauer (Inactive) added a comment - - edited Mihir Kamdar You can see the new set14_rebalance_test.go functional tests here: http://review.couchbase.org/c/indexing/+/158772/1/secondary/tests/functionaltests/set14_rebalance_test.go The comment headers document in some detail what each one does, and the code is not long. There are 10 tests, but five of them are setup and teardown. The actual test cases are: TestIndexNodeRebalanceIn TestIndexNodeRebalanceOut TestFailoverAndRebalance TestSwapRebalance TestRebalanceReplicaRepair  The indexes used are created in TestCreateIndexesBeforeRebalance TestCreateReplicatedIndexesBeforeRebalance
            jeelan.poola Jeelan Poola made changes -
            Link This issue is cloned by MB-47760 [ MB-47760 ]
            jeelan.poola Jeelan Poola made changes -
            Fix Version/s 7.0.1 [ 17104 ]

            Backporting to 7.0.1 via MB-47760.

            kevin.cherkauer Kevin Cherkauer (Inactive) added a comment - Backporting to 7.0.1 via MB-47760 .
            wayne Wayne Siu made changes -
            Link This issue backports to MB-47760 [ MB-47760 ]
            wayne Wayne Siu made changes -
            Link This issue is cloned by MB-47760 [ MB-47760 ]

            Build couchbase-server-7.1.0-1124 contains indexing commit 23371d9 with commit message:
            MB-47749 Delete s.muSnap.Unlock() calls from handleIndexMergeSnapshot()

            build-team Couchbase Build Team added a comment - Build couchbase-server-7.1.0-1124 contains indexing commit 23371d9 with commit message: MB-47749 Delete s.muSnap.Unlock() calls from handleIndexMergeSnapshot()
            kevin.cherkauer Kevin Cherkauer (Inactive) made changes -
            Resolution Fixed [ 1 ]
            Status In Progress [ 3 ] Resolved [ 5 ]
            mihir.kamdar Mihir Kamdar (Inactive) made changes -
            Assignee Kevin Cherkauer [ JIRAUSER25138 ] Hemant Rajput [ hemant.rajput ]

             Kevin Cherkauer, can you please provide steps to validate?

            hemant.rajput Hemant Rajput added a comment -   Kevin Cherkauer , can you please provide steps to validate?

            Hemant Rajput I think this one needs to be dev verified as it was found and fixed through code inspection, not from actually hitting the bugs.

            kevin.cherkauer Kevin Cherkauer (Inactive) added a comment - Hemant Rajput I think this one needs to be dev verified as it was found and fixed through code inspection, not from actually hitting the bugs.
            hemant.rajput Hemant Rajput made changes -
            Labels request-dev-verify

            Kevin Cherkauer, I've marked the ticket for Dev-verification. Please close the ticket as per your convenience. 

            hemant.rajput Hemant Rajput added a comment - Kevin Cherkauer , I've marked the ticket for Dev-verification. Please close the ticket as per your convenience. 
            hemant.rajput Hemant Rajput made changes -
            Assignee Hemant Rajput [ hemant.rajput ] Kevin Cherkauer [ JIRAUSER25138 ]

            Dev verifying per QE request.

            kevin.cherkauer Kevin Cherkauer (Inactive) added a comment - Dev verifying per QE request.
            kevin.cherkauer Kevin Cherkauer (Inactive) made changes -
            Status Resolved [ 5 ] Closed [ 6 ]

            People

              kevin.cherkauer Kevin Cherkauer (Inactive)
              kevin.cherkauer Kevin Cherkauer (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Gerrit Reviews

                  There are no open Gerrit changes

                  PagerDuty