Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-37227

Indexer panics in CI test TestIndexNodeRebalanceOut

    XMLWordPrintable

Details

    • Untriaged
    • Unknown

    Description

      Indexer crashes with the following stack trace at the TestIndexNodeRebalanceOut test during the run http://ci2i-unstable.northscale.in/gsi-10.12.2019-20.06.fail.html

      2019-12-10T22:48:42.710+05:30 [Info] clustMgrAgent::OnIndexDelete Success for Drop IndexId 14723988844231693918
      panic: runtime error: index out of range
       
      goroutine 125393 [running]:
      panic(0xf8a3e0, 0xc420018150)
              /home/buildbot/.cbdepscache/exploded/x86_64/go-1.7.6/go/src/runtime/panic.go:500 +0x1a1
      github.com/couchbase/indexing/secondary/indexer.(*StreamState).updateRepairState(0xc4202ba000, 0xc4237e0001, 0xc424e75618, 0x7, 0xc4238a8000, 0x1af, 0x200, 0x0, 0x0, 0x0)
              goproj/src/github.com/couchbase/indexing/secondary/indexer/stream_state.go:541 +0x3a5
      github.com/couchbase/indexing/secondary/indexer.(*timekeeper).sendRestartMsg(0xc420136080, 0x1a9cde0, 0xc427d5a150)
              goproj/src/github.com/couchbase/indexing/secondary/indexer/timekeeper.go:3103 +0x22ca
      created by github.com/couchbase/indexing/secondary/indexer.(*timekeeper).repairStream
              goproj/src/github.com/couchbase/indexing/secondary/indexer/timekeeper.go:3016 +0xf27
      

      This issue seems to have been fixed under MB-36341 but the panic is still been seen with 6.5.0-4928 build. 

       

      Attachments

        No reviews matched the request. Check your Options in the drop-down menu of this sections header.

        Activity

          varun.velamuri Varun Velamuri added a comment - - edited

          The issue is due to a race condition between bucket clean-up and stream repair. If bucket clean-up happens while stream repair is in progress, it will clean-up all the book keeping related to the bucket in stream-state. When repair stream code path tries to access the book-keeping, it results in a panic. 

          The issue could be reproduced using the following steps:

          a. Add a sleep of 30 seconds in sendRestartMsg, at KV_SENDER_RESTART_VBUCKETS_RESPONSE after needsRollback call

          b. Add a sleep in indexer for 30 seconds at removeIndexesFromStream after sending the message to timekeeper

          c. Cluster run with 1KV+n1ql, 1KV+index, 1 index node

          d. Create and build an index on one indexer ndoe

          e. After the index is created, remove the indexer node on which the index was built. Trigger rebalance. Rebalance will move the index to the other node and clean-up the bucket from stream

          f. Rebalance should fail and indexer should panic

          The for indexer panic is because, the stream status is validated only once while processing the KV_SENDER_RESTART_VBUCKETS_RESPONSE. After the status is validated and if stream clean-up happens, updateRepairState method would panic as the stream is cleaned up. We are locking around the stream state variables twice but validating the status only once.

          varun.velamuri Varun Velamuri added a comment - - edited The issue is due to a race condition between bucket clean-up and stream repair. If bucket clean-up happens while stream repair is in progress, it will clean-up all the book keeping related to the bucket in stream-state. When repair stream code path tries to access the book-keeping, it results in a panic.  The issue could be reproduced using the following steps: a. Add a sleep of 30 seconds in sendRestartMsg , at KV_SENDER_RESTART_VBUCKETS_RESPONSE after needsRollback call b. Add a sleep in indexer for 30 seconds at  removeIndexesFromStream after sending the message to timekeeper c. Cluster run with 1KV+n1ql, 1KV+index, 1 index node d. Create and build an index on one indexer ndoe e. After the index is created, remove the indexer node on which the index was built. Trigger rebalance. Rebalance will move the index to the other node and clean-up the bucket from stream f. Rebalance should fail and indexer should panic The for indexer panic is because, the stream status is validated only once while processing the  KV_SENDER_RESTART_VBUCKETS_RESPONSE . After the status is validated and if stream clean-up happens, updateRepairState method would panic as the stream is cleaned up. We are locking around the stream state variables twice but validating the status only once.

          Build couchbase-server-6.5.0-4948 contains indexing commit d526c84 with commit message:
          MB-37227 Check bucket status after timekeeper lock acquire in sendRestartMsg

          build-team Couchbase Build Team added a comment - Build couchbase-server-6.5.0-4948 contains indexing commit d526c84 with commit message: MB-37227 Check bucket status after timekeeper lock acquire in sendRestartMsg

          Build couchbase-server-7.0.0-1131 contains indexing commit 1755149 with commit message:
          MB-37227 Check bucket status after timekeeper lock acquire in sendRestartMsg

          build-team Couchbase Build Team added a comment - Build couchbase-server-7.0.0-1131 contains indexing commit 1755149 with commit message: MB-37227 Check bucket status after timekeeper lock acquire in sendRestartMsg

          Build couchbase-server-6.5.1-6008 contains indexing commit d526c84 with commit message:
          MB-37227 Check bucket status after timekeeper lock acquire in sendRestartMsg

          build-team Couchbase Build Team added a comment - Build couchbase-server-6.5.1-6008 contains indexing commit d526c84 with commit message: MB-37227 Check bucket status after timekeeper lock acquire in sendRestartMsg
          mihir.kamdar Mihir Kamdar added a comment -

          Varun Velamuri Is there an equivalent functional test that we can implement? We do have a lot of rebalance out testcases in the functional regression, and do not see any issues with those. If not, can you pls close this based the CI test?

          mihir.kamdar Mihir Kamdar added a comment - Varun Velamuri Is there an equivalent functional test that we can implement? We do have a lot of rebalance out testcases in the functional regression, and do not see any issues with those. If not, can you pls close this based the CI test?

          Mihir Kamdar, This is a race condition between stream repair and stream bucket clean-up. The failure was sporadic on our side too. You can try to implement a similar test like TestIndexNodeRebalanceOut. The test does the following:

          a. Setup a cluster with 1kv+n1ql (n0), 1kv+index nodes (n1)

          b. Build 4 indexes

          c. Add a new index node into the cluster (i.e. n2)

          d. Remove node n1 from the cluster

          e. Trigger rebalance

          During rebalance, there will be a stream clean-up on node n1 as indexes from node n1 will be moved to node n2. Also, there will be a stream repair on node n1 as the KV node is being moved out of the cluster. Before the fix, this could trigger the race and indexer could panic.

          I will close this issue based on dev-verification

          varun.velamuri Varun Velamuri added a comment - Mihir Kamdar , This is a race condition between stream repair and stream bucket clean-up. The failure was sporadic on our side too. You can try to implement a similar test like TestIndexNodeRebalanceOut. The test does the following: a. Setup a cluster with 1kv+n1ql (n0), 1kv+index nodes (n1) b. Build 4 indexes c. Add a new index node into the cluster (i.e. n2) d. Remove node n1 from the cluster e. Trigger rebalance During rebalance, there will be a stream clean-up on node n1 as indexes from node n1 will be moved to node n2. Also, there will be a stream repair on node n1 as the KV node is being moved out of the cluster. Before the fix, this could trigger the race and indexer could panic. I will close this issue based on dev-verification

          Closing this as the CI test TestIndexNodeRebalanceOut is not failing

          varun.velamuri Varun Velamuri added a comment - Closing this as the CI test TestIndexNodeRebalanceOut is not failing

          People

            varun.velamuri Varun Velamuri
            varun.velamuri Varun Velamuri
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Gerrit Reviews

                There are no open Gerrit changes

                PagerDuty