Details
-
Bug
-
Status: Closed
-
Critical
-
Resolution: Fixed
-
6.5.0
-
Untriaged
-
Unknown
Description
Indexer crashes with the following stack trace at the TestIndexNodeRebalanceOut test during the run http://ci2i-unstable.northscale.in/gsi-10.12.2019-20.06.fail.html
2019-12-10T22:48:42.710+05:30 [Info] clustMgrAgent::OnIndexDelete Success for Drop IndexId 14723988844231693918 |
panic: runtime error: index out of range
|
|
goroutine 125393 [running]: |
panic(0xf8a3e0, 0xc420018150) |
/home/buildbot/.cbdepscache/exploded/x86_64/go-1.7.6/go/src/runtime/panic.go:500 +0x1a1 |
github.com/couchbase/indexing/secondary/indexer.(*StreamState).updateRepairState(0xc4202ba000, 0xc4237e0001, 0xc424e75618, 0x7, 0xc4238a8000, 0x1af, 0x200, 0x0, 0x0, 0x0) |
goproj/src/github.com/couchbase/indexing/secondary/indexer/stream_state.go:541 +0x3a5 |
github.com/couchbase/indexing/secondary/indexer.(*timekeeper).sendRestartMsg(0xc420136080, 0x1a9cde0, 0xc427d5a150) |
goproj/src/github.com/couchbase/indexing/secondary/indexer/timekeeper.go:3103 +0x22ca |
created by github.com/couchbase/indexing/secondary/indexer.(*timekeeper).repairStream
|
goproj/src/github.com/couchbase/indexing/secondary/indexer/timekeeper.go:3016 +0xf27 |
This issue seems to have been fixed under MB-36341 but the panic is still been seen with 6.5.0-4928 build.
The issue is due to a race condition between bucket clean-up and stream repair. If bucket clean-up happens while stream repair is in progress, it will clean-up all the book keeping related to the bucket in stream-state. When repair stream code path tries to access the book-keeping, it results in a panic.
The issue could be reproduced using the following steps:
a. Add a sleep of 30 seconds in sendRestartMsg, at KV_SENDER_RESTART_VBUCKETS_RESPONSE after needsRollback call
b. Add a sleep in indexer for 30 seconds at removeIndexesFromStream after sending the message to timekeeper
c. Cluster run with 1KV+n1ql, 1KV+index, 1 index node
d. Create and build an index on one indexer ndoe
e. After the index is created, remove the indexer node on which the index was built. Trigger rebalance. Rebalance will move the index to the other node and clean-up the bucket from stream
f. Rebalance should fail and indexer should panic
The for indexer panic is because, the stream status is validated only once while processing the KV_SENDER_RESTART_VBUCKETS_RESPONSE. After the status is validated and if stream clean-up happens, updateRepairState method would panic as the stream is cleaned up. We are locking around the stream state variables twice but validating the status only once.