Details
-
Bug
-
Status: Closed
-
Critical
-
Resolution: Fixed
-
Cheshire-Cat
-
Enterprise Edition 7.0.0 build 5060
Windows
-
Untriaged
-
Windows 64-bit
-
-
1
-
Unknown
Description
Build: 7.0.0 - 5060
Scenario:
Adding 2 nodes into the cluster (1 KV and 1 n1ql+index) node.
Cluster has active FTS, 2i, eventing and cbas services running.
Observing index rebalance step stuck for more than 5 hrs.
Rebalance Operation Id = cb4133936ca1cf441dfb7e7e80b08fd8
+----------------+----------------+-----------------------+----------------+--------------+
|
| Nodes | Services | Version | CPU | Status |
|
+----------------+----------------+-----------------------+----------------+--------------+
|
| 172.23.136.114 | index, n1ql | 7.0.0-5017-enterprise | 16.2759689664 | Cluster node |
|
| 172.23.136.106 | kv | 7.0.0-5017-enterprise | 96.3590072138 | Cluster node |
|
| 172.23.136.107 | kv | 7.0.0-5017-enterprise | 95.904451103 | Cluster node |
|
| 172.23.136.108 | index, n1ql | 7.0.0-5017-enterprise | 6.25161456988 | Cluster node |
|
| 172.23.136.115 | backup | 7.0.0-5017-enterprise | 0.247504125069 | Cluster node |
|
| 172.23.136.113 | eventing, fts | 7.0.0-5017-enterprise | 48.0959898431 | Cluster node |
|
| 172.23.136.110 | kv | 7.0.0-5017-enterprise | 82.6015216793 | Cluster node |
|
| 172.23.136.105 | kv | 7.0.0-5017-enterprise | 97.6586680867 | Cluster node |
|
| 172.23.136.112 | ['kv'] | | | <--- IN --- |
|
| 172.23.138.127 | ['n1ql,index'] | | | <--- IN --- |
|
+----------------+----------------+-----------------------+----------------+--------------+
|
Attachments
Issue Links
- relates to
-
MB-46251 Handle errors in shutdownVbuckets code path
-
- Closed
-
For Gerrit Dashboard: MB-46005 | ||||||
---|---|---|---|---|---|---|
# | Subject | Branch | Project | Status | CR | V |
153091,5 | MB-46005 Use UUID when building feed name for getting failover logs | unstable | indexing | Status: MERGED | +2 | +1 |
153095,2 | MB-46005 Prepare stream for fresh start during KV repair | unstable | indexing | Status: ABANDONED | +1 | 0 |
153196,4 | MB-46005 Clean-up keyspace on error during shutdownVBuckets | unstable | indexing | Status: ABANDONED | +1 | 0 |
The other issue is with regard to how indexer manages repair.
a. For vb:929, Indexer got more than one StreamBegin. So, it marked this vb for connection error
2021-04-29T00:36:15.922-07:00 [Info] Timekeeper::handleStreamBegin Owner count > 1. Treat as CONN_ERR. StreamId MAINT_STREAM MutationMeta KeyspaceId: travel-sample Vbucket: 929 Vbuuid: 42283421376440 Seqno: 32167 FirstSnap: false
2021-04-29T00:36:15.922-07:00 [Info] StreamState::connection error - set repair state to RESTART_VB for MAINT_STREAM keyspaceId travel-sample vb 929
2021-04-29T00:36:15.922-07:00 [Info] Timekeeper::handleStreamConnError RepairStream due to ConnError. StreamId MAINT_STREAM KeyspaceId travel-sample VbList [929]
b. For this, indexer sent a message with ConnErrVbs containing vb:929. This lead to shutdown vbuckets being called for vb:929
2021-04-29T00:36:16.925-07:00 [Info] KVSender::sendRestartVbuckets Projector 172.23.136.112:9999 Topic MAINT_STREAM_TOPIC_aee54a90f72022800c2cd6f5f6023d1b travel-sample travel-sample
2021-04-29T00:36:16.925-07:00 [Info] KVSender::sendRestartVbuckets ShutdownVbuckets 172.23.136.112:9999 Topic MAINT_STREAM_TOPIC_aee54a90f72022800c2cd6f5f6023d1b travel-sample travel-sample ConnErrVbs [929]
2021-04-29T00:36:16.925-07:00 [Info] KVSender::sendRestartVbuckets ShutdownVbuckets Projector 172.23.136.112:9999 Topic MAINT_STREAM_TOPIC_aee54a90f72022800c2cd6f5f6023d1b travel-sample travel-sample
ShutdownTs bucket: travel-sample, scope :, collectionIDs: [], vbuckets: 1 -
{vbno, vbuuid, manifest, seqno, snapshot-start, snapshot-end}
{ 929 2674e04f9fb8 0 32203 32203 32203}
2021-04-29T00:36:16.976-07:00 [Error] KVSender::sendRestartVbuckets Unexpected Error During ShutdownVbuckets Request for Projector 172.23.136.112:9999 Topic MAINT_STREAM_TOPIC_aee54a90f72022800c2cd6f5f6023d1b travel-sample. Err feed.feeder.
c. Shutdown failed do the the error mentioned in above comment i.e. two DCP streams started to use same UUID
d. As the restart was successful, timekeeper updated the state of the vbucket to SHUTDOWN_VB so that it can clear it's state when a StreamBegin comes
2021-04-29T00:37:17.050-07:00 [Info] StreamState::set repair state to SHUTDOWN_VB for MAINT_STREAM keyspaceId travel-sample vb 929
e. Stream begin never comes as projector thinks nothing is wrong with VB (since shutdown failed). As the state is updated to REPAIR_SHUTDOWN_VB, indexer does not attempt a new shutdown again
As no shutdown is attempted for this VB, the state remains corrupt and timekeeper never comes out of the restart loop