Some more observations:
Index stuck is on indexer 172.23.96.253.
1. Looks like flusher is waiting to flush mutations
3 @ 0x43485a 0x443ef5 0x442c8c 0x4ff836 0x465a11
|
# 0x4ff835 github.com/couchbase/indexing/secondary/indexer.(*flusher).flushSingleVbucketUptoSeqno+0x365 goproj/src/github.com/couchbase/indexing/secondary/indexer/flusher.go:308
|
2. From the logs, it looks like for 3 vbuckets (191, 189, 248), flush is not happening. The same state remains for around 19 hours.
2020-11-29T14:53:23.331-08:00 [Warn] Indexer::MutationQueue Dequeue Waiting For Seqno 173667 Bucket default Vbucket 191 for 35000 ms. Last Dequeue 0.
|
2020-11-29T14:53:23.364-08:00 [Warn] Indexer::MutationQueue Dequeue Waiting For Seqno 186635 Bucket default Vbucket 189 for 35000 ms. Last Dequeue 0.
|
2020-11-29T14:53:23.377-08:00 [Warn] Indexer::MutationQueue Dequeue Waiting For Seqno 148443 Bucket default Vbucket 248 for 35000 ms. Last Dequeue 0.
|
...
|
2020-11-30T09:51:48.685-08:00 [Warn] Indexer::MutationQueue Dequeue Waiting For Seqno 148443 Bucket default Vbucket 248 for 61105000 ms. Last Dequeue 0.
|
2020-11-30T09:51:48.961-08:00 [Warn] Indexer::MutationQueue Dequeue Waiting For Seqno 186635 Bucket default Vbucket 189 for 61105000 ms. Last Dequeue 0.
|
2020-11-30T09:51:49.495-08:00 [Warn] Indexer::MutationQueue Dequeue Waiting For Seqno 173667 Bucket default Vbucket 191 for 61105000 ms. Last Dequeue 0.
|
3. Focussing on bucket 191, it belongs to kv node 172.23.108.103. As per the kvdata logs, for the vbucket 191, kvdata has already seen the seqno 173667 for which indexer is waiting.
2020-11-29T14:52:36.978-08:00 [Info] KVDT[<-default<-127.0.0.1:8091 #MAINT_STREAM_TOPIC_b5498457cf95d0d23003751bba32455d] ##288 vbseqnos: [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 178600 0 161146 174703 0 187001 214810 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 186393 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 148480 148274 175199 200724 147850 173997 187845 186973 187352 0 0 0 0 0 0 173717 173842 186635 160590 173667 149685 169891 181165 180896 181008 194135 149387 149665 149870 150749 193192 191726 169058 193596 157406 168843 192752 182023 149647 149850 168664 150703 172227 193448 192503 168743 169207 157475 149484 189661 180605 180819 204723 180149 149287 148678 157265 168226 168663 179858 180591 181055 149193 168675 148979 149473 180677 254131 168182 167680 203868 168405 168144 192085 148605 154020 148443 148879 179547 179945 179332 192233 167514 0 186877 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 204797 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ]
|
4. At 2020-11-29T14:53:23, the flusher was waiting for 35 seconds for the seqno 173667 (i.e. since 2020-11-29T14:52:48), but at that time there isn't any active vbucket stream as per the memcached logs. Following is the last occurrence of vbucket stream for vb 191 for this indexer. Note that, after this time, wee can see other indexers successfully creating vbucket streams for vb 191.
2020-11-29T14:52:29.704357-08:00 INFO 134: (default) DCP (Producer) eq_dcpq:secidx:proj-default-MAINT_STREAM_TOPIC_b5498457cf95d0d23003751bba32455d-12886222343688388930/3 - (vb:191) Creating stream with start seqno 173247 and end seqno 18446744073709551615; requested end seqno was 18446744073709551615, collections-manifest uid:none, sid:none
|
2020-11-29T14:52:29.704407-08:00 INFO 134: (default) DCP (Producer) eq_dcpq:secidx:proj-default-MAINT_STREAM_TOPIC_b5498457cf95d0d23003751bba32455d-12886222343688388930/3 - (vb:191) ActiveStream::scheduleBackfill_UNLOCKED register cursor with name "eq_dcpq:secidx:proj-default-MAINT_STREAM_TOPIC_b5498457cf95d0d23003751bba32455d-12886222343688388930/3" backfill:true, seqno:175142
|
2020-11-29T14:52:29.704441-08:00 INFO 134: (default) DCP (Producer) eq_dcpq:secidx:proj-default-MAINT_STREAM_TOPIC_b5498457cf95d0d23003751bba32455d-12886222343688388930/3 - (vb:191) Scheduling backfill from 173248 to 175141, reschedule flag : False
|
2020-11-29T14:52:30.469143-08:00 INFO 134: (default) DCP (Producer) eq_dcpq:secidx:proj-default-MAINT_STREAM_TOPIC_b5498457cf95d0d23003751bba32455d-12886222343688388930/3 - (vb:191) ActiveStream::markDiskSnapshot: Sending disk snapshot with start 173247, end 175791, and high completed --, max visible 175791
|
2020-11-29T14:52:30.469295-08:00 INFO 134: (default) DCP (Producer) eq_dcpq:secidx:proj-default-MAINT_STREAM_TOPIC_b5498457cf95d0d23003751bba32455d-12886222343688388930/3 - (vb:191) ActiveStream::registerCursor name "eq_dcpq:secidx:proj-default-MAINT_STREAM_TOPIC_b5498457cf95d0d23003751bba32455d-12886222343688388930/3", backfill:false, seqno:175792
|
2020-11-29T14:52:32.281168-08:00 INFO 134: (default) DCP (Producer) eq_dcpq:secidx:proj-default-MAINT_STREAM_TOPIC_b5498457cf95d0d23003751bba32455d-12886222343688388930/3 - (vb:191) Backfill complete, 1079 items read from disk, 434 from memory, last seqno read: 175791, pendingBackfill : False
|
2020-11-29T14:52:36.877757-08:00 WARNING 134: (default) DCP (Producer) eq_dcpq:secidx:proj-default-MAINT_STREAM_TOPIC_b5498457cf95d0d23003751bba32455d-12886222343688388930/3 - (vb:191) Stream closing, sent until seqno 174633 remaining items 0, reason: The stream closed early because the conn was disconnected
|
This means that something went wrong on the indexer side.
5. Indexer tried to open mutation stream at 2020-11-29T14:52:36, but projector on node 172.23.108.103 returned error feed.feeder - because previous request at 2020-11-29T14:51:43 hasn't yet responded.
2020-11-29T14:52:36.917-08:00 [Error] KVSender::sendMutationTopicRequest Projector 172.23.108.103:9999 Topic MAINT_STREAM_TOPIC_b5498457cf95d0d23003751bba32455d default
|
2020-11-29T14:52:36.917-08:00 [Error] KVSender::openMutationStream MAINT_STREAM default Error Received feed.feeder from 172.23.108.103:9999
|
|
prev req: 2020-11-29T14:51:43.594-08:00 [Info] KVSender::sendMutationTopicRequest Projector 172.23.108.103:9999 Topic MAINT_STREAM_TOPIC_b5498457cf95d0d23003751bba32455d default
|
6. Meanwhile, previous request receives the rollback. So, stream end issue recieved.
2020-11-29T14:52:45.741-08:00 [Info] KVSender::openMutationStream MAINT_STREAM default Projector 172.23.99.25:9999 Rollback Received pool:"default" bucket:"default" vbnos:303 vbnos:304 vbnos:305 vbnos:306 vbnos:307 vbnos:308 vbnos:309 vbnos:310 vbnos:311 vbnos:315 vbnos:316 vbnos:623 vbnos:624 vbnos:625 vbnos:626 vbnos:627 vbnos:628 vbnos:629 vbnos:630 vbnos:631 vbnos:695 vbnos:696 seqnos:0 seqnos:0 seqnos:0 seqnos:0 seqnos:0 seqnos:0 seqnos:0 seqnos:0 seqnos:0 seqnos:0 seqnos:0 seqnos:0 seqnos:0 seqnos:0 seqnos:0 seqnos:0 seqnos:0 seqnos:0 seqnos:0 seqnos:0 seqnos:0 seqnos:0 vbuuids:133585739715497 vbuuids:15950125255989 vbuuids:23175550451884 vbuuids:52741518790536 vbuuids:225866375618113 vbuuids:49139206490282 vbuuids:39216553385743 vbuuids:30148342475671 vbuuids:9819956268752 vbuuids:46902178522420 vbuuids:111148728589406 vbuuids:26729517547856 vbuuids:131238491713084 vbuuids:223399997626612 vbuuids:53381103950333 vbuuids:153017239541313 vbuuids:85507659899778 vbuuids:194725927067558 vbuuids:171303516790908 vbuuids:116485497368401 vbuuids:23362301620939 vbuuids:141866089159540 snapshots:<start:146776 end:146776 > snapshots:<start:216325 end:216325 > snapshots:<start:178079 end:178079 > snapshots:<start:175404 end:175404 > snapshots:<start:174288 end:174288 > snapshots:<start:147156 end:147156 > snapshots:<start:177598 end:177598 > snapshots:<start:175335 end:175335 > snapshots:<start:205062 end:205062 > snapshots:<start:146450 end:146450 > snapshots:<start:175374 end:175374 > snapshots:<start:185681 end:185681 > snapshots:<start:186475 end:186475 > snapshots:<start:185884 end:185884 > snapshots:<start:146512 end:146512 > snapshots:<start:148348 end:148348 > snapshots:<start:172879 end:172879 > snapshots:<start:146561 end:146561 > snapshots:<start:173788 end:173788 > snapshots:<start:199003 end:199003
|
> snapshots:<start:158756 end:158756 > snapshots:<start:179480 end:179480 >
|
2020-11-29T14:52:36.933-08:00 [Info] TK StreamEnd MAINT_STREAM default 191 135392966901152 173667
|
7. The restart TS for vb 191 is less than the seqno for which flusher is waiting ...
2020-11-29T14:52:45.741-08:00 [Info] KVSender::openMutationStream restartTsList pool:"default" bucket:"default"
|
'seqnos:173247'
|
'<start:173247 end:173247 > '
|
8. Indexer starts processing rollback and goes into stream recovery. Recovery code finds ongoing flush and decides to wait until flush finishes. This ensures high watermark timestamp for this bucket is greater than or equal to flush timestamp. Looking at the flush ts below, the seqno 173667 is seen by the stream reader.
2020-11-29T14:52:45.745-08:00 [Info] Indexer::startBucketStream Rollback from Projector For Stream MAINT_STREAM Bucket default SessionId 3
|
2020-11-29T14:52:45.745-08:00 [Info] Indexer::handleInitPrepRecovery StreamId MAINT_STREAM Bucket default State STREAM_PREPARE_RECOVERY SessionId 3
|
2020-11-29T14:52:45.745-08:00 [Info] Timekeeper::handleInitPrepRecovery MAINT_STREAM default
|
2020-11-29T14:52:45.745-08:00 [Info] Timekeeper::prepareRecovery StreamId MAINT_STREAM Bucket default
|
2020-11-29T14:52:45.745-08:00 [Info] Timekeeper::prepareRecovery Stream MAINT_STREAM Bucket default State Changed to PREPARE_RECOVERY
|
2020-11-29T14:52:45.745-08:00 [Info] Timekeeper::flushOrAbortInProgressTS Processing Flush TS bucket: default, vbuckets: 1024 Crc64: 2021842410562851897 snapType INMEM_SNAP -
|
{vbno, vbuuid, seqno, snapshot-start, snapshot-end}
|
{ 191 7b23a1b27da0 173667 173247 175791}
|
9. We can see that the flush was triggered at around 2020-11-29T14:52:36, as per the flushMonitor logs.
2020-11-29T14:53:46.881-08:00 [Warn] Timekeeper::flushMonitor Waiting for flush to finish for 70 seconds. Stream MAINT_STREAM Bucket default.
|
10. At 2020-11-29T14:52:36 has also received flush abort request. Where we can see that the seqnos for Flush TS are less the seqnos for which flusher is waiting.
2020-11-29T14:52:36.856-08:00 [Info] Timekeeper::processFlushAbort Flush Abort Received MAINT_STREAM default
|
FlushTs bucket: default, vbuckets: 1024 Crc64: 9616317055526481323 snapType INMEM_SNAP -
|
{vbno, vbuuid, seqno, snapshot-start, snapshot-end}
|
{ 189 129803b17ebc 186635 186359 188009
|
{ 191 7b23a1b27da0 173667 173247 175791}
|
{ 248 b6b637351efe 148443 148009 149970}
|
|
LastFlushTs bucket: default, vbuckets: 1024 Crc64: 9616317055526481323 snapType INMEM_SNAP -
|
{vbno, vbuuid, seqno, snapshot-start, snapshot-end}
|
{ 189 129803b17ebc 186630 186359 188009}
|
{ 191 7b23a1b27da0 173547 173247 175791}
|
{ 248 b6b637351efe 148389 148009 149970}
|
|
2020-11-29T14:52:36.859-08:00 [Info] Timekeeper::processFlushAbort MAINT_STREAM default 3 Generate InitPrepRecovery
|
2020-11-29T14:52:36.859-08:00 [Info] Indexer::handleInitPrepRecovery StreamId MAINT_STREAM Bucket default SessionId 3. Cannot initiate another recovery while previous recovery in progress.
|
Checking if this is same as
MB-36980...