Details
-
Bug
-
Resolution: Unresolved
-
Major
-
6.6.6, 7.2.0, 7.1.3
-
Untriaged
-
0
-
Unknown
Description
================================================================
1. The problem starts with indexer skipping the build completion disk snapshot due to MB-46797.
2. The last disk snapshot created before the build completion (2 index, 3 partitions)
2023-03-18T17:20:50.439-07:00 [Info] plasmaSlice::OpenSnapshot SliceId 0 IndexInstId 17574301434787230136 PartitionId 4 Creating New Snapshot SnapshotInfo: count:513621127 committed:true
|
2023-03-18T17:20:50.439-07:00 [Info] PlasmaSlice Slice Id 0, IndexInstId 17574301434787230136, PartitionId 4 Creating recovery point ...
|
2023-03-18T17:20:50.466-07:00 [Info] plasmaSlice::OpenSnapshot SliceId 0 IndexInstId 3887017921767071747 PartitionId 2 Creating New Snapshot SnapshotInfo: count:514303824 committed:true
|
2023-03-18T17:20:50.466-07:00 [Info] PlasmaSlice Slice Id 0, IndexInstId 3887017921767071747, PartitionId 2 Creating recovery point ...
|
2023-03-18T17:20:50.466-07:00 [Info] plasmaSlice::OpenSnapshot SliceId 0 IndexInstId 3887017921767071747 PartitionId 5 Creating New Snapshot SnapshotInfo: count:513387893 committed:true
|
2023-03-18T17:20:50.466-07:00 [Info] PlasmaSlice Slice Id 0, IndexInstId 3887017921767071747, PartitionId 5 Creating recovery point ...
|
3. One index/partition finishes
2023-03-18T17:25:21.930-07:00 [Info] PlasmaSlice Slice Id 0, IndexInstId 17574301434787230136, PartitionId 4 Created recovery point (took 4m31.491268326s)
|
2023-03-18T17:25:21.933-07:00 [Info] PlasmaSlice Slice Id 0, IndexInstId 17574301434787230136, PartitionId 4 Cleanup mainstore recovery point SnapshotInfo: count:498180407 committed:false. num RPs 3.
|
4. Next set of disk snapshots get created (build completion disk snapshots).
2023-03-18T17:25:52.715-07:00 [Info] plasmaSlice::OpenSnapshot SliceId 0 IndexInstId 17574301434787230136 PartitionId 4 Creating New Snapshot SnapshotInfo: count:528652160 committed:true
|
2023-03-18T17:25:52.715-07:00 [Info] plasmaSlice::OpenSnapshot SliceId 0 IndexInstId 3887017921767071747 PartitionId 5 Creating New Snapshot SnapshotInfo: count:528416182 committed:true
|
2023-03-18T17:25:52.717-07:00 [Info] plasmaSlice::OpenSnapshot SliceId 0 IndexInstId 3887017921767071747 PartitionId 2 Creating New Snapshot SnapshotInfo: count:529367068 committed:true
|
5. Only one disk snapshot actually gets initiated but 2 get skipped as the disk snapshot triggered in Step2 is still in progress.
2023-03-18T17:25:52.716-07:00 [Info] PlasmaSlice Slice Id 0, IndexInstId 17574301434787230136, PartitionId 4 Creating recovery point ...
|
2023-03-18T17:25:52.715-07:00 [Info] PlasmaSlice Slice Id 0, IndexInstId 3887017921767071747, PartitionId 5 Skipping ondisk snapshot. A snapshot writer is in progress.
|
2023-03-18T17:25:52.717-07:00 [Info] PlasmaSlice Slice Id 0, IndexInstId 3887017921767071747, PartitionId 2 Skipping ondisk snapshot. A snapshot writer is in progress.
|
6. The pending disk snapshots from Step2 finish.
2023-03-18T17:26:02.935-07:00 [Info] PlasmaSlice Slice Id 0, IndexInstId 3887017921767071747, PartitionId 2 Created recovery point (took 5m12.468437027s)
|
2023-03-18T17:26:02.938-07:00 [Info] PlasmaSlice Slice Id 0, IndexInstId 3887017921767071747, PartitionId 2 Cleanup mainstore recovery point SnapshotInfo: count:498844929 committed:false. num RPs 3.
|
2023-03-18T17:26:59.286-07:00 [Info] PlasmaSlice Slice Id 0, IndexInstId 3887017921767071747, PartitionId 5 Created recovery point (took 6m8.819277349s)
|
2023-03-18T17:26:59.289-07:00 [Info] PlasmaSlice Slice Id 0, IndexInstId 3887017921767071747, PartitionId 5 Cleanup mainstore recovery point SnapshotInfo: count:497955619 committed:false. num RPs 3.
|
7. The disk snapshot in Step5 gets created.
2023-03-18T17:27:59.523-07:00 [Info] PlasmaSlice Slice Id 0, IndexInstId 17574301434787230136, PartitionId 4 Created recovery point (took 2m6.807509695s)
|
2023-03-18T17:27:59.527-07:00 [Info] PlasmaSlice Slice Id 0, IndexInstId 17574301434787230136, PartitionId 4 Cleanup mainstore recovery point SnapshotInfo: count:513621127 committed:false. num RPs 3.
|
8. Now the skipped disk snapshot would only get created if there are more mutation in the bucket, which doesn't happen in this case.
9. When indexer restarts, it computes the minimum seqnum for all vbuckets to restart DCP stream. It can be seen that some of the vbuckets are lower than the snapEnd.
2023-03-18T20:20:21.568-07:00 [Info] Indexer::startBucketStream Stream: MAINT_STREAM Bucket: MLEhist
|
{vbno, vbuuid, seqno, snapshot-start, snapshot-end}
|
{ 72 3150111d5795 5766646 0 8700722}
|
{ 80 f4943cd4534f 6140788 0 8697218}
|
{ 96 d6c0afb9feb1 6069126 0 8675696}
|
{ 158 fac37da0dcc8 3830175 0 8699706}
|
{ 195 4190b0fe34f0 4255513 0 8683042}
|
10. This leads to rollback to 0 as the KV purge_seqno is already higher.
2023-03-18T20:20:22.612-07:00 [Warn] Timekeeper::handleStreamBegin StreamBegin rollback for StreamId MAINT_STREAM Bucket MLEhist vbucket 72. Rollback (0, 3150111d5795).
|
2023-03-18T20:20:22.618-07:00 [Warn] Timekeeper::handleStreamBegin StreamBegin rollback for StreamId MAINT_STREAM Bucket MLEhist vbucket 80. Rollback (0, f4943cd4534f).
|
2023-03-18T20:20:22.637-07:00 [Warn] Timekeeper::handleStreamBegin StreamBegin rollback for StreamId MAINT_STREAM Bucket MLEhist vbucket 96. Rollback (0, d6c0afb9feb1).
|
2023-03-18T20:20:23.209-07:00 [Warn] Timekeeper::handleStreamBegin StreamBegin rollback for StreamId MAINT_STREAM Bucket MLEhist vbucket 158. Rollback (0, fac37da0dcc8).
|
2023-03-18T20:20:23.223-07:00 [Warn] Timekeeper::handleStreamBegin StreamBegin rollback for StreamId MAINT_STREAM Bucket MLEhist vbucket 166. Rollback (0, ae4aed62ffd3).
|
2023-03-18T20:20:23.491-07:00 [Warn] Timekeeper::handleStreamBegin StreamBegin rollback for StreamId MAINT_STREAM Bucket MLEhist vbucket 195. Rollback (0, 4190b0fe34f0).
|
11. Storage Manager uses all disk snapshots to try to start DCP streams in case of rollback to 0. First it uses the latest snapshots.
2023-03-18T20:20:34.200-07:00 [Info] StorageMgr::handleRollback latestSnapInfo SnapshotInfo: count:533326858 committed:false lastRollbackTs <nil>. Use latest snapshot.
|
2023-03-18T20:20:45.892-07:00 [Info] StorageMgr::handleRollback Rollback Index: 17574301434787230136 PartitionId: 4 SliceId: 0 To Snapshot SnapshotInfo: count:533326858 committed:false
|
2023-03-18T20:21:01.009-07:00 [Info] StorageMgr::handleRollback Rollback Index: 4260340739290744432 PartitionId: 3 SliceId: 0 To Snapshot SnapshotInfo: count:711175443 committed:false
|
2023-03-18T20:21:07.610-07:00 [Info] StorageMgr::handleRollback Rollback Index: 8324473573058655730 PartitionId: 5 SliceId: 0 To Snapshot SnapshotInfo: count:711219055 committed:false
|
2023-03-18T20:21:15.218-07:00 [Info] StorageMgr::handleRollback Rollback Index: 3887017921767071747 PartitionId: 2 SliceId: 0 To Snapshot SnapshotInfo: count:529367068 committed:false
|
2023-03-18T20:21:21.131-07:00 [Info] StorageMgr::handleRollback Rollback Index: 3887017921767071747 PartitionId: 5 SliceId: 0 To Snapshot SnapshotInfo: count:528416182 committed:false
|
2023-03-18T20:21:22.821-07:00 [Info] Indexer::startBucketStream Stream: MAINT_STREAM Bucket: MLEhist
|
{ 72 3150111d5795 5766646 0 8700722}
|
{ 80 f4943cd4534f 6140788 0 8697218}
|
{ 96 d6c0afb9feb1 6069126 0 8675696}
|
12. Then it tries with the next older snapshots
2023-03-18T20:21:35.422-07:00 [Info] StorageMgr::handleRollback Discarding Already Used Snapshot SnapshotInfo: count:529367068 committed:false. Using Next snapshot SnapshotInfo: count:514303824 committed:false
|
2023-03-18T20:21:57.137-07:00 [Info] StorageMgr::handleRollback Discarding Already Used Snapshot SnapshotInfo: count:528416182 committed:false. Using Next snapshot SnapshotInfo: count:513387893 committed:false
|
2023-03-18T20:22:03.031-07:00 [Info] StorageMgr::handleRollback Discarding Already Used Snapshot SnapshotInfo: count:533326858 committed:false. Using Next snapshot SnapshotInfo: count:528652160 committed:false
|
2023-03-18T20:22:09.011-07:00 [Info] StorageMgr::handleRollback Discarding Already Used Snapshot SnapshotInfo: count:711175443 committed:false. Using Next snapshot SnapshotInfo: count:711175443 committed:false
|
2023-03-18T20:22:16.733-07:00 [Info] StorageMgr::handleRollback Discarding Already Used Snapshot SnapshotInfo: count:711219055 committed:false. Using Next snapshot SnapshotInfo: count:711219055 committed:false
|
13. But this time, it doesn't compute the restart timestamp correctly
2023-03-18T20:22:25.364-07:00 [Info] Indexer::startBucketStream Stream: MAINT_STREAM Bucket: MLEhist
|
{ 72 3150111d5795 8700722 8700722 8700722}
|
{ 80 f4943cd4534f 8697218 8697218 8697218}
|
{ 96 d6c0afb9feb1 8675696 8675696 8675696}
|
The minimum timestamp must have been lower than the one used in previous try. But it is actually greater. This leads to the rolled back mutations never being streamed back from DCP.