1 bucket 10 collections ,275M docs , uneven data in collections, mix replica (0-2), 3 partitioned , total 600 index instances , 3 indexer nodes, Rebalance in 1 index node
KV DGM - ~20% and reached 0 because of background mutation.
further evidence on disk subsystem slowness in indexer logs from node 110.55
yogendraacharya@Yogendras-MacBook-Pro cbcollect_info_ns_1@172.23.110.55_20211215-052750 % grep "12-14T13:3[7-9].*Created recovery point.*" ns_server.indexer.log2021-12-14T13:37:35.289-08:00 [Info] PlasmaSlice Slice Id 0, IndexInstId 10953906878286803332, PartitionId 3 Created recovery point (took 1h5m13.454246981s)2021-12-14T13:38:20.875-08:00 [Info] PlasmaSlice Slice Id 0, IndexInstId 5773277496167614067, PartitionId 3 Created recovery point (took 1h5m59.009447114s)2021-12-14T13:38:21.206-08:00 [Info] PlasmaSlice Slice Id 0, IndexInstId 593601200835328506, PartitionId 2 Created recovery point (took 1h5m59.054128074s)2021-12-14T13:38:30.415-08:00 [Info] PlasmaSlice Slice Id 0, IndexInstId 8700133237930603353, PartitionId 1 Created recovery point (took 1h6m8.029553564s)2021-12-14T13:38:42.437-08:00 [Info] PlasmaSlice Slice Id 0, IndexInstId 2362769368038117700, PartitionId 3 Created recovery point (took 1h6m17.162656301s)2021-12-14T13:38:43.338-08:00 [Info] PlasmaSlice Slice Id 0, IndexInstId 15738064979878177523, PartitionId 3 Created recovery point (took 1h4m46.029806599s)2021-12-14T13:38:43.435-08:00 [Info] PlasmaSlice Slice Id 0, IndexInstId 2401943159049586821, PartitionId 2 Created recovery point (took 1h2m43.142353309s)2021-12-14T13:38:43.737-08:00 [Info] PlasmaSlice Slice Id 0, IndexInstId 8410473484300152842, PartitionId 2 Created recovery point (took 1h2m19.986989164s)2021-12-14T13:38:44.050-08:00 [Info] PlasmaSlice Slice Id 0, IndexInstId 16324477407284375889, PartitionId 1 Created recovery point (took 57m53.807281703s)2021-12-14T13:39:37.240-08:00 [Info] PlasmaSlice Slice Id 0, IndexInstId 14534674835979403970, PartitionId 1 Created recovery point (took 58m46.954088254s)2021-12-14T13:39:37.755-08:00 [Info] PlasmaSlice Slice Id 0, IndexInstId 3295221149229327593, PartitionId 1 Created recovery point (took 58m47.463711111s)2021-12-14T13:39:37.904-08:00 [Info] PlasmaSlice Slice Id 0, IndexInstId 5462993316292563902, PartitionId 3 Created recovery point (took 58m47.60407019s)
Also the resident ratio of many indexes is well below 10% as well as free memory as per plasma tuner is also below 10%
From these logs so far I do not see a problem with indexer, disk bottleneck and high DGM is causing indexer to be slow. As a result indexer has not catched up with mutations and hence sesion_consistency queries are failing due to timeout waiting for snapshot to be available.
I think we need to relook at test as it's not really 10% DGM but well below that plus disk io bottleneck needs to be reduced.
Yogendra Acharya
added a comment - - edited From these logs so far I do not see a problem with indexer, disk bottleneck and high DGM is causing indexer to be slow. As a result indexer has not catched up with mutations and hence sesion_consistency queries are failing due to timeout waiting for snapshot to be available.
I think we need to relook at test as it's not really 10% DGM but well below that plus disk io bottleneck needs to be reduced.
Vikas Chaudhary as discussed over slack you can retry the test with appropriate sizing to ensure closer to 10% RR for indexes. Let me know if you still face the issue with better sized nodes/modified test.
Yogendra Acharya
added a comment - Vikas Chaudhary as discussed over slack you can retry the test with appropriate sizing to ensure closer to 10% RR for indexes. Let me know if you still face the issue with better sized nodes/modified test.
Vikas Chaudhary - Is this a regression ?