Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-50097

Performance: Scan are failing when both KV and Indexer are in DGM with rebalance in

    XMLWordPrintable

Details

    • Bug
    • Status: Closed
    • Critical
    • Resolution: User Error
    • 7.1.0
    • None
    • secondary-index
    • Untriaged
    • 1
    • Unknown

    Description

      Test: 

      1 bucket 10 collections ,275M docs , uneven data in collections, mix replica (0-2), 3 partitioned , total 600 index instances , 3 indexer nodes, Rebalance in 1 index node

      KV DGM - ~20% and reached 0 because of background mutation.

      Indexer DGM - ~10%

       

      Seeing the following errors in the scan 

      03:09:30 2021-12-14T13:39:30.643-08:00 [Error] [GsiScanClient:"172.23.110.55:9101"] Range(./opt/couchbase/bin/cbindexperf2282) response failed `Index scan timed out`
      03:09:30 2021-12-14T13:39:30.643-08:00 [Error] [GsiScanClient:"172.23.110.55:9101"] Range(./opt/couchbase/bin/cbindexperf2279) response failed `Index scan timed out`
      03:09:30 2021-12-14T13:39:30.643-08:00 [Error] PickRandom: Fail to find indexer for all index partitions. Num partition 3.  Partition with instances 2 
      03:09:30 REQ:1 scan error occurred:  Index scan timed out from [172.23.110.72:9101 172.23.110.55:9101]
      03:09:30 2021-12-14T13:39:30.643-08:00 [Error] [GsiScanClient:"172.23.110.55:9101"] Range(./opt/couchbase/bin/cbindexperf2426) response failed `Index scan timed out`
      03:09:30 2021-12-14T13:39:30.643-08:00 [Error] PickRandom: Fail to find indexer for all index partitions. Num partition 3.  Partition with instances 2 
      03:09:30 REQ:1 scan error occurred:  Index scan timed out from [172.23.110.72:9101 172.23.110.72:9101 172.23.110.55:9101]
      03:09:30 2021-12-14T13:39:30.643-08:00 [Error] PickRandom: Fail to find indexer for all index partitions. Num partition 3.  Partition with instances 0 
      03:09:30 REQ:1 scan error occurred:  Index scan timed out from [172.23.110.56:9101 172.23.110.55:9101]
      03:09:30 2021-12-14T13:39:30.649-08:00 [Error] [GsiScanClient:"172.23.110.55:9101"] Range(./opt/couchbase/bin/cbindexperf2284) response failed `Index scan timed out`
      03:09:30 2021-12-14T13:39:30.649-08:00 [Error] PickRandom: Fail to find indexer for all index partitions. Num partition 3.  Partition with instances 2 
      03:09:30 REQ:1 scan error occurred:  Index scan timed out from [172.23.110.55:9101 172.23.110.72:9101]
      03:09:30 2021-12-14T13:39:30.654-08:00 [Error] [GsiScanClient:"172.23.110.55:9101"] Range(./opt/couchbase/bin/cbindexperf2430) response failed `Index scan timed out`
      03:09:30 2021-12-14T13:39:30.654-08:00 [Error] PickRandom: Fail to find indexer for all index partitions. Num partition 3.  Partition with instances 2 
      03:09:30 2021-12-14T13:39:30.654-08:00 [Error] [GsiScanClient:"172.23.110.55:9101"] Range(./opt/couchbase/bin/cbindexperf2431) response failed `Index scan timed out`
      03:09:30 2021-12-14T13:39:30.654-08:00 [Error] PickRandom: Fail to find indexer for all index partitions. Num partition 3.  Partition with instances 2 
      03:09:30 2021-12-14T13:39:30.664-08:00 [Error] [GsiScanClient:"172.23.110.55:9101"] Range(./opt/couchbase/bin/cbindexperf2292) response failed `Index scan timed out`
      03:09:30 2021-12-14T13:39:30.664-08:00 [Error] PickRandom: Fail to find indexer for all index partitions. Num partition 3.  Partition with instances 2 
      03:09:30 REQ:1 scan error occurred:  Index scan timed out from [172.23.110.72:9101 172.23.110.72:9101 172.23.110.55:9101] 

      Test job : http://perf.jenkins.couchbase.com/job/aether/1240/console 

      Cbmonitor: 

      http://cbmonitor.sc.couchbase.com/reports/html/?snapshot=aether_710-1887_rebalance_e43d

      Same test with rebalance out passed http://perf.jenkins.couchbase.com/job/aether/1238/console 

      Attachments

        No reviews matched the request. Check your Options in the drop-down menu of this sections header.

        Activity

          ritam.sharma Ritam Sharma added a comment -

          Vikas Chaudhary - Is this a regression ?

          ritam.sharma Ritam Sharma added a comment - Vikas Chaudhary - Is this a regression ?

          Ritam Sharma its a new test 

          vikas.chaudhary Vikas Chaudhary added a comment - Ritam Sharma its a new test 

          The cause of scan failures is indexer is still waiting for snapshot needed by scan request

          2021-12-14T13:39:01.939-08:00 [Info] SCAN##2373 Error in getRequestedIndexSnapshot Index scan timed out
          2021-12-14T13:39:01.939-08:00 [Info] SCAN##2373::logSnapInfoAtTimeout Vbucket seqno in snapTs is lagging reqTs at timeout. InstId: 13902284464657407248, snapTs.Seqnos[0]: 564406, reqTs.Seqnos[0]: 579502
          2021-12-14T13:39:01.939-08:00 [Info] SCAN##2373 REQUEST defnId:410057540232448091, instId:13902284464657407248, index:bucket-1/index8-8, type:scan, partitions:[1], scans: <ud>([{[1] [2] 3 range [] <nil>}])</ud>, limit:1, consistency:session_consistency, requestId:./opt/couchbase/bin/cbindexperf2384
          2021-12-14T13:39:01.939-08:00 [Info] SCAN##2373 RESPONSE status:(error = Index scan timed out), requestId: ./opt/couchbase/bin/cbindexperf2384
          2021-12-14T13:39:02.035-08:00 [Info] updateSnapMapAndNotify::logSnapInfoAtTimeout Vbucket seqno in snapTs is lagging reqTs at timeout. InstId: 13902284464657407248, snapTs.Seqnos[0]: 564408, reqTs.Seqnos[0]: 57950

          the numDocsPending is very high, snapshot gen latency above 1sec   

          2021-12-14T13:39:37.851-08:00 index_bucket-1:scope-1:collection-8:index8-8:13902284464657407248 { "flush_queue_size":0, "num_docs_pending":16584355, "num_docs_queued":225858, "num_scan_timeouts":47,"num_snapshot_waiters":1,"resident_percent":2, snapshot_gen_latency_dist":{"(1000ms-5000ms)":932,"(100ms-1000ms)":3952,"(30ms-50ms)":41,"(50ms-100ms)":425} 

          yogendra.acharya Yogendra Acharya added a comment - The cause of scan failures is indexer is still waiting for snapshot needed by scan request 2021 - 12 -14T13: 39 : 01.939 - 08 : 00 [Info] SCAN## 2373 Error in getRequestedIndexSnapshot Index scan timed out 2021 - 12 -14T13: 39 : 01.939 - 08 : 00 [Info] SCAN## 2373 ::logSnapInfoAtTimeout Vbucket seqno in snapTs is lagging reqTs at timeout. InstId: 13902284464657407248 , snapTs.Seqnos[ 0 ]: 564406 , reqTs.Seqnos[ 0 ]: 579502 2021 - 12 -14T13: 39 : 01.939 - 08 : 00 [Info] SCAN## 2373 REQUEST defnId: 410057540232448091 , instId: 13902284464657407248 , index:bucket- 1 /index8- 8 , type:scan, partitions:[ 1 ], scans: <ud>([{[ 1 ] [ 2 ] 3 range [] <nil>}])</ud>, limit: 1 , consistency:session_consistency, requestId:./opt/couchbase/bin/cbindexperf2384 2021 - 12 -14T13: 39 : 01.939 - 08 : 00 [Info] SCAN## 2373 RESPONSE status:(error = Index scan timed out), requestId: ./opt/couchbase/bin/cbindexperf2384 2021 - 12 -14T13: 39 : 02.035 - 08 : 00 [Info] updateSnapMapAndNotify::logSnapInfoAtTimeout Vbucket seqno in snapTs is lagging reqTs at timeout. InstId: 13902284464657407248 , snapTs.Seqnos[ 0 ]: 564408 , reqTs.Seqnos[ 0 ]: 57950 the numDocsPending is very high, snapshot gen latency above 1sec    2021 - 12 -14T13: 39 : 37.851 - 08 : 00 index_bucket- 1 :scope- 1 :collection- 8 :index8- 8 : 13902284464657407248 { "flush_queue_size" : 0 , "num_docs_pending" : 16584355 , "num_docs_queued" : 225858 , "num_scan_timeouts" : 47 , "num_snapshot_waiters" : 1 , "resident_percent" : 2 , snapshot_gen_latency_dist ":{" (1000ms-5000ms) ":932," (100ms-1000ms) ":3952," (30ms-50ms) ":41," (50ms-100ms)": 425 }
          yogendra.acharya Yogendra Acharya added a comment - - edited

          indexer is slow due to High DGM. and disk utilization appears to be bottleneck here, As per cbmonitor graphs http://cbmonitor.sc.couchbase.com/reports/html/?snapshot=aether_710-1887_rebalance_e43d#e100384d82a4cedec02cac1d5c48b322  data_utll    shows consistent 100% or above utilization.

          from projector logs on 110.53 we can see that mutChLen & rcvChLen are both 1000 indicating indexer is slow.

          2021-12-14T13:39:28.468-08:00 [Info] DCPT[secidx:proj-bucket-1-MAINT_STREAM_TOPIC_7a871c45eb6b7fc901b248b610d85b7a-13820348739579072406/2] ##28 stats: {"bytes":3772368250,"bufferacks":900,"toAckBytes":1185248,"streamreqs":128,"snapshots":63773,"mutations":3470354,"collectionCreate":0,"collectionDrop":0,"collectionFlush":0,"scopeCreate":0,"scopeDrop":0,"collectionChanged":0,"seqnoAdvanced":0,"osoSnapshotStart":0,"osoSnapshotEnd":0,"streamends":0,"closestreams":0,"lastAckTime":1.787456799s,"lastNoopSend":16.525676257s,"lastNoopRecv":16.525691022s,"lastMsgSend":1.787456799s,"lastMsgRecv":887.813744ms,"rcvchLen":1000,"incomingMsg":3535260}
          2021-12-14T13:39:28.469-08:00 [Info] KVDT[<-bucket-1<-127.0.0.1:8091 #MAINT_STREAM_TOPIC_7a871c45eb6b7fc901b248b610d85b7a] ##28 stats: {"eventCount":14135033,"reqCount":512,"endCount":0,"snapStat.samples":255684,"snapStat.min":1,"snapStat.max":15695,"snapStat.avg":66,"upsertCount":13878836,"deleteCount":0,"exprCount":0,"ainstCount":34,"dinstCount":25,"tsCount":0,"mutChLen":1000,"collectionCreate":0,"collectionDrop":0,"collectionFlush":0,"scopeCreate":0,"scopeDrop":0,"collectionChanged":0,"seqnoAdvanced":0,"numDocsProcessed":289158963,"numDocsPending":8330303} 

          further evidence on disk subsystem slowness in indexer logs from node 110.55

          yogendraacharya@Yogendras-MacBook-Pro cbcollect_info_ns_1@172.23.110.55_20211215-052750 % grep "12-14T13:3[7-9].*Created recovery point.*" ns_server.indexer.log2021-12-14T13:37:35.289-08:00 [Info] PlasmaSlice Slice Id 0, IndexInstId 10953906878286803332, PartitionId 3 Created recovery point (took 1h5m13.454246981s)2021-12-14T13:38:20.875-08:00 [Info] PlasmaSlice Slice Id 0, IndexInstId 5773277496167614067, PartitionId 3 Created recovery point (took 1h5m59.009447114s)2021-12-14T13:38:21.206-08:00 [Info] PlasmaSlice Slice Id 0, IndexInstId 593601200835328506, PartitionId 2 Created recovery point (took 1h5m59.054128074s)2021-12-14T13:38:30.415-08:00 [Info] PlasmaSlice Slice Id 0, IndexInstId 8700133237930603353, PartitionId 1 Created recovery point (took 1h6m8.029553564s)2021-12-14T13:38:42.437-08:00 [Info] PlasmaSlice Slice Id 0, IndexInstId 2362769368038117700, PartitionId 3 Created recovery point (took 1h6m17.162656301s)2021-12-14T13:38:43.338-08:00 [Info] PlasmaSlice Slice Id 0, IndexInstId 15738064979878177523, PartitionId 3 Created recovery point (took 1h4m46.029806599s)2021-12-14T13:38:43.435-08:00 [Info] PlasmaSlice Slice Id 0, IndexInstId 2401943159049586821, PartitionId 2 Created recovery point (took 1h2m43.142353309s)2021-12-14T13:38:43.737-08:00 [Info] PlasmaSlice Slice Id 0, IndexInstId 8410473484300152842, PartitionId 2 Created recovery point (took 1h2m19.986989164s)2021-12-14T13:38:44.050-08:00 [Info] PlasmaSlice Slice Id 0, IndexInstId 16324477407284375889, PartitionId 1 Created recovery point (took 57m53.807281703s)2021-12-14T13:39:37.240-08:00 [Info] PlasmaSlice Slice Id 0, IndexInstId 14534674835979403970, PartitionId 1 Created recovery point (took 58m46.954088254s)2021-12-14T13:39:37.755-08:00 [Info] PlasmaSlice Slice Id 0, IndexInstId 3295221149229327593, PartitionId 1 Created recovery point (took 58m47.463711111s)2021-12-14T13:39:37.904-08:00 [Info] PlasmaSlice Slice Id 0, IndexInstId 5462993316292563902, PartitionId 3 Created recovery point (took 58m47.60407019s) 

          Also the resident ratio of many indexes is well below 10% as well as free memory as per plasma tuner is also below 10%

          2021-12-14T13:39:01.913-08:00 [Info] DATP[->dataport ":9105"] DATP -> Indexer 49.492847% blocked2021-12-14T13:39:11.934-08:00 [Info] DATP[->dataport ":9105"] DATP -> Indexer 49.543744% blocked2021-12-14T13:39:15.636-08:00 [Info] Plasma: Adaptive memory quota tuning RSS:120424849408, freePercent:7.041058248480157, currentQuota=95745346985, heapSz=124371731828, limitOn=false2021-12-14T13:39:17.224-08:00 [Info] DATP[->dataport ":9105"] DATP -> Indexer 49.543608% blocked2021-12-14T13:39:17.237-08:00 [Info] DATP[->dataport ":9105"] DATP -> Indexer 49.570755% blocked2021-12-14T13:39:22.064-08:00 [Info] DATP[->dataport ":9105"] DATP -> Indexer 49.595100% blocked2021-12-14T13:39:25.639-08:00 [Info] Plasma: Adaptive memory quota tuning RSS:120437108736, freePercent:7.193576240760824, currentQuota=95745346985, heapSz=123087368683, limitOn=false2021-12-14T13:39:25.648-08:00 [Info] DATP[->dataport ":9105"] DATP -> Indexer 49.586493% blocked2021-12-14T13:39:26.665-08:00 [Info] DATP[->dataport ":9105"] DATP -> Indexer 49.618473% blocked
           
           
          2021-12-14T13:39:37.844-08:00 indexer {"avg_disk_bps":562588129,"avg_drain_rate":6869,"avg_mutation_rate":40139,"cpu_utilization":935.6,"memory_free":1842851840,"memory_rss":116881563648,"memory_total_storage":124102139904,"memory_used":127486716120,"memory_used_queue":381905969,"memory_used_storage":94997167922,"num_cgo_call":13649120313,"num_goroutine":7836,"timestamp":"1639517977843985713","timings/stats_response":"9136 24627139401 1197636947028960717","total_data_size":300657937807,"total_disk_size":146874113167,"uptime":"3h16m4.949182857s"}2021-12-14T13:39:37.844-08:00 index_bucket-1:scope-1:collection-6:index6-9 (replica 1):15344651329982121655 {"avg_ts_interval":584699636,"avg_ts_items_count":217,"data_size":919019293,"data_size_on_disk":350739155,"disk_size":747048715,"flush_queue_size":0,"get_bytes":11401109504,"insert_bytes":37622632448,"log_space_on_disk":707449818,"memory_used":122635748,"num_docs_indexed":11036194,"num_docs_pending":16584355,"num_docs_processed":578223606,"num_docs_queued":225858,"num_flush_queued":11036194,"progress_stat_time":"1639517974054084087","recs_in_mem":404923,"recs_on_disk":9609987,"resident_percent":4,"since_last_snapshot":630908749,"snapshot_gen_latency_dist":{"(1000ms-5000ms)":1164,"(100ms-1000ms)":3887,"(50ms-100ms)":303}}2021-12-14T13:39:37.844-08:00 index_bucket-1:scope-1:collection-10:index10-7 (replica 1):13002947246258695397 {"avg_ts_interval":513949678,"avg_ts_items_count":190,"data_size":1528560420,"data_size_on_disk":413920085,"disk_size":954667401,"flush_queue_size":1,"get_bytes":10811944960,"insert_bytes":39807483904,"log_space_on_disk":893679203,"memory_used":93083679,"num_docs_indexed":17722552,"num_docs_pending":16584355,"num_docs_processed":578223606,"num_docs_queued":225858,"num_flush_queued":17722553,"progress_stat_time":"1639517974054084087","recs_in_mem":36525,"recs_on_disk":16656513,"resident_percent":0,"since_last_snapshot":319943651,"snapshot_gen_latency_dist":{"(1000ms-5000ms)":1120,"(100ms-1000ms)":3985,"(50ms-100ms)":260}}2021-12-14T13:39:37.844-08:00 index_bucket-1:scope-1:collection-1:index1-9:1215949941826829021 {"avg_ts_interval":530203235,"avg_ts_items_count":192,"data_size":150598711,"data_size_on_disk":53427719,"disk_size":101503896,"flush_queue_size":0,"frag_percent":46,"get_bytes":6816403456,"insert_bytes":28152713216,"log_space_on_disk":99501032,"memory_used":19021618,"num_docs_indexed":2651859,"num_docs_pending":16584355,"num_docs_processed":578223606,"num_docs_queued":225858,"num_flush_queued":2651859,"progress_stat_time":"1639517974054084087","recs_in_mem":0,"recs_on_disk":1666401,"since_last_snapshot":498752346,"snapshot_gen_latency_dist":{"(1000ms-5000ms)":588,"(100ms-1000ms)":4172,"(50ms-100ms)":478}}2021-12-14T13:39:37.844-08:00 index_bucket-1:scope-1:collection-4:index4-6:15563223294563529092 {"avg_ts_interval":575070728,"avg_ts_items_count":186,"data_size":569177761,"data_size_on_disk":192907450,"disk_size":401590701,"flush_queue_size":0,"get_bytes":8346054656,"insert_bytes":37622632448,"log_space_on_disk":390705807,"memory_used":164682297,"num_docs_indexed":7680480,"num_docs_pending":16584355,"num_docs_processed":578223606,"num_docs_queued":225858,"num_flush_queued":7680480,"progress_stat_time":"1639517974054084087","recs_in_mem":123759,"recs_on_disk":6541301,"since_last_snapshot":612310347,"snapshot_gen_latency_dist":{"(1000ms-5000ms)":1013,"(100ms-1000ms)":3964,"(30ms-50ms)":28,"(50ms-100ms)":354}} 

          yogendra.acharya Yogendra Acharya added a comment - - edited indexer is slow due to High DGM. and disk utilization appears to be bottleneck here, As per cbmonitor graphs http://cbmonitor.sc.couchbase.com/reports/html/?snapshot=aether_710-1887_rebalance_e43d#e100384d82a4cedec02cac1d5c48b322   data_utll    shows consistent 100% or above utilization. from projector logs on 110.53 we can see that mutChLen & rcvChLen are both 1000 indicating indexer is slow. 2021 - 12 -14T13: 39 : 28.468 - 08 : 00 [Info] DCPT[secidx:proj-bucket- 1 -MAINT_STREAM_TOPIC_7a871c45eb6b7fc901b248b610d85b7a- 13820348739579072406 / 2 ] ## 28 stats: { "bytes" : 3772368250 , "bufferacks" : 900 , "toAckBytes" : 1185248 , "streamreqs" : 128 , "snapshots" : 63773 , "mutations" : 3470354 , "collectionCreate" : 0 , "collectionDrop" : 0 , "collectionFlush" : 0 , "scopeCreate" : 0 , "scopeDrop" : 0 , "collectionChanged" : 0 , "seqnoAdvanced" : 0 , "osoSnapshotStart" : 0 , "osoSnapshotEnd" : 0 , "streamends" : 0 , "closestreams" : 0 , "lastAckTime" : 1 .787456799s, "lastNoopSend" : 16 .525676257s, "lastNoopRecv" : 16 .525691022s, "lastMsgSend" : 1 .787456799s, "lastMsgRecv" : 887 .813744ms, "rcvchLen" : 1000 , "incomingMsg" : 3535260 } 2021 - 12 -14T13: 39 : 28.469 - 08 : 00 [Info] KVDT[<-bucket- 1 <- 127.0 . 0.1 : 8091 #MAINT_STREAM_TOPIC_7a871c45eb6b7fc901b248b610d85b7a] ## 28 stats: { "eventCount" : 14135033 , "reqCount" : 512 , "endCount" : 0 , "snapStat.samples" : 255684 , "snapStat.min" : 1 , "snapStat.max" : 15695 , "snapStat.avg" : 66 , "upsertCount" : 13878836 , "deleteCount" : 0 , "exprCount" : 0 , "ainstCount" : 34 , "dinstCount" : 25 , "tsCount" : 0 , "mutChLen" : 1000 , "collectionCreate" : 0 , "collectionDrop" : 0 , "collectionFlush" : 0 , "scopeCreate" : 0 , "scopeDrop" : 0 , "collectionChanged" : 0 , "seqnoAdvanced" : 0 , "numDocsProcessed" : 289158963 , "numDocsPending" : 8330303 } further evidence on disk subsystem slowness in indexer logs from node 110.55 yogendraacharya @Yogendras -MacBook-Pro cbcollect_info_ns_1 @172 .23. 110 .55_20211215- 052750 % grep "12-14T13:3[7-9].*Created recovery point.*" ns_server.indexer.log2021- 12 -14T13: 37 : 35.289 - 08 : 00 [Info] PlasmaSlice Slice Id 0 , IndexInstId 10953906878286803332 , PartitionId 3 Created recovery point (took 1h5m13.454246981s) 2021 - 12 -14T13: 38 : 20.875 - 08 : 00 [Info] PlasmaSlice Slice Id 0 , IndexInstId 5773277496167614067 , PartitionId 3 Created recovery point (took 1h5m59.009447114s) 2021 - 12 -14T13: 38 : 21.206 - 08 : 00 [Info] PlasmaSlice Slice Id 0 , IndexInstId 593601200835328506 , PartitionId 2 Created recovery point (took 1h5m59.054128074s) 2021 - 12 -14T13: 38 : 30.415 - 08 : 00 [Info] PlasmaSlice Slice Id 0 , IndexInstId 8700133237930603353 , PartitionId 1 Created recovery point (took 1h6m8.029553564s) 2021 - 12 -14T13: 38 : 42.437 - 08 : 00 [Info] PlasmaSlice Slice Id 0 , IndexInstId 2362769368038117700 , PartitionId 3 Created recovery point (took 1h6m17.162656301s) 2021 - 12 -14T13: 38 : 43.338 - 08 : 00 [Info] PlasmaSlice Slice Id 0 , IndexInstId 15738064979878177523 , PartitionId 3 Created recovery point (took 1h4m46.029806599s) 2021 - 12 -14T13: 38 : 43.435 - 08 : 00 [Info] PlasmaSlice Slice Id 0 , IndexInstId 2401943159049586821 , PartitionId 2 Created recovery point (took 1h2m43.142353309s) 2021 - 12 -14T13: 38 : 43.737 - 08 : 00 [Info] PlasmaSlice Slice Id 0 , IndexInstId 8410473484300152842 , PartitionId 2 Created recovery point (took 1h2m19.986989164s) 2021 - 12 -14T13: 38 : 44.050 - 08 : 00 [Info] PlasmaSlice Slice Id 0 , IndexInstId 16324477407284375889 , PartitionId 1 Created recovery point (took 57m53.807281703s) 2021 - 12 -14T13: 39 : 37.240 - 08 : 00 [Info] PlasmaSlice Slice Id 0 , IndexInstId 14534674835979403970 , PartitionId 1 Created recovery point (took 58m46.954088254s) 2021 - 12 -14T13: 39 : 37.755 - 08 : 00 [Info] PlasmaSlice Slice Id 0 , IndexInstId 3295221149229327593 , PartitionId 1 Created recovery point (took 58m47.463711111s) 2021 - 12 -14T13: 39 : 37.904 - 08 : 00 [Info] PlasmaSlice Slice Id 0 , IndexInstId 5462993316292563902 , PartitionId 3 Created recovery point (took 58m47.60407019s) Also the resident ratio of many indexes is well below 10% as well as free memory as per plasma tuner is also below 10% 2021 - 12 -14T13: 39 : 01.913 - 08 : 00 [Info] DATP[->dataport ":9105" ] DATP -> Indexer 49.492847 % blocked2021- 12 -14T13: 39 : 11.934 - 08 : 00 [Info] DATP[->dataport ":9105" ] DATP -> Indexer 49.543744 % blocked2021- 12 -14T13: 39 : 15.636 - 08 : 00 [Info] Plasma: Adaptive memory quota tuning RSS: 120424849408 , freePercent: 7.041058248480157 , currentQuota= 95745346985 , heapSz= 124371731828 , limitOn=false2021- 12 -14T13: 39 : 17.224 - 08 : 00 [Info] DATP[->dataport ":9105" ] DATP -> Indexer 49.543608 % blocked2021- 12 -14T13: 39 : 17.237 - 08 : 00 [Info] DATP[->dataport ":9105" ] DATP -> Indexer 49.570755 % blocked2021- 12 -14T13: 39 : 22.064 - 08 : 00 [Info] DATP[->dataport ":9105" ] DATP -> Indexer 49.595100 % blocked2021- 12 -14T13: 39 : 25.639 - 08 : 00 [Info] Plasma: Adaptive memory quota tuning RSS: 120437108736 , freePercent: 7.193576240760824 , currentQuota= 95745346985 , heapSz= 123087368683 , limitOn=false2021- 12 -14T13: 39 : 25.648 - 08 : 00 [Info] DATP[->dataport ":9105" ] DATP -> Indexer 49.586493 % blocked2021- 12 -14T13: 39 : 26.665 - 08 : 00 [Info] DATP[->dataport ":9105" ] DATP -> Indexer 49.618473 % blocked     2021 - 12 -14T13: 39 : 37.844 - 08 : 00 indexer { "avg_disk_bps" : 562588129 , "avg_drain_rate" : 6869 , "avg_mutation_rate" : 40139 , "cpu_utilization" : 935.6 , "memory_free" : 1842851840 , "memory_rss" : 116881563648 , "memory_total_storage" : 124102139904 , "memory_used" : 127486716120 , "memory_used_queue" : 381905969 , "memory_used_storage" : 94997167922 , "num_cgo_call" : 13649120313 , "num_goroutine" : 7836 , "timestamp" : "1639517977843985713" , "timings/stats_response" : "9136 24627139401 1197636947028960717" , "total_data_size" : 300657937807 , "total_disk_size" : 146874113167 , "uptime" : "3h16m4.949182857s" } 2021 - 12 -14T13: 39 : 37.844 - 08 : 00 index_bucket- 1 :scope- 1 :collection- 6 :index6- 9 (replica 1 ): 15344651329982121655 { "avg_ts_interval" : 584699636 , "avg_ts_items_count" : 217 , "data_size" : 919019293 , "data_size_on_disk" : 350739155 , "disk_size" : 747048715 , "flush_queue_size" : 0 , "get_bytes" : 11401109504 , "insert_bytes" : 37622632448 , "log_space_on_disk" : 707449818 , "memory_used" : 122635748 , "num_docs_indexed" : 11036194 , "num_docs_pending" : 16584355 , "num_docs_processed" : 578223606 , "num_docs_queued" : 225858 , "num_flush_queued" : 11036194 , "progress_stat_time" : "1639517974054084087" , "recs_in_mem" : 404923 , "recs_on_disk" : 9609987 , "resident_percent" : 4 , "since_last_snapshot" : 630908749 , "snapshot_gen_latency_dist" :{ "(1000ms-5000ms)" : 1164 , "(100ms-1000ms)" : 3887 , "(50ms-100ms)" : 303 }} 2021 - 12 -14T13: 39 : 37.844 - 08 : 00 index_bucket- 1 :scope- 1 :collection- 10 :index10- 7 (replica 1 ): 13002947246258695397 { "avg_ts_interval" : 513949678 , "avg_ts_items_count" : 190 , "data_size" : 1528560420 , "data_size_on_disk" : 413920085 , "disk_size" : 954667401 , "flush_queue_size" : 1 , "get_bytes" : 10811944960 , "insert_bytes" : 39807483904 , "log_space_on_disk" : 893679203 , "memory_used" : 93083679 , "num_docs_indexed" : 17722552 , "num_docs_pending" : 16584355 , "num_docs_processed" : 578223606 , "num_docs_queued" : 225858 , "num_flush_queued" : 17722553 , "progress_stat_time" : "1639517974054084087" , "recs_in_mem" : 36525 , "recs_on_disk" : 16656513 , "resident_percent" : 0 , "since_last_snapshot" : 319943651 , "snapshot_gen_latency_dist" :{ "(1000ms-5000ms)" : 1120 , "(100ms-1000ms)" : 3985 , "(50ms-100ms)" : 260 }} 2021 - 12 -14T13: 39 : 37.844 - 08 : 00 index_bucket- 1 :scope- 1 :collection- 1 :index1- 9 : 1215949941826829021 { "avg_ts_interval" : 530203235 , "avg_ts_items_count" : 192 , "data_size" : 150598711 , "data_size_on_disk" : 53427719 , "disk_size" : 101503896 , "flush_queue_size" : 0 , "frag_percent" : 46 , "get_bytes" : 6816403456 , "insert_bytes" : 28152713216 , "log_space_on_disk" : 99501032 , "memory_used" : 19021618 , "num_docs_indexed" : 2651859 , "num_docs_pending" : 16584355 , "num_docs_processed" : 578223606 , "num_docs_queued" : 225858 , "num_flush_queued" : 2651859 , "progress_stat_time" : "1639517974054084087" , "recs_in_mem" : 0 , "recs_on_disk" : 1666401 , "since_last_snapshot" : 498752346 , "snapshot_gen_latency_dist" :{ "(1000ms-5000ms)" : 588 , "(100ms-1000ms)" : 4172 , "(50ms-100ms)" : 478 }} 2021 - 12 -14T13: 39 : 37.844 - 08 : 00 index_bucket- 1 :scope- 1 :collection- 4 :index4- 6 : 15563223294563529092 { "avg_ts_interval" : 575070728 , "avg_ts_items_count" : 186 , "data_size" : 569177761 , "data_size_on_disk" : 192907450 , "disk_size" : 401590701 , "flush_queue_size" : 0 , "get_bytes" : 8346054656 , "insert_bytes" : 37622632448 , "log_space_on_disk" : 390705807 , "memory_used" : 164682297 , "num_docs_indexed" : 7680480 , "num_docs_pending" : 16584355 , "num_docs_processed" : 578223606 , "num_docs_queued" : 225858 , "num_flush_queued" : 7680480 , "progress_stat_time" : "1639517974054084087" , "recs_in_mem" : 123759 , "recs_on_disk" : 6541301 , "since_last_snapshot" : 612310347 , "snapshot_gen_latency_dist" :{ "(1000ms-5000ms)" : 1013 , "(100ms-1000ms)" : 3964 , "(30ms-50ms)" : 28 , "(50ms-100ms)" : 354 }}
          yogendra.acharya Yogendra Acharya added a comment - - edited

          From these logs so far I do not see a problem with indexer, disk bottleneck and high DGM is causing indexer to be slow. As a result indexer has not catched up with mutations and hence sesion_consistency queries are failing due to timeout waiting for snapshot to be available.

          I think we need to relook at test as it's not really 10% DGM but well below that plus disk io bottleneck needs to be reduced. 

          yogendra.acharya Yogendra Acharya added a comment - - edited From these logs so far I do not see a problem with indexer, disk bottleneck and high DGM is causing indexer to be slow. As a result indexer has not catched up with mutations and hence sesion_consistency queries are failing due to timeout waiting for snapshot to be available. I think we need to relook at test as it's not really 10% DGM but well below that plus disk io bottleneck needs to be reduced. 

          Vikas Chaudhary as discussed over slack you can retry the test with appropriate sizing to ensure closer to 10% RR for indexes. Let me know if you still face the issue with better sized nodes/modified test.

          yogendra.acharya Yogendra Acharya added a comment - Vikas Chaudhary as discussed over slack you can retry the test with appropriate sizing to ensure closer to 10% RR for indexes. Let me know if you still face the issue with better sized nodes/modified test.

          People

            vikas.chaudhary Vikas Chaudhary
            vikas.chaudhary Vikas Chaudhary
            Votes:
            0 Vote for this issue
            Watchers:
            5 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Gerrit Reviews

                There are no open Gerrit changes

                PagerDuty