Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-32162

Add more GSI stats to UI and periodic stats

    XMLWordPrintable

Details

    Description

      Consider adding stats like below to UI as well as periodic stats. Need further pruning to come with useful stats list.

      Projector CPU/Memory
      Mutation Queued docs vs Pending docs vs Flush queued docs
      Disk IO stats
      Warmup progress
      Number of outstanding scans
      Number of scan timeouts/scan errors
      Largest key size
      JEMALLOC stats
      End-to-end scan latency (including GsiClient)

      Attachments

        No reviews matched the request. Check your Options in the drop-down menu of this sections header.

        Activity

          ajay.bhullar Ajay Bhullar added a comment -

          Prathibha Bisarahalli hi, I need some more information on certain stats.

          The num_items_flushed, what is this specifically referring to? The amount of items deleted from the index since the index was created?

          initial_build_progess, I assume this is the stat that the UI pulls its build %age from, what is new about it/ am I wrong that it is not the build progress of the index?

          avg_drain_rate, not sure what this stat is supposed to track

          outstanding scans, does this mean the number of scans pending before a query completes/ what is a good way to test that this stat is being updated properly?

          num_scan_timeouts/num_scan_errors - what is the best way to trigger a scan timeout/scan error

          num_pending_requests, this is the backlog of scan requests right? How is this different than outstanding scans stat

          And finally the key_size_distribution and arrkey_size_distribution are existing stats that should now be persisted after the snapshot interval passes correct? Are there any other extra stats that should now be persisted that were not previously in the new stats added?

          ajay.bhullar Ajay Bhullar added a comment - Prathibha Bisarahalli hi, I need some more information on certain stats. The num_items_flushed, what is this specifically referring to? The amount of items deleted from the index since the index was created? initial_build_progess, I assume this is the stat that the UI pulls its build %age from, what is new about it/ am I wrong that it is not the build progress of the index? avg_drain_rate, not sure what this stat is supposed to track outstanding scans, does this mean the number of scans pending before a query completes/ what is a good way to test that this stat is being updated properly? num_scan_timeouts/num_scan_errors - what is the best way to trigger a scan timeout/scan error num_pending_requests, this is the backlog of scan requests right? How is this different than outstanding scans stat And finally the key_size_distribution and arrkey_size_distribution are existing stats that should now be persisted after the snapshot interval passes correct? Are there any other extra stats that should now be persisted that were not previously in the new stats added?
          ajay.bhullar Ajay Bhullar added a comment -

          Prathibha Bisarahalli is there any UI impact in this change that also needs to be tested?

          ajay.bhullar Ajay Bhullar added a comment - Prathibha Bisarahalli is there any UI impact in this change that also needs to be tested?

          Hi Ajay, answers below:

          1. The num_items_flushed, what is this specifically referring to? The amount of items deleted from the index since the index was created?

          This is the total number of items flushed (sent) to the storage. This is not a new stat, its an existing stat in PeriodicStat which is now being exposed through official REST API. If there are 10 documents indexed and each document emit one entry (ex: non-array index), then num_items_flushed will be 10. This stat can be obtained per index or per partition.

          2. initial_build_progess, I assume this is the stat that the UI pulls its build %age from, what is new about it/ am I wrong that it is not the build progress of the index?

          Yes this is initial build progress same as in UI.

          3. avg_drain_rate, not sure what this stat is supposed to track

          It is the rate of num_items_flushed as a simple moving average. IMO, there is no need to test these stats as they are not new.

          4. outstanding scans, does this mean the number of scans pending before a query completes/ what is a good way to test that this stat is being updated properly?

          Sorry, outstanding scans is same as num_pending_requests. I modified the summary in my previous comment. num_pending_requests = num_requests - numCompletedRequests

          5. num_scan_timeouts/num_scan_errors - what is the best way to trigger a scan timeout/scan error

          Timeout can be triggered having a very backlog in indexing or mutations reaching from projector to indexer and then do a stale=false query. scan timeout setting can be reduced from default of 2 mins to 5s or less to trigger a scan timeout. num_scan_errors stat is really when a scan error does not fall in any known category like ErrClientCancel, ErrScanTimedOut, ErrIndexNotReady. So it is hard to trigger that error.

          6. num_pending_requests, this is the backlog of scan requests right? How is this different than outstanding scans stat

          This is same as in #4.

          7. And finally the key_size_distribution and arrkey_size_distribution are existing stats that should now be persisted after the snapshot interval passes correct? Are there any other extra stats that should now be persisted that were not previously in the new stats added?

          key_size_distribution and arrkey_size_distribution are new stats that are introduced as part of current release.

          key_size_distribution = a bucketized stat which gives a picture of number of keys that belong to a size bracket. This stat is applicable main index entries for both non-array and array indexes in MOI and Plasma. For example:
          default:i1:key_size_distribution" :

          { "(0-64)" : 9, "(65-256)" : 1, "(257-1024)" : 0, "(1025-4096)" : 0, "(4097-102400)" : 0, "(102401-max)" : 0 }

          arrkey_size_distribution: This is a stat specific to array index for plasma storage only (as there is a difference between array index implementation in MOI and plasma and this stat does not add value for MOI array index). This stat gives the key size distribution of the full array key entry before splitting the entries. This has same size brackets as example mentioned above.

          Yes you are right about persistence of above key_size_distribution and arrkey_size_distribution. They persist as part of snapshot's metadata and this happens in every persistence interval. If indexer crashes, it recovers from a persisted snapshot so do these stats.

          8. is there any UI impact in this change that also needs to be tested?

          Some of the above mentioned stats have been requested to be added in UI. That is tracked by bug MB-33896

          prathibha Prathibha Bisarahalli (Inactive) added a comment - - edited Hi Ajay, answers below: 1. The num_items_flushed, what is this specifically referring to? The amount of items deleted from the index since the index was created? This is the total number of items flushed (sent) to the storage. This is not a new stat, its an existing stat in PeriodicStat which is now being exposed through official REST API. If there are 10 documents indexed and each document emit one entry (ex: non-array index), then num_items_flushed will be 10. This stat can be obtained per index or per partition. 2. initial_build_progess, I assume this is the stat that the UI pulls its build %age from, what is new about it/ am I wrong that it is not the build progress of the index? Yes this is initial build progress same as in UI. 3. avg_drain_rate, not sure what this stat is supposed to track It is the rate of num_items_flushed as a simple moving average. IMO, there is no need to test these stats as they are not new. 4. outstanding scans, does this mean the number of scans pending before a query completes/ what is a good way to test that this stat is being updated properly? Sorry, outstanding scans is same as num_pending_requests. I modified the summary in my previous comment. num_pending_requests = num_requests - numCompletedRequests 5. num_scan_timeouts/num_scan_errors - what is the best way to trigger a scan timeout/scan error Timeout can be triggered having a very backlog in indexing or mutations reaching from projector to indexer and then do a stale=false query. scan timeout setting can be reduced from default of 2 mins to 5s or less to trigger a scan timeout. num_scan_errors stat is really when a scan error does not fall in any known category like ErrClientCancel, ErrScanTimedOut, ErrIndexNotReady. So it is hard to trigger that error. 6. num_pending_requests, this is the backlog of scan requests right? How is this different than outstanding scans stat This is same as in #4. 7. And finally the key_size_distribution and arrkey_size_distribution are existing stats that should now be persisted after the snapshot interval passes correct? Are there any other extra stats that should now be persisted that were not previously in the new stats added? key_size_distribution and arrkey_size_distribution are new stats that are introduced as part of current release. key_size_distribution = a bucketized stat which gives a picture of number of keys that belong to a size bracket. This stat is applicable main index entries for both non-array and array indexes in MOI and Plasma. For example: default:i1:key_size_distribution" : { "(0-64)" : 9, "(65-256)" : 1, "(257-1024)" : 0, "(1025-4096)" : 0, "(4097-102400)" : 0, "(102401-max)" : 0 } arrkey_size_distribution : This is a stat specific to array index for plasma storage only (as there is a difference between array index implementation in MOI and plasma and this stat does not add value for MOI array index). This stat gives the key size distribution of the full array key entry before splitting the entries. This has same size brackets as example mentioned above. Yes you are right about persistence of above key_size_distribution and arrkey_size_distribution. They persist as part of snapshot's metadata and this happens in every persistence interval. If indexer crashes, it recovers from a persisted snapshot so do these stats. 8. is there any UI impact in this change that also needs to be tested? Some of the above mentioned stats have been requested to be added in UI. That is tracked by bug MB-33896

          Summary of changes made to stats:

          Changes to official stats REST endpoint are:
          num_items_flushed
          last_known_scan_time
          initial_build_progress
          avg_drain_rate
          num_scan_timeouts
          num_scan_errors
          num_pending_requests
          memory_total_storage

          Changes to internal stats endpoint are:
          num_scan_timeouts: num of requests that timed out (either waiting for snapshots or during scan in progress
          num_scan_errors: num of requests that failed due to any other errors
          Updated avg_scan_latency to a running average instead of simple average.
          key_size_distribution - a distribution of key sizes in various buckets
          arrkey_size_distribution - distribution of full array key size for plasma array index
          last_known_scan_time

          Projector stats:
          n1qlevaluate duration stat - Will be available in projector logs

          prathibha Prathibha Bisarahalli (Inactive) added a comment - - edited Summary of changes made to stats: Changes to official stats REST endpoint are: num_items_flushed last_known_scan_time initial_build_progress avg_drain_rate num_scan_timeouts num_scan_errors num_pending_requests memory_total_storage Changes to internal stats endpoint are: num_scan_timeouts: num of requests that timed out (either waiting for snapshots or during scan in progress num_scan_errors: num of requests that failed due to any other errors Updated avg_scan_latency to a running average instead of simple average. key_size_distribution - a distribution of key sizes in various buckets arrkey_size_distribution - distribution of full array key size for plasma array index last_known_scan_time Projector stats: n1qlevaluate duration stat - Will be available in projector logs
          ajay.bhullar Ajay Bhullar added a comment -

          these stats have been manually verified and automated

          http://review.couchbase.org/#/c/110728/

          ajay.bhullar Ajay Bhullar added a comment - these stats have been manually verified and automated http://review.couchbase.org/#/c/110728/

          People

            r.kalyanasundaram Ramalingam Kalyanasundaram [X] (Inactive)
            jeelan.poola Jeelan Poola
            Votes:
            0 Vote for this issue
            Watchers:
            9 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Gerrit Reviews

                PagerDuty