Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-63130

[BP-7.2.6] Under-accounting of flush stats in case complex page iterators

    XMLWordPrintable

Details

    • Triaged
    • 0
    • No

    Description

      Flush stats (fdSz and hdrSz) are under accounted in newPgOperator for complex pages with merge deltas. This discrepancy leads to inaccurate calculations(over-accounting) of FlushDataSz and FlushHdrSz in memory.

      Consequently, LSS cleaners compute fragmentation incorrectly and run less frequently, causing accumulation of stale data on disk and resulting in disk bloat, particularly evident in workloads with frequent merges (e.g: timeseries data). A slow mutation rate aggravates the issue.

      Example:

      After the merge delta is added, if we have a parent page like:

      "low:":         <ud>(key-       401, sn:2, insert:true)</ud> (len:44),
      "high:":        maxItem (len:7),
      "chainLen:":    3,
      "numItems:":    0,
      "state:":       8006,
      "version:":     6,
      "flushed:":     true,
      "evicted:":     false,
      "compressed:":  false
       
       0 merge: op compress[false]purge[false]empty[false]op[opPageMergeDelta] 
           0 delta: op compress[false]purge[false]empty[false]op[opPageRemoveDelta] ptr[0x10eb4c000]
           1 flush: op compress[false]purge[false]empty[false]op[opRelocPageDelta] NumRecords 0 NumSegments 1 bloomFilter: <nil> flushDataSz: 43, flushHdrSz:80 
           2 base:
       1 flush: op compress[false]purge[false]empty[false]op[opRelocPageDelta] NumRecords 0 NumSegments 1 bloomFilter: <nil> flushDataSz: 53, flushHdrSz:124 
       2 base:
      

       

      After compaction, we'd expect the staleDataSz to be 43+53=96 and staleHdrSz to be 80+124=204
      After compaction the page becomes,
      Plasma: 

      "low:":         <ud>(key-       401, sn:2, insert:true)</ud> (len:44),
      "high:":        maxItem (len:7),
      "chainLen:":    0,
      "numItems:":    0,
      "state:":       7,
      "version:":     7,
      "flushed:":     false,
      "evicted:":     false,
      "compressed:":  false
       
       0 base:
      

      But the staleDataSz returned is 43 and staleHdrSz returned is 80 .
      This causes us to under-subtract the flush stats which eventually leads to a over-counting of these stats in memory.
      The stats persisted on disk are correct. Because of this recovery is able to correct the situation.

       

      Workaround:
      Recovery log blocks correctly persist flushDataSz and flushHdrSz without issues. During recovery, the in-memory stats FlushDataSz and FlushHdrSz are recomputed. Restarting the indexer process serves as a temporary fix.

      Attachments

        Issue Links

          No reviews matched the request. Check your Options in the drop-down menu of this sections header.

          Activity

            People

              jinesh.parakh Jinesh Parakh
              jinesh.parakh Jinesh Parakh
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Gerrit Reviews

                  There are no open Gerrit changes

                  PagerDuty