Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-47462

Improve checkpoint removal performance

    XMLWordPrintable

Details

    Description

      Investigation under MB-35075 uncovered that during a bulk load, closed unreffed checkpoint removal occupies a non-trivial amount of the item pager runtime.

      It is reasonable that the pager would attempt to remove these checkpoints to recover memory before evicting items.

      However, closed unreffed checkpoints can be removed at any time; if they were removed more "promptly", then at any given time relatively little memory would be occupied by unreffed checkpoints, and the pager would not need to spend time destroying checkpoints while under memory pressure.

      Experiments with a modified build where the paging visitor does not do any checkpoint removal, the dedicated ClosedUnreffedCheckpointRemover task becomes a bottleneck when under reasonable load - that is, persistence may consume checkpoints faster than they can be removed, leaving a small DWQ and large amounts of memory in unreffed checkpoints. In a long duration bulk load, this can eventually lead to almost all of the quota being used for unreffed checkpoints. Once the bulkload ends and the node becomes idle, this memory will still eventually recovered by the ClosedUnreffedCheckpointRemover leaving a very low residency bucket, with unexpectedly low memory usage.

      With ongoing work to introduce a quota for the checkpoint manager, checkpoint removal may become a direct rate limiting factor for incoming ops.

      Possiblilties:

      • Decrease the time taken to destroy a single checkpoint
        • use F14FastMap over F14NodeMap for checkpoint index as fewer individual allocations need freeing during destruction, at the cost of increased memory usage
        • investigate viability of std::deque<Item> over std::list<STRCPtr<Item>>, reducing indirection and number of deallocations
      • Decrease overhead of checkpoint removal task
        • Move to a more event driven model, e.g., VBReadyQueue and notification once a checkpoint becomes unreffed, rather than the task scanning every vb
      • Parallelise checkpoint removal (likely by sharding the removal task)

      Attachments

        Issue Links

          No reviews matched the request. Check your Options in the drop-down menu of this sections header.

          Activity

            People

              james.harrison James Harrison (Inactive)
              james.harrison James Harrison (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                PagerDuty