Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-53779

Phosphor: Keep "important" spans for an increased period of time

    XMLWordPrintable

Details

    • Improvement
    • Resolution: Unresolved
    • Major
    • techdebt-backlog
    • master
    • phosphor
    • None
    • 1

    Description

      Background

      Phosphor currently uses a fixed-size ring buffer by default; where each thread which wishes to record trace events takes a "loan" on a block from the ring buffer and records events in the block.

      While blocks are broadly assigned in oldest-first order, threads can have very different event recording rates and this results in uneven "history" of event across threads. For example, consider the following trace where we have over 3 hours of history for front-end executor threads (as they have few events recorded), but less than 6 minutes of information from nonIO threads, and less than 2 minutes from reader threads, as they record significantly more events:

      (It's worth highlighting that this doesn't necessarily mean that a "quiet" thread is guaranteed to have a long history duration - once that thread fills the current event block and returns it, that "old" could be loaned out to thread - say which has a much higher event rate - and the events are quickly overwritten).

      Problem

      The variable-length history is expected given the current design, however it can cause issues when some events are more important than others - and it would be desirable to keep them for an extended period of time. Example of such events include:

      • Slow operation spans
      • Slow task runtime spans
      • Timed mutex events (mutexes held, or waiting to be held for long time periods).

      Possible Solution

      1. Get rid of frequent events - Simply reduce or remove the instances of high frequency events - e.g. task runtimes for very commonly running tasks. This is architecturally simple - no changes are needed to phosphor, and would extend the average thread history duration. In practical terms is it challenging - it's not always obvious how frequent events actually will be, so trying to predict which ones we need to remove / simplify can be non-obvious. Additionally we are potentially loosing a lot of useful information if we stop tracking certain trace events.
      2. Split the ring buffer into two logical ring buffers, high and low priority. Each thread can take a lock on two blocks at once - a high-priority block and a low-priority block. Events can be (statically) categorised as high or low priority, and they are recorded to the appropriate block when they occur.
        When loaning and returning blocks, the two sections of the ring buffer are treated independently - i.e. a request for a low-priority block will only ever come from the low-priority ring-buffer; and hence the high-priority (and what should be low-throughout) ring buffer should end up with a much longer history.

      Attachments

        No reviews matched the request. Check your Options in the drop-down menu of this sections header.

        Activity

          People

            jwalker Jim Walker
            drigby Dave Rigby (Inactive)
            Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

              Created:
              Updated:

              Gerrit Reviews

                There are no open Gerrit changes

                PagerDuty