Uploaded image for project: 'Couchbase .NET client library'
  1. Couchbase .NET client library
  2. NCBC-1439

Fix race condition with opaque mismatch

    XMLWordPrintable

Details

    • Bug
    • Resolution: Fixed
    • Major
    • 2.4.7
    • 2.4.6
    • library
    • None

    Description

      From https://github.com/couchbase/couchbase-net-client/pull/72:

      "We encountered an issue in the .NET SDK with Multiplexing IO. Our operations sometimes return with ClientFailure status and message: Expected opaque 154777 but got 154681������������\9����s�. They also sometimes end with NullReferenceException occurring in the SDK code.

      We traced this down to a weird race condition in the Multiplexing connection implementation. It's between the Send thread public override byte[] Send(byte[] request), Receive thread internal void ParseReceivedData(), .NET Task queue and SyncState pooling. Chances for this to occur are increasing with combination of high traffic, slow network and high utilization of Tasks (so Task may be queued for longer time).

      Here's the race condition scenario:

      1. Send thread enters public override byte[] Send(byte[] request) and executed the method up until var didComplete = state.SyncWait.WaitOne(Configuration.SendTimeout); where it waits
      2. Receive thread gets somewhat delayed Couchbase response (network issues). It executes the complete internal void ParseReceivedData() method including the Task.Run(()=> state.Complete(response));. If the app uses Task framework a lot, our Task might be stuck in a queue for a while.
      3. The WaitOne call on Send thread times out and the {{public override byte[] Send(byte[] request)}}resumes. The current state instance is cleaned up and placed back into the state pool. Note that at this point we still have a Task queued up to set received results on this state instance.
      4. A new Send thread grabs the same SyncState instance from the state pool and uses it for its own operation.
      5. Our original enqueued Task finally executes. The problem is that we are now setting received data from the previous operation on a state which was already cleaned and reused for something else.

      My proposed fix has two parts. First is to move the result assignment of SyncState from .NET thread pool into the receive thread. This way we have to worry only about synchronizing two threads. So instead of Task.Run(()=> state.Complete(response)); we execute plain state.Complete(response);. I understand that the AsyncState has some long and potentially blocking Callback execution in its scope. So i pushed it out of the Receive thread context again by adding the Task.Run call around the Callback execution.

      The second part is to make sure that setting results on the state is synchronized to eliminate the race condition. The state.Complete(response) call got moved inside the lock (_statesInFlight) scope. Both Sync and Async states will execute very short and straight-forward code there, so any effects of increasing the lock length will be negligible. Now in the same scenario there are two options what might happen:

      a) Send thread acquires the lock on _statesInFlight first and removes the state from this collection. When Receive thread acquires the lock, it can't find the state so it can't mess with it after it's been reused for something else.
      b) Receive thread acquires the lock on _statesInFlight first and sets the results on state within the lock scope. Even if the Send thread releases on timeout at the same time, it has to wait for Receive thread to finish it's business. When it gets to ReleaseState(state); line we are 100% sure no one else can touch the same state instance anymore."

      Attachments

        Issue Links

          No reviews matched the request. Check your Options in the drop-down menu of this sections header.

          Activity

            People

              jmorris Jeff Morris
              jmorris Jeff Morris
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Gerrit Reviews

                  There are no open Gerrit changes

                  PagerDuty