Uploaded image for project: 'Couchbase .NET client library'
  1. Couchbase .NET client library
  2. NCBC-1439

Fix race condition with opaque mismatch

    XMLWordPrintable

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 2.4.6
    • Fix Version/s: 2.4.7
    • Component/s: library
    • Labels:
      None

      Description

      From https://github.com/couchbase/couchbase-net-client/pull/72:

      "We encountered an issue in the .NET SDK with Multiplexing IO. Our operations sometimes return with ClientFailure status and message: Expected opaque 154777 but got 154681������������\9����s�. They also sometimes end with NullReferenceException occurring in the SDK code.

      We traced this down to a weird race condition in the Multiplexing connection implementation. It's between the Send thread public override byte[] Send(byte[] request), Receive thread internal void ParseReceivedData(), .NET Task queue and SyncState pooling. Chances for this to occur are increasing with combination of high traffic, slow network and high utilization of Tasks (so Task may be queued for longer time).

      Here's the race condition scenario:

      1. Send thread enters public override byte[] Send(byte[] request) and executed the method up until var didComplete = state.SyncWait.WaitOne(Configuration.SendTimeout); where it waits
      2. Receive thread gets somewhat delayed Couchbase response (network issues). It executes the complete internal void ParseReceivedData() method including the Task.Run(()=> state.Complete(response));. If the app uses Task framework a lot, our Task might be stuck in a queue for a while.
      3. The WaitOne call on Send thread times out and the {{public override byte[] Send(byte[] request)}}resumes. The current state instance is cleaned up and placed back into the state pool. Note that at this point we still have a Task queued up to set received results on this state instance.
      4. A new Send thread grabs the same SyncState instance from the state pool and uses it for its own operation.
      5. Our original enqueued Task finally executes. The problem is that we are now setting received data from the previous operation on a state which was already cleaned and reused for something else.

      My proposed fix has two parts. First is to move the result assignment of SyncState from .NET thread pool into the receive thread. This way we have to worry only about synchronizing two threads. So instead of Task.Run(()=> state.Complete(response)); we execute plain state.Complete(response);. I understand that the AsyncState has some long and potentially blocking Callback execution in its scope. So i pushed it out of the Receive thread context again by adding the Task.Run call around the Callback execution.

      The second part is to make sure that setting results on the state is synchronized to eliminate the race condition. The state.Complete(response) call got moved inside the lock (_statesInFlight) scope. Both Sync and Async states will execute very short and straight-forward code there, so any effects of increasing the lock length will be negligible. Now in the same scenario there are two options what might happen:

      a) Send thread acquires the lock on _statesInFlight first and removes the state from this collection. When Receive thread acquires the lock, it can't find the state so it can't mess with it after it's been reused for something else.
      b) Receive thread acquires the lock on _statesInFlight first and sets the results on state within the lock scope. Even if the Send thread releases on timeout at the same time, it has to wait for Receive thread to finish it's business. When it gets to ReleaseState(state); line we are 100% sure no one else can touch the same state instance anymore."

        Attachments

          Issue Links

          For Gerrit Dashboard: NCBC-1439
          # Subject Branch Project Status CR V

            Activity

            Hide
            jmorris Jeff Morris added a comment -

            Jae Park [X] -

            Can we test this patch: git fetch ssh://jmorris@review.couchbase.org:29418/couchbase-net-client refs/changes/92/79192/2 && git checkout FETCH_HEAD

            Thanks,

            Jeff

            Show
            jmorris Jeff Morris added a comment - Jae Park [X] - Can we test this patch: git fetch ssh://jmorris@review.couchbase.org:29418/couchbase-net-client refs/changes/92/79192/2 && git checkout FETCH_HEAD Thanks, Jeff
            Hide
            jaekwon.park Jae Park [X] (Inactive) added a comment -

            started regression test for sync and async mode against 4.1 and 4.5 as we usually do

            Show
            jaekwon.park Jae Park [X] (Inactive) added a comment - started regression test for sync and async mode against 4.1 and 4.5 as we usually do

              People

              • Assignee:
                jmorris Jeff Morris
                Reporter:
                jmorris Jeff Morris
              • Votes:
                0 Vote for this issue
                Watchers:
                5 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved:

                  Gerrit Reviews

                  There are no open Gerrit changes

                    PagerDuty

                    Error rendering 'com.pagerduty.jira-server-plugin:PagerDuty'. Please contact your Jira administrators.