Uploaded image for project: 'Couchbase Gateway'
  1. Couchbase Gateway
  2. CBG-520

Initialization race for channel cache validFrom

    XMLWordPrintable

    Details

    • Type: Bug
    • Status: Closed
    • Priority: Critical
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 2.7.0
    • Component/s: SyncGateway
    • Security Level: Public
    • Labels:
      None
    • Required Mobile Fields:
      Hide
      Mandatory:
       - CBL / SG Version:
         - SG Config:
       - Steps to Reproduce:
       - Actual Result:
       - Expected Result:
       - Logs :
            SGW LOGS: sgcollect info
            CBL LOGS:
            Logcat LOGS: for Android tickets
       - Github link for the code:
       - Jenkins job failure link:
       - Pytest Command
       - What is the last build this test passed:
      Show
      Mandatory:  - CBL / SG Version:    - SG Config:  - Steps to Reproduce:  - Actual Result:  - Expected Result:  - Logs :       SGW LOGS: sgcollect info       CBL LOGS:       Logcat LOGS: for Android tickets  - Github link for the code:  - Jenkins job failure link:  - Pytest Command  - What is the last build this test passed:
    • Sprint:
      CBG Sprint 32, CBG Sprint 33, CBG Sprint 34, CBG Sprint 35
    • Story Points:
      3

      Description

      TestContinuousChangesBackfill is failing intermittently due to doc-1 not being sent over changes.

      http://mobile.jenkins.couchbase.com/job/sgw-unix-build/9909/consoleFull#283334476db2b220c-562c-42fc-9783-a473165a58a0

        Attachments

        No reviews matched the request. Check your Options in the drop-down menu of this sections header.

          Activity

          Hide
          adamf Adam Fraser added a comment -

          The test logging suggests that doc-1 (seq #1) is being received over DCP as expected (based on sequence buffering messages), but isn't ending up in the cache (based on Cache: GetCachedChanges logging. Suggests a potential race between per-channel cache initialization (triggered by the changes request), and doc-1 arriving over DCP. SG should handle all variations of timing here - need to identify which scenario isn't covered properly.

          Show
          adamf Adam Fraser added a comment - The test logging suggests that doc-1 (seq #1) is being received over DCP as expected (based on sequence buffering messages), but isn't ending up in the cache (based on Cache: GetCachedChanges logging. Suggests a potential race between per-channel cache initialization (triggered by the changes request), and doc-1 arriving over DCP. SG should handle all variations of timing here - need to identify which scenario isn't covered properly.
          Hide
          adamf Adam Fraser added a comment -

          There's another failure in TestChangesAfterChannelAdded that could be potentially related - needs review.
          http://mobile.jenkins.couchbase.com/view/Sync_Gateway/job/sgw-unix-build/9915/consoleFull#-49449668db2b220c-562c-42fc-9783-a473165a58a0

          Show
          adamf Adam Fraser added a comment - There's another failure in TestChangesAfterChannelAdded that could be potentially related - needs review. http://mobile.jenkins.couchbase.com/view/Sync_Gateway/job/sgw-unix-build/9915/consoleFull#-49449668db2b220c-562c-42fc-9783-a473165a58a0
          Hide
          adamf Adam Fraser added a comment -

          There was an initialization race related to validFrom when instantiating a new channel cache, between AddToCache and addChannelCache.

          A. AddToCache:
          1. If any of the mutation's channels are active in the channel cache map, add entry to the cache
          2. Acquire the seqLock, and update highCachedSequence

          B. addChannelCache:
          1. Acquire the seqLock. Add a new entry to the channel cache map, with validFrom = highCachedSequence + 1

          In the case where B1 happens between A1 and A2, the validFrom for the new singleChannelCache will be invalid, as the sequence associated with the mutation being processed is greater than validFrom, but won't be in the cache.

          To address, addToCache will need to hold a mutex across A1 and A2. It's preferred to not use seqLock for this, as it will introduce contention with changes processing that needs to retrieve highCachedSequence (independent of changes to the channel cache map). Instead, a new 'validFromLock' can be used. This shouldn't introduce significant contention on the caching side, as addToCache is already single-threaded (inside processEntry).

          Show
          adamf Adam Fraser added a comment - There was an initialization race related to validFrom when instantiating a new channel cache, between AddToCache and addChannelCache. A. AddToCache: 1. If any of the mutation's channels are active in the channel cache map, add entry to the cache 2. Acquire the seqLock, and update highCachedSequence B. addChannelCache: 1. Acquire the seqLock. Add a new entry to the channel cache map, with validFrom = highCachedSequence + 1 In the case where B1 happens between A1 and A2, the validFrom for the new singleChannelCache will be invalid, as the sequence associated with the mutation being processed is greater than validFrom, but won't be in the cache. To address, addToCache will need to hold a mutex across A1 and A2. It's preferred to not use seqLock for this, as it will introduce contention with changes processing that needs to retrieve highCachedSequence (independent of changes to the channel cache map). Instead, a new 'validFromLock' can be used. This shouldn't introduce significant contention on the caching side, as addToCache is already single-threaded (inside processEntry).
          Hide
          build-team Couchbase Build Team added a comment -

          Build sync_gateway-2.7.0-138 contains sync_gateway commit c582819 with commit message:
          CBG-520 Add mutex for validFrom handling (#4346)

          Show
          build-team Couchbase Build Team added a comment - Build sync_gateway-2.7.0-138 contains sync_gateway commit c582819 with commit message: CBG-520 Add mutex for validFrom handling (#4346)

            People

            Assignee:
            adamf Adam Fraser
            Reporter:
            adamf Adam Fraser
            Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

              Dates

              Created:
              Updated:
              Resolved:

                Gerrit Reviews

                There are no open Gerrit changes

                  PagerDuty