Uploaded image for project: 'Couchbase Gateway'
  1. Couchbase Gateway
  2. CBG-463

Potential feedback loop when replicating large attachments

    XMLWordPrintable

    Details

    • Type: Bug
    • Status: Closed
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 2.6.0
    • Fix Version/s: 2.7.0
    • Component/s: SyncGateway
    • Security Level: Public
    • Labels:
      None
    • Sprint:
      CBG Sprint 28, CBG Sprint 29
    • Story Points:
      5

      Description

      During test fest, QE and CBL teams hit a situation where Sync Gateway became non-responsive while attempting to replicate ~350 documents, where each document had large attachments (2-3 MB).

      The Sync Gateway logs showed a lot of 30s timeouts between Sync Gateway and Couchbase Server while trying to push attachments. The SG logs suggested that retry handling was taking place, so that after timeout SG would re-attempt the request up to 11 times.

      This test was running against a single Couchbase Server node on AWS. This suggests that the requests were timing out because of the large amount of data in gocb's single pipeline to the server. The concern is that the retry handling is exacerbating the situation by retrying the attachment on timeout - increasing the amount of data being pushed through the pipeline, and making future timeouts more likely.

      Generally speaking this would be mitigated with a larger server cluster, but we should still avoid the cascading failures due to retry handling.

      Need to review a few things to identify how best to avoid this scenario:

      • backoff settings when pushing/pulling attachments during blip replication
      • whether retry handling on timeout should be disabled for large attachments
      • whether timeout should automatically be extended for large attachments
      • whether attachments should have their own dedicated gocb connection, to avoid bringing down the rest of SG in this scenario
      • whether SG should be increasing the number of gocb pipelines per CBS node (I believe gocb added support for this, but not sure whether there's uptake required) This can be configured using the kv_pool_size option in the gocb connection string.

        Attachments

        No reviews matched the request. Check your Options in the drop-down menu of this sections header.

          Activity

          Hide
          sridevi.saragadam Sridevi Saragadam added a comment -

          CBS version : 6.5.0-3821

          SGW version : 2.6.0-125

          This happened while doing testfest for this scenario which Jay performed :

          1. Create 350 docs and add attachments to all 350 docs and share 

           

          Here is the sgcollect log : testfest_sgw_crash.zip

          Show
          sridevi.saragadam Sridevi Saragadam added a comment - CBS version : 6.5.0-3821 SGW version : 2.6.0-125 This happened while doing testfest for this scenario which Jay performed : Create 350 docs and add attachments to all 350 docs and share    Here is the sgcollect log :  testfest_sgw_crash.zip
          Hide
          adamf Adam Fraser added a comment -

          Since hitting the gocb operation timeout on a write doesn't evict the request from the pipeline, the write request that timed out can eventually succeed on the server. This will result in a subsequent retry returning a 'key already exists' error. This suggests that the existing retry handling for timeouts on Add/AddRaw is always going to return error, and should be removed.

          We could switch from an Add to a Get for retry handling after timeout, to avoid returning a timeout error. However, this approach has some corner cases that are difficult to address. At minimum we'd need to check that the document body and expiry retrieved by the Get was the same one associated with the previous Add. In the case that the document exists and the body didn't match, though, we wouldn't be able to identify whether this was the original Add failing with a 'key already exists', or someone doing a Set on the document after our original Add succeeded.

          The Set case is a bit more challenging - We should also consider the Set case, though - in that scenario we'd want to retry until the response If we extend this to

          Show
          adamf Adam Fraser added a comment - Since hitting the gocb operation timeout on a write doesn't evict the request from the pipeline, the write request that timed out can eventually succeed on the server. This will result in a subsequent retry returning a 'key already exists' error. This suggests that the existing retry handling for timeouts on Add/AddRaw is always going to return error, and should be removed. We could switch from an Add to a Get for retry handling after timeout, to avoid returning a timeout error. However, this approach has some corner cases that are difficult to address. At minimum we'd need to check that the document body and expiry retrieved by the Get was the same one associated with the previous Add. In the case that the document exists and the body didn't match, though, we wouldn't be able to identify whether this was the original Add failing with a 'key already exists', or someone doing a Set on the document after our original Add succeeded. The Set case is a bit more challenging - We should also consider the Set case, though - in that scenario we'd want to retry until the response If we extend this to
          Hide
          adamf Adam Fraser added a comment -

          This change is too risky to add to Cobalt - moving to Mercury.

          Users can mitigate this issue today by increasing the bucket_op_timeout_ms Sync Gateway database config property. Deployments with more Couchbase Server nodes are less likely to hit this issue, as SG maintains one pipeline per Couchbase Server node.

          Show
          adamf Adam Fraser added a comment - This change is too risky to add to Cobalt - moving to Mercury. Users can mitigate this issue today by increasing the bucket_op_timeout_ms Sync Gateway database config property. Deployments with more Couchbase Server nodes are less likely to hit this issue, as SG maintains one pipeline per Couchbase Server node.
          Hide
          build-team Couchbase Build Team added a comment -

          Build sync_gateway-2.7.0-21 contains sync_gateway commit 41057c3 with commit message:
          CBG-463 - Remove retry timeout on write operations (#4222)

          Show
          build-team Couchbase Build Team added a comment - Build sync_gateway-2.7.0-21 contains sync_gateway commit 41057c3 with commit message: CBG-463 - Remove retry timeout on write operations (#4222)

            People

            Assignee:
            adamf Adam Fraser
            Reporter:
            adamf Adam Fraser
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

              Dates

              Created:
              Updated:
              Resolved:

                Gerrit Reviews

                There are no open Gerrit changes

                  PagerDuty