During test fest, QE and CBL teams hit a situation where Sync Gateway became non-responsive while attempting to replicate ~350 documents, where each document had large attachments (2-3 MB).
The Sync Gateway logs showed a lot of 30s timeouts between Sync Gateway and Couchbase Server while trying to push attachments. The SG logs suggested that retry handling was taking place, so that after timeout SG would re-attempt the request up to 11 times.
This test was running against a single Couchbase Server node on AWS. This suggests that the requests were timing out because of the large amount of data in gocb's single pipeline to the server. The concern is that the retry handling is exacerbating the situation by retrying the attachment on timeout - increasing the amount of data being pushed through the pipeline, and making future timeouts more likely.
Generally speaking this would be mitigated with a larger server cluster, but we should still avoid the cascading failures due to retry handling.
Need to review a few things to identify how best to avoid this scenario:
- backoff settings when pushing/pulling attachments during blip replication
- whether retry handling on timeout should be disabled for large attachments
- whether timeout should automatically be extended for large attachments
- whether attachments should have their own dedicated gocb connection, to avoid bringing down the rest of SG in this scenario
- whether SG should be increasing the number of gocb pipelines per CBS node (
I believe gocb added support for this, but not sure whether there's uptake required) This can be configured using the kv_pool_size option in the gocb connection string.