Details
-
Improvement
-
Resolution: Unresolved
-
Major
-
None
-
None
-
None
-
0
Description
Since before the beginning of (Couchbase) time, users may attribute poor performance to the wrong thing. Two common ones we see regularly are running a tight loop of operations (e.g. `while true; do`) and comparing the latency of something local/in their datacenter to something not local. This has been enshrined since the 1980s as one of the Fallacies of Distributed Computing. #2: The latency of the network is zero.
We currently describe how to select regions and how to set up VPC peering, but we do not explain anywhere that where one runs and deploys their app matters. We leave it up to the user to (hopefully) figure this out and far too many naïve users don't.
Currently neither
"region site:docs.couchbase.com/cloud/"
nor
"latency site:docs.couchbase.com/cloud/"
Google searches turn up anything useful. lists the regions, but not why it matters, and latency seems to be something about FTS indexes. . there's a RN about Private Networking, which isn't likely to be notied: https://docs.couchbase.com/cloud/release-notes/release-notes.html#25-november-2021-release
To improve this, I'd recommend we try to cover a few things…
- The concept that choice of where the app runs matters (and yes, probably even be explicit that running on your laptop across the Internet is not a good way to compare apples and oranges)
- What one should expect when running various setups
- That Couchbase runs in multiple regions so users can optimize
- Why VPC peering or App Endpoints matter beyond security, and what improvements can be had setting that up
- If you see something that isn't what you expect, what kind of diagnostics you may run to figure out why there is a problem.
- Possibly some guidance/samples on how code may need to be written to properly test performance and what additional tools (response time observability, OTel) are available to dig deeper if need be. (as an aside, a cloud based customer evaluation where the Disk IO was the limiting factor was immediately apparent with OTel, but would otherwise have been nearly impossible to find just looking at the outer surface area of the requests).
Further analysis is probably needed and my suggestion above is just high level. Maybe we do have good artifacts for this, but I've not found them and it comes up often enough that we should have something good to point users to. As has been pointed out previously, competitors often document how to setup/run benchmarks.
The good news is we have a couple of tools that can be useful for understanding if this is an issue. One is sdk-doctor performs a set of pings against each endpoint and issues a warning if the latencies seem to indicate that it's being run across a WAN. Second is the SDKs have the health check 'ping'. Finally the big tool is OpenTelemetry and the SDK support for this, which allows one to understand where the time is going and intuit if it's a network, backend IO or other issue.
The bad news is a great many users won't notice this without reading the docs or engaging us. Two other ideas, just to drop them here… we could make suggestions based on the location of the browser at cluster creation time. We could also, at a low level, reverse ping inbound connections (assuming ICMP is available) and make recommendations.