Details
-
Improvement
-
Resolution: Unresolved
-
Major
-
7.0.3, 7.1.0
-
None
-
1
Description
As was seen in some CBSEs, when different nodes in the cluster have different disk performance, it can cause users to experience unexpected behaviours from different couchbase services which are hard to troubleshoot & explain. In one case, it resulted in periodic spikes in num_docs_pending to be indexed which can also potentially cause stale=false scans to time out. Often, it takes a long time to RCA and attribute reported symptoms to underlying disk issues and multiple teams need to get involved.
Would be good if this condition can be detected by the server automatically, an alert raised on the server UI (perhaps system events too), data collected in cbcollect and mortimer highlight the issue when a supportal snapshot is uploaded. This will help multiple stake holders (customers, support, engineering etc).
This is applicable to all couchbase services that store data. I added ns_server & UI as the components to begin with as I thought thats probably where we should start. We can add/modify more components as necessary after we spend sometime figuring out how to go about this platform wide.
Attachments
Issue Links
- relates to
-
MB-34155 Support Auto-failover for exceptionally slow/hanging disks
- Open