Details
-
Improvement
-
Status: In Progress
-
Critical
-
Resolution: Unresolved
-
None
-
1
Description
Follow-on to MB-44738 Autofailover for GSI main feature.
The counting and reporting of disk read/write errors is probably not fully containable in Neo for GSI code. This MB is to add that to the feature separately from implementing the main feature itself.
Additional changes besides Indexer needed:
- For external marketing, error counting is meant to be released all-or-nothing. Thus to deliver work as it is done rather than with a single huge big bang delivery, we need to temporarily hard code autofailover_service_manager.go HealthCheck() to return 0 in the HealthInfo.DiskFailures field. (Currently it is returning the detected error count.) Once everything is wrapped that needs to be, this can then be changed back to returning the error count. (This is now done in https://review.couchbase.org/c/indexing/+/171285/31..32/secondary/indexer/autofailover_service_manager.go.)
- To avoid Plasma dependency on Index code, move new package goproj/src/github.com/couchbase/indexing/secondary/iowrap/ to goproj/src/github.com/couchbase/goutils/ioutils/. (This can be done after initial patch delivery.)
- Go package path/filepath also does disk IO but was not covered in "Autof2 Part 1" patch, e.g. common/util.go DiskUsage() calls filepath.Walk() which does disk IO. (Note the Go documentation says a more efficient version of this function, WalkDir(), exists starting in Go 1.16.) There may be uses of filepath functions that need to be covered in a follow-on patch.
- gometa needs wrappers.
- Plasma and ForestDB need wrappers. Index-owned non-C parts of ForestDB are wrapped by initial patch https://review.couchbase.org/c/indexing/+/171285 but not the C code. (C code will require a different implementation.) Indexer stores metadata in FDB so the IO errors of those codepaths must be counted even though Autofailover is not supported in Community Edition.
- The following suggestion from Amit Kulkarni collides with the "do error counting same as KV" requirement from Jeelan Poola: "As of now, countDiskFailures counts Not Found errors in os.Open() as disk errors. This needs to be contextual." My (Kevin Cherkauer) original implementation counted only errors that are probably disk hardware failures (whitelist), so it never counted things like "file not found" or "wrong permissions" or "disk full" because they are not hardware failures. Jeelan asked me instead to count errors the same way KV does for cross-product consistency, so I replaced this with the current implementation that counts everything from a disk operation as an error other than EINTR (blacklist). Dave Rigby said this is how KV does it, and in his view we should count errors from disk calls even when they are not hardware failures, as they still could mean somebody needs to do something to the disk (like restore files they accidentally deleted or permissions they accidentally changed or mount more disk space).
Attachments
Gerrit Reviews
For Gerrit Dashboard: MB-48612 | ||||||
---|---|---|---|---|---|---|
# | Subject | Branch | Project | Status | CR | V |
171285,34 | MB-48612 Autof2 Part 1 (7.1.0 2335) Disk error counting for indexer code | unstable | indexing | Status: MERGED | +2 | +1 |
zerrors_[os_name].go disk hardware and configuration error strings. The indexes and strings are not the same in all OSes, and not all OSes have the same set of errors either. The symbolic error code names like ENOSPC seem to be the same across OSes, but Go does not provide these (or the numeric indexes) in its error messages.
var error[] is the table of strings. The indexes are defined in a const block with familiar Unix error code names that generally are documented (https://www.man7.org/linux/man-pages/man2/open.2.html), e.g.
Go looks up the error strings for OS errors when it constructs an error string for a system call. It does NOT include the symbolic error name or numeric code in the error string. This means we have to collect and check for all variants of the human-readable strings across supported OSes. (Confirmed by trying to open a nonexistent file using path "non/existent/file"; the error returned is just "open non/existent/file: no such file or directory". "no such file or directory" is the zerrors_darwin_amd64.go error string for error code 0x2 ENOENT, but the error code and symbolic name are not in the returned error message.)
Disk errors we should count
Close but no banana. These are ones that initially sound like they could be hardware-related but are not: