Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-48612

Autofailover: Add disk read/write error counting


    • 1


      Follow-on to MB-44738 Autofailover for GSI main feature.

      The counting and reporting of disk read/write errors is probably not fully containable in Neo for GSI code. This MB is to add that to the feature separately from implementing the main feature itself.


      Additional changes besides Indexer needed:

      1. For external marketing, error counting is meant to be released all-or-nothing. Thus to deliver work as it is done rather than with a single huge big bang delivery, we need to temporarily hard code autofailover_service_manager.go HealthCheck() to return 0 in the HealthInfo.DiskFailures field. (Currently it is returning the detected error count.) Once everything is wrapped that needs to be, this can then be changed back to returning the error count. (This is now done in https://review.couchbase.org/c/indexing/+/171285/31..32/secondary/indexer/autofailover_service_manager.go.)
      2. To avoid Plasma dependency on Index code, move new package goproj/src/github.com/couchbase/indexing/secondary/iowrap/ to goproj/src/github.com/couchbase/goutils/ioutils/. (This can be done after initial patch delivery.)
      3. Go package path/filepath also does disk IO but was not covered in "Autof2 Part 1" patch, e.g. common/util.go DiskUsage() calls filepath.Walk() which does disk IO. (Note the Go documentation says a more efficient version of this function, WalkDir(), exists starting in Go 1.16.) There may be uses of filepath functions that need to be covered in a follow-on patch.
      4. gometa needs wrappers.
      5. Plasma and ForestDB need wrappers. Index-owned non-C parts of ForestDB are wrapped by initial patch https://review.couchbase.org/c/indexing/+/171285 but not the C code. (C code will require a different implementation.) Indexer stores metadata in FDB so the IO errors of those codepaths must be counted even though Autofailover is not supported in Community Edition.
      6. The following suggestion from Amit Kulkarni collides with the "do error counting same as KV" requirement from Jeelan Poola: "As of now, countDiskFailures counts Not Found errors in os.Open() as disk errors. This needs to be contextual." My (Kevin Cherkauer) original implementation counted only errors that are probably disk hardware failures (whitelist), so it never counted things like "file not found" or "wrong permissions" or "disk full" because they are not hardware failures. Jeelan asked me instead to count errors the same way KV does for cross-product consistency, so I replaced this with the current implementation that counts everything from a disk operation as an error other than EINTR (blacklist). Dave Rigby said this is how KV does it, and in his view we should count errors from disk calls even when they are not hardware failures, as they still could mean somebody needs to do something to the disk (like restore files they accidentally deleted or permissions they accidentally changed or mount more disk space).


        For Gerrit Dashboard: MB-48612
        # Subject Branch Project Status CR V



            amit.kulkarni Amit Kulkarni
            kevin.cherkauer Kevin Cherkauer (Inactive)
            0 Vote for this issue
            2 Start watching this issue



              Gerrit Reviews

                There are no open Gerrit changes