Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-48612

Autofailover: Add disk read/write error counting

    XMLWordPrintable

Details

    • 1

    Description

      Follow-on to MB-44738 Autofailover for GSI main feature.

      The counting and reporting of disk read/write errors is probably not fully containable in Neo for GSI code. This MB is to add that to the feature separately from implementing the main feature itself.

       

      Additional changes besides Indexer needed:

      1. For external marketing, error counting is meant to be released all-or-nothing. Thus to deliver work as it is done rather than with a single huge big bang delivery, we need to temporarily hard code autofailover_service_manager.go HealthCheck() to return 0 in the HealthInfo.DiskFailures field. (Currently it is returning the detected error count.) Once everything is wrapped that needs to be, this can then be changed back to returning the error count. (This is now done in https://review.couchbase.org/c/indexing/+/171285/31..32/secondary/indexer/autofailover_service_manager.go.)
      2. To avoid Plasma dependency on Index code, move new package goproj/src/github.com/couchbase/indexing/secondary/iowrap/ to goproj/src/github.com/couchbase/goutils/ioutils/. (This can be done after initial patch delivery.)
      3. Go package path/filepath also does disk IO but was not covered in "Autof2 Part 1" patch, e.g. common/util.go DiskUsage() calls filepath.Walk() which does disk IO. (Note the Go documentation says a more efficient version of this function, WalkDir(), exists starting in Go 1.16.) There may be uses of filepath functions that need to be covered in a follow-on patch.
      4. gometa needs wrappers.
      5. Plasma and ForestDB need wrappers. Index-owned non-C parts of ForestDB are wrapped by initial patch https://review.couchbase.org/c/indexing/+/171285 but not the C code. (C code will require a different implementation.) Indexer stores metadata in FDB so the IO errors of those codepaths must be counted even though Autofailover is not supported in Community Edition.
      6. The following suggestion from Amit Kulkarni collides with the "do error counting same as KV" requirement from Jeelan Poola: "As of now, countDiskFailures counts Not Found errors in os.Open() as disk errors. This needs to be contextual." My (Kevin Cherkauer) original implementation counted only errors that are probably disk hardware failures (whitelist), so it never counted things like "file not found" or "wrong permissions" or "disk full" because they are not hardware failures. Jeelan asked me instead to count errors the same way KV does for cross-product consistency, so I replaced this with the current implementation that counts everything from a disk operation as an error other than EINTR (blacklist). Dave Rigby said this is how KV does it, and in his view we should count errors from disk calls even when they are not hardware failures, as they still could mean somebody needs to do something to the disk (like restore files they accidentally deleted or permissions they accidentally changed or mount more disk space).

      Attachments

        For Gerrit Dashboard: MB-48612
        # Subject Branch Project Status CR V

        Activity

          kevin.cherkauer Kevin Cherkauer added a comment - - edited

          zerrors_[os_name].go disk hardware and configuration error strings. The indexes and strings are not the same in all OSes, and not all OSes have the same set of errors either. The symbolic error code names like ENOSPC seem to be the same across OSes, but Go does not provide these (or the numeric indexes) in its error messages.

          var error[] is the table of strings. The indexes are defined in a const block with familiar Unix error code names that generally are documented (https://www.man7.org/linux/man-pages/man2/open.2.html), e.g.

          • EIO = Errno(0x5)
          • ENOSPC = Errno(0x1c) – decimal 28

          Go looks up the error strings for OS errors when it constructs an error string for a system call. It does NOT include the symbolic error name or numeric code in the error string. This means we have to collect and check for all variants of the human-readable strings across supported OSes. (Confirmed by trying to open a nonexistent file using path "non/existent/file"; the error returned is just "open non/existent/file: no such file or directory". "no such file or directory" is the zerrors_darwin_amd64.go error string for error code 0x2 ENOENT, but the error code and symbolic name are not in the returned error message.)

          Disk errors we should count

          • mac – zerrors_darwin_amd64.go
          • linux – zerrors_linux_amd64.go
          • win – zerrors_windows.go – indexes omitted as they are not explicit in the file
          OS Dec Hex Name Error String Comments
          mac, linux, win 5 0x5 EIO input/output error Usually a disk-related hardware failure.
          mac, linux, win 28 0x1c ENOSPC no space left on device  
          mac 69 0x45 EDQUOT disc quota exceeded Same effect as ENOSPC. Mac uses British spelling.
          linux, win 122 0x7a EDQUOT disk quota exceeded  
          mac 82 0x52 EPWROFF device power is off Does not exist on Linux or Windows.
          mac 83 0x53 EDEVERR device error Does not exist on Linux or Windows.
          linux, win 121 0x79 EREMOTEIO remote I/O error Does not exist on Mac.

          Close but no banana. These are ones that initially sound like they could be hardware-related but are not:

          OA Dec Hex Name Error String Comments
          mac 6 0x6 ENXIO device not configured System config problem, e.g. "The file is a device special file and no corresponding device exists."
          linux, win 6 0x6 ENXIO no such device or address  
          mac, linux, win 9 0x9 EBADF bad file descriptor Opened a file without the needed flags for doing the attempted operation, e.g. opened for read but tried to write.
          mac, linux, win 29 0x1d ESPIPE illegal seek Attempted an invalid operation on a (socket) stream.

           

          kevin.cherkauer Kevin Cherkauer added a comment - - edited zerrors_ [os_name] .go disk hardware and configuration error strings. The indexes and strings are not the same in all OSes, and not all OSes have the same set of errors either. The symbolic error code names like ENOSPC seem to be the same across OSes, but Go does not provide these (or the numeric indexes) in its error messages. var error[] is the table of strings. The indexes are defined in a const block with familiar Unix error code names that generally are documented ( https://www.man7.org/linux/man-pages/man2/open.2.html ), e.g. EIO = Errno(0x5) ENOSPC = Errno(0x1c) – decimal 28 Go looks up the error strings for OS errors when it constructs an error string for a system call. It does NOT include the symbolic error name or numeric code in the error string. This means we have to collect and check for all variants of the human-readable strings across supported OSes. (Confirmed by trying to open a nonexistent file using path "non/existent/file"; the error returned is just "open non/existent/file: no such file or directory". "no such file or directory" is the zerrors_darwin_amd64.go error string for error code 0x2 ENOENT , but the error code and symbolic name are not in the returned error message.) Disk errors we should count mac – zerrors_darwin_amd64.go linux – zerrors_linux_amd64.go win – zerrors_windows.go – indexes omitted as they are not explicit in the file OS Dec Hex Name Error String Comments mac, linux, win 5 0x5 EIO input/output error Usually a disk-related hardware failure. mac, linux, win 28 0x1c ENOSPC no space left on device   mac 69 0x45 EDQUOT disc quota exceeded Same effect as ENOSPC. Mac uses British spelling. linux, win 122 0x7a EDQUOT disk quota exceeded   mac 82 0x52 EPWROFF device power is off Does not exist on Linux or Windows. mac 83 0x53 EDEVERR device error Does not exist on Linux or Windows. linux, win 121 0x79 EREMOTEIO remote I/O error Does not exist on Mac. Close but no banana. These are ones that initially sound like they could be hardware-related but are not: OA Dec Hex Name Error String Comments mac 6 0x6 ENXIO device not configured System config problem, e.g. "The file is a device special file and no corresponding device exists." linux, win 6 0x6 ENXIO no such device or address   mac, linux, win 9 0x9 EBADF bad file descriptor Opened a file without the needed flags for doing the attempted operation, e.g. opened for read but tried to write. mac, linux, win 29 0x1d ESPIPE illegal seek Attempted an invalid operation on a (socket) stream.  

          Amit Kulkarni  Deepkaran Salooja Jeelan Poola I would prefer to get the initial patch delivered myself to wherever it needs to go before someone else takes it up while I am in Capella. I have temporarily hard-coded HealthCheck() to return 0 disk errors so that work can be delivered piece by piece without affecting external marketing messaging. Once 100% of the work is done this can be changed back to return the detected error count. This way the feature can be enabled as all-or-nothing as Jeelan wanted. (Note that, as I feared, having this sit so long undelivered it now has a merge conflict, but apparently only in one file.)

          kevin.cherkauer Kevin Cherkauer added a comment - Amit Kulkarni   Deepkaran Salooja Jeelan Poola I would prefer to get the initial patch delivered myself to wherever it needs to go before someone else takes it up while I am in Capella. I have temporarily hard-coded HealthCheck() to return 0 disk errors so that work can be delivered piece by piece without affecting external marketing messaging. Once 100% of the work is done this can be changed back to return the detected error count. This way the feature can be enabled as all-or-nothing as Jeelan wanted. (Note that, as I feared, having this sit so long undelivered it now has a merge conflict, but apparently only in one file.)

          Build couchbase-server-7.2.0-1099 contains indexing commit 45d60b6 with commit message:
          MB-48612 Autof2 Part 1 (7.1.0 2335) Disk error counting for indexer code

          build-team Couchbase Build Team added a comment - Build couchbase-server-7.2.0-1099 contains indexing commit 45d60b6 with commit message: MB-48612 Autof2 Part 1 (7.1.0 2335) Disk error counting for indexer code

          People

            kevin.cherkauer Kevin Cherkauer
            kevin.cherkauer Kevin Cherkauer
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated:

              Gerrit Reviews

                There are no open Gerrit changes

                PagerDuty