Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-50273

Autofailover: getCachedIndexTopology remove scheduled indexes

    XMLWordPrintable

Details

    • Untriaged
    • 1
    • No

    Description

      request_handler.go getCachedIndexToplogy() is a custom, stripped-down version of getIndexStatus() used by autofailover_service_manager.go IsSafe() --> getLastKnownIndexTopology(cached = true) via REST call. Its purpose is to get cached index metadata from the local node to make the failover safety decision quickly. Thus its primary goals are:

      1. Do not contact any remote nodes.
      2. Do not use ClusterInfoCache

      as both of the above can be very slow and thus hold up Autofailover.

      Goal #2 is currently violated by the call chain IsSafe() --> getLastKnownIndexTopology(cached = true) --> REST /getCachedIndexToplogy --> handleCachedIndexTopologyRequest() --> getCachedIndexTopology() --> getScheduledIndexes() --> getIndexes() --> getIndexesFromTokens() --> makeIndexStatus() --> getNodeAddr() --> GetClusterInfoCache().

      An additional problem is that the caller is not holding the s.cinfo mutex, since it was unknown at the time of writing that any use of ClusterInfoCache was buried under getScheduledIndexes(). This could lead to a request handler crash.

      It was agreed in GSI scrum that IsSafe() can ignore scheduled indexes, because they do not exist yet, so failing over a node or set of nodes that constitute the only scheduled future host(s) of the index or some of its partition does not lose any indexes or disrupt any queries. The failover will migrate any scheduled tokens that need new homes to surviving nodes.

      Thus the fix is just to remove the lines

      // Add statuses of scheduled indexes that are not built yetschedIndexList := m.getScheduledIndexes(defns)
      indexStatuses = append(indexStatuses, schedIndexList...)

      from request_handler.go getCachedIndexTopology().

      The fix also needs to add a new flag, omitScheduled=true, to regular getIndexStatus() to be used by autofailover_service_manager.go getLastKnownIndexTopology(cached = false) so the details it logs will exclude scheduled indexes both for cached and non-cached retrievals, else debugging any discrepancies will be very difficult.

      This is not a regression as it occurs in code newly written for the Neo GSI Autofailover feature.

       

      Thanks to Sai Krishna Teja for discovering this problem!

      Attachments

        For Gerrit Dashboard: MB-50273
        # Subject Branch Project Status CR V

        Activity

          People

            kevin.cherkauer Kevin Cherkauer (Inactive)
            kevin.cherkauer Kevin Cherkauer (Inactive)
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Gerrit Reviews

                There are no open Gerrit changes

                PagerDuty