Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-60495

Handle doScan/GetScanPorts for corrupt metadataClient bookkeeping

    XMLWordPrintable

Details

    • 0

    Description

      In a recent customer case, it is seen that the bookkeeping of metadataClient goes wrong failing to successfully connect to the newly added indexer node.

      Once the rebalance is over and the new node takes over scans for the moved indices, scan failures also start coming up as we are not able to find the scanport for the destination node.

      The scenario goes as follows -

      1. we have a code safeupdate which is responsible for correcting the internal metadata.
      2. this function gets called with either (map, true) or (nil, false) and this can be called by multiple go-routines.
      3. the hypothesis is that, for (nil, false) calls, we load the current meta in a variable `adminports` and create a new topology using it. After building the new topology we do a CompareAndSwap (CAS) to update the topology.
      4. for a concurrent update to this topology, it can happen that the CAS can fail for the (nil, false) call and we will do a retry there. The difference here is that, this time we don't reset the `adminports` variable which can have stale values and corrupt the bookkeeping by successfully updating the topology in the new run.

      Detailed walkthrough -

      WatchMetadata(new node - nodeA) -> fails in first attempt, create a tempID for indexer and add to IndexTopology, retry in background
       
      safeupdate(nil, false) -> read currmeta (with temp indexer ID), set in adminports, create new topology, try to update IndexTopology using CAS - CAS fails
       
      Watcher connected for nodeA -> safeupdate(adminports, true) -> create new topology using adminports (correct Indexer ID) in parameter, try to update IndexTopology using CAS - CAS passes, send msg to update scan clients and exit
       
      <retry> safeupdate(nil, false) -> adminports not reset so use the old adminport (with temp indexer ID) from step 2, create new topology, try to update IndexTopology using CAS - CAS passes, send msg to update scan clients and exit
      

      1. Now when the 2nd safeupdate goes through, it will also remove the scanport, for new node, from IndexTopology.
      2. Once the rebalance is over, new node takes over the indices and scans will start considering replica on the new node.
      3. If `pickRandom` selects the replica on new node but the scanport is not available for it then the scan fails with error `Fail to find indexers to satisfy query request. Terminate scan for index %v`
      4. This needs to be improved as we can have an active replica with a valid scanport and we don't select this. One thing to consider is temporarily unavailable scanports vs corrupt book keeping

      Attachments

        Issue Links

          No reviews matched the request. Check your Options in the drop-down menu of this sections header.

          Activity

            People

              dhruvil.ketanshah Dhruvil Shah
              dhruvil.ketanshah Dhruvil Shah
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated:

                Gerrit Reviews

                  There are no open Gerrit changes

                  PagerDuty