Loading...

XML

Word

Printable

Details

Type: Improvement
Resolution: Unresolved
Priority: Critical
Fix Version/s: Morpheus
Affects Version/s: 7.2.2
Component/s: secondary-index
Labels:
- customer-issue

Story Points:
0

Description

In a recent customer case, it is seen that the bookkeeping of metadataClient goes wrong failing to successfully connect to the newly added indexer node.

Once the rebalance is over and the new node takes over scans for the moved indices, scan failures also start coming up as we are not able to find the scanport for the destination node.

The scenario goes as follows -

we have a code safeupdate which is responsible for correcting the internal metadata.
this function gets called with either (map, true) or (nil, false) and this can be called by multiple go-routines.
the hypothesis is that, for (nil, false) calls, we load the current meta in a variable `adminports` and create a new topology using it. After building the new topology we do a CompareAndSwap (CAS) to update the topology.
for a concurrent update to this topology, it can happen that the CAS can fail for the (nil, false) call and we will do a retry there. The difference here is that, this time we don't reset the `adminports` variable which can have stale values and corrupt the bookkeeping by successfully updating the topology in the new run.

Detailed walkthrough -

WatchMetadata(new node - nodeA) -> fails in first attempt, create a tempID for indexer and add to IndexTopology, retry in background

safeupdate(nil, false) -> read currmeta (with temp indexer ID), set in adminports, create new topology, try to update IndexTopology using CAS - CAS fails

Watcher connected for nodeA -> safeupdate(adminports, true) -> create new topology using adminports (correct Indexer ID) in parameter, try to update IndexTopology using CAS - CAS passes, send msg to update scan clients and exit

<retry> safeupdate(nil, false) -> adminports not reset so use the old adminport (with temp indexer ID) from step 2, create new topology, try to update IndexTopology using CAS - CAS passes, send msg to update scan clients and exit

Now when the 2nd safeupdate goes through, it will also remove the scanport, for new node, from IndexTopology.
Once the rebalance is over, new node takes over the indices and scans will start considering replica on the new node.
If `pickRandom` selects the replica on new node but the scanport is not available for it then the scan fails with error `Fail to find indexers to satisfy query request. Terminate scan for index %v`
This needs to be improved as we can have an active replica with a valid scanport and we don't select this. One thing to consider is temporarily unavailable scanports vs corrupt book keeping

Attachments

Issue Links

Clones

MB-60460 Corruption in metadataClient bookkeeping

Closed

Gerrit Reviews

- Issue Only
- Show All Reviews
- Show Open Reviews

No reviews matched the request. Check your Options in the drop-down menu of this sections header.

Activity

People

Assignee:: Dhruvil Shah

Reporter:: Dhruvil Shah

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Dates

Due:: 25/Jan/24

Created:: 23/Jan/24 3:50 AM

Updated:: 23/Jan/24 4:10 AM

Gerrit Reviews

There are no open Gerrit changes

Handle doScan/GetScanPorts for corrupt metadataClient bookkeeping

Details

Description

Attachments

Issue Links

Gerrit Reviews

Activity

People

Dates

Gerrit Reviews

PagerDuty