1. During rebalance of a partitioned index, individual partitions move between nodes.
e.g. partition P1 and P2 are moving from nodeA to nodeB and nodeC.
2. Once the partitions have successfully moved to nodeB/nodeC, those need to be cleaned up from nodeA. The cleanup involves both metadata cleanup and actual partition data cleanup.
3. After metadata cleanup, a tombstone is created to make sure the actual partition cleanup can happen even if the indexer crashes.
In this example, 2 tombstone records will get created (one for P1 and one for P2). Each tombstone has its own instanceId with PendingDelete state.
4. These tombstones get deleted by the indexer on next subsequent restart.
Also, if the partition with tombstone were to be recreated as part of subsequent rebalance(e.g. P1 or P2 moving back again on nodeA), it will cleanup the tombstone for the partition to make sure indexer doesn't delete a valid partition after restart.
5. The tombstone cleanup mechanism has a bug due to which some of the tombstones can skip the cleanup if there are multiple tombstones for the same index.
The function returns prematurely after it has found the first tombstone.
6. In the presence of such tombstones, any indexer restart can lead to partition data being cleaned up for a valid partition on the node.