Description
I've run this 2 node failure scenario:
- load 100k items to nodeA
- rebalance in NodeB
- crash nodeA during rebalance
- failover NodeA and continue to rebalance in NodeB
- There will be some data loss, but stream requests return initial snapshot
Keeping track of the failover tables here is NodeA vb0 stats before crashing:
{'failovers:vb_0:0:id': '53942919883496', 'failovers:vb_0:num_entries': '1', 'failovers:vb_0:0:seq': '0'} {'vb_0:purge_seqno': '0', 'vb_0:uuid': '53942919883496', 'vb_0:high_seqno': '173'}After NodeA is rebalanced out here are the stats on NodeB:
{'failovers:vb_0:0:id': '66362171126476', 'failovers:vb_0:num_entries': '1', 'failovers:vb_0:0:seq': '0'} {'vb_0:purge_seqno': '0', 'vb_0:uuid': '66362171126476', 'vb_0:high_seqno': '0'}There is still data, although high_seqno has become 0.
Attempting to stream from NodeB gives:
{'status': 0, 'body': '', 'opcode': 80} {'status': 0, 'opcode': 83, 'failover_log': [(66362171126476, 0)]} {'vbucket': 0, 'opcode': 86}There is similar behavior here to MB-10947 except reason for sending empty_item is because CheckpointManager thinks vbucket is still in backfill phase.
I have a script to repro this but I'm adding these into testrunner.