Description
Explore the option of separating out the KV rebalance and service rebalance steps. This would allow us to run the janitor while the service rebalance is going on and help solve the extended unavailability problem that arises currently when memcached process crashes during an ongoing rebalance of a topology aware service.
The unavailability problem stems from the fact that if memcahed restarts while the service rebalance is progressing then the janitor will not be able to run to bring the buckets back online. If the service that is undergoing rebalance requires access to the buckets then this would lead to a deadlock. Currently, some services (eventing fox ex) aborts their rebalance operation when such a situation occurs.
But there is value in exploring the possibility of allowing the janitor to run during service rebalance as this would fix the unavailability problem and allow the ongoing rebalance to succeed.
Attachments
Issue Links
- depends on
-
MB-29271 Eventing Rebalance in hangs when memcached is killed on kv and eventing nodes
- Closed