Details
-
Task
-
Resolution: Unresolved
-
Major
-
2.2.0
-
Security Level: Public
Description
In the field, we are seeing many time that when a node is 'slow' due to the OS, the node is auto-failed over. During this 'slow' time the memcached process is handling gets/sets from the clients without any issues.
Often the issue comes down to erlang not being able to communicate to each other for some reason that is not impacting memcached and is sometimes blamed on swap, THP, erlang's internal balancing among threads, etc.
Should we look at moving the auto-failover logic out of erlang to help prevent some of these 'false' failovers?
Attachments
Issue Links
- relates to
-
MB-9321 Implement new cluster orchestration (was: Get us off erlang's global facility and re-elect failed master quickly and safely
- Resolved