Details
-
Bug
-
Resolution: Fixed
-
Critical
-
6.6.6, 7.1.4, 7.0.5, 7.2.0
-
Triaged
-
0
-
Unknown
Description
TL;DR
timer:apply_after(...) calls may never execute the given function if we hit the system process limit. Should we recover from the issue causing us to hit the system process limit the application may never recover as timers that we expect to have fired may not have. supervisor2 uses timer:apply_after(...) if it fails to restart a process (possibly due to hitting the system process limit), this can causes processes to "go missing" as the timer not firing causes the process to remain stuck in the "restarting" state.
Erlang issue - https://github.com/erlang/otp/issues/7606
Original Description
It was observed in a recent issue that a health monitoring processes crashed and caused a max restart intensity error to crash the health_monitor_sup. The supervisor above it, ns_server_sup, is a "supervisor2". When the health_monitor_sup crashed ns_server_sup attempted to restart it but hit a system limit (too many processes) error and never attempted to restart the process again, even after the system limit error was resolved. The cluster had multiple nodes reported as unhealthy as the health monitoring processes were no longer present for over a week.
I wrote a unit test that reproduces the observed issue as exactly as possible, it has the same supervision tree with a single worker process as the leaf (health monitoring process). It lowers the process limit in the Erlang VM and creates a bunch of dummy processes such that we cannot restart the worker or supervisor. The supervisor2 only attempted to restart the child process that died once, and at the end of the test the child process was not running. The same test was ran for the regular supervisor which restarted the child as expected.
supervisor2 is a copy of supervisor.erl from R16B Erlang/OTP with some modifications to it, and a few bug fix changes applied. Bug fixes were last applied from the base supervisor.erl in 2015.
Attachments
For Gerrit Dashboard: MB-58410 | ||||||
---|---|---|---|---|---|---|
# | Subject | Branch | Project | Status | CR | V |
196012,6 | MB-58410: Supervisor2 broken test | master | ns_server | Status: ABANDONED | 0 | 0 |
196304,16 | MB-58410: Remove supervisor_3 | master | ns_server | Status: MERGED | +2 | +1 |
196306,4 | MB-58410: Remove use of timer:apply_all in supervisor2 | master | ns_server | Status: MERGED | +2 | +1 |
196499,11 | MB-58410: Add max_restart_intensity test | master | ns_server | Status: MERGED | +2 | +1 |
196617,24 | MB-58410: Add suppress_max_r... to avoid max_restart_intensity | master | ns_server | Status: MERGED | +2 | +1 |
196695,4 | MB-58410: Improve logging supervisor_cushion | master | ns_server | Status: MERGED | +2 | +1 |
196804,11 | MB-58410: Migrate child specs to supress_max_restart_intensity | master | ns_server | Status: MERGED | +2 | +1 |
196805,3 | MB-58410: Migrate ns_server_nodes_sup to supress_max_restart_intensity | master | ns_server | Status: ABANDONED | 0 | +1 |
196806,4 | MB-58410: Migrate ns_server_sup to supress_max_restart_intensity | master | ns_server | Status: ABANDONED | 0 | +1 |
196874,5 | MB-58410: Add always_delay option to supervisor_cushion | master | ns_server | Status: MERGED | +2 | +1 |
197292,9 | MB-58410: Swap supervisor2s to supervisor | master | ns_server | Status: MERGED | +2 | +1 |
197293,10 | MB-58410: Remove supervisor2 | master | ns_server | Status: MERGED | +2 | +1 |
197300,5 | MB-58410: Remove internal tuple child spec from supress_max_r | master | ns_server | Status: ABANDONED | 0 | 0 |
197415,2 | MB-58410: Add restartable test | master | ns_server | Status: MERGED | +2 | +1 |
197477,5 | MB-58410: Update comments in suppress_max_restart_intensity | master | ns_server | Status: MERGED | +2 | +1 |
197641,3 | MB-58410: Do not attempt to restart ns_capi_ssl_service if not running | master | ns_server | Status: MERGED | +2 | +1 |