Loading...

XML

Word

Printable

Details

Type: Bug
Resolution: Fixed
Priority: Critical
Fix Version/s: 7.6.0
Affects Version/s: 6.6.6, 7.1.4, 7.0.5, 7.2.0
Component/s: ns_server
Labels:
- request-dev-verify

Triage:
Triaged
Story Points:
0
Is this a Regression?:
Unknown

Description

TL;DR

timer:apply_after(...) calls may never execute the given function if we hit the system process limit. Should we recover from the issue causing us to hit the system process limit the application may never recover as timers that we expect to have fired may not have. supervisor2 uses timer:apply_after(...) if it fails to restart a process (possibly due to hitting the system process limit), this can causes processes to "go missing" as the timer not firing causes the process to remain stuck in the "restarting" state.

Erlang issue - https://github.com/erlang/otp/issues/7606

Original Description

It was observed in a recent issue that a health monitoring processes crashed and caused a max restart intensity error to crash the health_monitor_sup. The supervisor above it, ns_server_sup, is a "supervisor2". When the health_monitor_sup crashed ns_server_sup attempted to restart it but hit a system limit (too many processes) error and never attempted to restart the process again, even after the system limit error was resolved. The cluster had multiple nodes reported as unhealthy as the health monitoring processes were no longer present for over a week.

I wrote a unit test that reproduces the observed issue as exactly as possible, it has the same supervision tree with a single worker process as the leaf (health monitoring process). It lowers the process limit in the Erlang VM and creates a bunch of dummy processes such that we cannot restart the worker or supervisor. The supervisor2 only attempted to restart the child process that died once, and at the end of the test the child process was not running. The same test was ran for the regular supervisor which restarted the child as expected.

supervisor2 is a copy of supervisor.erl from R16B Erlang/OTP with some modifications to it, and a few bug fix changes applied. Bug fixes were last applied from the base supervisor.erl in 2015.

Attachments

Gerrit Reviews

- Issue Only
- Show All Reviews
- Show Open Reviews
- Show All Issues
- Show Open Issues

For Gerrit Dashboard: MB-58410
#	Subject	Branch	Project	Status	CR	V
196012,6	MB-58410: Supervisor2 broken test	master	ns_server	Status: ABANDONED	0	0
196304,16	MB-58410: Remove supervisor_3	master	ns_server	Status: MERGED	+2	+1
196306,4	MB-58410: Remove use of timer:apply_all in supervisor2	master	ns_server	Status: MERGED	+2	+1
196499,11	MB-58410: Add max_restart_intensity test	master	ns_server	Status: MERGED	+2	+1
196617,24	MB-58410: Add suppress_max_r... to avoid max_restart_intensity	master	ns_server	Status: MERGED	+2	+1
196695,4	MB-58410: Improve logging supervisor_cushion	master	ns_server	Status: MERGED	+2	+1
196804,11	MB-58410: Migrate child specs to supress_max_restart_intensity	master	ns_server	Status: MERGED	+2	+1
196805,3	MB-58410: Migrate ns_server_nodes_sup to supress_max_restart_intensity	master	ns_server	Status: ABANDONED	0	+1
196806,4	MB-58410: Migrate ns_server_sup to supress_max_restart_intensity	master	ns_server	Status: ABANDONED	0	+1
196874,5	MB-58410: Add always_delay option to supervisor_cushion	master	ns_server	Status: MERGED	+2	+1
197292,9	MB-58410: Swap supervisor2s to supervisor	master	ns_server	Status: MERGED	+2	+1
197293,10	MB-58410: Remove supervisor2	master	ns_server	Status: MERGED	+2	+1
197300,5	MB-58410: Remove internal tuple child spec from supress_max_r	master	ns_server	Status: ABANDONED	0	0
197415,2	MB-58410: Add restartable test	master	ns_server	Status: MERGED	+2	+1
197477,5	MB-58410: Update comments in suppress_max_restart_intensity	master	ns_server	Status: MERGED	+2	+1
197641,3	MB-58410: Do not attempt to restart ns_capi_ssl_service if not running	master	ns_server	Status: MERGED	+2	+1