Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-58410

Supervisor2 may not continue to restart processes if the first restart attempt fails

    XMLWordPrintable

Details

    • Bug
    • Resolution: Fixed
    • Critical
    • 7.6.0
    • 6.6.6, 7.1.4, 7.0.5, 7.2.0
    • ns_server
    • Triaged
    • 0
    • Unknown

    Description

      TL;DR

      timer:apply_after(...) calls may never execute the given function if we hit the system process limit. Should we recover from the issue causing us to hit the system process limit the application may never recover as timers that we expect to have fired may not have. supervisor2 uses timer:apply_after(...) if it fails to restart a process (possibly due to hitting the system process limit), this can causes processes to "go missing" as the timer not firing causes the process to remain stuck in the "restarting" state.

      Erlang issue - https://github.com/erlang/otp/issues/7606

      Original Description

      It was observed in a recent issue that a health monitoring processes crashed and caused a max restart intensity error to crash the health_monitor_sup. The supervisor above it, ns_server_sup, is a "supervisor2". When the health_monitor_sup crashed ns_server_sup attempted to restart it but hit a system limit (too many processes) error and never attempted to restart the process again, even after the system limit error was resolved. The cluster had multiple nodes reported as unhealthy as the health monitoring processes were no longer present for over a week.

      I wrote a unit test that reproduces the observed issue as exactly as possible, it has the same supervision tree with a single worker process as the leaf (health monitoring process). It lowers the process limit in the Erlang VM and creates a bunch of dummy processes such that we cannot restart the worker or supervisor. The supervisor2 only attempted to restart the child process that died once, and at the end of the test the child process was not running. The same test was ran for the regular supervisor which restarted the child as expected.

      supervisor2 is a copy of supervisor.erl from R16B Erlang/OTP with some modifications to it, and a few bug fix changes applied. Bug fixes were last applied from the base supervisor.erl in 2015.

      Attachments

        No reviews matched the request. Check your Options in the drop-down menu of this sections header.

        Activity

          People

            ben.huddleston Ben Huddleston
            ben.huddleston Ben Huddleston
            Votes:
            0 Vote for this issue
            Watchers:
            5 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              PagerDuty