Uploaded image for project: 'Couchbase Monitoring and Observability Stack'
  1. Couchbase Monitoring and Observability Stack
  2. CMOS-345

Agent: handle sub-systems exiting abnormally

    XMLWordPrintable

Details

    • Task
    • Resolution: Unresolved
    • Major
    • 1.0
    • None
    • cluster-monitor
    • None

    Description

      Currently in agent/cmd/cbhealthagent/main.go, we shut down the agent when all services cleanly exit, and abort start-up if a service fails to be created outright, but we don't handle the case where an agent starts up and then fails. It'll decrement the WaitGroup, but it won't hit zero so nothing will happen (plus, main() will only terminate if it receives a SIGINT).

      We should have a means for sub-systems to notify the agent core that they have exited abnormally. The question that arises is what should the agent core do - retrying a few times may be sensible, but what happens if the error is fatal - go into Waiting state? Notify the cluster monitor somehow?

      Attachments

        No reviews matched the request. Check your Options in the drop-down menu of this sections header.

        Activity

          People

            Unassigned Unassigned
            marks.polakovs Marks Polakovs (Inactive)
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated:

              Gerrit Reviews

                There are no open Gerrit changes

                PagerDuty