Details
-
Task
-
Resolution: Unresolved
-
Major
-
None
-
None
Description
Currently in agent/cmd/cbhealthagent/main.go, we shut down the agent when all services cleanly exit, and abort start-up if a service fails to be created outright, but we don't handle the case where an agent starts up and then fails. It'll decrement the WaitGroup, but it won't hit zero so nothing will happen (plus, main() will only terminate if it receives a SIGINT).
We should have a means for sub-systems to notify the agent core that they have exited abnormally. The question that arises is what should the agent core do - retrying a few times may be sensible, but what happens if the error is fatal - go into Waiting state? Notify the cluster monitor somehow?