Details
-
Bug
-
Resolution: Fixed
-
Major
-
None
-
10: Autoscaling, completion
-
1
Description
See Manchester legends 10CC's Dreadlock Holiday for context!
I found this through the chaotic monkey that is GKE Autopilot, it turns out that if the Operator deployment gets rescheduled while it's waiting for a pod to get scheduled (magnified quite a lot by cluster autoscaling!!) then we end up in a situation where:
- We try to get a list of callable members
- None are working so server throws a wobbly when we call /pools/default to determine what's clustered or not
- Spin in loop of death forever
We need a mechanism to "off" uninitialized nodes. We version our resources, so we should be able to say a 2.2 pod without the pod.couchbase.com/initialized annotation can be "retired". This can happen transparently without any special configuration. It also preserves any pre 2.2 pods or initialized ones for log collection.