Uploaded image for project: 'Couchbase Kubernetes'
  1. Couchbase Kubernetes
  2. K8S-2071

Deadlock Holiday...

    XMLWordPrintable

    Details

    • Type: Bug
    • Status: Closed
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 2.2.0
    • Component/s: operator
    • Labels:
    • Sprint:
      10: Autoscaling, completion
    • Story Points:
      1

      Description

      See Manchester legends 10CC's Dreadlock Holiday for context!

      I found this through the chaotic monkey that is GKE Autopilot, it turns out that if the Operator deployment gets rescheduled while it's waiting for a pod to get scheduled (magnified quite a lot by cluster autoscaling!!) then we end up in a situation where:

      • We try to get a list of callable members
      • None are working so server throws a wobbly when we call /pools/default to determine what's clustered or not
      • Spin in loop of death forever

      We need a mechanism to "off" uninitialized nodes.  We version our resources, so we should be able to say a 2.2 pod without the pod.couchbase.com/initialized annotation can be "retired".  This can happen transparently without any special configuration.  It also preserves any pre 2.2 pods or initialized ones for log collection.

        Attachments

        No reviews matched the request. Check your Options in the drop-down menu of this sections header.

          Activity

          Hide
          matt.carabine Matt Carabine added a comment -

          We have seen an issue caused by a similar root cause, the error message was:

          {"level":"error","ts":1614223610.6046593,"logger":"cluster","msg":"Failed to update members","cluster":"2a400ef3-5dbf-4920-ab81-55362bc46bc9/cb","error":"context deadline exceeded: [Get https://cb-0006.cb.2a400ef3-5dbf-4920-ab81-55362bc46bc9.svc:18091/pools/default: uuid is unset]","stacktrace":"github.com/couchbase/couchbase-operator/vendor/github.com/go-logr/zapr.(*zapLogger).Error\n\t/home/couchbase/jenkins/workspace/couchbase-operator-build/goproj/src/github.com/couchbase/couchbase-operator/vendor/github.com/go-logr/zapr/zapr.go:128\ngithub.com/couchbase/couchbase-operator/pkg/cluster.(*Cluster).runReconcile\n\t/home/couchbase/jenkins/workspace/couchbase-operator-build/goproj/src/github.com/couchbase/couchbase-operator/pkg/cluster/cluster.go:360\ngithub.com/couchbase/couchbase-operator/pkg/cluster.(*Cluster).Update\n\t/home/couchbase/jenkins/workspace/couchbase-operator-build/goproj/src/github.com/couchbase/couchbase-operator/pkg/cluster/cluster.go:387\ngithub.com/couchbase/couchbase-operator/pkg/controller.(*CouchbaseClusterReconciler).Reconcile\n\t/home/couchbase/jenkins/workspace/couchbase-operator-build/goproj/src/github.com/couchbase/couchbase-operator/pkg/controller/controller.go:86\ngithub.com/couchbase/couchbase-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/home/couchbase/jenkins/workspace/couchbase-operator-build/goproj/src/github.com/couchbase/couchbase-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:215\ngithub.com/couchbase/couchbase-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func1\n\t/home/couchbase/jenkins/workspace/couchbase-operator-build/goproj/src/github.com/couchbase/couchbase-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:158\ngithub.com/couchbase/couchbase-operator/vendor/k8s.io/apimachinery/pkg/util/wait.JitterUntil.func1\n\t/home/couchbase/jenkins/workspace/couchbase-operator-build/goproj/src/github.com/couchbase/couchbase-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:133\ngithub.com/couchbase/couchbase-operator/vendor/k8s.io/apimachinery/pkg/util/wait.JitterUntil\n\t/home/couchbase/jenkins/workspace/couchbase-operator-build/goproj/src/github.com/couchbase/couchbase-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:134\ngithub.com/couchbase/couchbase-operator/vendor/k8s.io/apimachinery/pkg/util/wait.Until\n\t/home/couchbase/jenkins/workspace/couchbase-operator-build/goproj/src/github.com/couchbase/couchbase-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:88"}
          

          What happened was that the Operator created the pod, hadn't finished initializing/clustering it, then once the Operator restarted it tried to manage the pod as if initialization had completed, so spin locked forever.
          While I think this sequence of events differs slightly from this ticket, I believe the fix proposed is the same, would you agree?

          Show
          matt.carabine Matt Carabine added a comment - We have seen an issue caused by a similar root cause, the error message was: {"level":"error","ts":1614223610.6046593,"logger":"cluster","msg":"Failed to update members","cluster":"2a400ef3-5dbf-4920-ab81-55362bc46bc9/cb","error":"context deadline exceeded: [Get https://cb-0006.cb.2a400ef3-5dbf-4920-ab81-55362bc46bc9.svc:18091/pools/default: uuid is unset]","stacktrace":"github.com/couchbase/couchbase-operator/vendor/github.com/go-logr/zapr.(*zapLogger).Error\n\t/home/couchbase/jenkins/workspace/couchbase-operator-build/goproj/src/github.com/couchbase/couchbase-operator/vendor/github.com/go-logr/zapr/zapr.go:128\ngithub.com/couchbase/couchbase-operator/pkg/cluster.(*Cluster).runReconcile\n\t/home/couchbase/jenkins/workspace/couchbase-operator-build/goproj/src/github.com/couchbase/couchbase-operator/pkg/cluster/cluster.go:360\ngithub.com/couchbase/couchbase-operator/pkg/cluster.(*Cluster).Update\n\t/home/couchbase/jenkins/workspace/couchbase-operator-build/goproj/src/github.com/couchbase/couchbase-operator/pkg/cluster/cluster.go:387\ngithub.com/couchbase/couchbase-operator/pkg/controller.(*CouchbaseClusterReconciler).Reconcile\n\t/home/couchbase/jenkins/workspace/couchbase-operator-build/goproj/src/github.com/couchbase/couchbase-operator/pkg/controller/controller.go:86\ngithub.com/couchbase/couchbase-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/home/couchbase/jenkins/workspace/couchbase-operator-build/goproj/src/github.com/couchbase/couchbase-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:215\ngithub.com/couchbase/couchbase-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func1\n\t/home/couchbase/jenkins/workspace/couchbase-operator-build/goproj/src/github.com/couchbase/couchbase-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:158\ngithub.com/couchbase/couchbase-operator/vendor/k8s.io/apimachinery/pkg/util/wait.JitterUntil.func1\n\t/home/couchbase/jenkins/workspace/couchbase-operator-build/goproj/src/github.com/couchbase/couchbase-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:133\ngithub.com/couchbase/couchbase-operator/vendor/k8s.io/apimachinery/pkg/util/wait.JitterUntil\n\t/home/couchbase/jenkins/workspace/couchbase-operator-build/goproj/src/github.com/couchbase/couchbase-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:134\ngithub.com/couchbase/couchbase-operator/vendor/k8s.io/apimachinery/pkg/util/wait.Until\n\t/home/couchbase/jenkins/workspace/couchbase-operator-build/goproj/src/github.com/couchbase/couchbase-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:88"} What happened was that the Operator created the pod, hadn't finished initializing/clustering it, then once the Operator restarted it tried to manage the pod as if initialization had completed, so spin locked forever. While I think this sequence of events differs slightly from this ticket, I believe the fix proposed is the same, would you agree?
          Hide
          simon.murray Simon Murray added a comment -

          Yeah, looks like what I'd expect of 2.0, 2.1 operates somewhat differently.  Same problem, different symptom.

          Show
          simon.murray Simon Murray added a comment - Yeah, looks like what I'd expect of 2.0, 2.1 operates somewhat differently.  Same problem, different symptom.
          Hide
          simon.murray Simon Murray added a comment -

          releasenote:

          On dynamic platforms (GKE Autopilot being one example), where deployments can be rescheduled to better utilize system resources, we detected a deadlock situation. In this scenario, the operator could be terminated while a Couchbase Server pod was still being initialized, and thus on restart it would look okay, however Couchbase Server would refuse to respond to the Operator. This has been remedied by annotating Couchbase Server pods when we know they have been full initialized, and thus can be terminated when we know they are uninitialized and it's safe to do so, then recreated.

          Show
          simon.murray Simon Murray added a comment - releasenote: On dynamic platforms (GKE Autopilot being one example), where deployments can be rescheduled to better utilize system resources, we detected a deadlock situation. In this scenario, the operator could be terminated while a Couchbase Server pod was still being initialized, and thus on restart it would look okay, however Couchbase Server would refuse to respond to the Operator. This has been remedied by annotating Couchbase Server pods when we know they have been full initialized, and thus can be terminated when we know they are uninitialized and it's safe to do so, then recreated.

            People

            Assignee:
            simon.murray Simon Murray
            Reporter:
            simon.murray Simon Murray
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

              Dates

              Created:
              Updated:
              Resolved:

                Gerrit Reviews

                There are no open Gerrit changes

                  PagerDuty