Uploaded image for project: 'Couchbase Kubernetes'
  1. Couchbase Kubernetes
  2. K8S-1736

Operator unable to recover custom TLS clusters if all pods go down

    XMLWordPrintable

Details

    • Bug
    • Resolution: Fixed
    • Blocker
    • 2.0.3
    • 2.0.0, 2.0.1, 2.0.2
    • operator
    • None
    • 1

    Description

      Summary
      When using custom TLS certificates, if all the Couchbase Server pods go down, the Operator is unable to recover the cluster.
      This violates a core principle of the Operator (hands-off management of the cluster).

      Steps to Reproduce

      1. Install the Operator in a Kubernetes cluster:

        ./bin/cbopcfg | kubectl create -f -
        

      2. Generate custom TLS certificates for the cluster:

        git clone https://github.com/OpenVPN/easy-rsa
         cd easy-rsa/easyrsa3
         ./easyrsa init-pki
         ./easyrsa build-ca
         # You'll need to put a password in here, so don't C+P the whole thing
         ./easyrsa --subject-alt-name='DNS:*.cb-example,DNS:*.cb-example.default,DNS:*.cb-example.default.svc,DNS:cb-example-srv,DNS:cb-example-srv.default,DNS:cb-example-srv.default.svc,DNS:localhost' build-server-full couchbase-server nopass
         # You'll need to put a password in here, so don't C+P the whole thing
         cp pki/private/couchbase-server.key pkey.key
         cp pki/issued/couchbase-server.crt chain.pem
         openssl rsa -in pkey.key -out pkey.key.der -outform DER
         openssl rsa -in pkey.key.der -inform DER -out pkey.key -outform PEM
         kubectl create secret generic couchbase-server-tls --from-file chain.pem --from-file pkey.key
         kubectl create secret generic couchbase-operator-tls --from-file pki/ca.crt
        

      3. Add the custom TLS certs to the CouchbaseCluster definition:

          networking:
            tls:
              static:
                serverSecret: couchbase-server-tls
                operatorSecret: couchbase-operator-tls
        

      4. Add persistent volumes (default mount) to the server definition, also set the number of servers to 1.
      5. Create the cluster (for convenience I've also attached this as couchbase-cluster.yaml):

        kubectl create -f couchbase-cluster.yaml
        

      6. Wait for the cluster to become fully ready (i.e. all pods marked ready)
      7. Delete the Couchbase Server pod:

        kubectl delete po cb-example-0000
        

      8. Wait for the Operator to recover the pods

      Expected Behavior
      The operator recreates all of the pods as they have a PVC associated, so it can just remount this to a new pod.

      Actual Behavior
      Operator ends up in a spin loop with the following error:

      {"level":"error","ts":1604321398.8870301,"logger":"cluster","msg":"Reconciliation failed","cluster":"default/cb-example","error":"[Get http://cb-example-0000.cb-example.default.svc:8091/settings/clientCertAuth: dial tcp: lookup cb-example-0000.cb-example.default.svc on 10.100.0.10:53: no such host]","stacktrace":"github.com/couchbase/couchbase-operator/vendor/github.com/go-logr/zapr.(*zapLogger).Error\n\t/home/couchbase/jenkins/workspace/couchbase-operator-build/goproj/src/github.com/couchbase/couchbase-operator/vendor/github.com/go-logr/zapr/zapr.go:128\ngithub.com/couchbase/couchbase-operator/pkg/cluster.(*Cluster).runReconcile\n\t/home/couchbase/jenkins/workspace/couchbase-operator-build/goproj/src/github.com/couchbase/couchbase-operator/pkg/cluster/cluster.go:370\ngithub.com/couchbase/couchbase-operator/pkg/cluster.(*Cluster).Update\n\t/home/couchbase/jenkins/workspace/couchbase-operator-build/goproj/src/github.com/couchbase/couchbase-operator/pkg/cluster/cluster.go:387\ngithub.com/couchbase/couchbase-operator/pkg/controller.(*CouchbaseClusterReconciler).Reconcile\n\t/home/couchbase/jenkins/workspace/couchbase-operator-build/goproj/src/github.com/couchbase/couchbase-operator/pkg/controller/controller.go:86\ngithub.com/couchbase/couchbase-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/home/couchbase/jenkins/workspace/couchbase-operator-build/goproj/src/github.com/couchbase/couchbase-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:215\ngithub.com/couchbase/couchbase-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func1\n\t/home/couchbase/jenkins/workspace/couchbase-operator-build/goproj/src/github.com/couchbase/couchbase-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:158\ngithub.com/couchbase/couchbase-operator/vendor/k8s.io/apimachinery/pkg/util/wait.JitterUntil.func1\n\t/home/couchbase/jenkins/workspace/couchbase-operator-build/goproj/src/github.com/couchbase/couchbase-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:133\ngithub.com/couchbase/couchbase-operator/vendor/k8s.io/apimachinery/pkg/util/wait.JitterUntil\n\t/home/couchbase/jenkins/workspace/couchbase-operator-build/goproj/src/github.com/couchbase/couchbase-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:134\ngithub.com/couchbase/couchbase-operator/vendor/k8s.io/apimachinery/pkg/util/wait.Until\n\t/home/couchbase/jenkins/workspace/couchbase-operator-build/goproj/src/github.com/couchbase/couchbase-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:88"}
      

      In the worst case, if the Operator also restarts (as may happen during a worker node failure), a different code path gets hit, but this recovery still fails and the cluster is marked as failed:

      {"level":"error","ts":1604321541.0505052,"logger":"cluster","msg":"Cluster setup failed","cluster":"default/cb-example","error":"","stacktrace":"github.com/couchbase/couchbase-operator/vendor/github.com/go-logr/zapr.(*zapLogger).Error\n\t/home/couchbase/jenkins/workspace/couchbase-operator-build/goproj/src/github.com/couchbase/couchbase-operator/vendor/github.com/go-logr/zapr/zapr.go:128\ngithub.com/couchbase/couchbase-operator/pkg/cluster.New\n\t/home/couchbase/jenkins/workspace/couchbase-operator-build/goproj/src/github.com/couchbase/couchbase-operator/pkg/cluster/cluster.go:143\ngithub.com/couchbase/couchbase-operator/pkg/controller.(*CouchbaseClusterReconciler).Reconcile\n\t/home/couchbase/jenkins/workspace/couchbase-operator-build/goproj/src/github.com/couchbase/couchbase-operator/pkg/controller/controller.go:71\ngithub.com/couchbase/couchbase-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/home/couchbase/jenkins/workspace/couchbase-operator-build/goproj/src/github.com/couchbase/couchbase-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:215\ngithub.com/couchbase/couchbase-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func1\n\t/home/couchbase/jenkins/workspace/couchbase-operator-build/goproj/src/github.com/couchbase/couchbase-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:158\ngithub.com/couchbase/couchbase-operator/vendor/k8s.io/apimachinery/pkg/util/wait.JitterUntil.func1\n\t/home/couchbase/jenkins/workspace/couchbase-operator-build/goproj/src/github.com/couchbase/couchbase-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:133\ngithub.com/couchbase/couchbase-operator/vendor/k8s.io/apimachinery/pkg/util/wait.JitterUntil\n\t/home/couchbase/jenkins/workspace/couchbase-operator-build/goproj/src/github.com/couchbase/couchbase-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:134\ngithub.com/couchbase/couchbase-operator/vendor/k8s.io/apimachinery/pkg/util/wait.Until\n\t/home/couchbase/jenkins/workspace/couchbase-operator-build/goproj/src/github.com/couchbase/couchbase-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:88"}
      {"level":"error","ts":1604321541.093771,"logger":"controller","msg":"Failed to create Couchbase cluster","cluster":{"namespace":"default","name":"cb-example"},"error":"","stacktrace":"github.com/couchbase/couchbase-operator/vendor/github.com/go-logr/zapr.(*zapLogger).Error\n\t/home/couchbase/jenkins/workspace/couchbase-operator-build/goproj/src/github.com/couchbase/couchbase-operator/vendor/github.com/go-logr/zapr/zapr.go:128\ngithub.com/couchbase/couchbase-operator/pkg/controller.(*CouchbaseClusterReconciler).Reconcile\n\t/home/couchbase/jenkins/workspace/couchbase-operator-build/goproj/src/github.com/couchbase/couchbase-operator/pkg/controller/controller.go:73\ngithub.com/couchbase/couchbase-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/home/couchbase/jenkins/workspace/couchbase-operator-build/goproj/src/github.com/couchbase/couchbase-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:215\ngithub.com/couchbase/couchbase-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func1\n\t/home/couchbase/jenkins/workspace/couchbase-operator-build/goproj/src/github.com/couchbase/couchbase-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:158\ngithub.com/couchbase/couchbase-operator/vendor/k8s.io/apimachinery/pkg/util/wait.JitterUntil.func1\n\t/home/couchbase/jenkins/workspace/couchbase-operator-build/goproj/src/github.com/couchbase/couchbase-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:133\ngithub.com/couchbase/couchbase-operator/vendor/k8s.io/apimachinery/pkg/util/wait.JitterUntil\n\t/home/couchbase/jenkins/workspace/couchbase-operator-build/goproj/src/github.com/couchbase/couchbase-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:134\ngithub.com/couchbase/couchbase-operator/vendor/k8s.io/apimachinery/pkg/util/wait.Until\n\t/home/couchbase/jenkins/workspace/couchbase-operator-build/goproj/src/github.com/couchbase/couchbase-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:88"}
       
      ...
       
      {"level":"error","ts":1604321557.1527011,"logger":"kubebuilder.controller","msg":"Reconciler error","controller":"couchbase-controller","request":"default/cb-example","error":"unexpected cluster phase: Failed","stacktrace":"github.com/couchbase/couchbase-operator/vendor/github.com/go-logr/zapr.(*zapLogger).Error\n\t/home/couchbase/jenkins/workspace/couchbase-operator-build/goproj/src/github.com/couchbase/couchbase-operator/vendor/github.com/go-logr/zapr/zapr.go:128\ngithub.com/couchbase/couchbase-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/home/couchbase/jenkins/workspace/couchbase-operator-build/goproj/src/github.com/couchbase/couchbase-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:217\ngithub.com/couchbase/couchbase-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func1\n\t/home/couchbase/jenkins/workspace/couchbase-operator-build/goproj/src/github.com/couchbase/couchbase-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:158\ngithub.com/couchbase/couchbase-operator/vendor/k8s.io/apimachinery/pkg/util/wait.JitterUntil.func1\n\t/home/couchbase/jenkins/workspace/couchbase-operator-build/goproj/src/github.com/couchbase/couchbase-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:133\ngithub.com/couchbase/couchbase-operator/vendor/k8s.io/apimachinery/pkg/util/wait.JitterUntil\n\t/home/couchbase/jenkins/workspace/couchbase-operator-build/goproj/src/github.com/couchbase/couchbase-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:134\ngithub.com/couchbase/couchbase-operator/vendor/k8s.io/apimachinery/pkg/util/wait.Until\n\t/home/couchbase/jenkins/workspace/couchbase-operator-build/goproj/src/github.com/couchbase/couchbase-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:88"}
      

      Additional Notes

      This does not happen when not using custom TLS certificates, implying that the issue is with the codepath used to setup TLS.
      I also don't believe 1.x had this issue.

      Workaround

      There is a workaround available but it is not trivial to implement.
      Essentially you need to create a new pod that exactly matches the pod that the Operator would create for one of the pods in the cluster. Once the first pod is available the Operator can then manage the cluster again.

      Attachments

        Issue Links

          No reviews matched the request. Check your Options in the drop-down menu of this sections header.

          Activity

            People

              simon.murray Simon Murray
              matt.carabine Matt Carabine (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Gerrit Reviews

                  There are no open Gerrit changes

                  PagerDuty