Uploaded image for project: 'Couchbase Kubernetes'
  1. Couchbase Kubernetes
  2. K8S-1783

Unable to run Operator 2.0.x with 2.1.x CRD + Admission Controller

    XMLWordPrintable

Details

    • Bug
    • Status: Closed
    • Critical
    • Resolution: Fixed
    • None
    • 2.1.0
    • operator
    • 47: Documentation 2
    • 1

    Description

      Summary
      After upgrading the K8s cluster-wide CRD to 2.1 and switching the admission controller to instead be 2.1, existing 2.0.x Operators stop working.
      Might not be a bug but raised as such at this stage in absence of any clarification on how upgrade is supposed to work.

      Steps to Reproduce

      1. Install 2.0.2 Operator, CRD and Admission Controller
      2. Create cluster with at least 1 volumeClaimTemplate (see couchbase-cluster.yaml)
      3. Download 2.1.0-230 package
      4. Replace 2.1.0 CRD and deploy 2.1.0 Admission Controller instead
      5. Validate that 2.0.2 Operator can still manage its cluster

      Expected Behavior
      As the CRD is still v2, there should be no breaking changes between 2.0.x and 2.1.x.
      The 2.0.x Operator should continue to be able to correctly manage the cluster.

      Actual Behavior
      The 2.0.x Operator stops being able to manage the cluster, throwing the following error:

      {"level":"error","ts":1605666608.7187085,"logger":"cluster","msg":"Status update failed","cluster":"default/cb-example","error":"CouchbaseCluster.couchbase.com \"cb-example\" is invalid: []: Invalid value: map[string]interface {}{\"apiVersion\":\"couchbase.com/v2\", \"kind\":\"CouchbaseCluster\", \"metadata\":map[string]interface {}{\"creationTimestamp\":\"2020-11-18T02:15:46Z\", \"generation\":114, \"name\":\"cb-example\", \"namespace\":\"default\", \"resourceVersion\":\"3636\", \"selfLink\":\"/apis/couchbase.com/v2/namespaces/default/couchbaseclusters/cb-example\", \"uid\":\"faa57471-074a-4872-a9b1-65db14972cf6\"}, \"spec\":map[string]interface {}{\"backup\":map[string]interface {}{}, \"buckets\":map[string]interface {}{\"managed\":true}, \"cluster\":map[string]interface {}{\"analyticsServiceMemoryQuota\":\"1Gi\", \"autoCompaction\":map[string]interface {}{\"databaseFragmentationThreshold\":map[string]interface {}{\"percent\":30}, \"timeWindow\":map[string]interface {}{}, \"tombstonePurgeInterval\":\"72h0m0s\", \"viewFragmentationThreshold\":map[string]interface {}{\"percent\":30}}, \"autoFailoverMaxCount\":3, \"autoFailoverOnDataDiskIssuesTimePeriod\":\"2m0s\", \"autoFailoverTimeout\":\"2m0s\", \"dataServiceMemoryQuota\":\"256Mi\", \"eventingServiceMemoryQuota\":\"256Mi\", \"indexServiceMemoryQuota\":\"256Mi\", \"indexStorageSetting\":\"memory_optimized\", \"searchServiceMemoryQuota\":\"256Mi\"}, \"image\":\"couchbase/server:enterprise-6.5.1\", \"logging\":map[string]interface {}{}, \"networking\":map[string]interface {}{\"adminConsoleServiceType\":\"NodePort\", \"exposedFeatureServiceType\":\"NodePort\"}, \"security\":map[string]interface {}{\"adminSecret\":\"cb-example-auth\", \"rbac\":map[string]interface {}{}}, \"securityContext\":map[string]interface {}{\"fsGroup\":1000}, \"servers\":[]interface {}{map[string]interface {}{\"name\":\"all_services\", \"resources\":map[string]interface {}{}, \"services\":[]interface {}{\"data\", \"analytics\"}, \"size\":3, \"volumeMounts\":map[string]interface {}{\"default\":\"couchbase\"}}}, \"softwareUpdateNotifications\":false, \"volumeClaimTemplates\":[]interface {}{map[string]interface {}{\"metadata\":map[string]interface {}{\"creationTimestamp\":interface {}(nil), \"name\":\"couchbase\"}, \"spec\":map[string]interface {}{\"dataSource\":interface {}(nil), \"resources\":map[string]interface {}{\"requests\":map[string]interface {}{\"storage\":\"10Gi\"}}}, \"status\":map[string]interface {}{}}}, \"xdcr\":map[string]interface {}{}}, \"status\":map[string]interface {}{\"buckets\":[]interface {}{map[string]interface {}{\"compressionMode\":\"passive\", \"conflictResolution\":\"seqno\", \"enableFlush\":false, \"enableIndexReplica\":false, \"evictionPolicy\":\"valueOnly\", \"ioPriority\":\"low\", \"memoryQuota\":100, \"name\":\"default\", \"password\":\"\", \"replicas\":1, \"type\":\"couchbase\"}}, \"clusterId\":\"43f0a413c6f3170936b889c3f9447594\", \"conditions\":[]interface {}{map[string]interface {}{\"lastTransitionTime\":\"2020-11-18T02:30:08Z\", \"lastUpdateTime\":\"2020-11-18T02:30:08Z\", \"message\":\"Data is equally distributed across all nodes in the cluster\", \"reason\":\"Balanced\", \"status\":\"True\", \"type\":\"Balanced\"}, map[string]interface {}{\"lastTransitionTime\":\"2020-11-18T02:19:32Z\", \"lastUpdateTime\":\"2020-11-18T02:19:32Z\", \"reason\":\"Available\", \"status\":\"True\", \"type\":\"Available\"}}, \"currentVersion\":\"enterprise-6.5.1\", \"members\":map[string]interface {}{\"ready\":[]interface {}{\"cb-example-0000\", \"cb-example-0001\", \"cb-example-0002\"}}, \"phase\":\"Running\", \"size\":3}}: validation failure list:\nspec.volumeClaimTemplates.spec.dataSource in body must be of type object: \"null\"","stacktrace":"github.com/couchbase/couchbase-operator/vendor/github.com/go-logr/zapr.(*zapLogger).Error\n\t/home/couchbase/jenkins/workspace/couchbase-operator-build/goproj/src/github.com/couchbase/couchbase-operator/vendor/github.com/go-logr/zapr/zapr.go:128\ngithub.com/couchbase/couchbase-operator/pkg/cluster.(*Cluster).runReconcile.func1\n\t/home/couchbase/jenkins/workspace/couchbase-operator-build/goproj/src/github.com/couchbase/couchbase-operator/pkg/cluster/cluster.go:327\ngithub.com/couchbase/couchbase-operator/pkg/cluster.(*Cluster).runReconcile\n\t/home/couchbase/jenkins/workspace/couchbase-operator-build/goproj/src/github.com/couchbase/couchbase-operator/pkg/cluster/cluster.go:373\ngithub.com/couchbase/couchbase-operator/pkg/cluster.(*Cluster).Update\n\t/home/couchbase/jenkins/workspace/couchbase-operator-build/goproj/src/github.com/couchbase/couchbase-operator/pkg/cluster/cluster.go:387\ngithub.com/couchbase/couchbase-operator/pkg/controller.(*CouchbaseClusterReconciler).Reconcile\n\t/home/couchbase/jenkins/workspace/couchbase-operator-build/goproj/src/github.com/couchbase/couchbase-operator/pkg/controller/controller.go:86\ngithub.com/couchbase/couchbase-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/home/couchbase/jenkins/workspace/couchbase-operator-build/goproj/src/github.com/couchbase/couchbase-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:215\ngithub.com/couchbase/couchbase-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func1\n\t/home/couchbase/jenkins/workspace/couchbase-operator-build/goproj/src/github.com/couchbase/couchbase-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:158\ngithub.com/couchbase/couchbase-operator/vendor/k8s.io/apimachinery/pkg/util/wait.JitterUntil.func1\n\t/home/couchbase/jenkins/workspace/couchbase-operator-build/goproj/src/github.com/couchbase/couchbase-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:133\ngithub.com/couchbase/couchbase-operator/vendor/k8s.io/apimachinery/pkg/util/wait.JitterUntil\n\t/home/couchbase/jenkins/workspace/couchbase-operator-build/goproj/src/github.com/couchbase/couchbase-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:134\ngithub.com/couchbase/couchbase-operator/vendor/k8s.io/apimachinery/pkg/util/wait.Until\n\t/home/couchbase/jenkins/workspace/couchbase-operator-build/goproj/src/github.com/couchbase/couchbase-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:88"}
      

      Attachments

        No reviews matched the request. Check your Options in the drop-down menu of this sections header.

        Activity

          Yea, we'll need to test what else is out there, I'll chat with Arun about this.

          Meanwhile spent some time looking into root cause of this and it seems to that 2.0 was built against k1.13 api and 2.1 against 1.17.  Just a side by side comparison I'm seeing the PersistentClaimTemplate object being marshalled differently.  In 1.13 we have the dataSource being marshalled into a null string:

          pvc.couchbase.com/spec: '{"resources":{"requests":{"storage":"10Gi"}},"storageClassName":"default", "dataSource": "null" }' 

          wheras in 1.17 this is omitted:

          pvc.couchbase.com/spec: '{"resources":{"requests":{"storage":"10Gi"}},"storageClassName":"default"}' 

          This same logic is used when the object is marshalled and sent to etcd.  The difference is in 2.1 the CRD spec explicitly defines dataSource as an object (if provided) hence the error:

          validation failure list:\nspec.volumeClaimTemplates.spec.dataSource in body must be of type object: \"null\" 

          I was able to reproduce this in 2.0 itself by adding dataSource to the 2.0 CRD validation.

          Although I wasn't able to build 2.0 against a newer k8s api to truly verify this is root cause - will need Simon Murray to fill me in further.

          Anyways the root cause is with 2.0.2 and how it's creating a string where an object should be, and unless we get another shot at patching 2.0, I think we'll have to retroactively fix whatever invalid attributes there may be via admission controller. 

          tommie Tommie McAfee added a comment - Yea, we'll need to test what else is out there, I'll chat with Arun about this. Meanwhile spent some time looking into root cause of this and it seems to that 2.0 was built against k1.13 api and 2.1 against 1.17.  Just a side by side comparison I'm seeing the PersistentClaimTemplate object being marshalled differently.  In 1.13 we have the dataSource being marshalled into a null string: pvc.couchbase.com/spec: '{"resources":{"requests":{"storage":"10Gi"}},"storageClassName":"default", "dataSource": "null" }' wheras in 1.17 this is omitted: pvc.couchbase.com/spec: '{"resources":{"requests":{"storage":"10Gi"}},"storageClassName":"default"}' This same logic is used when the object is marshalled and sent to etcd.  The difference is in 2.1 the CRD spec explicitly defines dataSource as an object (if provided) hence the error: validation failure list:\nspec.volumeClaimTemplates.spec.dataSource in body must be of type object: \" null \" I was able to reproduce this in 2.0 itself by adding dataSource to the 2.0 CRD validation. Although I wasn't able to build 2.0 against a newer k8s api to truly verify this is root cause - will need Simon Murray to fill me in further. Anyways the root cause is with 2.0.2 and how it's creating a string where an object should be , and unless we get another shot at patching 2.0, I think we'll have to retroactively fix whatever invalid attributes there may be via admission controller. 

          Nice work Tommie McAfee.

          One good thing is Simon Murray just recently added automation for proper validation of CRD against all K8S versions, so we shouldn't have more bugs like this.

          ingenthr Matt Ingenthron added a comment - Nice work Tommie McAfee . One good thing is Simon Murray just recently added automation for proper validation of CRD against all K8S versions, so we shouldn't have more bugs like this.
          simon.murray Simon Murray added a comment -

          There are actually 3 issues here:

          • Where there was no validation, there is now some, and the old libs didn't have an omitempty in there.  I've just pruned this from the CRDs
          • DNS got messed up on an upgrade, which is actually far more important, as the cluster completely failed
          • This still isn't automatically tested... until now!

          All of the above has been fixed up, however there is one caveat to be aware of, the TLS requirements have been updated, so all existing clusters should have their SANs updated and rotated before deploying the new DAC... or you can just ignore validation by removing the Secrets privilege.

          simon.murray Simon Murray added a comment - There are actually 3 issues here: Where there was no validation, there is now some, and the old libs didn't have an omitempty in there.  I've just pruned this from the CRDs DNS got messed up on an upgrade, which is actually far more important, as the cluster completely failed This still isn't automatically tested... until now! All of the above has been fixed up, however there is one caveat to be aware of, the TLS requirements have been updated, so all existing clusters should have their SANs updated and rotated before deploying the new DAC... or you can just ignore validation by removing the Secrets privilege.

          Simon Murray with regards to the automatic testing, is this testing all possible options within the CRD for 'compatibility'?

          Also, what's the golden build number for me to re-run my upgrade tests on 2.1?
          Also (one more thing :columbo what's the TLS requirement change? I know there was one between 1.x -> 2.0, any specific SANs that have been added?

          matt.carabine Matt Carabine added a comment - Simon Murray with regards to the automatic testing, is this testing all possible options within the CRD for 'compatibility'? Also, what's the golden build number for me to re-run my upgrade tests on 2.1? Also (one more thing :columbo what's the TLS requirement change? I know there was one between 1.x -> 2.0, any specific SANs that have been added?
          simon.murray Simon Murray added a comment - - edited

          I care not about what comes from build, ask them.

          http://review.couchbase.org/c/couchbase-operator/+/140233 is the docs thing.  Unsurprisingly it's support for GCCCP.  This should be the last additive change, things can be removed in future without impact.

          Testing, isn't matrix style, but it is with TLS and volumes, should be fairly trivial to add more features, PROVIDED, it runs in a kind cluster

           

          simon.murray Simon Murray added a comment - - edited I care not about what comes from build, ask them. http://review.couchbase.org/c/couchbase-operator/+/140233 is the docs thing.  Unsurprisingly it's support for GCCCP.  This should be the last additive change, things can be removed in future without impact. Testing, isn't matrix style, but it is with TLS and volumes, should be fairly trivial to add more features, PROVIDED, it runs in a kind cluster  

          People

            simon.murray Simon Murray
            matt.carabine Matt Carabine
            Votes:
            0 Vote for this issue
            Watchers:
            9 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Gerrit Reviews

                There are no open Gerrit changes

                PagerDuty