Uploaded image for project: 'Couchbase Kubernetes'
  1. Couchbase Kubernetes
  2. K8S-1734

Add NodeSelector to DAC helm specification

    XMLWordPrintable

Details

    • Improvement
    • Status: Closed
    • Major
    • Resolution: Fixed
    • None
    • 2.2.0
    • documentation, kubernetes
    • None
    • 14: Helm/Testing/bugfixing
    • 1

    Description

      Currently only the Operator has the NodeSelector specified in the values.yaml

      Attachments

        Issue Links

          No reviews matched the request. Check your Options in the drop-down menu of this sections header.

          Activity

            Done by contributor.

            ingenthr Matt Ingenthron added a comment - Done by contributor.
            tin.tran Tin Tran added a comment - - edited

            Hi Patrick Stephens

            Since Tommie is out for sometime, can you help take a look at this issue?
            After the fix tommie pushed in for this issue last week, the customer is hitting this error when deploying the Operator

             
            Message: secrets "cb-example" is forbidden: cannot set blockOwnerDeletion if an ownerReference refers to a resource you can't set finalizers on: , <nil>
             
            Reason: ErrorEncountered
             
            Status: True
            

            their values.yaml: values-amex-custom.yaml

            tin.tran Tin Tran added a comment - - edited Hi Patrick Stephens Since Tommie is out for sometime, can you help take a look at this issue? After the fix tommie pushed in for this issue last week, the customer is hitting this error when deploying the Operator   Message: secrets "cb-example" is forbidden: cannot set blockOwnerDeletion if an ownerReference refers to a resource you can't set finalizers on: , <nil>   Reason: ErrorEncountered   Status: True their values.yaml: values-amex-custom.yaml

            This is an RBAC issue, I encountered a similar one during development of log forwarding when the secret was created by another user (not the operator but as part of the test framework) and we were trying to take over control of it (so it gets deleted as managed by the operator when you delete the cluster). To do that we need permissions to `delete` a `Secret` which is not part of our usual set and we do not want to increase that if we can help it.

            If the cluster role used by the operator (or DAC - it is not clear which is generating the error here) is extended to include delete then it should work around this issue.

            For 2.2 we have the following the operator can `get`, `create`, `update`, `list` and `watch` Secrets and the DAC can just `get` them if it is asked to validate secrets. You can edit the role directly to add the `delete` verb for `secrets` resources as it will not be reconciled.

            I will investigate why it is happening, note this admission controller that triggers the error was only on Openshift during testing - are they using it or have enabled any extra admission controllers on a non-OCP platform?

            patrick.stephens Patrick Stephens (Inactive) added a comment - - edited This is an RBAC issue, I encountered a similar one during development of log forwarding when the secret was created by another user (not the operator but as part of the test framework) and we were trying to take over control of it (so it gets deleted as managed by the operator when you delete the cluster). To do that we need permissions to `delete` a `Secret` which is not part of our usual set and we do not want to increase that if we can help it. If the cluster role used by the operator (or DAC - it is not clear which is generating the error here) is extended to include delete then it should work around this issue. For 2.2 we have the following the operator can `get`, `create`, `update`, `list` and `watch` Secrets and the DAC can just `get` them if it is asked to validate secrets. You can edit the role directly to add the `delete` verb for `secrets` resources as it will not be reconciled. I will investigate why it is happening, note this admission controller that triggers the error was only on Openshift during testing - are they using it or have enabled any extra admission controllers on a non-OCP platform?

            Questions:

            1. Are they on Openshift?
            2. Can you also confirm if this is from a fresh install or are they upgrading? This includes trying to install with vanilla 2.2 then using the update delivered recently for the node selector as this is effectively an upgrade. Can they confirm it appears on a completely fresh install?
            patrick.stephens Patrick Stephens (Inactive) added a comment - Questions: Are they on Openshift? Can you also confirm if this is from a fresh install or are they upgrading? This includes trying to install with vanilla 2.2 then using the update delivered recently for the node selector as this is effectively an upgrade. Can they confirm it appears on a completely fresh install?

            This actually looks like K8S-2205 but the fix for that is included in 2.2 and QE confirm the test case is still passing on OCP for it. It may be related to the recent update for node selector usage so I will investigate on an OCP cluster.

            You can see the references to a lot of similar issues on OCP here: https://github.com/spotahome/redis-operator/issues/98 

            It is still an RBAC issue but need to confirm which one. Ideally we can get a full cbopinfo so we can see what is happening on OCP as well.

            It may be that the `delete` role adding to the `couchbaseclusters` crd group will solve it as well. Adding all verbs and ensure the /finalizers will likely do it but we need to constrain to the minimum acceptable for a deliverable solution.

            patrick.stephens Patrick Stephens (Inactive) added a comment - - edited This actually looks like K8S-2205 but the fix for that is included in 2.2 and QE confirm the test case is still passing on OCP for it. It may be related to the recent update for node selector usage so I will investigate on an OCP cluster. You can see the references to a lot of similar issues on OCP here: https://github.com/spotahome/redis-operator/issues/98   It is still an RBAC issue but need to confirm which one. Ideally we can get a full cbopinfo so we can see what is happening on OCP as well. It may be that the `delete` role adding to the `couchbaseclusters` crd group will solve it as well. Adding all verbs and ensure the /finalizers will likely do it but we need to constrain to the minimum acceptable for a deliverable solution.
            tin.tran Tin Tran added a comment -

            Hi Patrick Stephens
            The customer has uploaded the cbopinfo here

            s3://cb-customers-secure/american-express-az/40311/2021-07-23/cbopinfo-20210723t102706+0530.tar.gz

            Thank you Patrick

            tin.tran Tin Tran added a comment - Hi Patrick Stephens The customer has uploaded the cbopinfo here s3://cb-customers-secure/american-express-az/40311/2021-07-23/cbopinfo-20210723t102706+0530.tar.gz Thank you Patrick

            Thanks, it looks like the helm chart creates RBAC directly rather than relying on the `cbopcfg` approach: https://github.com/couchbase-partners/helm-charts/blob/master/charts/couchbase-operator/templates/operator-deployment.yaml

            This therefore means it has missed the addition of the `finalizers` change from K8S-2205 and required for the extra admission controller used by OCP: https://kubernetes.io/docs/reference/access-authn-authz/admission-controllers/#ownerreferencespermissionenforcement 

            This admission controller also protects the access to metadata.ownerReferences[x].blockOwnerDeletion of an object, so that only users with "update" permission to the finalizers subresource of the referenced owner can change it.

            This is borne about by a quick test using the latest version (2.2.005):

            helm upgrade --install --debug --wait test couchbase/couchbase-operator --set install.couchbaseCluster=false

            kubectl describe role test-couchbase-operator

            ...

              services                                   []                 []              [get list watch create update delete patch]

              couchbaseclusters.couchbase.com            []                 []              [get list watch update]

              events                                     []                 []              [list create update]

            ...

            With an update to the RBAC on the helm chart and repeating it we get:

              services                                    []                 []              [get list watch create update delete patch]

              couchbaseclusters.couchbase.com/finalizers  []                 []              [get list watch update]

              couchbaseclusters.couchbase.com             []                 []              [get list watch update]

              events                                      []                 []              [list create update]

            This should resolve the issue. As a workaround a manual edit of the role to add the couchbaseclusters.couchbase.com/finalizers resource permissions would also do it.

            patrick.stephens Patrick Stephens (Inactive) added a comment - - edited Thanks, it looks like the helm chart creates RBAC directly rather than relying on the `cbopcfg` approach: https://github.com/couchbase-partners/helm-charts/blob/master/charts/couchbase-operator/templates/operator-deployment.yaml This therefore means it has missed the addition of the `finalizers` change from K8S-2205 and required for the extra admission controller used by OCP: https://kubernetes.io/docs/reference/access-authn-authz/admission-controllers/#ownerreferencespermissionenforcement   This admission controller also protects the access to  metadata.ownerReferences [x] .blockOwnerDeletion  of an object, so that only users with "update" permission to the  finalizers  subresource of the referenced owner can change it. This is borne about by a quick test using the latest version (2.2.005): helm upgrade --install --debug --wait test couchbase/couchbase-operator --set install.couchbaseCluster=false kubectl describe role test-couchbase-operator ...   services                                   []                 []              [get list watch create update delete patch]   couchbaseclusters.couchbase.com            []                 []              [get list watch update]   events                                     []                 []              [list create update] ... With an update to the RBAC on the helm chart and repeating it we get:   services                                    []                 []              [get list watch create update delete patch]   couchbaseclusters.couchbase.com/finalizers  []                 []              [get list watch update]   couchbaseclusters.couchbase.com             []                 []              [get list watch update]   events                                      []                 []              [list create update] This should resolve the issue. As a workaround a manual edit of the role to add the  couchbaseclusters.couchbase.com/finalizers resource permissions would also do it.

            I can replicate the problem on KIND using the configuration shown in K8S-2322 to add the admission controller that OCP is using with the existing chart version.

            I can confirm that the change resolves the issue to allow a cluster to deploy. This includes the node selector option in the values file provided: the only tweak I had to make to the values file was the server image name (not using RHCC so should be the default of couchbase/server:6.6.2) and the storage class was commented out so it used the default.

            I will ask QE (Roo Thorp) to verify the fix on OCP as well: https://github.com/couchbase-partners/helm-charts/pull/58 

            patrick.stephens Patrick Stephens (Inactive) added a comment - - edited I can replicate the problem on KIND using the configuration shown in K8S-2322 to add the admission controller that OCP is using with the existing chart version. I can confirm that the change resolves the issue to allow a cluster to deploy. This includes the node selector option in the values file provided: the only tweak I had to make to the values file was the server image name (not using RHCC so should be the default of couchbase/server:6.6.2 ) and the storage class was commented out so it used the default. I will ask QE ( Roo Thorp ) to verify the fix on OCP as well: https://github.com/couchbase-partners/helm-charts/pull/58  
            roo.thorp Roo Thorp added a comment - - edited

            I've tested Patrick's change on OCP - labelling one of the nodes as the customer does (tier=app), and then helm installing using their values.yaml from above (I changed the storage class so that the cluster could install properly, but that's the only change I made). I can see that the operator & DAC install on the labelled node - the couchbase cluster pods are spinning up on other nodes, which I think is expected as the nodeSelector was just for the DAC and operator.

            roo.thorp Roo Thorp added a comment - - edited I've tested Patrick's change on OCP - labelling one of the nodes as the customer does (tier=app), and then helm installing using their values.yaml from above (I changed the storage class so that the cluster could install properly, but that's the only change I made). I can see that the operator & DAC install on the labelled node - the couchbase cluster pods are spinning up on other nodes, which I think is expected as the nodeSelector was just for the DAC and operator.

            The updated chart version should now be available as version 2.2.006, make sure to do a `helm repo update` to pick it up as the latest version.

            https://github.com/couchbase-partners/helm-charts/releases/tag/couchbase-operator-2.2.006 

            patrick.stephens Patrick Stephens (Inactive) added a comment - - edited The updated chart version should now be available as version 2.2.006, make sure to do a `helm repo update` to pick it up as the latest version. https://github.com/couchbase-partners/helm-charts/releases/tag/couchbase-operator-2.2.006  

            Confirmed with Tin Tran that this is all ok now.

            patrick.stephens Patrick Stephens (Inactive) added a comment - Confirmed with Tin Tran that this is all ok now.

            Done in 2.2, removing the label.

            ingenthr Matt Ingenthron added a comment - Done in 2.2, removing the label.

            People

              patrick.stephens Patrick Stephens (Inactive)
              tin.tran Tin Tran
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Gerrit Reviews

                  There are no open Gerrit changes

                  PagerDuty