Uploaded image for project: 'Couchbase Kubernetes'
  1. Couchbase Kubernetes
  2. K8S-2274

Test Istio in strict mode with post-creation updates

    XMLWordPrintable

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Done
    • 2.2.0
    • 2.2.1
    • operator, testing
    • None
    • 1

    Description

      Istio needs to be tested in STRICT mode to pick up issues seen during customer roll out.

      https://github.com/patrick-stephens/couchbase-gitops/blob/d30ea4a6f97555a12a1a82c6151b2442fe1930bb/istio-dac-permissive.sh

      Deploy Istio, set up namespace injection and peer authentication rules for STRICT (apart from DAC and Prometheus exporter).

      Deploy helm chart with `--set cluster.networking.networkPlatform=Istio`.

      Once cluster up, scale up cluster size from 3 to 6 pods and ensure all come up correctly.

       

      Attachments

        No reviews matched the request. Check your Options in the drop-down menu of this sections header.

        Activity

          We should see failures as there is an Istio issue - I would expect us to see them in the resize test cases. If we are not we need to expand our testing until it is picking up these failures.

          Separately we may see failures in a subset of tests due to the Istio configuration, e.g. remote XDCR, sync gateway and SDK tests. These are not actual issues but will require cluster configuration or skipping on Istio.

          patrick.stephens Patrick Stephens (Inactive) added a comment - We should see failures as there is an Istio issue - I would expect us to see them in the resize test cases. If we are not we need to expand our testing until it is picking up these failures. Separately we may see failures in a subset of tests due to the Istio configuration, e.g. remote XDCR, sync gateway and SDK tests. These are not actual issues but will require cluster configuration or skipping on Istio.
          prateek.kumar Prateek Kumar (Inactive) added a comment - - edited

          Regression was triggered against 2.2.1-114 build with Istio mTLS STRICT.

          Couchbase Server Versions: 6.6.2, 6.5.2

          K8s version: 1.18 (GKE, EKS, AKS)

          GKE :

          http://qa.sc.couchbase.com/view/Cloud/job/k8s-cbop-gke-pipeline-2.2.x/63/testReport/

          http://qa.sc.couchbase.com/view/Cloud/job/k8s-cbop-gke-pipeline-2.2.x/64/testReport/

          EKS: 

          http://qa.sc.couchbase.com/view/Cloud/job/k8s-cbop-eks-pipeline-2.2.x/35/testReport/

          http://qa.sc.couchbase.com/view/Cloud/job/k8s-cbop-eks-pipeline-2.2.x/34/testReport/

          AKS:

          http://qa.sc.couchbase.com/view/Cloud/job/k8s-cbop-aks-pipeline-2.2.x/35/testReport/

          http://qa.sc.couchbase.com/view/Cloud/job/k8s-cbop-aks-pipeline-2.2.x/34/testReport/

           

          Each of the above mentioned platforms had an average of ~63 failures.

          The failures fall under categories of XDCR, Backup TLS, Sync Gateway Remote, TLS, AutoScaling, and a couple of Persistent Volume Resize test cases. Out of all these categories , first 4 are expected and we have release notes in place to indicate we don't support them over Istio mTLS STRICT.

          Re-triggering Auto Scaling tests because they have been failing across all k8s platforms with Istio STRICT and could possibly be a bug, more information after run is complete.

          prateek.kumar Prateek Kumar (Inactive) added a comment - - edited Regression was triggered against 2.2.1-114 build with Istio mTLS STRICT. Couchbase Server Versions: 6.6.2, 6.5.2 K8s version: 1.18 (GKE, EKS, AKS) GKE : http://qa.sc.couchbase.com/view/Cloud/job/k8s-cbop-gke-pipeline-2.2.x/63/testReport/ http://qa.sc.couchbase.com/view/Cloud/job/k8s-cbop-gke-pipeline-2.2.x/64/testReport/ EKS:  http://qa.sc.couchbase.com/view/Cloud/job/k8s-cbop-eks-pipeline-2.2.x/35/testReport/ http://qa.sc.couchbase.com/view/Cloud/job/k8s-cbop-eks-pipeline-2.2.x/34/testReport/ AKS: http://qa.sc.couchbase.com/view/Cloud/job/k8s-cbop-aks-pipeline-2.2.x/35/testReport/ http://qa.sc.couchbase.com/view/Cloud/job/k8s-cbop-aks-pipeline-2.2.x/34/testReport/   Each of the above mentioned platforms had an average of ~63 failures. The failures fall under categories of XDCR, Backup TLS, Sync Gateway Remote, TLS, AutoScaling, and a couple of Persistent Volume Resize test cases. Out of all these categories , first 4 are expected and we have release notes in place to indicate we don't support them over Istio mTLS STRICT. Re-triggering Auto Scaling tests because they have been failing across all k8s platforms with Istio STRICT and could possibly be a bug, more information after run is complete.

          The autoscaling is specific to our test framework because we are creating a mock metric service which communicates over TLS with APIserver.
          I suspect we just need to exclude the metric deployment from STRICT mode: https://github.com/couchbase/couchbase-operator/blob/master/test/e2e/e2espec/autoscale.go#L182

          tommie Tommie McAfee added a comment - The autoscaling is specific to our test framework because we are creating a mock metric service which communicates over TLS with APIserver. I suspect we just need to exclude the metric deployment from STRICT mode: https://github.com/couchbase/couchbase-operator/blob/master/test/e2e/e2espec/autoscale.go#L182

          Unexpected failures were triaged and reran. All but autoscaling tests passed. 

          As mentioned by Tommie McAfee , we need to exclude the custom metric deployment from STRICT mode, this was done by creating a new PeerAuthentication Rule:

          apiVersion: "security.istio.io/v1beta1"
          kind: "PeerAuthentication"
          metadata:  
            name: "peer-authentication-custom-metrics"  
            namespace: "istio-system"
          spec:
            selector:
              matchLabels:
                app: "custom-metrics-apiserver"
            mtls:
              mode: PERMISSIVE

          This created the rule however the tests did not pass with STRICT mode applied to other workloads.

          Status of the pods:

          Prateeks-MacBook-Pro:Downloads prateekkumar$ kubectl get pods -n test-t5mqp
          NAME                                        READY   STATUS    RESTARTS   AGE
          couchbase-operator-6d8f5d7445-grzm6         2/2     Running   1          81s
          custom-metrics-apiserver-59fb9797c4-7x9m9   2/2     Running   2          28s
          test-couchbase-6brcx-0000                   2/2     Running   0          45s 

          Logs have been shared with Tommie, and once he confirms that the metric deployment is not running with STRICT mode, we will investigate further.

          Upon completion of this exercise, QE will sign off on ISTIO STRICT testing with 2.2.1-117 build.

           

           

          P.S. : The categories of failures which are expected over mTLS STRICT : XDCR, Sync Gateway Remote, Backup TLS and TLS 

          prateek.kumar Prateek Kumar (Inactive) added a comment - Unexpected failures were triaged and reran. All but autoscaling tests passed.  As mentioned by  Tommie McAfee  , we need to exclude the custom metric deployment from STRICT mode, this was done by creating a new PeerAuthentication Rule: apiVersion: "security.istio.io/v1beta1" kind: "PeerAuthentication" metadata: name: "peer-authentication-custom-metrics" namespace: "istio-system" spec: selector: matchLabels: app: "custom-metrics-apiserver" mtls: mode: PERMISSIVE This created the rule however the tests did not pass with STRICT mode applied to other workloads. Status of the pods: Prateeks-MacBook-Pro:Downloads prateekkumar$ kubectl get pods -n test-t5mqp NAME READY STATUS RESTARTS AGE couchbase-operator-6d8f5d7445-grzm6 2 / 2 Running 1 81s custom-metrics-apiserver-59fb9797c4-7x9m9 2 / 2 Running 2 28s test-couchbase-6brcx- 0000 2 / 2 Running 0 45s Logs have been shared with Tommie, and once he confirms that the metric deployment is not running with STRICT mode, we will investigate further. Upon completion of this exercise, QE will sign off on ISTIO STRICT testing with 2.2.1-117 build.     P.S. : The categories of failures which are expected over mTLS STRICT : XDCR, Sync Gateway Remote, Backup TLS and TLS  

          As Patrick Stephens mentioned, the rule needs to be applied to the particular namespace in this scenario(autoscaling). Following this procedure, the autoscaling tests pass.

          QE has completed its testing and signs off on build 2.2.1-117 for ISTIO STRICT.

          Marking this issue as 'Resolved' , will be closed once 2.2.1 GA build has been identified.

          prateek.kumar Prateek Kumar (Inactive) added a comment - As Patrick Stephens  mentioned, the rule needs to be applied to the particular namespace in this scenario(autoscaling). Following this procedure, the autoscaling tests pass. QE has completed its testing and signs off on build 2.2.1-117 for ISTIO STRICT. Marking this issue as 'Resolved' , will be closed once 2.2.1 GA build has been identified.

          People

            prateek.kumar Prateek Kumar (Inactive)
            patrick.stephens Patrick Stephens (Inactive)
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Gerrit Reviews

                There are no open Gerrit changes

                PagerDuty