Uploaded image for project: 'Couchbase Kubernetes'
  1. Couchbase Kubernetes
  2. K8S-2092

[TestBackupPVCResize] Backup event did not complete after PVC was resized.

    XMLWordPrintable

    Details

    • Type: Bug
    • Status: Closed
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 2.2.0
    • Fix Version/s: 2.2.0
    • Component/s: operator
    • Labels:
    • Sprint:
      12: Logging,Indexing, PE/SB
    • Story Points:
      1

      Description

      Job: http://qa.sc.couchbase.com/job/roo-test/194/console 

      Couchbase Server: couchbase/server:6.6.0

      Istio Enabled with MTLS_PERMISSIVE

      Storage class: standard-rwo

      Gerrit review for backup image: http://review.couchbase.org/c/couchbase-operator-backup/+/148816 

      Stack Trace: 

           --- FAIL: TestOperator/TestBackupPVCResize (1034.95s)
               crd_util.go:42: creating couchbase cluster: test-couchbase-zm2d9
               util.go:1170: context deadline exceeded: failed to wait for event BackupCompleted/Backup `full-only` completed
               util.go:1171: goroutine 802 [running]:
                   runtime/debug.Stack(0x1f0a905, 0xc0004b8480, 0xc0002b2000)
                   	/jenkins/workspace/roo-test/go/src/runtime/debug/stack.go:24 +0xab
                   github.com/couchbase/couchbase-operator/test/e2e/e2eutil.Die(0xc0009f6200, 0x29c66c0, 0xc000752a00)
                   	/jenkins/workspace/roo-test/test/e2e/e2eutil/util.go:1166 +0x34
                   github.com/couchbase/couchbase-operator/test/e2e/e2eutil.mustWaitForResourceEventFromNow(0xc0009f6200, 0xc0004bca90, 0x29d32e0, 0xc0004b8480, 0xc0002b2000, 0x8bb2c97000)
                   	/jenkins/workspace/roo-test/test/e2e/e2eutil/wait_util.go:289 +0x155
                   github.com/couchbase/couchbase-operator/test/e2e/e2eutil.MustWaitForBackupEvent(...)
                   	/jenkins/workspace/roo-test/test/e2e/e2eutil/wait_util.go:592
                   github.com/couchbase/couchbase-operator/test/e2e.testBackupPVCResize(0xc0009f6200, 0xc000612700)
                   	/jenkins/workspace/roo-test/test/e2e/backup_test.go:1037 +0xdc5
                   github.com/couchbase/couchbase-operator/test/e2e.TestBackupPVCResize(0xc0009f6200)
                   	/jenkins/workspace/roo-test/test/e2e/backup_test.go:1058 +0x3e
                   testing.tRunner(0xc0009f6200, 0x2703cf8)
                   	/jenkins/workspace/roo-test/go/src/testing/testing.go:909 +0x19a
                   created by testing.(*T).Run
                   	/jenkins/workspace/roo-test/go/src/testing/testing.go:960 +0x652              

      Event : https://github.com/couchbase/couchbase-operator/blob/master/test/e2e/backup_test.go#L1037 ** 

      cbbackupmgr did not have any Error message for the failure which might suggest this failure is a testware issue and we just have to increase the timeout period for the event. 

      NFO:root:Exiting Script, success. Output: Backed up to 2021-03-18T18_07_33.169097231Z
      Copied all data in 2.530934555s (Avg. 28.14KB/Sec)
      1 buckets, 100 items / 71.22KB
      Backup successfully completed
      Backed up bucket "default" succeeded
      Mutations backed up: 100, Mutations failed to backup: 0
      Deletions backed up: 0, Deletions failed to backup: 0
      Skipped due to purge number or conflict resolution: Mutations: 0 Deletions: 0
      2021-03-18 18:07:35,870 - root - INFO - Exiting Script, success. Output: Backed up to 2021-03-18T18_07_33.169097231Z
      Copied all data in 2.530934555s (Avg. 28.14KB/Sec)
      1 buckets, 100 items / 71.22KB
      Backup successfully completed
      Backed up bucket "default" succeeded
      Mutations backed up: 100, Mutations failed to backup: 0
      Deletions backed up: 0, Deletions failed to backup: 0
      Skipped due to purge number or conflict resolution: Mutations: 0 Deletions: 0 

        Attachments

        No reviews matched the request. Check your Options in the drop-down menu of this sections header.

          Activity

          Hide
          simon.murray Simon Murray added a comment -

          Initial backup created...

          events.py TestBackupPVCResize/test-5w28c/cbopinfo-20210318T111738-0700/namespace/test-5w28c/cronjob/full-only-full/events.yaml 
          2021-03-18T18:07:02Z Created job full-only-full-1616090820

          And the pod created...

          events.py TestBackupPVCResize/test-5w28c/cbopinfo-20210318T111738-0700/namespace/test-5w28c/job/full-only-full-1616090820/events.yaml 
          2021-03-18T18:07:02Z Created pod: full-only-full-1616090820-kj8j5

          That was 18:07... the cron job was updated

          [Tue 23 Mar 16:00:04 GMT 2021] simon@symphony ~/Downloads/temp cat TestBackupPVCResize/test-5w28c/cbopinfo-20210318T111738-0700/namespace/test-5w28c/cronjob/full-only-full/full-only-full.yaml | grep schedule
              cronjob.couchbase.com/spec: '{"schedule":"11 * * * *","concurrencyPolicy":"Forbid","jobTemplate":{"metadata":{"creationTimestamp":null},"spec":{"backoffLimit":2,"template":{"metadata":{"creationTimestamp":null,"labels":{"app":"couchbase","couchbase_backup":"full-only","couchbase_cluster":"test-couchbase-zm2d9"}},"spec":{"volumes":[{"name":"couchbase-cluster-backup-volume","persistentVolumeClaim":{"claimName":"full-only"}},{"name":"couchbase-admin","secret":{"secretName":"basic-test-secret"}}],"containers":[{"name":"cbbackupmgr-full","image":"prateekkumarv12f/backup:6.6.0-100","command":["backup_script"],"args":["test-couchbase-zm2d9","--strategy","full_only","--config","true","--mode","backup","--backup-ret","720.00","--log-ret","168.00","-v","INFO"],"workingDir":"/","resources":{},"volumeMounts":[{"name":"couchbase-cluster-backup-volume","mountPath":"/data"},{"name":"couchbase-admin","readOnly":true,"mountPath":"/var/run/secrets/couchbase"}]}],"restartPolicy":"Never","serviceAccountName":"couchbase-backup","securityContext":{"fsGroup":1000},"imagePullSecrets":[{"name":"test-docker-pull-secret-zkdkf"},{"name":"test-docker-pull-secret-pnm84"}]}}}},"successfulJobsHistoryLimit":3,"failedJobsHistoryLimit":3}'
                          f:schedulerName: {}
                  f:schedule: {}
                    schedulerName: default-scheduler
            schedule: 11 * * * *

          The cronjob at the time of collection looks like...

          status:
            active:
            - apiVersion: batch/v1
              kind: Job
              name: full-only-full-1616090820
              namespace: test-5w28c
              resourceVersion: "17607"
              uid: 6b1f3a78-9232-410d-bd19-32bb21197e68
            lastScheduleTime: "2021-03-18T18:07:00Z"

          And the job...

          status:
            active: 1
            startTime: "2021-03-18T18:07:02Z"

          And the pod...

            - containerID: docker://35f2c6bb4dd289cb5506a994847748deaadd3d45470b984b19a6638cb3de0a7d
              image: istio/proxyv2:1.9.0
              imageID: docker-pullable://istio/proxyv2@sha256:286b821197d7a9233d1d889119f090cd9a9394468d3a312f66ea24f6e16b2294
              lastState: {}
              name: istio-proxy
              ready: true
              restartCount: 0
              started: true
              state:
                running:
                  startedAt: "2021-03-18T18:07:28Z"

          Oooooh, it seems the job terminating doesn't bring down the pod, and envoy is keeping it alive and preventing further scheduling

          Show
          simon.murray Simon Murray added a comment - Initial backup created... events.py TestBackupPVCResize/test-5w28c/cbopinfo-20210318T111738-0700/namespace/test-5w28c/cronjob/full-only-full/events.yaml 2021-03-18T18:07:02Z Created job full-only-full-1616090820 And the pod created... events.py TestBackupPVCResize/test-5w28c/cbopinfo-20210318T111738-0700/namespace/test-5w28c/job/full-only-full-1616090820/events.yaml 2021-03-18T18:07:02Z Created pod: full-only-full-1616090820-kj8j5 That was 18:07... the cron job was updated [Tue 23 Mar 16:00:04 GMT 2021] simon@symphony ~/Downloads/temp cat TestBackupPVCResize/test-5w28c/cbopinfo-20210318T111738-0700/namespace/test-5w28c/cronjob/full-only-full/full-only-full.yaml | grep schedule cronjob.couchbase.com/spec: '{"schedule":"11 * * * *","concurrencyPolicy":"Forbid","jobTemplate":{"metadata":{"creationTimestamp":null},"spec":{"backoffLimit":2,"template":{"metadata":{"creationTimestamp":null,"labels":{"app":"couchbase","couchbase_backup":"full-only","couchbase_cluster":"test-couchbase-zm2d9"}},"spec":{"volumes":[{"name":"couchbase-cluster-backup-volume","persistentVolumeClaim":{"claimName":"full-only"}},{"name":"couchbase-admin","secret":{"secretName":"basic-test-secret"}}],"containers":[{"name":"cbbackupmgr-full","image":"prateekkumarv12f/backup:6.6.0-100","command":["backup_script"],"args":["test-couchbase-zm2d9","--strategy","full_only","--config","true","--mode","backup","--backup-ret","720.00","--log-ret","168.00","-v","INFO"],"workingDir":"/","resources":{},"volumeMounts":[{"name":"couchbase-cluster-backup-volume","mountPath":"/data"},{"name":"couchbase-admin","readOnly":true,"mountPath":"/var/run/secrets/couchbase"}]}],"restartPolicy":"Never","serviceAccountName":"couchbase-backup","securityContext":{"fsGroup":1000},"imagePullSecrets":[{"name":"test-docker-pull-secret-zkdkf"},{"name":"test-docker-pull-secret-pnm84"}]}}}},"successfulJobsHistoryLimit":3,"failedJobsHistoryLimit":3}' f:schedulerName: {} f:schedule: {} schedulerName: default-scheduler schedule: 11 * * * * The cronjob at the time of collection looks like... status: active: - apiVersion: batch/v1 kind: Job name: full-only-full-1616090820 namespace: test-5w28c resourceVersion: "17607" uid: 6b1f3a78-9232-410d-bd19-32bb21197e68 lastScheduleTime: "2021-03-18T18:07:00Z" And the job... status: active: 1 startTime: "2021-03-18T18:07:02Z" And the pod... - containerID: docker://35f2c6bb4dd289cb5506a994847748deaadd3d45470b984b19a6638cb3de0a7d image: istio/proxyv2:1.9.0 imageID: docker-pullable://istio/proxyv2@sha256:286b821197d7a9233d1d889119f090cd9a9394468d3a312f66ea24f6e16b2294 lastState: {} name: istio-proxy ready: true restartCount: 0 started: true state: running: startedAt: "2021-03-18T18:07:28Z" Oooooh, it seems the job terminating doesn't bring down the pod, and envoy is keeping it alive and preventing further scheduling
          Hide
          simon.murray Simon Murray added a comment -

          #fml... https://github.com/istio/istio/issues/6324 seems the "answer" is to kill envoy with a /quitquitquit cURL call sigh

           

          Show
          simon.murray Simon Murray added a comment - #fml... https://github.com/istio/istio/issues/6324 seems the "answer" is to kill envoy with a /quitquitquit cURL call sigh  
          Hide
          simon.murray Simon Murray added a comment -

          releasenote:

          Istio in mTLS mode is unable to run using `Job` or `CronJob` resource types.  This is due to the Envoy sidecar proxy not terminating when a backup or restore `Job` does, thus keeping the `Pod` alive and blocking further execution.  This is remedied in Couchbase Operator Backup versions 1.0.0 and above by sending a termination signal to the Envoy side car on termination of the backup and restore `Job`.

          Show
          simon.murray Simon Murray added a comment - releasenote: Istio in mTLS mode is unable to run using `Job` or `CronJob` resource types.  This is due to the Envoy sidecar proxy not terminating when a backup or restore `Job` does, thus keeping the `Pod` alive and blocking further execution.  This is remedied in Couchbase Operator Backup versions 1.0.0 and above by sending a termination signal to the Envoy side car on termination of the backup and restore `Job`.

            People

            Assignee:
            simon.murray Simon Murray
            Reporter:
            prateek.kumar Prateek Kumar
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

              Dates

              Created:
              Updated:
              Resolved:

                Gerrit Reviews

                There are no open Gerrit changes

                  PagerDuty