Uploaded image for project: 'Couchbase Kubernetes'
  1. Couchbase Kubernetes
  2. K8S-2532

cao certify with --clean can't clean, resources exhausted

    XMLWordPrintable

Details

    • Bug
    • Resolution: Fixed
    • Major
    • 2.3.0
    • 2.3.0-beta
    • None
    • None
    • 1: Recovery to productivity, 3: SBEE, Multi-Cert
    • 1

    Description

      When attempting to run something simple:

      bin/cao certify --clean – -test TestCreateCluster

      It unfortunately would exit after failing to create the pod:

      {preformat}
      Initializing ...
      Deleting pull secrets ...
      Deleting certification pod ...
      Creating service account ...
      Creating cluster role ...
      Creating cluster role binding ...
      Creating artifacts volume ...
      Creating pull secrets ...
      Creating certification pod ...
      Waiting for certification pod to become ready ...
      Deleting certification pod ...
      Deleting pull secrets ...
      Deleting artifacts volume ...
      Deleting cluster role binding ...
      Deleting cluster role ...
      Deleting service account ...
      Certification error: failed to wait for condition: pod condition missing{preformat}

      Investigation shows the pod couldn't start because of a lack of CPU, but there are a lot of old test namespaces:

      {preformat}
      % kubectl describe pod certification
      Name: certification
      Namespace: default
      Priority: 0
      Node: <none>
      Labels: <none>
      Annotations: <none>
      Status: Pending
      IP:
      IPs: <none>
      Containers:
      certification:
      Image: couchbase/operator-certification:2.3.0-beta1
      Port: <none>
      Host Port: <none>
      Args:
      -test.v
      -test.run
      TestOperator
      -test.timeout
      12h
      -test.parallel
      8
      -color
      -collect-logs
      -test
      TestCreateCluster
      Requests:
      cpu: 2
      memory: 3Gi
      Environment: <none>
      Mounts:
      /artifacts from artifacts (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from couchbase-operator-certification-token-5bhck (ro)
      Conditions:
      Type Status
      PodScheduled False
      Volumes:
      artifacts:
      Type: PersistentVolumeClaim (a reference to a PersistentVolumeClaim in the same namespace)
      ClaimName: artifacts
      ReadOnly: false
      couchbase-operator-certification-token-5bhck:
      Type: Secret (a volume populated by a Secret)
      SecretName: couchbase-operator-certification-token-5bhck
      Optional: false
      QoS Class: Burstable
      Node-Selectors: <none>
      Tolerations: node.kubernetes.io/memory-pressure:NoSchedule op=Exists
      node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
      node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
      Events:
      Type Reason Age From Message
      ---- ------ ---- ---- -------
      Warning FailedScheduling 2m42s default-scheduler 0/3 nodes are available: 3 Insufficient cpu.
      Warning FailedScheduling 2m42s default-scheduler 0/3 nodes are available: 3 Insufficient cpu.
      ingenthr@ingenthr-mbp ~ % kubectl get namespace
      NAME STATUS AGE
      default Active 11d
      kube-node-lease Active 11d
      kube-public Active 11d
      kube-system Active 11d
      test-586zq Active 3h2m
      test-bxxbh Active 177m
      test-cdwq4 Active 179m
      test-dwwtq Active 3h2m
      test-kcl2r Active 3h2m
      test-kxssz Active 3h3m
      test-skfq6 Active 3h5m
      test-skkls Active 3h6m
      test-tgqkr Active 3h1m
      test-w67bd Active 3h2m
      ingenthr@ingenthr-mbp ~ % kubectl get po -n test-w67bd
      NAME READY STATUS RESTARTS AGE
      couchbase-operator-6dfdb7c9cc-bkbmn 1/1 Running 0 3h2m
      test-couchbase-ssl9j-0000 1/1 Running 0 3h1m
      ingenthr@ingenthr-mbp ~ % kubectl get po -n test-tgqkr
      NAME READY STATUS RESTARTS AGE
      couchbase-operator-6dfdb7c9cc-776bw 1/1 Running 0 3h2m
      test-couchbase-h72cn-0000 1/1 Running 0 179m
      test-couchbase-h72cn-0001 1/1 Running 0 172m
      test-couchbase-h72cn-0002 0/1 Pending 0 25s{preformat} {preformat}
      % kubectl get namespace
      NAME STATUS AGE
      default Active 11d
      kube-node-lease Active 11d
      kube-public Active 11d
      kube-system Active 11d
      test-586zq Active 3h4m
      test-bxxbh Active 178m
      test-cdwq4 Active 3h1m
      test-dwwtq Active 3h3m
      test-kcl2r Active 3h3m
      test-kxssz Active 3h5m
      test-skfq6 Active 3h7m
      test-skkls Active 3h8m
      test-tgqkr Active 3h3m
      test-w67bd Active 3h3m
      ingenthr@ingenthr-mbp ~ % kubectl get namespace --show-labels
      NAME STATUS AGE LABELS
      default Active 11d <none>
      kube-node-lease Active 11d <none>
      kube-public Active 11d <none>
      kube-system Active 11d addonmanager.kubernetes.io/mode=Reconcile,control-plane=true,kubernetes.io/cluster-service=true
      test-586zq Active 3h5m istio-injection=enabled
      test-bxxbh Active 179m istio-injection=enabled
      test-cdwq4 Active 3h2m istio-injection=enabled
      test-dwwtq Active 3h4m istio-injection=enabled
      test-kcl2r Active 3h4m istio-injection=enabled
      test-kxssz Active 3h6m istio-injection=enabled
      test-skfq6 Active 3h8m istio-injection=enabled
      test-skkls Active 3h9m istio-injection=enabled
      test-tgqkr Active 3h4m istio-injection=enabled
      test-w67bd Active 3h4m istio-injection=enabled{preformat}

      Suggestion would be to have a label on the namespaces for it to be clear they were created by certification, then `cao` with --clean can clean them up before trying to deploy a new pod.

      Attachments

        No reviews matched the request. Check your Options in the drop-down menu of this sections header.

        Activity

          People

            tommie Tommie McAfee (Inactive)
            ingenthr Matt Ingenthron
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Gerrit Reviews

                There are no open Gerrit changes

                PagerDuty