Uploaded image for project: 'Couchbase Kubernetes'
  1. Couchbase Kubernetes
  2. K8S-3516

[2.6.4-119] EKS worker node upgrade failed from 1.25 to 1.26

    XMLWordPrintable

Details

    • Bug
    • Resolution: Fixed
    • Critical
    • 2.6.4
    • 2.6.4
    • operator
    • None
    • 11 - Race to Crashpoint Tower
    • 0

    Description

      Couchbase Cluster Description

      • Set up the cluster as per the required specifications
      • Each node is an m5.4xlarge instance. (16 vCPUs and 64GB RAM)
      • 6 Data Service, 4 Index Service and Query Service Nodes.
      • 10 Buckets (with 1 replica), Full Eviction and Auto-failover set to 5s.
      • ~3TB data loaded onto the cluster.
      • 50 Primary Indexes with 1 Replica each. (Total 100 Indexes)
      • Continuous data and query workload on all buckets during the update process.

      Task: Upgrade EKS 1.25 -> 1.26

       

      Observation:-

      1. Control Plane node Updated Successfully.
      2. Worker Node update failed

       

      Follow-Ups :

      1. Why operator is still looking for 0005 when a new 0010 is added back to the cluster?
      2. How does this impact the overall EKS worker node upgrade, (may be related to PDB)

       

      Analysis:-

      1. cb-example-0005 was failover and a new pod was added back named 0010
      2. Operator's and admission controller node is still on 1.25, node didn't get upgraded.
      3. Except 0010 (previous 0005), every CB pod is on 1.25.

      The operator is trying to look for cb-example-0005, but it no longer exists and is stuck here.

       

      {"level":"info","ts":"2024-05-30T09:53:06Z","logger":"cluster","msg":"Resource updated","cluster":"default/cb-example","diff":"-{v2.ClusterStatus}.Conditions[2->?]:{Type:Error Status:True LastUpdateTime:2024-05-30T09:27:32Z LastTransitionTime:2024-05-30T09:27:32Z Reason:ErrorEncountered Message:requested resource not found: unable to lookup node cb-example-0005.cb-example.default.svc:8091};-{v2.ClusterStatus}.Conditions[3->?]:{Type:Scaling Status:True LastUpdateTime:2024-05-30T09:27:34Z LastTransitionTime:2024-05-30T09:27:34Z Reason:ClusterScaling Message:The operator is attempting to scale the cluster};-{v2.ClusterStatus}.Conditions[4->?]:{Type:ScalingUp Status:True LastUpdateTime:2024-05-30T09:27:34Z LastTransitionTime:2024-05-30T09:27:34Z Reason:ScalingUp Message:Scaling Server Class data-only from 5 to 6};+{v2.ClusterStatus}.Autoscalers:[]"}

       

      CB logs - http://supportal.couchbase.com/snapshot/563d7ce9bd15b54278019e5f584f95fb::0

      https://cb-engineering.s3.amazonaws.com/EKS_UPGRADE_1.26_CB_7_2_5_failed/collectinfo-2024-05-30T121628-ns_1%40cb-example-0000.cb-example.default.svc.zip
      https://cb-engineering.s3.amazonaws.com/EKS_UPGRADE_1.26_CB_7_2_5_failed/collectinfo-2024-05-30T121628-ns_1%40cb-example-0001.cb-example.default.svc.zip
      https://cb-engineering.s3.amazonaws.com/EKS_UPGRADE_1.26_CB_7_2_5_failed/collectinfo-2024-05-30T121628-ns_1%40cb-example-0002.cb-example.default.svc.zip
      https://cb-engineering.s3.amazonaws.com/EKS_UPGRADE_1.26_CB_7_2_5_failed/collectinfo-2024-05-30T121628-ns_1%40cb-example-0003.cb-example.default.svc.zip
      https://cb-engineering.s3.amazonaws.com/EKS_UPGRADE_1.26_CB_7_2_5_failed/collectinfo-2024-05-30T121628-ns_1%40cb-example-0004.cb-example.default.svc.zip
      https://cb-engineering.s3.amazonaws.com/EKS_UPGRADE_1.26_CB_7_2_5_failed/collectinfo-2024-05-30T121628-ns_1%40cb-example-0006.cb-example.default.svc.zip
      https://cb-engineering.s3.amazonaws.com/EKS_UPGRADE_1.26_CB_7_2_5_failed/collectinfo-2024-05-30T121628-ns_1%40cb-example-0007.cb-example.default.svc.zip
      https://cb-engineering.s3.amazonaws.com/EKS_UPGRADE_1.26_CB_7_2_5_failed/collectinfo-2024-05-30T121628-ns_1%40cb-example-0008.cb-example.default.svc.zip
      https://cb-engineering.s3.amazonaws.com/EKS_UPGRADE_1.26_CB_7_2_5_failed/collectinfo-2024-05-30T121628-ns_1%40cb-example-0009.cb-example.default.svc.zip
      https://cb-engineering.s3.amazonaws.com/EKS_UPGRADE_1.26_CB_7_2_5_failed/collectinfo-2024-05-30T121628-ns_1%40cb-example-0010.cb-example.default.svc.zip

      Operator logs - cbopinfo-20240530T174642+0530.tar.gz

      Cluster link https://us-east-2.console.aws.amazon.com/eks/home?region=us-east-2#/clusters/k8s-eks-setup-manik-56?selectedTab=cluster-compute-tab-  

      Cluster SS :-

      Attachments

        Issue Links

          No reviews matched the request. Check your Options in the drop-down menu of this sections header.

          Activity

            People

              usamah.jassat Usamah Jassat
              manik.mahajan Manik Mahajan
              Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Gerrit Reviews

                  There are no open Gerrit changes

                  PagerDuty