Uploaded image for project: 'Couchbase Kubernetes'
  1. Couchbase Kubernetes
  2. K8S-3599

Swap rebalance is not re-tried during failure and pods are created with every failure causing infinite pod creation

    XMLWordPrintable

Details

    • Bug
    • Resolution: Fixed
    • Blocker
    • 2.7.0
    • 2.7.0
    • operator
    • Cluster version : 7.0.5-7659
      Kubernetes Version : v1.30.0
      CAO and operator : 2.7.0 built locally
      Environment : Kind cluster
    • 15 - First Frontier
    • 1

    Description

      Cluster Setup

      • Kind cluster locally run on Mac
      • 2 nodes with all services
      • 1 bucket
      • Initial Cluster version : 7.0.5-7659

      Steps taken in the scenario

      • Created a cluster
      • Created 1 bucket
      • Changed the cluster config to add a logging sidecar.
      • Swap rebalance is issued by the operator to reconcile the cluster to the changes
      • A new pod is added
      • Immediately rebalance is issued
      • Rebalance fails with not_all_nodes_are_ready_yet error - Tracked in K8S-3598
      • Rebalance is not re-tried but a new pod is created and added to cluster.
      • Rebalance fails again. A new pod is created and added to cluster
      • A pod was failed over. Without this fail over, there would be an infinite pod creation loop.

       

      Another instance of the same : https://cb-engineering.s3.amazonaws.com/K8S-3598/cbopinfo-20240725T172847+0530.tar.gz

      Here, I did not failover the problematic pod/node and with each rebalance failure, there's an infinite pod spin up.

      Issue

      • Rebalance is not re-tried on failure during the first swap rebalance.
      • New pods are spun up infinitely for each failure instead of retry.
      • New pod that gets added to cluster is actually ejected immediately in the next rebalance. A pod is created and added to cluster and immediately ejected causing wastage of resources.

       

       


      Operator logs : 

      https://cb-engineering.s3.amazonaws.com/K8S-3598/cbopinfo-20240725T170701+0530.tar.gz

      Another instance of loop : https://cb-engineering.s3.amazonaws.com/K8S-3598/cbopinfo-20240725T172847+0530.tar.gz

      Cluster logs : 
      https://cb-engineering.s3.amazonaws.com/K8S-3598/collectinfo-2024-07-25T114628-ns_1%40cb-example-0005.cb-example.default.svc.zip
      https://cb-engineering.s3.amazonaws.com/K8S-3598/collectinfo-2024-07-25T114628-ns_1%40cb-example-0001.cb-example.default.svc.zip
       


       The cao tool and operator images were built locally on this commit

      commit c2e920ddbcfa9b4819d47ad81d0a35c359dd1dc6 (HEAD -> master, origin/master, origin/HEAD)
      Author: usamah jassat <usamah.jassat@couchbase.com>
      Date:   Wed Jul 17 15:11:19 2024 +0100    K8S-3581: don't attempt backend migration when rebalance required
          
          Change-Id: I2d2b6d6d4f8dbb0a30db5bd54a05631d17631eee
          Reviewed-on: https://review.couchbase.org/c/couchbase-operator/+/212890
          Reviewed-by: Yusuf Ramzan <yusuf.ramzan@couchbase.com>
          Tested-by: Build Bot <build@couchbase.com>

      Attachments

        Issue Links

          For Gerrit Dashboard: K8S-3599
          # Subject Branch Project Status CR V

          Activity

            People

              raghav.sk Raghav S K
              raghav.sk Raghav S K
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Gerrit Reviews

                  There are no open Gerrit changes

                  PagerDuty