Uploaded image for project: 'Couchbase Kubernetes'
  1. Couchbase Kubernetes
  2. K8S-3594

Operator tries to add pods to cluster before failover leading to a infinite rebalance failure loop

    XMLWordPrintable

Details

    • Bug
    • Resolution: Fixed
    • Critical
    • 2.8.0
    • 2.7.0
    • operator
    • Upgrade Cluster version : 7.6.1-3200
      Kubernetes Version : v1.30.0
      CAO and operator : 2.7.0 built locally
      Environment : Kind cluster
    • 15 - First Frontier, 16 - Killing Time
    • 2

    Description

      Cluster Setup

      • Kind cluster locally run on Mac
      • 5 nodes with kv,index,query services
      • 1 bucket
      • Cluster version : 7.6.1-3200

      Steps taken in the scenario

      • Created a cluster
      • Created 1 bucket
      • Deleted 2 pods using 

       

      $kubectl delete pod cb-example-0003
      $kubectl delete pod cb-example-0004

       

      • Since the autofailover count was 1, ns-server auto failed over 1 of the pods
      • The next pod was not failed over.
      • Yet replacement pods of cb-example-0005 and cb-example-0006 were already added to the cluster by the operator
      • Rebalances are triggered by the operator. The rebalance fails obviously
      • The rebalances are continuously retried and it always fails leading to infinite rebalance retry loop.

      Issues

      • Before all pods/nodes are auto failed over, the new replacement pods should not be added. There should be a status check to check if the pod is actually failed over or not.
      • Rebalance is triggered and the non-failed over pod was ejected in the rebalance. A unresponsive node causes rebalance failures always when ejected. The pod should not be added to eject params without being failed over.

       


      Operator logs : 
      https://cb-engineering.s3.amazonaws.com/failover_problem/cbopinfo-20240724T200416+0530.tar.gz
      Cluster logs : 
      https://cb-engineering.s3.amazonaws.com/failover_problem/collectinfo-2024-07-24T143343-ns_1%40cb-example-0000.cb-example.default.svc.zip
      https://cb-engineering.s3.amazonaws.com/failover_problem/collectinfo-2024-07-24T143343-ns_1%40cb-example-0001.cb-example.default.svc.zip
      https://cb-engineering.s3.amazonaws.com/failover_problem/collectinfo-2024-07-24T143343-ns_1%40cb-example-0002.cb-example.default.svc.zip
      https://cb-engineering.s3.amazonaws.com/failover_problem/collectinfo-2024-07-24T143343-ns_1%40cb-example-0005.cb-example.default.svc.zip
      https://cb-engineering.s3.amazonaws.com/failover_problem/collectinfo-2024-07-24T143343-ns_1%40cb-example-0006.cb-example.default.svc.zip
       


       The cao tool and operator images were built locally on this commit

      commit c2e920ddbcfa9b4819d47ad81d0a35c359dd1dc6 (HEAD -> master, origin/master, origin/HEAD)
      Author: usamah jassat <usamah.jassat@couchbase.com>
      Date:   Wed Jul 17 15:11:19 2024 +0100    K8S-3581: don't attempt backend migration when rebalance required
          
          Change-Id: I2d2b6d6d4f8dbb0a30db5bd54a05631d17631eee
          Reviewed-on: https://review.couchbase.org/c/couchbase-operator/+/212890
          Reviewed-by: Yusuf Ramzan <yusuf.ramzan@couchbase.com>
          Tested-by: Build Bot <build@couchbase.com>

      Attachments

        No reviews matched the request. Check your Options in the drop-down menu of this sections header.

        Activity

          People

            yusuf.ramzan Yusuf Ramzan
            raghav.sk Raghav S K
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Gerrit Reviews

                There are no open Gerrit changes

                PagerDuty