Details
-
Bug
-
Resolution: Fixed
-
Critical
-
2.7.0
-
Upgrade Cluster version : 7.6.1-3200
Kubernetes Version : v1.30.0
CAO and operator : 2.7.0 built locally
Environment : Kind cluster
-
15 - First Frontier, 16 - Killing Time
-
2
Description
Cluster Setup
- Kind cluster locally run on Mac
- 5 nodes with kv,index,query services
- 1 bucket
- Cluster version : 7.6.1-3200
Steps taken in the scenario
- Created a cluster
- Created 1 bucket
- Deleted 2 pods using
$kubectl delete pod cb-example-0003 |
$kubectl delete pod cb-example-0004 |
- Since the autofailover count was 1, ns-server auto failed over 1 of the pods
- The next pod was not failed over.
- Yet replacement pods of cb-example-0005 and cb-example-0006 were already added to the cluster by the operator
- Rebalances are triggered by the operator. The rebalance fails obviously
- The rebalances are continuously retried and it always fails leading to infinite rebalance retry loop.
Issues
- Before all pods/nodes are auto failed over, the new replacement pods should not be added. There should be a status check to check if the pod is actually failed over or not.
- Rebalance is triggered and the non-failed over pod was ejected in the rebalance. A unresponsive node causes rebalance failures always when ejected. The pod should not be added to eject params without being failed over.
Operator logs :
https://cb-engineering.s3.amazonaws.com/failover_problem/cbopinfo-20240724T200416+0530.tar.gz
Cluster logs :
https://cb-engineering.s3.amazonaws.com/failover_problem/collectinfo-2024-07-24T143343-ns_1%40cb-example-0000.cb-example.default.svc.zip
https://cb-engineering.s3.amazonaws.com/failover_problem/collectinfo-2024-07-24T143343-ns_1%40cb-example-0001.cb-example.default.svc.zip
https://cb-engineering.s3.amazonaws.com/failover_problem/collectinfo-2024-07-24T143343-ns_1%40cb-example-0002.cb-example.default.svc.zip
https://cb-engineering.s3.amazonaws.com/failover_problem/collectinfo-2024-07-24T143343-ns_1%40cb-example-0005.cb-example.default.svc.zip
https://cb-engineering.s3.amazonaws.com/failover_problem/collectinfo-2024-07-24T143343-ns_1%40cb-example-0006.cb-example.default.svc.zip
The cao tool and operator images were built locally on this commit
commit c2e920ddbcfa9b4819d47ad81d0a35c359dd1dc6 (HEAD -> master, origin/master, origin/HEAD)
|
Author: usamah jassat <usamah.jassat@couchbase.com> |
Date: Wed Jul 17 15:11:19 2024 +0100 K8S-3581: don't attempt backend migration when rebalance required |
|
Change-Id: I2d2b6d6d4f8dbb0a30db5bd54a05631d17631eee
|
Reviewed-on: https://review.couchbase.org/c/couchbase-operator/+/212890 |
Reviewed-by: Yusuf Ramzan <yusuf.ramzan@couchbase.com> |
Tested-by: Build Bot <build@couchbase.com> |