Description
Couchbase Cluster Description
- Set up the cluster as per the required specifications
- Each node is an m5.4xlarge instance. (16 vCPUs and 64GB RAM)
- 6 Data Service, 4 Index Service and Query Service Nodes.
- 10 Buckets (with 1 replica), Full Eviction and Auto-failover set to 5s.
- Total Data around 3TiB on 6 KV nodes.
- 50 Primary Indexes with 1 Replica each. (Total 100 Indexes)
- DeltaRecovery Upgrade to update Couchbase Server from 7.2.5 to 7.6.1
- Continuous data and query workload on all buckets during the update process.
Observation:- #
- Successful graceful failover upgrade of cb-example-0001.
- Attempting to upgrade cb-example-0000, operator reported an unexpected counter change error and attempted a new graceful failover.
- Operator completes delta recovery upgrade for cb-example-0000 in 2 attempts.
- Operator completes delta recovery upgrade cb-example-0002 in 2 attempts.
- Operator completes delta recovery upgrade cb-example-0003 in 4 attempts.
- Decided to stop upgrade and remove the cluster to save AWS cost.
Conclusion :- #
- Multiple attempts to gracefully failover the same nodes lead to unnecessary rebalances, which put pressure on the Couchbase server and degrade performance.
- These redundant attempts can result in an unpredictable upgrade duration.
Expectation :- #
- Avoid multiple attempts for a graceful upgrade on the same Couchbase node.
- Ensure the upgrade process is completed within a known duration.
Suggestion for fix :-
I think we might be checking for unnecessary other counters which will get update resulting in unexpected counter change. Only required counters are "graceful_failover_start" and "graceful_failover_success" in gracefullyFailoverNode instead of matching old and new values of all initial counters.
ErrUnexpectedCounterChange might be coming because we are checking old value and new value of other counters which are not necessary for determining the success of a graceful failover that are [rebalance_start, rebalance_success, failover and failover_complete].
for name, curVal := range clusterInfo.Counters { |
oldVal := initialCounters[name]
|
switch name { |
case "graceful_failover_start", "graceful_failover_success": |
continue |
}
|
|
if curVal != oldVal { |
return ErrUnexpectedCounterChange, false |
}
|
}
|
we can see in following json, other values like [rebalance_start, rebalance_success, failover and failover_complete] can change in between of graceful failover due to auto failover which can lead to ErrUnexpectedCounterChange.
"counters":{"rebalance_start":5,"graceful_failover_success":4,"failover":4,"failover_complete":4,"graceful_failover_start":4,"rebalance_success":4} |
|
git diff which could possibly fix
```
diff --git a/pkg/cluster/nodereconcile.go b/pkg/cluster/nodereconcile.go
index 0dc2e9f1..e5375b36 100644
— a/pkg/cluster/nodereconcile.go
+++ b/pkg/cluster/nodereconcile.go
@@ -1188,24 +1188,12 @@ func (r *ReconcileMachine) gracefullyFailoverNode(candidate couchbaseutil.Member
return nil, true
}
- // Compare the counters to see if anything has changed
- if len(initialCounters) > len(clusterInfo.Counters) {
- return ErrUnexpectedCounterChange, false
+ // If the rebalance is unknown state, may lead to fail attempt for graceful failover
+ if clusterInfo.RebalanceStatus == couchbaseutil.RebalanceStatusUnknown
- for name, curVal := range clusterInfo.Counters {
- oldVal := initialCounters[name]
- switch name
-
- if curVal != oldVal
- }
-
- return nil, false
+ return ErrFailoverSuccessCounterNotIncremented, false
})
if err != nil {
```
Analysis :-
cb-example-0000
- First attempt for graceful failover failed for cb-example-0000 .
json {"level":"info","ts":"2024-05-23T05:29:39Z","logger":"cluster","msg":"cb-example-0000"}
{"level":"info","ts":"2024-05-23T05:29:39Z","logger":"cluster","msg":"Upgrading pods with DeltaRecovery","cluster":"default/cb-example","names":["cb-example-0000"],"target-version":"7.6.1"}
{"level":"error","ts":"2024-05-23T05:42:14Z","logger":"cluster","msg":"Reconciliation failed","cluster":"default/cb-example","error":"graceful failover failed: unexpected counter change","stacktrace":"github.com/couchbase/couchbase-operator/pkg/cluster.(*Cluster).runReconcile\n\tgithub.com/couchbase/couchbase-operator/pkg/cluster/cluster.go:498\ngithub.com/couchbase/couchbase-operator/pkg/cluster.(*Cluster).Update\n\tgithub.com/couchbase/couchbase-operator/pkg/cluster/cluster.go:535\ngithub.com/couchbase/couchbase-operator/pkg/controller.(*CouchbaseClusterReconciler).Reconcile\n\tgithub.com/couchbase/couchbase-operator/pkg/controller/controller.go:90\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile\n\tsigs.k8s.io/controller-runtime@v0.16.3/pkg/internal/controller/controller.go:119\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\tsigs.k8s.io/controller-runtime@v0.16.3/pkg/internal/controller/controller.go:316\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\tsigs.k8s.io/controller-runtime@v0.16.3/pkg/internal/controller/controller.go:266\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2\n\tsigs.k8s.io/controller-runtime@v0.16.3/pkg/internal/controller/controller.go:227"} ```
- Second attempt to graceful failover cb-example-0000 results in success
{"level":"info","ts":"2024-05-23T06:07:11Z","logger":"cluster","msg":"Upgrading pods with DeltaRecovery","cluster":"default/cb-example","names":["cb-example-0000"],"target-version":"7.6.1"}
{"level":"info","ts":"2024-05-23T06:21:53Z","logger":"kubernetes","msg":"Creating pod","cluster":"default/cb-example","name":"cb-example-0000","image":"couchbase/server:7.6.1"}
{"level":"info","ts":"2024-05-23T06:23:03Z","logger":"cluster","msg":"Resource updated","cluster":"default/cb-example","diff":"-{v2.ClusterStatus}.Conditions[1->?]:{Type:Balanced Status:True LastUpdateTime:2024-05-23T06:07:09Z LastTransitionTime:2024-05-23T06:07:09Z Reason:Balanced Message:Data is equally distributed across all nodes in the cluster};+{v2.ClusterStatus}.Conditions[?->1]:{Type:Balanced Status:False LastUpdateTime:2024-05-23T06:23:03Z LastTransitionTime:2024-05-23T06:23:03Z Reason:Unbalanced Message:The operator is attempting to rebalance the data to correct this issue}"} ```
cb-example-0002
- first attempt for graceful failover for cb-example-0002 fails
{"level":"info","ts":"2024-05-23T06:39:33Z","logger":"cluster","msg":"cb-example-0002"}
|
|
{"level":"info","ts":"2024-05-23T06:39:33Z","logger":"cluster","msg":"Upgrading pods with DeltaRecovery","cluster":"default/cb-example","names":["cb-example-0002"],"target-version":"7.6.1"}
|
|
{"level":"error","ts":"2024-05-23T06:50:21Z","logger":"cluster","msg":"Reconciliation failed","cluster":"default/cb-example","error":"graceful failover failed: unexpected counter change","stacktrace":"github.com/couchbase/couchbase-operator/pkg/cluster.(*Cluster).runReconcile\n\tgithub.com/couchbase/couchbase-operator/pkg/cluster/cluster.go:498\ngithub.com/couchbase/couchbase-operator/pkg/cluster.(*Cluster).Update\n\tgithub.com/couchbase/couchbase-operator/pkg/cluster/cluster.go:535\ngithub.com/couchbase/couchbase-operator/pkg/controller.(*CouchbaseClusterReconciler).Reconcile\n\tgithub.com/couchbase/couchbase-operator/pkg/controller/controller.go:90\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile\n\tsigs.k8s.io/controller-runtime@v0.16.3/pkg/internal/controller/controller.go:119\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\tsigs.k8s.io/controller-runtime@v0.16.3/pkg/internal/controller/controller.go:316\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\tsigs.k8s.io/controller-runtime@v0.16.3/pkg/internal/controller/controller.go:266\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2\n\tsigs.k8s.io/controller-runtime@v0.16.3/pkg/internal/controller/controller.go:227"}
|
- second attempt to graceful failover cb-example-0002 results in success
{"level":"info","ts":"2024-05-23T07:15:05Z","logger":"cluster","msg":"cb-example-0002"}
|
|
{"level":"info","ts":"2024-05-23T07:15:05Z","logger":"cluster","msg":"Upgrading pods with DeltaRecovery","cluster":"default/cb-example","names":["cb-example-0002"],"target-version":"7.6.1"} {"level":"info","ts":"2024-05-23T07:27:02Z","logger":"kubernetes","msg":"Creating pod","cluster":"default/cb-example","name":"cb-example-0002","image":"couchbase/server:7.6.1"}
|
|
{"level":"info","ts":"2024-05-23T07:30:26Z","logger":"cluster","msg":"Resource updated","cluster":"default/cb-example","diff":"-{v2.ClusterStatus}.Conditions[1->?]:{Type:Balanced Status:True LastUpdateTime:2024-05-23T07:15:02Z LastTransitionTime:2024-05-23T07:15:02Z Reason:Balanced Message:Data is equally distributed across all nodes in the cluster};+{v2.ClusterStatus}.Conditions[?->1]:{Type:Balanced Status:False LastUpdateTime:2024-05-23T07:30:26Z LastTransitionTime:2024-05-23T07:30:26Z Reason:Unbalanced Message:The operator is attempting to rebalance the data to correct this issue}"}
|
cb-example-0003
- first attempt for graceful failover on cb-example-0003 fails
{"level":"info","ts":"2024-05-23T07:46:16Z","logger":"cluster","msg":"cb-example-0003"}
|
|
{"level":"info","ts":"2024-05-23T07:46:16Z","logger":"cluster","msg":"Upgrading pods with DeltaRecovery","cluster":"default/cb-example","names":["cb-example-0003"],"target-version":"7.6.1"}
|
|
{"level":"error","ts":"2024-05-23T07:50:05Z","logger":"cluster","msg":"Reconciliation failed","cluster":"default/cb-example","error":"graceful failover failed: unexpected counter change","stacktrace":"github.com/couchbase/couchbase-operator/pkg/cluster.(*Cluster).runReconcile\n\tgithub.com/couchbase/couchbase-operator/pkg/cluster/cluster.go:498\ngithub.com/couchbase/couchbase-operator/pkg/cluster.(*Cluster).Update\n\tgithub.com/couchbase/couchbase-operator/pkg/cluster/cluster.go:535\ngithub.com/couchbase/couchbase-operator/pkg/controller.(*CouchbaseClusterReconciler).Reconcile\n\tgithub.com/couchbase/couchbase-operator/pkg/controller/controller.go:90\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile\n\tsigs.k8s.io/controller-runtime@v0.16.3/pkg/internal/controller/controller.go:119\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\tsigs.k8s.io/controller-runtime@v0.16.3/pkg/internal/controller/controller.go:316\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\tsigs.k8s.io/controller-runtime@v0.16.3/pkg/internal/controller/controller.go:266\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2\n\tsigs.k8s.io/controller-runtime@v0.16.3/pkg/internal/controller/controller.go:227"}
|
- second attempt for graceful failover on cb-example-0003 failed
{"level":"info","ts":"2024-05-23T08:00:57Z","logger":"cluster","msg":"cb-example-0003"}
{"level":"info","ts":"2024-05-23T08:00:57Z","logger":"cluster","msg":"Upgrading pods with DeltaRecovery","cluster":"default/cb-example","names":["cb-example-0003"],"target-version":"7.6.1"}
{"level":"error","ts":"2024-05-23T08:01:24Z","logger":"cluster","msg":"Reconciliation failed","cluster":"default/cb-example","error":"graceful failover failed: unexpected counter change","stacktrace":"github.com/couchbase/couchbase-operator/pkg/cluster.(*Cluster).runReconcile\n\tgithub.com/couchbase/couchbase-operator/pkg/cluster/cluster.go:498\ngithub.com/couchbase/couchbase-operator/pkg/cluster.(*Cluster).Update\n\tgithub.com/couchbase/couchbase-operator/pkg/cluster/cluster.go:535\ngithub.com/couchbase/couchbase-operator/pkg/controller.(*CouchbaseClusterReconciler).Reconcile\n\tgithub.com/couchbase/couchbase-operator/pkg/controller/controller.go:90\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile\n\tsigs.k8s.io/controller-runtime@v0.16.3/pkg/internal/controller/controller.go:119\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\tsigs.k8s.io/controller-runtime@v0.16.3/pkg/internal/controller/controller.go:316\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\tsigs.k8s.io/controller-runtime@v0.16.3/pkg/internal/controller/controller.go:266\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2\n\tsigs.k8s.io/controller-runtime@v0.16.3/pkg/internal/controller/controller.go:227"}
- third attempt for graceful failover on cb-example-0003 failed
{"level":"info","ts":"2024-05-23T08:12:11Z","logger":"cluster","msg":"cb-example-0003"}
|
|
{"level":"info","ts":"2024-05-23T08:12:11Z","logger":"cluster","msg":"Upgrading pods with DeltaRecovery","cluster":"default/cb-example","names":["cb-example-0003"],"target-version":"7.6.1"}
|
|
{"level":"error","ts":"2024-05-23T08:12:38Z","logger":"cluster","msg":"Reconciliation failed","cluster":"default/cb-example","error":"graceful failover failed: unexpected counter change","stacktrace":"github.com/couchbase/couchbase-operator/pkg/cluster.(*Cluster).runReconcile\n\tgithub.com/couchbase/couchbase-operator/pkg/cluster/cluster.go:498\ngithub.com/couchbase/couchbase-operator/pkg/cluster.(*Cluster).Update\n\tgithub.com/couchbase/couchbase-operator/pkg/cluster/cluster.go:535\ngithub.com/couchbase/couchbase-operator/pkg/controller.(*CouchbaseClusterReconciler).Reconcile\n\tgithub.com/couchbase/couchbase-operator/pkg/controller/controller.go:90\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile\n\tsigs.k8s.io/controller-runtime@v0.16.3/pkg/internal/controller/controller.go:119\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\tsigs.k8s.io/controller-runtime@v0.16.3/pkg/internal/controller/controller.go:316\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\tsigs.k8s.io/controller-runtime@v0.16.3/pkg/internal/controller/controller.go:266\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2\n\tsigs.k8s.io/controller-runtime@v0.16.3/pkg/internal/controller/controller.go:227"}
|
- fourth attempt to graceful failover cb-example-0003 results in success
{"level":"info","ts":"2024-05-23T08:23:21Z","logger":"cluster","msg":"cb-example-0003"}
{"level":"info","ts":"2024-05-23T08:23:21Z","logger":"cluster","msg":"Upgrading pods with DeltaRecovery","cluster":"default/cb-example","names":["cb-example-0003"],"target-version":"7.6.1"}
{"level":"info","ts":"2024-05-23T08:23:50Z","logger":"kubernetes","msg":"Creating pod","cluster":"default/cb-example","name":"cb-example-0003","image":"couchbase/server:7.6.1"}
{"level":"info","ts":"2024-05-23T08:27:31Z","logger":"cluster","msg":"Resource updated","cluster":"default/cb-example","diff":"-{v2.ClusterStatus}.Conditions[1->?]:{Type:Balanced Status:True LastUpdateTime:2024-05-23T08:23:19Z LastTransitionTime:2024-05-23T08:23:19Z Reason:Balanced Message:Data is equally distributed across all nodes in the cluster};+{v2.ClusterStatus}.Conditions[?->1]:{Type:Balanced Status:False LastUpdateTime:2024-05-23T08:27:31Z LastTransitionTime:2024-05-23T08:27:31Z Reason:Unbalanced Message:The operator is attempting to rebalance the data to correct this issue}"}
Screen Shots:-
Multiple graceful failover attempts for 0001 and 0002.
Multiple graceful for 0003
Operator logs-
cbopinfo-20240523T140321+0530.tar.gz
CB logs -
https://cb-engineering.s3.amazonaws.com/K8S-3492-multiple-failovers/collectinfo-2024-05-23T083914-ns_1%40cb-example-0000.cb-example.default.svc.zip
https://cb-engineering.s3.amazonaws.com/K8S-3492-multiple-failovers/collectinfo-2024-05-23T083914-ns_1%40cb-example-0001.cb-example.default.svc.zip
https://cb-engineering.s3.amazonaws.com/K8S-3492-multiple-failovers/collectinfo-2024-05-23T083914-ns_1%40cb-example-0002.cb-example.default.svc.zip
https://cb-engineering.s3.amazonaws.com/K8S-3492-multiple-failovers/collectinfo-2024-05-23T083914-ns_1%40cb-example-0003.cb-example.default.svc.zip
https://cb-engineering.s3.amazonaws.com/K8S-3492-multiple-failovers/collectinfo-2024-05-23T083914-ns_1%40cb-example-0004.cb-example.default.svc.zip
https://cb-engineering.s3.amazonaws.com/K8S-3492-multiple-failovers/collectinfo-2024-05-23T083914-ns_1%40cb-example-0005.cb-example.default.svc.zip
https://cb-engineering.s3.amazonaws.com/K8S-3492-multiple-failovers/collectinfo-2024-05-23T083914-ns_1%40cb-example-0006.cb-example.default.svc.zip
https://cb-engineering.s3.amazonaws.com/K8S-3492-multiple-failovers/collectinfo-2024-05-23T083914-ns_1%40cb-example-0007.cb-example.default.svc.zip
https://cb-engineering.s3.amazonaws.com/K8S-3492-multiple-failovers/collectinfo-2024-05-23T083914-ns_1%40cb-example-0008.cb-example.default.svc.zip
https://cb-engineering.s3.amazonaws.com/K8S-3492-multiple-failovers/collectinfo-2024-05-23T083914-ns_1%40cb-example-0009.cb-example.default.svc.zip