Description
Upgrade Process Description provided here:
https://issues.couchbase.com/browse/K8S-3547?focusedId=783386&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-783386
Kubernetes Version | 1.28 |
Couchbase Server | 7.2.5 → 7.6.1 |
Operator | 2.7.0-197 |
Cluster Setup
- Each node is an m5.4xlarge instance. (16 vCPUs and 64GB RAM)
- 6 Data Service, 4 Index Service & Query Service Nodes.
- 10 Buckets (with 1 replica), Full Eviction and Auto-failover set to 5s.
- 85-120GB data per bucket → ~1TB data loaded onto cluster before beginning of upgrade.
- 50 Primary Indexes with 1 Replica each. (Total 100 Indexes with Index Storage: Plasma)
Upgrade Process
- DeltaRecovery Upgrade to update Couchbase Server from 7.2.5 to 7.6.1.
- Continuous query and data workload on the buckets during the update process.
- Around 60% CPU load on all servers during the upgrade.
Issue
All the nodes got updated, after which there was a rebalance failure.
- All 10 nodes get upgraded
- Some issue happens with node cb-example-0006
- Cluster state is seen as unbalanced
- Rebalance is initiated, which fails. This is retried.
- Rebalance succeeds yet we see an error message "ErrorEncountered Message:failed to rebalance"
Logs
Pre Delta Recovery Upgrade - 7.2.5
Couchbase Server Logs
http://supportal.couchbase.com/snapshot/e544f65397cb5e33922cc1a1ee4556f7::0
http://supportal.couchbase.com/snapshot/e544f65397cb5e33922cc1a1ee4556f7::1
https://cb-engineering.s3.amazonaws.com/2024-06-30A_k8s1.28_cb7.2.5_delta_upgrade/collectinfo-2024-06-29T222152-ns_1%40cb-example-0000.cb-example.default.svc.zip
https://cb-engineering.s3.amazonaws.com/2024-06-30A_k8s1.28_cb7.2.5_delta_upgrade/collectinfo-2024-06-29T222152-ns_1%40cb-example-0001.cb-example.default.svc.zip
https://cb-engineering.s3.amazonaws.com/2024-06-30A_k8s1.28_cb7.2.5_delta_upgrade/collectinfo-2024-06-29T222152-ns_1%40cb-example-0002.cb-example.default.svc.zip
https://cb-engineering.s3.amazonaws.com/2024-06-30A_k8s1.28_cb7.2.5_delta_upgrade/collectinfo-2024-06-29T222152-ns_1%40cb-example-0003.cb-example.default.svc.zip
https://cb-engineering.s3.amazonaws.com/2024-06-30A_k8s1.28_cb7.2.5_delta_upgrade/collectinfo-2024-06-29T222152-ns_1%40cb-example-0004.cb-example.default.svc.zip
https://cb-engineering.s3.amazonaws.com/2024-06-30A_k8s1.28_cb7.2.5_delta_upgrade/collectinfo-2024-06-29T222152-ns_1%40cb-example-0005.cb-example.default.svc.zip
https://cb-engineering.s3.amazonaws.com/2024-06-30A_k8s1.28_cb7.2.5_delta_upgrade/collectinfo-2024-06-29T222152-ns_1%40cb-example-0006.cb-example.default.svc.zip
https://cb-engineering.s3.amazonaws.com/2024-06-30A_k8s1.28_cb7.2.5_delta_upgrade/collectinfo-2024-06-29T222152-ns_1%40cb-example-0007.cb-example.default.svc.zip
https://cb-engineering.s3.amazonaws.com/2024-06-30A_k8s1.28_cb7.2.5_delta_upgrade/collectinfo-2024-06-29T222152-ns_1%40cb-example-0008.cb-example.default.svc.zip
https://cb-engineering.s3.amazonaws.com/2024-06-30A_k8s1.28_cb7.2.5_delta_upgrade/collectinfo-2024-06-29T222152-ns_1%40cb-example-0009.cb-example.default.svc.zip
Post Delta Recovery Upgrade - 7.6.1
CAO Collect Logs:
Couchbase Server Logs
http://supportal.couchbase.com/snapshot/e544f65397cb5e33922cc1a1ee4556f7::1
https://cb-engineering.s3.amazonaws.com/2024-06-30A_k8s1.28_cb7.6.1_post_delta_upgrade/collectinfo-2024-06-30T044219-ns_1%40cb-example-0000.cb-example.default.svc.zip
https://cb-engineering.s3.amazonaws.com/2024-06-30A_k8s1.28_cb7.6.1_post_delta_upgrade/collectinfo-2024-06-30T044219-ns_1%40cb-example-0001.cb-example.default.svc.zip
https://cb-engineering.s3.amazonaws.com/2024-06-30A_k8s1.28_cb7.6.1_post_delta_upgrade/collectinfo-2024-06-30T044219-ns_1%40cb-example-0002.cb-example.default.svc.zip
https://cb-engineering.s3.amazonaws.com/2024-06-30A_k8s1.28_cb7.6.1_post_delta_upgrade/collectinfo-2024-06-30T044219-ns_1%40cb-example-0003.cb-example.default.svc.zip
https://cb-engineering.s3.amazonaws.com/2024-06-30A_k8s1.28_cb7.6.1_post_delta_upgrade/collectinfo-2024-06-30T044219-ns_1%40cb-example-0004.cb-example.default.svc.zip
https://cb-engineering.s3.amazonaws.com/2024-06-30A_k8s1.28_cb7.6.1_post_delta_upgrade/collectinfo-2024-06-30T044219-ns_1%40cb-example-0005.cb-example.default.svc.zip
https://cb-engineering.s3.amazonaws.com/2024-06-30A_k8s1.28_cb7.6.1_post_delta_upgrade/collectinfo-2024-06-30T044219-ns_1%40cb-example-0006.cb-example.default.svc.zip
https://cb-engineering.s3.amazonaws.com/2024-06-30A_k8s1.28_cb7.6.1_post_delta_upgrade/collectinfo-2024-06-30T044219-ns_1%40cb-example-0007.cb-example.default.svc.zip
https://cb-engineering.s3.amazonaws.com/2024-06-30A_k8s1.28_cb7.6.1_post_delta_upgrade/collectinfo-2024-06-30T044219-ns_1%40cb-example-0008.cb-example.default.svc.zip
https://cb-engineering.s3.amazonaws.com/2024-06-30A_k8s1.28_cb7.6.1_post_delta_upgrade/collectinfo-2024-06-30T044219-ns_1%40cb-example-0009.cb-example.default.svc.zip
Bug / Issue
After all 10 nodes get updated
{"level":"info","ts":"2024-06-30T00:51:56Z","logger":"cluster","msg":"Resource updated","cluster":"default/cb-example","diff":"{v2.ClusterStatus}.CurrentVersion:7.2.5->7.6.1"} |
{"level":"info","ts":"2024-06-30T00:51:59Z","logger":"cluster","msg":"Cluster status","cluster":"default/cb-example","balance":"balanced","rebalancing":false} |
Then we see this in the logs
{"level":"info","ts":"2024-06-30T00:52:07Z","logger":"cluster","msg":"Resource updated","cluster":"default/cb-example","diff":"-{v2.ClusterStatus}.Members.Ready[6->?]:cb-example-0006;+{v2.ClusterStatus}.Members.Unready:[cb-example-0006]"} |
{"level":"info","ts":"2024-06-30T00:52:07Z","logger":"cluster","msg":"Cluster status","cluster":"default/cb-example","balance":"unbalanced","rebalancing":false} |
Next, CAO logs the state as unbalanced and starts a rebalance. This rebalance fails.
{"level":"info","ts":"2024-06-30T00:52:07Z","logger":"cluster","msg":"Setting recovery type","cluster":"default/cb-example","name":"cb-example-0006","type":"full"} {"level":"info","ts":"2024-06-30T00:52:09Z","logger":"cluster","msg":"Resource updated","cluster":"default/cb-example","diff":"-{v2.ClusterStatus}.Conditions[1->?]:{Type:Balanced Status:True LastUpdateTime:2024-06-30T00:51:55Z LastTransitionTime:2024-06-30T00:51:55Z Reason:Balanced Message:Data is equally distributed across all nodes in the cluster};+{v2.ClusterStatus}.Conditions[?->1]:{Type:Balanced Status:False LastUpdateTime:2024-06-30T00:52:09Z LastTransitionTime:2024-06-30T00:52:09Z Reason:Unbalanced Message:The operator is attempting to rebalance the data to correct this issue}"} {"level":"info","ts":"2024-06-30T00:52:10Z","logger":"cluster","msg":"Rebalancing","cluster":"default/cb-example","progress":0} {"level":"info","ts":"2024-06-30T00:52:14Z","logger":"cluster","msg":"Rebalancing","cluster":"default/cb-example","progress":36} {"level":"info","ts":"2024-06-30T00:52:28Z","logger":"cluster","msg":"Reconciliation failed","cluster":"default/cb-example","error":"failed to rebalance: timeout: unexpected rebalance error","stack":"github.com/couchbase/couchbase-operator/pkg/cluster.(*Cluster).verifyRebalance.func1\n\tgithub.com/couchbase/couchbase-operator/pkg/cluster/rebalance.go:299\ngithub.com/couchbase/couchbase-operator/pkg/util/retryutil.Retry\n\tgithub.com/couchbase/couchbase-operator/pkg/util/retryutil/retryutil.go:14\ngithub.com/couchbase/couchbase-operator/pkg/util/retryutil.RetryFor\n\tgithub.com/couchbase/couchbase-operator/pkg/util/retryutil/retryutil.go:30\ngithub.com/couchbase/couchbase-operator/pkg/cluster.(*Cluster).verifyRebalance\n\tgithub.com/couchbase/couchbase-operator/pkg/cluster/rebalance.go:324\ngithub.com/couchbase/couchbase-operator/pkg/cluster.(*Cluster).rebalanceWithRetriesOnVerifyFails\n\tgithub.com/couchbase/couchbase-operator/pkg/cluster/rebalance.go:258\ngithub.com/couchbase/couchbase-operator/pkg/cluster.(*Cluster).rebalance\n\tgithub.com/couchbase/couchbase-operator/pkg/cluster/rebalance.go:185\ngithub.com/couchbase/couchbase-operator/pkg/cluster.(*ReconcileMachine).handleRebalance\n\tgithub.com/couchbase/couchbase-operator/pkg/cluster/nodereconcile.go:1631\ngithub.com/couchbase/couchbase-operator/pkg/cluster.(*ReconcileMachine).exec\n\tgithub.com/couchbase/couchbase-operator/pkg/cluster/nodereconcile.go:321\ngithub.com/couchbase/couchbase-operator/pkg/cluster.(*Cluster).reconcileMembers\n\tgithub.com/couchbase/couchbase-operator/pkg/cluster/reconcile.go:264\ngithub.com/couchbase/couchbase-operator/pkg/cluster.(*Cluster).reconcile\n\tgithub.com/couchbase/couchbase-operator/pkg/cluster/reconcile.go:173\ngithub.com/couchbase/couchbase-operator/pkg/cluster.(*Cluster).runReconcile\n\tgithub.com/couchbase/couchbase-operator/pkg/cluster/cluster.go:511\ngithub.com/couchbase/couchbase-operator/pkg/cluster.(*Cluster).Update\n\tgithub.com/couchbase/couchbase-operator/pkg/cluster/cluster.go:558\ngithub.com/couchbase/couchbase-operator/pkg/controller.(*CouchbaseClusterReconciler).Reconcile\n\tgithub.com/couchbase/couchbase-operator/pkg/controller/controller.go:90\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile\n\tsigs.k8s.io/controller-runtime@v0.16.3/pkg/internal/controller/controller.go:119\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\tsigs.k8s.io/controller-runtime@v0.16.3/pkg/internal/controller/controller.go:316\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\tsigs.k8s.io/controller-runtime@v0.16.3/pkg/internal/controller/controller.go:266\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2\n\tsigs.k8s.io/controller-runtime@v0.16.3/pkg/internal/controller/controller.go:227"}
|
The rebalance is initiated again, this rebalance completed successfully yet the error message came that `ErrorEncountered Message:failed to rebalance: timeout: unexpected rebalance error`
{"level":"info","ts":"2024-06-30T00:52:28Z","logger":"cluster","msg":"Resource updated","cluster":"default/cb-example","diff":"+{v2.ClusterStatus}.Conditions[?->2]:{Type:Error Status:True LastUpdateTime:2024-06-30T00:52:28Z LastTransitionTime:2024-06-30T00:52:28Z Reason:ErrorEncountered Message:failed to rebalance: timeout: unexpected rebalance error}"} |
{"level":"info","ts":"2024-06-30T00:52:28Z","logger":"cluster","msg":"Resource updated","cluster":"default/cb-example","diff":"+{v2.ClusterStatus}.Members.Ready[?->6]:cb-example-0006;-{v2.ClusterStatus}.Members.Unready:[cb-example-0006]"} |
{"level":"info","ts":"2024-06-30T00:52:31Z","logger":"cluster","msg":"Rebalancing","cluster":"default/cb-example","progress":0} |
{"level":"info","ts":"2024-06-30T00:52:35Z","logger":"cluster","msg":"Rebalancing","cluster":"default/cb-example","progress":18} |
{"level":"info","ts":"2024-06-30T00:52:39Z","logger":"cluster","msg":"Rebalancing","cluster":"default/cb-example","progress":48} |
{"level":"info","ts":"2024-06-30T00:52:43Z","logger":"cluster","msg":"Rebalancing","cluster":"default/cb-example","progress":61.99999999999999} |
...
|
{"level":"info","ts":"2024-06-30T01:06:56Z","logger":"cluster","msg":"Rebalancing","cluster":"default/cb-example","progress":80} |
{"level":"info","ts":"2024-06-30T01:07:00Z","logger":"cluster","msg":"Rebalance completed successfully","cluster":"default/cb-example"} |
{"level":"info","ts":"2024-06-30T01:07:00Z","logger":"cluster","msg":"Resource updated","cluster":"default/cb-example","diff":"-{v2.ClusterStatus}.Conditions[1->?]:{Type:Balanced Status:False LastUpdateTime:2024-06-30T00:52:09Z LastTransitionTime:2024-06-30T00:52:09Z Reason:Unbalanced Message:The operator is attempting to rebalance the data to correct this issue};+{v2.ClusterStatus}.Conditions[?->1]:{Type:Balanced Status:True LastUpdateTime:2024-06-30T01:07:00Z LastTransitionTime:2024-06-30T01:07:00Z Reason:Balanced Message:Data is equally distributed across all nodes in the cluster}"} |
{"level":"info","ts":"2024-06-30T01:07:01Z","logger":"cluster","msg":"Resource updated","cluster":"default/cb-example","diff":"-{v2.ClusterStatus}.Conditions[2->?]:{Type:Error Status:True LastUpdateTime:2024-06-30T00:52:28Z LastTransitionTime:2024-06-30T00:52:28Z Reason:ErrorEncountered Message:failed to rebalance: timeout: unexpected rebalance error};+{v2.ClusterStatus}.Autoscalers:[]"} |
{"level":"info","ts":"2024-06-30T01:07:03Z","logger":"cluster","msg":"Cluster status","cluster":"default/cb-example","balance":"balanced","rebalancing":false} |