Description
Swap Rebalance Upgrade
Kubernetes Version | 1.25 |
Couchbase Server | 7.2.5 (Pre Upgrade) → 7.6.1 (Post Upgrade) |
Operator | 2.6.4-121 |
Cluster Setup
- Each node is an m5.4xlarge instance. (16 vCPUs and 64GB RAM)
- 6 Data Service, 4 Index Service & Query Service Nodes.
- 10 Buckets (with 1 replica), Full Eviction and Auto-failover set to 5s.
- 115GB data per bucket → ~1.15TB data loaded onto cluster before beginning of upgrade.
- 50 Primary Indexes with 1 Replica each. (Total 100 Indexes with Index Storage: Plasma)
Upgrade Process
- SwapRebalance Upgrade to update Couchbase Server from 7.2.5 to 7.6.1.
- Continuous query and data workload on the buckets during the update process.
- Around 35-50% CPU load on all servers during the upgrade.
- Node Restarts
Two nodes cb-example-0000 and cb-example-0001 have been swap rebalance upgraded with nodes cb-example-0010 and cb-example-0011.
The third data service node cb-example-0002 was being upgraded and ejected, while cb-example-0012 was being added to the node. Rebalance was initiated.
I restarted the physical K8s node on which cb-example-0012 was present during the rebalance. This led to a rebalance failure which was expected.
{"level":"info","ts":"2024-05-31T14:09:10Z","logger":"cluster","msg":"Resource updated","cluster":"default/cb-example","diff":"+{v2.ClusterStatus}.Conditions[?->5]:{Type:Error Status:True LastUpdateTime:2024-05-31T14:09:10Z LastTransitionTime:2024-05-31T14:09:10Z Reason:ErrorEncountered Message:failed to rebalance: timeout: unexpected rebalance error}"} |
{"level":"info","ts":"2024-05-31T14:09:10Z","logger":"cluster","msg":"Cluster status","cluster":"default/cb-example","balance":"unbalanced","rebalancing":false} |
Next, while 0012 has not joined back the cluster, another rebalance was initiated which eventually failed.
{"level":"info","ts":"2024-05-31T14:10:12Z","logger":"cluster","msg":"reconciler","clustered":["cb-example-0003","cb-example-0004","cb-example-0005","cb-example-0006","cb-example-0007","cb-example-0008","cb-example-0009","cb-example-0010","cb-example-0011","cb-example-0012"],"running":["cb-example-0003","cb-example-0004","cb-example-0005","cb-example-0006","cb-example-0007","cb-example-0008","cb-example-0009","cb-example-0010","cb-example-0011","cb-example-0012"],"eject":["cb-example-0002"],"unclustered":[],"rebalance":true}
|
{"level":"info","ts":"2024-05-31T14:10:12Z","logger":"cluster","msg":"External address collection failed","cluster":"default/cb-example","name":"cb-example-0012"}
|
{"level":"info","ts":"2024-05-31T14:10:15Z","logger":"cluster","msg":"Pod add-back failed, forcing full recovery","cluster":"default/cb-example"}
|
{"level":"info","ts":"2024-05-31T14:10:15Z","logger":"cluster","msg":"Resource updated","cluster":"default/cb-example","diff":"{v2.ClusterStatus}.Conditions[5].LastUpdateTime:2024-05-31T14:09:10Z->2024-05-31T14:10:15Z;{v2.ClusterStatus}.Conditions[5].Message:failed to rebalance: timeout: unexpected rebalance error->reconcile was blocked from running: rebalance failed, forcing full recovery"}
|
|
{"level":"info","ts":"2024-05-31T14:11:18Z","logger":"cluster","msg":"Resource updated","cluster":"default/cb-example","diff":"{v2.ClusterStatus}.Conditions[5].LastUpdateTime:2024-05-31T14:11:00Z->2024-05-31T14:11:18Z;{v2.ClusterStatus}.Conditions[5].Message:failed to rebalance: timeout: unexpected rebalance error->reconcile was blocked from running: recovering pending add node cb-example-0012"}
|
Now, after the node 0012 came back, another rebalance started and it was successful.
You can get the full task and logs here in the comment of the task K8S-3487
Logs:
CAO Collect: 2024-05-31A_DuringUpgrade_AfterKVRestart_cbopinfo-20240531T212108+0530.tar.gz
Supportal: http://supportal.couchbase.com/snapshot/7476142a9b4960384458d1bb417decbe::1
https://cb-engineering.s3.amazonaws.com/2024-05-31A_K8s1.25_CB7.2.5_DurSwapRebUpgrade/collectinfo-2024-05-31T155058-ns_1%40cb-example-0003.cb-example.default.svc.zip
https://cb-engineering.s3.amazonaws.com/2024-05-31A_K8s1.25_CB7.2.5_DurSwapRebUpgrade/collectinfo-2024-05-31T155058-ns_1%40cb-example-0004.cb-example.default.svc.zip
https://cb-engineering.s3.amazonaws.com/2024-05-31A_K8s1.25_CB7.2.5_DurSwapRebUpgrade/collectinfo-2024-05-31T155058-ns_1%40cb-example-0005.cb-example.default.svc.zip
https://cb-engineering.s3.amazonaws.com/2024-05-31A_K8s1.25_CB7.2.5_DurSwapRebUpgrade/collectinfo-2024-05-31T155058-ns_1%40cb-example-0006.cb-example.default.svc.zip
https://cb-engineering.s3.amazonaws.com/2024-05-31A_K8s1.25_CB7.2.5_DurSwapRebUpgrade/collectinfo-2024-05-31T155058-ns_1%40cb-example-0007.cb-example.default.svc.zip
https://cb-engineering.s3.amazonaws.com/2024-05-31A_K8s1.25_CB7.2.5_DurSwapRebUpgrade/collectinfo-2024-05-31T155058-ns_1%40cb-example-0008.cb-example.default.svc.zip
https://cb-engineering.s3.amazonaws.com/2024-05-31A_K8s1.25_CB7.2.5_DurSwapRebUpgrade/collectinfo-2024-05-31T155058-ns_1%40cb-example-0009.cb-example.default.svc.zip
https://cb-engineering.s3.amazonaws.com/2024-05-31A_K8s1.25_CB7.2.5_DurSwapRebUpgrade/collectinfo-2024-05-31T155058-ns_1%40cb-example-0010.cb-example.default.svc.zip
https://cb-engineering.s3.amazonaws.com/2024-05-31A_K8s1.25_CB7.2.5_DurSwapRebUpgrade/collectinfo-2024-05-31T155058-ns_1%40cb-example-0011.cb-example.default.svc.zip
https://cb-engineering.s3.amazonaws.com/2024-05-31A_K8s1.25_CB7.2.5_DurSwapRebUpgrade/collectinfo-2024-05-31T155058-ns_1%40cb-example-0012.cb-example.default.svc.zip
Attachments
Issue Links
Gerrit Reviews
For Gerrit Dashboard: K8S-3528 | ||||||
---|---|---|---|---|---|---|
# | Subject | Branch | Project | Status | CR | V |
210855,1 | K8S-3528: wait for inactiveAdded terminating pods | 2.6.x | couchbase-operator | Status: ABANDONED | 0 | +1 |