Details
-
Bug
-
Resolution: Unresolved
-
Major
-
7.2.5
-
Untriaged
-
Linux x86_64
-
0
-
Unknown
Description
Upgrade Process Description provided here:
https://issues.couchbase.com/browse/K8S-3547?focusedId=783386&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-783386
Kubernetes Version | 1.28 EKS |
Couchbase Server | 7.2.5 → 7.6.1 |
Operator | 2.7.0-197 |
Cluster Setup
- Each node is an m5.4xlarge instance. (16 vCPUs and 64GB RAM)
- 6 Data Service, 4 Index Service & Query Service Nodes.
- 10 Buckets (with 1 replica), Full Eviction and Auto-failover set to 5s.
- 85-120GB data per bucket → ~1TB data loaded onto cluster before beginning of upgrade.
- 50 Primary Indexes with 1 Replica each. (Total 100 Indexes with Index Storage: Plasma)
Upgrade Process
- DeltaRecovery Upgrade to update Couchbase Server from 7.2.5 to 7.6.1.
- Continuous query and data workload on the buckets during the update process.
- Around 60% CPU load on all servers during the upgrade.
Logs
Pre Delta Recovery Upgrade - 7.2.5
Couchbase Server Logs
http://supportal.couchbase.com/snapshot/e544f65397cb5e33922cc1a1ee4556f7::0
http://supportal.couchbase.com/snapshot/e544f65397cb5e33922cc1a1ee4556f7::1
https://cb-engineering.s3.amazonaws.com/2024-06-30A_k8s1.28_cb7.2.5_delta_upgrade/collectinfo-2024-06-29T222152-ns_1%40cb-example-0000.cb-example.default.svc.zip
https://cb-engineering.s3.amazonaws.com/2024-06-30A_k8s1.28_cb7.2.5_delta_upgrade/collectinfo-2024-06-29T222152-ns_1%40cb-example-0001.cb-example.default.svc.zip
https://cb-engineering.s3.amazonaws.com/2024-06-30A_k8s1.28_cb7.2.5_delta_upgrade/collectinfo-2024-06-29T222152-ns_1%40cb-example-0002.cb-example.default.svc.zip
https://cb-engineering.s3.amazonaws.com/2024-06-30A_k8s1.28_cb7.2.5_delta_upgrade/collectinfo-2024-06-29T222152-ns_1%40cb-example-0003.cb-example.default.svc.zip
https://cb-engineering.s3.amazonaws.com/2024-06-30A_k8s1.28_cb7.2.5_delta_upgrade/collectinfo-2024-06-29T222152-ns_1%40cb-example-0004.cb-example.default.svc.zip
https://cb-engineering.s3.amazonaws.com/2024-06-30A_k8s1.28_cb7.2.5_delta_upgrade/collectinfo-2024-06-29T222152-ns_1%40cb-example-0005.cb-example.default.svc.zip
https://cb-engineering.s3.amazonaws.com/2024-06-30A_k8s1.28_cb7.2.5_delta_upgrade/collectinfo-2024-06-29T222152-ns_1%40cb-example-0006.cb-example.default.svc.zip
https://cb-engineering.s3.amazonaws.com/2024-06-30A_k8s1.28_cb7.2.5_delta_upgrade/collectinfo-2024-06-29T222152-ns_1%40cb-example-0007.cb-example.default.svc.zip
https://cb-engineering.s3.amazonaws.com/2024-06-30A_k8s1.28_cb7.2.5_delta_upgrade/collectinfo-2024-06-29T222152-ns_1%40cb-example-0008.cb-example.default.svc.zip
https://cb-engineering.s3.amazonaws.com/2024-06-30A_k8s1.28_cb7.2.5_delta_upgrade/collectinfo-2024-06-29T222152-ns_1%40cb-example-0009.cb-example.default.svc.zip
Post Delta Recovery Upgrade - 7.6.1
Couchbase Server Logs
http://supportal.couchbase.com/snapshot/e544f65397cb5e33922cc1a1ee4556f7::1
https://cb-engineering.s3.amazonaws.com/2024-06-30A_k8s1.28_cb7.6.1_post_delta_upgrade/collectinfo-2024-06-30T044219-ns_1%40cb-example-0000.cb-example.default.svc.zip
https://cb-engineering.s3.amazonaws.com/2024-06-30A_k8s1.28_cb7.6.1_post_delta_upgrade/collectinfo-2024-06-30T044219-ns_1%40cb-example-0001.cb-example.default.svc.zip
https://cb-engineering.s3.amazonaws.com/2024-06-30A_k8s1.28_cb7.6.1_post_delta_upgrade/collectinfo-2024-06-30T044219-ns_1%40cb-example-0002.cb-example.default.svc.zip
https://cb-engineering.s3.amazonaws.com/2024-06-30A_k8s1.28_cb7.6.1_post_delta_upgrade/collectinfo-2024-06-30T044219-ns_1%40cb-example-0003.cb-example.default.svc.zip
https://cb-engineering.s3.amazonaws.com/2024-06-30A_k8s1.28_cb7.6.1_post_delta_upgrade/collectinfo-2024-06-30T044219-ns_1%40cb-example-0004.cb-example.default.svc.zip
https://cb-engineering.s3.amazonaws.com/2024-06-30A_k8s1.28_cb7.6.1_post_delta_upgrade/collectinfo-2024-06-30T044219-ns_1%40cb-example-0005.cb-example.default.svc.zip
https://cb-engineering.s3.amazonaws.com/2024-06-30A_k8s1.28_cb7.6.1_post_delta_upgrade/collectinfo-2024-06-30T044219-ns_1%40cb-example-0006.cb-example.default.svc.zip
https://cb-engineering.s3.amazonaws.com/2024-06-30A_k8s1.28_cb7.6.1_post_delta_upgrade/collectinfo-2024-06-30T044219-ns_1%40cb-example-0007.cb-example.default.svc.zip
https://cb-engineering.s3.amazonaws.com/2024-06-30A_k8s1.28_cb7.6.1_post_delta_upgrade/collectinfo-2024-06-30T044219-ns_1%40cb-example-0008.cb-example.default.svc.zip
https://cb-engineering.s3.amazonaws.com/2024-06-30A_k8s1.28_cb7.6.1_post_delta_upgrade/collectinfo-2024-06-30T044219-ns_1%40cb-example-0009.cb-example.default.svc.zip
Issue
- 9/10 nodes get upgraded successfully.
- While the delta recovery of cb-example-0009 is happening, the rebalance fails.
- Some issue happens with node cb-example-0006.
Bug / Issue
The couchbase nodes till cb-example-0008 got updated smoothly.
When cb-example-0009 was getting delta recovered - rebalance exited
2024-06-30T00:51:54.872Z, ns_orchestrator:0:critical:message(ns_1@cb-example-0006.cb-example.default.svc) - Rebalance exited with reason {service_rebalance_failed,index, |
{agent_died,<0.6832.0>, |
{linked_process_died,<0.4633.2>, |
{'ns_1@cb-example-0006.cb-example.default.svc', |
{timeout,
|
{gen_server,call,
|
[<0.7894.0>, |
{call,"ServiceAPI.StartTopologyChange", |
#Fun<json_rpc_connection.0.36915653>, |
#{timeout => 60000}}, |
60000]}}}}}}. |
Rebalance Operation Id = 5aa3fd06f8ea6f80a6bdfbca4d3db2f9
|
Next, we see cb-example-0006 got auto-failed over.
2024-06-30T00:52:02.526Z, failover:0:info:message(ns_1@cb-example-0006.cb-example.default.svc) - Starting failing over ['ns_1@cb-example-0006.cb-example.default.svc'] |
2024-06-30T00:52:02.526Z, ns_orchestrator:0:info:message(ns_1@cb-example-0006.cb-example.default.svc) - Starting failover of nodes ['ns_1@cb-example-0006.cb-example.default.svc'] AllowUnsafe = false Operation Id = d984cf6fcc2169763a839c158c57e544 |
2024-06-30T00:52:02.758Z, failover:0:info:message(ns_1@cb-example-0006.cb-example.default.svc) - Failed over ['ns_1@cb-example-0006.cb-example.default.svc']: ok |
2024-06-30T00:52:02.774Z, failover:0:info:message(ns_1@cb-example-0006.cb-example.default.svc) - Deactivating failed over nodes ['ns_1@cb-example-0006.cb-example.default.svc'] |
2024-06-30T00:52:03.031Z, ns_orchestrator:0:info:message(ns_1@cb-example-0006.cb-example.default.svc) - Failover completed successfully. |
Rebalance Operation Id = d984cf6fcc2169763a839c158c57e544
|
2024-06-30T00:52:03.090Z, auto_failover:0:info:message(ns_1@cb-example-0006.cb-example.default.svc) - Node ('ns_1@cb-example-0006.cb-example.default.svc') was automatically failed over. Reason: Connection to the service is lost |
A rebalance was started after the failover.
2024-06-30T00:52:03.031Z, ns_orchestrator:0:info:message(ns_1@cb-example-0006.cb-example.default.svc) - Failover completed successfully. |
Rebalance Operation Id = d984cf6fcc2169763a839c158c57e544
|
2024-06-30T00:52:03.090Z, auto_failover:0:info:message(ns_1@cb-example-0006.cb-example.default.svc) - Node ('ns_1@cb-example-0006.cb-example.default.svc') was automatically failed over. Reason: Connection to the service is lost |
2024-06-30T00:52:10.262Z, ns_orchestrator:0:info:message(ns_1@cb-example-0006.cb-example.default.svc) - Starting rebalance, KeepNodes = ['ns_1@cb-example-0000.cb-example.default.svc', |
'ns_1@cb-example-0001.cb-example.default.svc', |
'ns_1@cb-example-0002.cb-example.default.svc', |
'ns_1@cb-example-0003.cb-example.default.svc', |
'ns_1@cb-example-0004.cb-example.default.svc', |
'ns_1@cb-example-0005.cb-example.default.svc', |
'ns_1@cb-example-0006.cb-example.default.svc', |
'ns_1@cb-example-0007.cb-example.default.svc', |
'ns_1@cb-example-0008.cb-example.default.svc', |
'ns_1@cb-example-0009.cb-example.default.svc'], EjectNodes = [], Failed over and being ejected nodes = []; no delta recovery nodes; Operation Id = bcc78b093a939e4bb3352ac168dc6fba |
This rebalance failed and exited.
2024-06-30T00:52:15.873Z, ns_orchestrator:0:critical:message(ns_1@cb-example-0006.cb-example.default.svc) - Rebalance exited with reason {service_rebalance_failed,index, |
{worker_died,
|
{'EXIT',<0.16166.2>, |
{{badmatch,
|
{error,
|
{bad_nodes,index,prepare_rebalance,
|
[{'ns_1@cb-example-0006.cb-example.default.svc', |
{error,
|
{unknown_error,
|
<<"indexer rebalance failure - cleanup pending from previous failed/aborted rebalance/failover/move index. please retry the request later.">>}}}]}}}, |
[{service_manager,rebalance_op,5, |
[{file,"src/service_manager.erl"}, |
{line,338}]}, |
{service_manager,do_run_op,1, |
[{file,"src/service_manager.erl"}, |
{line,257}]}, |
{proc_lib,init_p,3, |
[{file,"proc_lib.erl"},{line,225}]}]}}}}. |
Rebalance Operation Id = bcc78b093a939e4bb3352ac168dc6fba
|
After this, another rebalance proceeds and is successful.