Description
Kubernetes Version | 1.25 |
Couchbase Server | 7.2.5 → 7.6.1 |
Operator | 2.6.4-119 |
Cluster Setup
- Each node is an m5.4xlarge instance. (16 vCPUs and 64GB RAM)
- 6 Data Service, 4 Index Service & Query Service Nodes.
- 10 Buckets (with 1 replica), Full Eviction and Auto-failover set to 5s.
- ~2TiB data loaded onto cluster before the beginning of upgrade.
- 50 Primary Indexes with 1 Replica each. (Total 100 Indexes)
Experiment :-
- EC2 node hosting Data pod restart during the upgrade.
- EC2 node hosting Query+index pod restart during upgrade
Observation:-
- The graceful failover exited prematurely for the data pod because the node went down. The node was later returned with the new Couchbase version and added back through delta recovery.
- After the node hosting pod cb-example-0007 (Query + Index) restarted, the pod was never re-added to the cluster.
- Cluster is running on reduced capacity as one Index+query pod is effectively in an unknown state
Follow ups:-
- Why query node was not added back to the cluster?
- From my analysis, cb-0007 was upgraded to version 7.6.1, but the operator is stuck on creating the pod for cb-0007, and the cluster status still shows as 'upgrading'. How can we resolve these kinds of issues?
CB logs - http://supportal.couchbase.com/snapshot/04bbaa961d67e0c9b6cd01145ce0be72::0
Operator logs -
cbopinfo-20240529T200123+0530.tar.gz
Analysis :-
CB server's perspective:-
failover initiated on 0007 for upgrade
Starting failing over ['ns_1@cb-example-0007.cb-example.default.svc']failover 000ns_1@cb-example-0006.cb-example.default.svc 10:45:08 AM 29 May, 2024
|
failover completed successfully
Failover completed successfully.
|
Rebalance Operation Id = 77809f491a566aa32c488a62fabae607ns_orchestrator 000ns_1@cb-example-0006.cb-example.default.svc 10:45:09 AM 29 May, 2024
|
0007 was up with 7.6.1 and the tries to get add back.
Couchbase Server has started on web port 8091 on node 'ns_1@cb-example-0007.cb-example.default.svc'. Version: "7.6.1-3200-enterprise".menelaus_web_sup 001ns_1@cb-example-0007.cb-example.default.svc 10:45:20 AM 29 May, 2024
|
Node hosting pod 0007 restarts, operator tries to spawn new pod which never came to running status.
Operator's perspective:-
- Operator is trying to do 0007 delta Recovery
- Unable to perform Graceful failover
- Stuck on creating pod
"level":"info","ts":"2024-05-29T10:45:08Z","logger":"cluster","msg":"Upgrading pods with DeltaRecovery","cluster":"default/cb-example","names":["cb-example-0007"],"target-version":"7.6.1"}
|
|
{"level":"info","ts":"2024-05-29T10:45:08Z","logger":"cluster","msg":"Unable to perform graceful failover on node. Reverting to hard failover.","cluster":"default/cb-example","name":"cb-example-0007"}
|
|
{"level":"info","ts":"2024-05-29T10:45:09Z","logger":"kubernetes","msg":"Creating pod","cluster":"default/cb-example","name":"cb-example-0007","image":"couchbase/server:7.6.1"}
|
The cluster is still showing in 7.2.5 in the operator but got upgraded to 7.6.1.
- lastTransitionTime: "2024-05-29T09:33:03Z"
|
lastUpdateTime: "2024-05-29T09:33:03Z"
|
message: Cluster upgrading (progress 9/10)
|
reason: Upgrading
|
status: "True"
|
type: Upgrading
|
currentVersion: 7.2.5
|
members:
|
ready:
|
- cb-example-0000
|
- cb-example-0001
|
- cb-example-0002
|
- cb-example-0003
|
- cb-example-0004
|
- cb-example-0005
|
- cb-example-0006
|
- cb-example-0007
|
- cb-example-0008
|
- cb-example-0009
|
size: 10
|
SS all pods
SS Cluster
0007 details - 0007.json