Loading...

XML

Word

Printable

Details

Type: Bug
Resolution: Fixed
Priority: Critical
Fix Version/s: 2.6.4
Affects Version/s: None
Component/s: operator
Labels:
None

Sprint:
11 - Race to Crashpoint Tower
Story Points:
2

Description

Kubernetes Version	1.25
Couchbase Server	7.2.5 → 7.6.1
Operator	2.6.4-119

Cluster Setup

Each node is an m5.4xlarge instance. (16 vCPUs and 64GB RAM)
6 Data Service, 4 Index Service & Query Service Nodes.
10 Buckets (with 1 replica), Full Eviction and Auto-failover set to 5s.
~2TiB data loaded onto cluster before the beginning of upgrade.
50 Primary Indexes with 1 Replica each. (Total 100 Indexes)

Experiment :-

EC2 node hosting Data pod restart during the upgrade.
EC2 node hosting Query+index pod restart during upgrade

Observation:-

The graceful failover exited prematurely for the data pod because the node went down. The node was later returned with the new Couchbase version and added back through delta recovery.
After the node hosting pod cb-example-0007 (Query + Index) restarted, the pod was never re-added to the cluster.
Cluster is running on reduced capacity as one Index+query pod is effectively in an unknown state

Follow ups:-

Why query node was not added back to the cluster?
From my analysis, cb-0007 was upgraded to version 7.6.1, but the operator is stuck on creating the pod for cb-0007, and the cluster status still shows as 'upgrading'. How can we resolve these kinds of issues?

CB logs - http://supportal.couchbase.com/snapshot/04bbaa961d67e0c9b6cd01145ce0be72::0

Operator logs -

cbopinfo-20240529T200123+0530.tar.gz

Analysis :-

CB server's perspective:-

failover initiated on 0007 for upgrade

Starting failing over ['ns_1@cb-example-0007.cb-example.default.svc']failover 000ns_1@cb-example-0006.cb-example.default.svc    10:45:08 AM 29 May, 2024

failover completed successfully

Failover completed successfully.

Rebalance Operation Id = 77809f491a566aa32c488a62fabae607ns_orchestrator 000ns_1@cb-example-0006.cb-example.default.svc 10:45:09 AM 29 May, 2024

0007 was up with 7.6.1 and the tries to get add back.

Couchbase Server has started on web port 8091 on node 'ns_1@cb-example-0007.cb-example.default.svc'. Version: "7.6.1-3200-enterprise".menelaus_web_sup 001ns_1@cb-example-0007.cb-example.default.svc 10:45:20 AM 29 May, 2024

Node hosting pod 0007 restarts, operator tries to spawn new pod which never came to running status.

Operator's perspective:-

Operator is trying to do 0007 delta Recovery
Unable to perform Graceful failover
Stuck on creating pod

"level":"info","ts":"2024-05-29T10:45:08Z","logger":"cluster","msg":"Upgrading pods with DeltaRecovery","cluster":"default/cb-example","names":["cb-example-0007"],"target-version":"7.6.1"}

{"level":"info","ts":"2024-05-29T10:45:08Z","logger":"cluster","msg":"Unable to perform graceful failover on node. Reverting to hard failover.","cluster":"default/cb-example","name":"cb-example-0007"}

{"level":"info","ts":"2024-05-29T10:45:09Z","logger":"kubernetes","msg":"Creating pod","cluster":"default/cb-example","name":"cb-example-0007","image":"couchbase/server:7.6.1"}

The cluster is still showing in 7.2.5 in the operator but got upgraded to 7.6.1.

- lastTransitionTime: "2024-05-29T09:33:03Z"

  lastUpdateTime: "2024-05-29T09:33:03Z"

  message: Cluster upgrading (progress 9/10)

  reason: Upgrading

  status: "True"

  type: Upgrading

currentVersion: 7.2.5

members:

  ready:

  - cb-example-0000

  - cb-example-0001

  - cb-example-0002

  - cb-example-0003

  - cb-example-0004

  - cb-example-0005

  - cb-example-0006

  - cb-example-0007

  - cb-example-0008

  - cb-example-0009

size: 10

SS all pods

SS Cluster

0007 details - 0007.json

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending
- Thumbnails
- List
- Download All

0007_updated.json
22 kB
29/May/24 7:58 PM
0007.json
22 kB
29/May/24 9:05 AM
cbopinfo-20240529T200123+0530.tar.gz
2.43 MB
29/May/24 7:41 AM
cbopinfo-20240530T081737+0530.tar.gz
2.24 MB
29/May/24 7:54 PM
Screenshot 2024-05-29 at 8.35.27 PM.png
961 kB
29/May/24 8:07 AM
Screenshot 2024-05-29 at 8.45.54 PM.png
546 kB
29/May/24 8:16 AM

Issue Links

is triggered by

K8S-3485 1.25+7.25 -> 1.25+7.6.1 (delta recovery + node restart)

Resolved

mentioned in: Page Loading...

Gerrit Reviews

- Issue Only
- Show All Reviews
- Show Open Reviews
- Show All Issues
- Show Open Issues

No reviews matched the request. Check your Options in the drop-down menu of this sections header.

Activity

People

Assignee:: Justin Ashworth

Reporter:: Usamah Jassat

Votes:: 0 Vote for this issue

Watchers:: 4 Start watching this issue

Dates

Created:: 29/May/24 7:42 AM

Updated:: 26/Jun/24 11:55 PM

Resolved:: 03/Jun/24 2:11 AM

Gerrit Reviews

There are no open Gerrit changes

Show There is 1 closed Gerrit change

Hide There is 1 closed Gerrit change

K8S-3515: Use context for create pod: Gerrit Review:

[operator 2.6.4-119] Query+Index pod is not added back upon restart of EC2 machine during delta recovery upgrade

Details

Description

Attachments

Attachments

Issue Links

Gerrit Reviews

Activity

People

Dates

Gerrit Reviews

PagerDuty