Loading...

Details

Type: Bug
Resolution: Fixed
Priority: Critical
Fix Version/s: 2.6.4
Affects Version/s: 2.6.4
Component/s: operator
Labels:
None

Story Points:
0

Description

Couchbase Cluster Description

Set up the cluster as per the required specifications
Each node is an m5.4xlarge instance. (16 vCPUs and 64GB RAM)
6 Data Service, 4 Index Service and Query Service Nodes.
10 Buckets (with 1 replica), Full Eviction and Auto-failover set to 5s.
~210GB data per bucket → ~2TB data loaded onto cluster.
50 Primary Indexes with 1 Replica each. (Total 100 Indexes)

DeltaRecovery Upgrade to update Couchbase Server from 7.2.5 to 7.6.1
Continuous data and query workload on all buckets during the update process.

Current:-

Node 0001 is graceful failover in loop.
Node 0001 is still staying at 7.2.5 even after 3-4 graceful failovers.

Expected:-

Node 0001 should be upgraded to 7.6.1 after successful graceful failover and rebalance.
Whole Cluster should upgraded to 7.6.1

Follow Ups:-
1. Is there a limit of retrying on which operator will stop upgrade and inform the users ?

#0001 graceful failover : 4

#0001 graceful failover : 5

CB Logs Pre upgrade - http://supportal.couchbase.com/snapshot/eb81ca9e079b0c90bf92068cb8ca76b3::0

CB Logs During Upgrade (stuck) -
supportal Link-http://supportal.couchbase.com/snapshot/eb81ca9e079b0c90bf92068cb8ca76b3::1

Operator logs :-

{"level":"info","ts":"2024-05-20T09:28:13Z","logger":"cluster","msg":"Upgrading pods with DeltaRecovery","cluster":"default/cb-example","names":["cb-example-0001"],"target-version":"7.6.1"}

{"level":"error","ts":"2024-05-20T09:33:14Z","logger":"cluster","msg":"Reconciliation failed","cluster":"default/cb-example","error":"timeout: task is currently running","stacktrace":"github.com/couchbase/couchbase-operator/pkg/cluster.(*Cluster).runReconcile\n\tgithub.com/couchbase/couchbase-operator/pkg/cluster/cluster.go:498\ngithub.com/couchbase/couchbase-operator/pkg/cluster.(*Cluster).Update\n\tgithub.com/couchbase/couchbase-operator/pkg/cluster/cluster.go:535\ngithub.com/couchbase/couchbase-operator/pkg/controller.(*CouchbaseClusterReconciler).Reconcile\n\tgithub.com/couchbase/couchbase-operator/pkg/controller/controller.go:90\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile\n\tsigs.k8s.io/controller-runtime@v0.16.3/pkg/internal/controller/controller.go:119\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\tsigs.k8s.io/controller-runtime@v0.16.3/pkg/internal/controller/controller.go:316\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\tsigs.k8s.io/controller-runtime@v0.16.3/pkg/internal/controller/controller.go:266\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2\n\tsigs.k8s.io/controller-runtime@v0.16.3/pkg/internal/controller/controller.go:227"}

{"level":"info","ts":"2024-05-20T09:33:14Z","logger":"cluster","msg":"Resource updated","cluster":"default/cb-example","diff":"+{v2.ClusterStatus}.Conditions[?->3]:{Type:Error Status:True LastUpdateTime:2024-05-20T09:33:14Z LastTransitionTime:2024-05-20T09:33:14Z Reason:ErrorEncountered Message:timeout: task is currently running}"}

Operator Logs :-

cbopinfo-20240520T151320+0530.tar.gz

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending
- Thumbnails
- List
- Download All

cbopinfo-20240520T151320+0530.tar.gz
20/May/24 3:22 AM
1.73 MB
Manik Mahajan
image-2024-05-20-15-41-35-454.png
20/May/24 3:11 AM
929 kB
Manik Mahajan
image-2024-05-20-21-23-06-226.png
20/May/24 9:23 PM
449 kB
Dave Finlay
screenshot-1.png
20/May/24 9:09 PM
551 kB
Dave Finlay
Screenshot 2024-05-20 at 4.23.17 PM.png
20/May/24 3:55 AM
1.08 MB
Manik Mahajan
Screenshot 2024-05-20 at 4.42.45 PM.png
20/May/24 4:13 AM
1.05 MB
Manik Mahajan

Issue Links

is caused by

K8S-3492 1.25+7.25 -> 1.25+7.6.1 (delta recovery)

Resolved

relates to

K8S-3472 Operator needs more robust way to detect outcome of graceful failover

Resolved

K8S-3494 [Operator 2.6.4] Required a break point to stop infinite rebalances which can lead to potential cascading failures.

Resolved

K8S-3496 Operator should query K8S API to see if a pod is terminating before attempting to add back

Resolved

K8S-3497 Delta Recovery should retry once before falling back to full recovery

Resolved

K8S-3499 Remove timeout on waiting on GracefulFailover to complete

Resolved

mentioned in: Page Loading...

(1 relates to, 1 mentioned in)

[Operator 2.6.4-109] No progress during delta recovery upgrade (7.2.5 -> 7.6.1) because of "Reconciliation failed" in operator

Details

Description

Couchbase Cluster Description

Attachments

Attachments

Issue Links

Activity

People

Dates

PagerDuty