Loading...

Details

Type: Bug
Resolution: Unresolved
Priority: Major
Fix Version/s: 2.8.0
Affects Version/s: 2.6.4
Component/s: operator
Labels:
None

Sprint:
19 - A Rock and a Hard Place
Story Points:
1

Description

Kubernetes Version 1.25
Couchbase Server 7.2.5 → 7.6.1
Upgrade Process: DeltaRecovery
Operator 2.6.4-122
Cluster Setup:-

Each node is an m5.4xlarge instance. (16 vCPUs and 64GB RAM)
6 Data Service, 4 Index Service & Query Service Nodes.
10 Buckets (with 1 replica), Full Eviction and Auto-failover set to 5s.
~2TiB data loaded onto cluster before the beginning of upgrade.
50 Primary Indexes with 1 Replica each. (Total 100 Indexes)

Observation:-

Multiple Rebalance exited 0006 and 0009

Result:-

CB cluster successfully upgraded to 7.6.1 from 7.25

Rebalance exited with reason {{badmatch,failed}, [{ns_rebalancer,rebalance_body,7, [{file,"src/ns_rebalancer.erl"}, {line,500}]}, {async,'-async_init/4-fun-1-',3, [{file,"src/async.erl"},{line,199}]}]}. Rebalance Operation Id = d11d06f80dbcb3932093d4187135162fns_orchestrator 000ns_1@cb-example-0006.cb-example.default.svc 6:27:56 AM 3 Jun, 2024

Rebalance exited with reason {{badmatch,failed}, [{ns_rebalancer,rebalance_body,7, [{file,"src/ns_rebalancer.erl"}, {line,500}]}, {async,'-async_init/4-fun-1-',3, [{file,"src/async.erl"},{line,199}]}]}. Rebalance Operation Id = ab5f43c47572c4d6e9ae65b267677f0ans_orchestrator 000ns_1@cb-example-0006.cb-example.default.svc 6:26:56 AM 3 Jun, 2024

Rebalance exited with reason {{badmatch,failed}, [{ns_rebalancer,rebalance_body,7, [{file,"src/ns_rebalancer.erl"}, {line,500}]}, {async,'-async_init/4-fun-1-',3, [{file,"src/async.erl"},{line,199}]}]}. Rebalance Operation Id = f2044a766208623faac2833cbaec1487ns_orchestrator 000ns_1@cb-example-0006.cb-example.default.svc 6:25:56 AM 3 Jun, 2024

Rebalance exited with reason {{badmatch,failed}, [{ns_rebalancer,rebalance_body,7, [{file,"src/ns_rebalancer.erl"}, {line,500}]}, {async,'-async_init/4-fun-1-',3, [{file,"src/async.erl"},{line,199}]}]}. Rebalance Operation Id = 9972220fb1f2980822c81804e8055638ns_orchestrator 000ns_1@cb-example-0006.cb-example.default.svc 6:24:56 AM 3 Jun, 2024

CB logs - http://supportal.couchbase.com/snapshot/51994b67a1f45b0e99c2b08649ba8032::0

Operator logs - cbopinfo-20240603T125514+0530.tar.gz

Analysis :-

rebalance of 0008 failed at 2024-06-03T06:13:56.089Z

2024-06-03T06:12:50.509Z, ns_orchestrator:0:info:message(ns_1@cb-example-0006.cb-example.default.svc) - Starting rebalance, KeepNodes = ['ns_1@cb-example-0000.cb-example.default.svc',                             'ns_1@cb-example-0001.cb-example.default.svc',                                 'ns_1@cb-example-0002.cb-example.default.svc',                                 'ns_1@cb-example-0003.cb-example.default.svc',                                 'ns_1@cb-example-0004.cb-example.default.svc',                                 'ns_1@cb-example-0005.cb-example.default.svc',                                 'ns_1@cb-example-0006.cb-example.default.svc',                                 'ns_1@cb-example-0007.cb-example.default.svc',                                 'ns_1@cb-example-0008.cb-example.default.svc',                                 'ns_1@cb-example-0009.cb-example.default.svc'], EjectNodes = [], Failed over and being ejected nodes = []; Delta recovery nodes = ['ns_1@cb-example-0008.cb-example.default.svc'], Delta recovery buckets = all;; Operation Id = 70958a95d46cb219ff9b4da9b04b0e02

2024-06-03T06:13:56.089Z, ns_orchestrator:0:critical:message(ns_1@cb-example-0006.cb-example.default.svc) - Rebalance exited with reason {service_rebalance_failed,index,                              {agent_died,<0.2649.0>,                               {linked_process_died,<0.16481.1>,                                {'ns_1@cb-example-0006.cb-example.default.svc',                                 {timeout,                                  {gen_server,call,                                   [<0.3793.0>,                                    {call,"ServiceAPI.StartTopologyChange",                                     #Fun<json_rpc_connection.0.36915653>,                                     #{timeout => 60000}},                                    60000]}}}}}}.Rebalance Operation Id = 70958a95d46cb219ff9b4da9b04b0e02

Immediately after the failed rebalance of 0008, the operator failover 0009 for the upgrade procedure

2024-06-03T06:14:02.340Z, failover:0:info:message(ns_1@cb-example-0006.cb-example.default.svc) - Starting failing over ['ns_1@cb-example-0009.cb-example.default.svc']

2024-06-03T06:14:02.340Z, ns_orchestrator:0:info:message(ns_1@cb-example-0006.cb-example.default.svc) - Starting failover of nodes ['ns_1@cb-example-0009.cb-example.default.svc'] AllowUnsafe = false Operation Id = 95eab6d19bd916c005486555bb1f5382

Immediately after initiating failover 0009, 0006 tries to fail itself and subsequent rebalance failures till the upgrade finishes.

2024-06-03T06:14:02.488Z, auto_failover:0:info:message(ns_1@cb-example-0006.cb-example.default.svc) - Could not automatically fail over nodes (['ns_1@cb-example-0006.cb-example.default.svc']). Failover is running.

operator logs:-

Operator started upgrading for 0008 at 2024-06-03T06:11:43Z and completed rebalance at 2024-06-03T06:13:58Z

{"level":"info","ts":"2024-06-03T06:11:42Z","logger":"cluster","msg":"Upgrading pods with DeltaRecovery","cluster":"default/cb-example","names":["cb-example-0008"],"target-version":"7.6.1"}

{"level":"info","ts":"2024-06-03T06:11:42Z","logger":"cluster","msg":"Unable to perform graceful failover on node. Reverting to hard failover.","cluster":"default/cb-example","name":"cb-example-0008"}

{"level":"info","ts":"2024-06-03T06:11:43Z","logger":"kubernetes","msg":"Creating pod","cluster":"default/cb-example","name":"cb-example-0008","image":"couchbase/server:7.6.1"}

{"level":"info","ts":"2024-06-03T06:12:50Z","logger":"cluster","msg":"Resource updated","cluster":"default/cb-example","diff":"-{v2.ClusterStatus}.Conditions[1->?]:{Type:Balanced Status:True LastUpdateTime:2024-06-03T06:11:38Z LastTransitionTime:2024-06-03T06:11:38Z Reason:Balanced Message:Data is equally distributed across all nodes in the cluster};+{v2.ClusterStatus}.Conditions[?->1]:{Type:Balanced Status:False LastUpdateTime:2024-06-03T06:12:50Z LastTransitionTime:2024-06-03T06:12:50Z Reason:Unbalanced Message:The operator is attempting to rebalance the data to correct this issue}"}

{"level":"info","ts":"2024-06-03T06:12:50Z","logger":"cluster","msg":"Rebalancing","cluster":"default/cb-example","progress":0}

{"level":"info","ts":"2024-06-03T06:12:54Z","logger":"cluster","msg":"Rebalancing","cluster":"default/cb-example","progress":42}

{"level":"info","ts":"2024-06-03T06:12:58Z","logger":"cluster","msg":"Rebalancing","cluster":"default/cb-example","progress":60}

{"level":"info","ts":"2024-06-03T06:13:02Z","logger":"cluster","msg":"Rebalancing","cluster":"default/cb-example","progress":60}

{"level":"info","ts":"2024-06-03T06:13:06Z","logger":"cluster","msg":"Rebalancing","cluster":"default/cb-example","progress":60}

{"level":"info","ts":"2024-06-03T06:13:10Z","logger":"cluster","msg":"Rebalancing","cluster":"default/cb-example","progress":60}

{"level":"info","ts":"2024-06-03T06:13:14Z","logger":"cluster","msg":"Rebalancing","cluster":"default/cb-example","progress":60}

{"level":"info","ts":"2024-06-03T06:13:18Z","logger":"cluster","msg":"Rebalancing","cluster":"default/cb-example","progress":60}

{"level":"info","ts":"2024-06-03T06:13:22Z","logger":"cluster","msg":"Rebalancing","cluster":"default/cb-example","progress":60}

{"level":"info","ts":"2024-06-03T06:13:26Z","logger":"cluster","msg":"Rebalancing","cluster":"default/cb-example","progress":60}

{"level":"info","ts":"2024-06-03T06:13:30Z","logger":"cluster","msg":"Rebalancing","cluster":"default/cb-example","progress":60}

{"level":"info","ts":"2024-06-03T06:13:34Z","logger":"cluster","msg":"Rebalancing","cluster":"default/cb-example","progress":60}

{"level":"info","ts":"2024-06-03T06:13:38Z","logger":"cluster","msg":"Rebalancing","cluster":"default/cb-example","progress":60}

{"level":"info","ts":"2024-06-03T06:13:42Z","logger":"cluster","msg":"Rebalancing","cluster":"default/cb-example","progress":60}

{"level":"info","ts":"2024-06-03T06:13:46Z","logger":"cluster","msg":"Rebalancing","cluster":"default/cb-example","progress":60}

{"level":"info","ts":"2024-06-03T06:13:50Z","logger":"cluster","msg":"Rebalancing","cluster":"default/cb-example","progress":60}

{"level":"info","ts":"2024-06-03T06:13:54Z","logger":"cluster","msg":"Rebalancing","cluster":"default/cb-example","progress":60}

{"level":"info","ts":"2024-06-03T06:13:58Z","logger":"cluster","msg":"Rebalance completed successfully","cluster":"default/cb-example"}

{"level":"info","ts":"2024-06-03T06:13:58Z","logger":"cluster","msg":"Resource updated","cluster":"default/cb-example","diff":"-{v2.ClusterStatus}.Conditions[1->?]:{Type:Balanced Status:False LastUpdateTime:2024-06-03T06:12:50Z LastTransitionTime:2024-06-03T06:12:50Z Reason:Unbalanced Message:The operator is attempting to rebalance the data to correct this issue};+{v2.ClusterStatus}.Conditions[?->1]:{Type:Balanced Status:True LastUpdateTime:2024-06-03T06:13:58Z LastTransitionTime:2024-06-03T06:13:58Z Reason:Balanced Message:Data is equally distributed across all nodes in the cluster}"}

The operator initiated the failover of 0009 immediately 0008 upgrade which is the right move from operator perspective

{"level":"info","ts":"2024-06-03T06:14:02Z","logger":"cluster","msg":"Upgrading pods with DeltaRecovery","cluster":"default/cb-example","names":["cb-example-0009"],"target-version":"7.6.1"}

{"level":"info","ts":"2024-06-03T06:14:02Z","logger":"cluster","msg":"Unable to perform graceful failover on node. Reverting to hard failover.","cluster":"default/cb-example","name":"cb-example-0009"}

In server logs, 0008 rebalance failed and should be retired before initiating 0009.

This is the same behaviour in K8S-3529

Here, we didn't try any restarts in between the upgrades.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending
- Thumbnails
- List
- Download All

cbopinfo-20240603T125514+0530.tar.gz
1.61 MB
03/Jun/24 12:26 AM

Issue Links

is triggered by

K8S-3492 1.25+7.25 -> 1.25+7.6.1 (delta recovery)

Resolved

mentioned in: Page Loading...

Gerrit Reviews

- Issue Only
- Show All Reviews
- Show Open Reviews

No reviews matched the request. Check your Options in the drop-down menu of this sections header.

[operator-122] Multiple Rebalance failures on Index+query during delta recovery upgrade.

Details

Description

Attachments

Attachments

Issue Links

Gerrit Reviews

Activity

People

Dates

Gerrit Reviews

PagerDuty