Loading...

Details

Type: Bug
Resolution: Unresolved
Priority: Major
Fix Version/s: Morpheus
Affects Version/s: 7.2.5
Component/s: secondary-index
Labels:
- CAO
- cao

Triage:
Untriaged
Operating System:
Linux x86_64
Story Points:
0
Is this a Regression?:
Unknown

Description

Upgrade Process Description provided here:
https://issues.couchbase.com/browse/K8S-3547?focusedId=783386&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-783386

Kubernetes Version	1.28 EKS
Couchbase Server	7.2.5 → 7.6.1
Operator	2.7.0-197

Cluster Setup

Each node is an m5.4xlarge instance. (16 vCPUs and 64GB RAM)
6 Data Service, 4 Index Service & Query Service Nodes.
10 Buckets (with 1 replica), Full Eviction and Auto-failover set to 5s.
85-120GB data per bucket → ~1TB data loaded onto cluster before beginning of upgrade.
50 Primary Indexes with 1 Replica each. (Total 100 Indexes with Index Storage: Plasma)

Upgrade Process

DeltaRecovery Upgrade to update Couchbase Server from 7.2.5 to 7.6.1.
Continuous query and data workload on the buckets during the update process.
Around 60% CPU load on all servers during the upgrade.

Logs

Pre Delta Recovery Upgrade - 7.2.5

Couchbase Server Logs

http://supportal.couchbase.com/snapshot/e544f65397cb5e33922cc1a1ee4556f7::0

http://supportal.couchbase.com/snapshot/e544f65397cb5e33922cc1a1ee4556f7::1

Post Delta Recovery Upgrade - 7.6.1

Couchbase Server Logs

http://supportal.couchbase.com/snapshot/e544f65397cb5e33922cc1a1ee4556f7::1

Issue

9/10 nodes get upgraded successfully.
While the delta recovery of cb-example-0009 is happening, the rebalance fails.
Some issue happens with node cb-example-0006.

Bug / Issue

The couchbase nodes till cb-example-0008 got updated smoothly.

When cb-example-0009 was getting delta recovered - rebalance exited

2024-06-30T00:51:54.872Z, ns_orchestrator:0:critical:message(ns_1@cb-example-0006.cb-example.default.svc) - Rebalance exited with reason {service_rebalance_failed,index,

                              {agent_died,<0.6832.0>,

                               {linked_process_died,<0.4633.2>,

                                {'ns_1@cb-example-0006.cb-example.default.svc',

                                 {timeout,

                                  {gen_server,call,

                                   [<0.7894.0>,

                                    {call,"ServiceAPI.StartTopologyChange",

                                     #Fun<json_rpc_connection.0.36915653>,

                                     #{timeout => 60000}},

                                    60000]}}}}}}.

Rebalance Operation Id = 5aa3fd06f8ea6f80a6bdfbca4d3db2f9

Next, we see cb-example-0006 got auto-failed over.

2024-06-30T00:52:02.526Z, failover:0:info:message(ns_1@cb-example-0006.cb-example.default.svc) - Starting failing over ['ns_1@cb-example-0006.cb-example.default.svc']

2024-06-30T00:52:02.526Z, ns_orchestrator:0:info:message(ns_1@cb-example-0006.cb-example.default.svc) - Starting failover of nodes ['ns_1@cb-example-0006.cb-example.default.svc'] AllowUnsafe = false Operation Id = d984cf6fcc2169763a839c158c57e544

2024-06-30T00:52:02.758Z, failover:0:info:message(ns_1@cb-example-0006.cb-example.default.svc) - Failed over ['ns_1@cb-example-0006.cb-example.default.svc']: ok

2024-06-30T00:52:02.774Z, failover:0:info:message(ns_1@cb-example-0006.cb-example.default.svc) - Deactivating failed over nodes ['ns_1@cb-example-0006.cb-example.default.svc']

2024-06-30T00:52:03.031Z, ns_orchestrator:0:info:message(ns_1@cb-example-0006.cb-example.default.svc) - Failover completed successfully.

Rebalance Operation Id = d984cf6fcc2169763a839c158c57e544

2024-06-30T00:52:03.090Z, auto_failover:0:info:message(ns_1@cb-example-0006.cb-example.default.svc) - Node ('ns_1@cb-example-0006.cb-example.default.svc') was automatically failed over. Reason: Connection to the service is lost

A rebalance was started after the failover.

2024-06-30T00:52:03.031Z, ns_orchestrator:0:info:message(ns_1@cb-example-0006.cb-example.default.svc) - Failover completed successfully.

Rebalance Operation Id = d984cf6fcc2169763a839c158c57e544

2024-06-30T00:52:03.090Z, auto_failover:0:info:message(ns_1@cb-example-0006.cb-example.default.svc) - Node ('ns_1@cb-example-0006.cb-example.default.svc') was automatically failed over. Reason: Connection to the service is lost

2024-06-30T00:52:10.262Z, ns_orchestrator:0:info:message(ns_1@cb-example-0006.cb-example.default.svc) - Starting rebalance, KeepNodes = ['ns_1@cb-example-0000.cb-example.default.svc',

                                 'ns_1@cb-example-0001.cb-example.default.svc',

                                 'ns_1@cb-example-0002.cb-example.default.svc',

                                 'ns_1@cb-example-0003.cb-example.default.svc',

                                 'ns_1@cb-example-0004.cb-example.default.svc',

                                 'ns_1@cb-example-0005.cb-example.default.svc',

                                 'ns_1@cb-example-0006.cb-example.default.svc',

                                 'ns_1@cb-example-0007.cb-example.default.svc',

                                 'ns_1@cb-example-0008.cb-example.default.svc',

                                 'ns_1@cb-example-0009.cb-example.default.svc'], EjectNodes = [], Failed over and being ejected nodes = []; no delta recovery nodes; Operation Id = bcc78b093a939e4bb3352ac168dc6fba

This rebalance failed and exited.

2024-06-30T00:52:15.873Z, ns_orchestrator:0:critical:message(ns_1@cb-example-0006.cb-example.default.svc) - Rebalance exited with reason {service_rebalance_failed,index,

                              {worker_died,

                               {'EXIT',<0.16166.2>,

                                {{badmatch,

                                  {error,

                                   {bad_nodes,index,prepare_rebalance,

                                    [{'ns_1@cb-example-0006.cb-example.default.svc',

                                      {error,

                                       {unknown_error,

                                        <<"indexer rebalance failure - cleanup pending from previous  failed/aborted rebalance/failover/move index. please retry the request later.">>}}}]}}},

                                 [{service_manager,rebalance_op,5,

                                   [{file,"src/service_manager.erl"},

                                    {line,338}]},

                                  {service_manager,do_run_op,1,

                                   [{file,"src/service_manager.erl"},

                                    {line,257}]},

                                  {proc_lib,init_p,3,

                                   [{file,"proc_lib.erl"},{line,225}]}]}}}}.

Rebalance Operation Id = bcc78b093a939e4bb3352ac168dc6fba

After this, another rebalance proceeds and is successful.

Attachments

Gerrit Reviews

- Issue Only
- Show All Reviews
- Show Open Reviews

No reviews matched the request. Check your Options in the drop-down menu of this sections header.

Index Rebalance Failures during DeltaRecovery upgrade 7.2.5 to 7.6.1

Details

Description

Cluster Setup

Upgrade Process

Logs

Issue

Bug / Issue

Attachments

Gerrit Reviews

Activity

People

Dates

Gerrit Reviews

PagerDuty