Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-62586

Index Rebalance Failures during DeltaRecovery upgrade 7.2.5 to 7.6.1

    XMLWordPrintable

Details

    • Untriaged
    • Linux x86_64
    • 0
    • Unknown

    Description

      Upgrade Process Description provided here:
      https://issues.couchbase.com/browse/K8S-3547?focusedId=783386&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-783386

       

      Kubernetes Version 1.28 EKS
      Couchbase Server 7.2.5 → 7.6.1
      Operator 2.7.0-197

      Cluster Setup

      • Each node is an m5.4xlarge instance. (16 vCPUs and 64GB RAM) 
      • 6 Data Service, 4 Index Service & Query Service Nodes.
      • 10 Buckets (with 1 replica), Full Eviction and Auto-failover set to 5s.
      • 85-120GB data per bucket → ~1TB data loaded onto cluster before beginning of upgrade.
      • 50 Primary Indexes with 1 Replica each. (Total 100 Indexes with Index Storage: Plasma)

      Upgrade Process

      • DeltaRecovery Upgrade to update Couchbase Server from 7.2.5 to 7.6.1.
      • Continuous query and data workload on the buckets during the update process.
      • Around 60% CPU load on all servers during the upgrade.

      Logs

      Pre Delta Recovery Upgrade - 7.2.5

      Couchbase Server Logs

      http://supportal.couchbase.com/snapshot/e544f65397cb5e33922cc1a1ee4556f7::0

       

      http://supportal.couchbase.com/snapshot/e544f65397cb5e33922cc1a1ee4556f7::1

      https://cb-engineering.s3.amazonaws.com/2024-06-30A_k8s1.28_cb7.2.5_delta_upgrade/collectinfo-2024-06-29T222152-ns_1%40cb-example-0000.cb-example.default.svc.zip
      https://cb-engineering.s3.amazonaws.com/2024-06-30A_k8s1.28_cb7.2.5_delta_upgrade/collectinfo-2024-06-29T222152-ns_1%40cb-example-0001.cb-example.default.svc.zip
      https://cb-engineering.s3.amazonaws.com/2024-06-30A_k8s1.28_cb7.2.5_delta_upgrade/collectinfo-2024-06-29T222152-ns_1%40cb-example-0002.cb-example.default.svc.zip
      https://cb-engineering.s3.amazonaws.com/2024-06-30A_k8s1.28_cb7.2.5_delta_upgrade/collectinfo-2024-06-29T222152-ns_1%40cb-example-0003.cb-example.default.svc.zip
      https://cb-engineering.s3.amazonaws.com/2024-06-30A_k8s1.28_cb7.2.5_delta_upgrade/collectinfo-2024-06-29T222152-ns_1%40cb-example-0004.cb-example.default.svc.zip
      https://cb-engineering.s3.amazonaws.com/2024-06-30A_k8s1.28_cb7.2.5_delta_upgrade/collectinfo-2024-06-29T222152-ns_1%40cb-example-0005.cb-example.default.svc.zip
      https://cb-engineering.s3.amazonaws.com/2024-06-30A_k8s1.28_cb7.2.5_delta_upgrade/collectinfo-2024-06-29T222152-ns_1%40cb-example-0006.cb-example.default.svc.zip
      https://cb-engineering.s3.amazonaws.com/2024-06-30A_k8s1.28_cb7.2.5_delta_upgrade/collectinfo-2024-06-29T222152-ns_1%40cb-example-0007.cb-example.default.svc.zip
      https://cb-engineering.s3.amazonaws.com/2024-06-30A_k8s1.28_cb7.2.5_delta_upgrade/collectinfo-2024-06-29T222152-ns_1%40cb-example-0008.cb-example.default.svc.zip
      https://cb-engineering.s3.amazonaws.com/2024-06-30A_k8s1.28_cb7.2.5_delta_upgrade/collectinfo-2024-06-29T222152-ns_1%40cb-example-0009.cb-example.default.svc.zip

       

      Post Delta Recovery Upgrade - 7.6.1

      Couchbase Server Logs

      http://supportal.couchbase.com/snapshot/e544f65397cb5e33922cc1a1ee4556f7::1

       

      https://cb-engineering.s3.amazonaws.com/2024-06-30A_k8s1.28_cb7.6.1_post_delta_upgrade/collectinfo-2024-06-30T044219-ns_1%40cb-example-0000.cb-example.default.svc.zip
      https://cb-engineering.s3.amazonaws.com/2024-06-30A_k8s1.28_cb7.6.1_post_delta_upgrade/collectinfo-2024-06-30T044219-ns_1%40cb-example-0001.cb-example.default.svc.zip
      https://cb-engineering.s3.amazonaws.com/2024-06-30A_k8s1.28_cb7.6.1_post_delta_upgrade/collectinfo-2024-06-30T044219-ns_1%40cb-example-0002.cb-example.default.svc.zip
      https://cb-engineering.s3.amazonaws.com/2024-06-30A_k8s1.28_cb7.6.1_post_delta_upgrade/collectinfo-2024-06-30T044219-ns_1%40cb-example-0003.cb-example.default.svc.zip
      https://cb-engineering.s3.amazonaws.com/2024-06-30A_k8s1.28_cb7.6.1_post_delta_upgrade/collectinfo-2024-06-30T044219-ns_1%40cb-example-0004.cb-example.default.svc.zip
      https://cb-engineering.s3.amazonaws.com/2024-06-30A_k8s1.28_cb7.6.1_post_delta_upgrade/collectinfo-2024-06-30T044219-ns_1%40cb-example-0005.cb-example.default.svc.zip
      https://cb-engineering.s3.amazonaws.com/2024-06-30A_k8s1.28_cb7.6.1_post_delta_upgrade/collectinfo-2024-06-30T044219-ns_1%40cb-example-0006.cb-example.default.svc.zip
      https://cb-engineering.s3.amazonaws.com/2024-06-30A_k8s1.28_cb7.6.1_post_delta_upgrade/collectinfo-2024-06-30T044219-ns_1%40cb-example-0007.cb-example.default.svc.zip
      https://cb-engineering.s3.amazonaws.com/2024-06-30A_k8s1.28_cb7.6.1_post_delta_upgrade/collectinfo-2024-06-30T044219-ns_1%40cb-example-0008.cb-example.default.svc.zip
      https://cb-engineering.s3.amazonaws.com/2024-06-30A_k8s1.28_cb7.6.1_post_delta_upgrade/collectinfo-2024-06-30T044219-ns_1%40cb-example-0009.cb-example.default.svc.zip

      Issue

      • 9/10 nodes get upgraded successfully.
      • While the delta recovery of cb-example-0009 is happening, the rebalance fails.
      • Some issue happens with node cb-example-0006.

      Bug / Issue

      The couchbase nodes till cb-example-0008 got updated smoothly.

      When cb-example-0009 was getting delta recovered - rebalance exited 

      2024-06-30T00:51:54.872Z, ns_orchestrator:0:critical:message(ns_1@cb-example-0006.cb-example.default.svc) - Rebalance exited with reason {service_rebalance_failed,index,
                                    {agent_died,<0.6832.0>,
                                     {linked_process_died,<0.4633.2>,
                                      {'ns_1@cb-example-0006.cb-example.default.svc',
                                       {timeout,
                                        {gen_server,call,
                                         [<0.7894.0>,
                                          {call,"ServiceAPI.StartTopologyChange",
                                           #Fun<json_rpc_connection.0.36915653>,
                                           #{timeout => 60000}},
                                          60000]}}}}}}.
      Rebalance Operation Id = 5aa3fd06f8ea6f80a6bdfbca4d3db2f9

      Next, we see cb-example-0006 got auto-failed over.

      2024-06-30T00:52:02.526Z, failover:0:info:message(ns_1@cb-example-0006.cb-example.default.svc) - Starting failing over ['ns_1@cb-example-0006.cb-example.default.svc']
      2024-06-30T00:52:02.526Z, ns_orchestrator:0:info:message(ns_1@cb-example-0006.cb-example.default.svc) - Starting failover of nodes ['ns_1@cb-example-0006.cb-example.default.svc'] AllowUnsafe = false Operation Id = d984cf6fcc2169763a839c158c57e544
      2024-06-30T00:52:02.758Z, failover:0:info:message(ns_1@cb-example-0006.cb-example.default.svc) - Failed over ['ns_1@cb-example-0006.cb-example.default.svc']: ok
      2024-06-30T00:52:02.774Z, failover:0:info:message(ns_1@cb-example-0006.cb-example.default.svc) - Deactivating failed over nodes ['ns_1@cb-example-0006.cb-example.default.svc']
      2024-06-30T00:52:03.031Z, ns_orchestrator:0:info:message(ns_1@cb-example-0006.cb-example.default.svc) - Failover completed successfully.
      Rebalance Operation Id = d984cf6fcc2169763a839c158c57e544
      2024-06-30T00:52:03.090Z, auto_failover:0:info:message(ns_1@cb-example-0006.cb-example.default.svc) - Node ('ns_1@cb-example-0006.cb-example.default.svc') was automatically failed over. Reason: Connection to the service is lost

      A rebalance was started after the failover. 

      2024-06-30T00:52:03.031Z, ns_orchestrator:0:info:message(ns_1@cb-example-0006.cb-example.default.svc) - Failover completed successfully.
      Rebalance Operation Id = d984cf6fcc2169763a839c158c57e544
      2024-06-30T00:52:03.090Z, auto_failover:0:info:message(ns_1@cb-example-0006.cb-example.default.svc) - Node ('ns_1@cb-example-0006.cb-example.default.svc') was automatically failed over. Reason: Connection to the service is lost 
      2024-06-30T00:52:10.262Z, ns_orchestrator:0:info:message(ns_1@cb-example-0006.cb-example.default.svc) - Starting rebalance, KeepNodes = ['ns_1@cb-example-0000.cb-example.default.svc',
                                       'ns_1@cb-example-0001.cb-example.default.svc',
                                       'ns_1@cb-example-0002.cb-example.default.svc',
                                       'ns_1@cb-example-0003.cb-example.default.svc',
                                       'ns_1@cb-example-0004.cb-example.default.svc',
                                       'ns_1@cb-example-0005.cb-example.default.svc',
                                       'ns_1@cb-example-0006.cb-example.default.svc',
                                       'ns_1@cb-example-0007.cb-example.default.svc',
                                       'ns_1@cb-example-0008.cb-example.default.svc',
                                       'ns_1@cb-example-0009.cb-example.default.svc'], EjectNodes = [], Failed over and being ejected nodes = []; no delta recovery nodes; Operation Id = bcc78b093a939e4bb3352ac168dc6fba

      This rebalance failed and exited.

      2024-06-30T00:52:15.873Z, ns_orchestrator:0:critical:message(ns_1@cb-example-0006.cb-example.default.svc) - Rebalance exited with reason {service_rebalance_failed,index,
                                    {worker_died,
                                     {'EXIT',<0.16166.2>,
                                      {{badmatch,
                                        {error,
                                         {bad_nodes,index,prepare_rebalance,
                                          [{'ns_1@cb-example-0006.cb-example.default.svc',
                                            {error,
                                             {unknown_error,
                                              <<"indexer rebalance failure - cleanup pending from previous  failed/aborted rebalance/failover/move index. please retry the request later.">>}}}]}}},
                                       [{service_manager,rebalance_op,5,
                                         [{file,"src/service_manager.erl"},
                                          {line,338}]},
                                        {service_manager,do_run_op,1,
                                         [{file,"src/service_manager.erl"},
                                          {line,257}]},
                                        {proc_lib,init_p,3,
                                         [{file,"proc_lib.erl"},{line,225}]}]}}}}.
      Rebalance Operation Id = bcc78b093a939e4bb3352ac168dc6fba

      After this, another rebalance proceeds and is successful.

      Attachments

        No reviews matched the request. Check your Options in the drop-down menu of this sections header.

        Activity

          People

            amit.kulkarni Amit Kulkarni
            aryaan.bhaskar Aryaan Bhaskar
            Votes:
            0 Vote for this issue
            Watchers:
            8 Start watching this issue

            Dates

              Created:
              Updated:

              Gerrit Reviews

                There are no open Gerrit changes

                PagerDuty