Uploaded image for project: 'Couchbase Kubernetes'
  1. Couchbase Kubernetes
  2. K8S-3527

[operator-122] Multiple Rebalance failures on Index+query during delta recovery upgrade.

    XMLWordPrintable

Details

    • Bug
    • Resolution: Unresolved
    • Major
    • 2.8.0
    • 2.6.4
    • operator
    • None
    • 19 - A Rock and a Hard Place
    • 1

    Description

       

      Kubernetes Version  1.25           
      Couchbase Server    7.2.5  → 7.6.1 
      Upgrade Process: DeltaRecovery
      Operator           2.6.4-122    
      Cluster Setup:-

      1. Each node is an m5.4xlarge instance. (16 vCPUs and 64GB RAM)
      2. 6 Data Service, 4 Index Service & Query Service Nodes.
      3. 10 Buckets (with 1 replica), Full Eviction and Auto-failover set to 5s.
      4. ~2TiB data loaded onto cluster before the beginning of upgrade.
      5. 50 Primary Indexes with 1 Replica each. (Total 100 Indexes)

      Observation:-

      1. Multiple Rebalance exited 0006 and 0009

      Result:-

      1. CB cluster successfully upgraded to 7.6.1 from 7.25

       

      Rebalance exited with reason {{badmatch,failed}, [{ns_rebalancer,rebalance_body,7, [{file,"src/ns_rebalancer.erl"}, {line,500}]}, {async,'-async_init/4-fun-1-',3, [{file,"src/async.erl"},{line,199}]}]}. Rebalance Operation Id = d11d06f80dbcb3932093d4187135162fns_orchestrator 000ns_1@cb-example-0006.cb-example.default.svc 6:27:56 AM 3 Jun, 2024
       
      Rebalance exited with reason {{badmatch,failed}, [{ns_rebalancer,rebalance_body,7, [{file,"src/ns_rebalancer.erl"}, {line,500}]}, {async,'-async_init/4-fun-1-',3, [{file,"src/async.erl"},{line,199}]}]}. Rebalance Operation Id = ab5f43c47572c4d6e9ae65b267677f0ans_orchestrator 000ns_1@cb-example-0006.cb-example.default.svc 6:26:56 AM 3 Jun, 2024
       
      Rebalance exited with reason {{badmatch,failed}, [{ns_rebalancer,rebalance_body,7, [{file,"src/ns_rebalancer.erl"}, {line,500}]}, {async,'-async_init/4-fun-1-',3, [{file,"src/async.erl"},{line,199}]}]}. Rebalance Operation Id = f2044a766208623faac2833cbaec1487ns_orchestrator 000ns_1@cb-example-0006.cb-example.default.svc 6:25:56 AM 3 Jun, 2024
       
      Rebalance exited with reason {{badmatch,failed}, [{ns_rebalancer,rebalance_body,7, [{file,"src/ns_rebalancer.erl"}, {line,500}]}, {async,'-async_init/4-fun-1-',3, [{file,"src/async.erl"},{line,199}]}]}. Rebalance Operation Id = 9972220fb1f2980822c81804e8055638ns_orchestrator 000ns_1@cb-example-0006.cb-example.default.svc 6:24:56 AM 3 Jun, 2024

      CB logs - http://supportal.couchbase.com/snapshot/51994b67a1f45b0e99c2b08649ba8032::0

       

      Operator logs - cbopinfo-20240603T125514+0530.tar.gz

      Analysis :-

      rebalance of 0008 failed at 2024-06-03T06:13:56.089Z

      2024-06-03T06:12:50.509Z, ns_orchestrator:0:info:message(ns_1@cb-example-0006.cb-example.default.svc) - Starting rebalance, KeepNodes = ['ns_1@cb-example-0000.cb-example.default.svc',                             'ns_1@cb-example-0001.cb-example.default.svc',                                 'ns_1@cb-example-0002.cb-example.default.svc',                                 'ns_1@cb-example-0003.cb-example.default.svc',                                 'ns_1@cb-example-0004.cb-example.default.svc',                                 'ns_1@cb-example-0005.cb-example.default.svc',                                 'ns_1@cb-example-0006.cb-example.default.svc',                                 'ns_1@cb-example-0007.cb-example.default.svc',                                 'ns_1@cb-example-0008.cb-example.default.svc',                                 'ns_1@cb-example-0009.cb-example.default.svc'], EjectNodes = [], Failed over and being ejected nodes = []; Delta recovery nodes = ['ns_1@cb-example-0008.cb-example.default.svc'], Delta recovery buckets = all;; Operation Id = 70958a95d46cb219ff9b4da9b04b0e02 
       
       
       
      2024-06-03T06:13:56.089Z, ns_orchestrator:0:critical:message(ns_1@cb-example-0006.cb-example.default.svc) - Rebalance exited with reason {service_rebalance_failed,index,                              {agent_died,<0.2649.0>,                               {linked_process_died,<0.16481.1>,                                {'ns_1@cb-example-0006.cb-example.default.svc',                                 {timeout,                                  {gen_server,call,                                   [<0.3793.0>,                                    {call,"ServiceAPI.StartTopologyChange",                                     #Fun<json_rpc_connection.0.36915653>,                                     #{timeout => 60000}},                                    60000]}}}}}}.Rebalance Operation Id = 70958a95d46cb219ff9b4da9b04b0e02

      Immediately after the failed rebalance of 0008, the operator failover 0009 for the upgrade procedure

      2024-06-03T06:14:02.340Z, failover:0:info:message(ns_1@cb-example-0006.cb-example.default.svc) - Starting failing over ['ns_1@cb-example-0009.cb-example.default.svc'] 
       
      2024-06-03T06:14:02.340Z, ns_orchestrator:0:info:message(ns_1@cb-example-0006.cb-example.default.svc) - Starting failover of nodes ['ns_1@cb-example-0009.cb-example.default.svc'] AllowUnsafe = false Operation Id = 95eab6d19bd916c005486555bb1f5382

      Immediately after initiating failover 0009, 0006 tries to fail itself and subsequent rebalance failures till the upgrade finishes.

      2024-06-03T06:14:02.488Z, auto_failover:0:info:message(ns_1@cb-example-0006.cb-example.default.svc) - Could not automatically fail over nodes (['ns_1@cb-example-0006.cb-example.default.svc']). Failover is running. 

      operator logs:-

      Operator started upgrading for 0008 at 2024-06-03T06:11:43Z and completed rebalance at 2024-06-03T06:13:58Z

       

      {"level":"info","ts":"2024-06-03T06:11:42Z","logger":"cluster","msg":"Upgrading pods with DeltaRecovery","cluster":"default/cb-example","names":["cb-example-0008"],"target-version":"7.6.1"}
       
      {"level":"info","ts":"2024-06-03T06:11:42Z","logger":"cluster","msg":"Unable to perform graceful failover on node. Reverting to hard failover.","cluster":"default/cb-example","name":"cb-example-0008"}
       
      {"level":"info","ts":"2024-06-03T06:11:43Z","logger":"kubernetes","msg":"Creating pod","cluster":"default/cb-example","name":"cb-example-0008","image":"couchbase/server:7.6.1"}
       
      {"level":"info","ts":"2024-06-03T06:12:50Z","logger":"cluster","msg":"Resource updated","cluster":"default/cb-example","diff":"-{v2.ClusterStatus}.Conditions[1->?]:{Type:Balanced Status:True LastUpdateTime:2024-06-03T06:11:38Z LastTransitionTime:2024-06-03T06:11:38Z Reason:Balanced Message:Data is equally distributed across all nodes in the cluster};+{v2.ClusterStatus}.Conditions[?->1]:{Type:Balanced Status:False LastUpdateTime:2024-06-03T06:12:50Z LastTransitionTime:2024-06-03T06:12:50Z Reason:Unbalanced Message:The operator is attempting to rebalance the data to correct this issue}"}
      {"level":"info","ts":"2024-06-03T06:12:50Z","logger":"cluster","msg":"Rebalancing","cluster":"default/cb-example","progress":0}
      {"level":"info","ts":"2024-06-03T06:12:54Z","logger":"cluster","msg":"Rebalancing","cluster":"default/cb-example","progress":42}
      {"level":"info","ts":"2024-06-03T06:12:58Z","logger":"cluster","msg":"Rebalancing","cluster":"default/cb-example","progress":60}
      {"level":"info","ts":"2024-06-03T06:13:02Z","logger":"cluster","msg":"Rebalancing","cluster":"default/cb-example","progress":60}
      {"level":"info","ts":"2024-06-03T06:13:06Z","logger":"cluster","msg":"Rebalancing","cluster":"default/cb-example","progress":60}
      {"level":"info","ts":"2024-06-03T06:13:10Z","logger":"cluster","msg":"Rebalancing","cluster":"default/cb-example","progress":60}
      {"level":"info","ts":"2024-06-03T06:13:14Z","logger":"cluster","msg":"Rebalancing","cluster":"default/cb-example","progress":60}
      {"level":"info","ts":"2024-06-03T06:13:18Z","logger":"cluster","msg":"Rebalancing","cluster":"default/cb-example","progress":60}
      {"level":"info","ts":"2024-06-03T06:13:22Z","logger":"cluster","msg":"Rebalancing","cluster":"default/cb-example","progress":60}
      {"level":"info","ts":"2024-06-03T06:13:26Z","logger":"cluster","msg":"Rebalancing","cluster":"default/cb-example","progress":60}
      {"level":"info","ts":"2024-06-03T06:13:30Z","logger":"cluster","msg":"Rebalancing","cluster":"default/cb-example","progress":60}
      {"level":"info","ts":"2024-06-03T06:13:34Z","logger":"cluster","msg":"Rebalancing","cluster":"default/cb-example","progress":60}
      {"level":"info","ts":"2024-06-03T06:13:38Z","logger":"cluster","msg":"Rebalancing","cluster":"default/cb-example","progress":60}
      {"level":"info","ts":"2024-06-03T06:13:42Z","logger":"cluster","msg":"Rebalancing","cluster":"default/cb-example","progress":60}
      {"level":"info","ts":"2024-06-03T06:13:46Z","logger":"cluster","msg":"Rebalancing","cluster":"default/cb-example","progress":60}
      {"level":"info","ts":"2024-06-03T06:13:50Z","logger":"cluster","msg":"Rebalancing","cluster":"default/cb-example","progress":60}
      {"level":"info","ts":"2024-06-03T06:13:54Z","logger":"cluster","msg":"Rebalancing","cluster":"default/cb-example","progress":60}
      {"level":"info","ts":"2024-06-03T06:13:58Z","logger":"cluster","msg":"Rebalance completed successfully","cluster":"default/cb-example"}
       
      {"level":"info","ts":"2024-06-03T06:13:58Z","logger":"cluster","msg":"Resource updated","cluster":"default/cb-example","diff":"-{v2.ClusterStatus}.Conditions[1->?]:{Type:Balanced Status:False LastUpdateTime:2024-06-03T06:12:50Z LastTransitionTime:2024-06-03T06:12:50Z Reason:Unbalanced Message:The operator is attempting to rebalance the data to correct this issue};+{v2.ClusterStatus}.Conditions[?->1]:{Type:Balanced Status:True LastUpdateTime:2024-06-03T06:13:58Z LastTransitionTime:2024-06-03T06:13:58Z Reason:Balanced Message:Data is equally distributed across all nodes in the cluster}"}
       

      The operator initiated the failover of 0009 immediately 0008 upgrade which is the right move from operator perspective

      {"level":"info","ts":"2024-06-03T06:14:02Z","logger":"cluster","msg":"Upgrading pods with DeltaRecovery","cluster":"default/cb-example","names":["cb-example-0009"],"target-version":"7.6.1"}
       
      {"level":"info","ts":"2024-06-03T06:14:02Z","logger":"cluster","msg":"Unable to perform graceful failover on node. Reverting to hard failover.","cluster":"default/cb-example","name":"cb-example-0009"} 

      In server logs, 0008 rebalance failed and should be retired before initiating 0009.

       

      This is the same behaviour in K8S-3529

       

      Here, we didn't try any restarts in between the upgrades.

      Attachments

        Issue Links

          No reviews matched the request. Check your Options in the drop-down menu of this sections header.

          Activity

            People

              justin.ashworth Justin Ashworth
              manik.mahajan Manik Mahajan
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated:

                Gerrit Reviews

                  There are no open Gerrit changes

                  PagerDuty