Uploaded image for project: 'Couchbase Kubernetes'
  1. Couchbase Kubernetes
  2. K8S-3489

[operator 2.6.4-109] Failed rebalance is not retried before new CB node is considered for upgrade.

    XMLWordPrintable

Details

    • Bug
    • Resolution: Fixed
    • Major
    • 2.6.4
    • 2.6.4
    • None
    • None
    • 0

    Description

      Couchbase Cluster Description

      • Set up the cluster as per the required specifications
      • Each node is an m5.4xlarge instance. (16 vCPUs and 64GB RAM)
      • 6 Data Service, 4 Index Service and Query Service Nodes.
      • 10 Buckets (with 1 replica), Full Eviction and Auto-failover set to 5s.
      • ~210GB data per bucket → ~2TB data loaded onto cluster.
      • 50 Primary Indexes with 1 Replica each. (Total 100 Indexes)
      • DeltaRecovery Upgrade to update Couchbase Server from 7.2.5 to 7.6.1
      • Continuous data and query workload on all buckets during the update process.
      • Interrupted upgrade by restarting a KV node.

       

      Rebooted cb-example-0001 while upgrading

       10:10:55 AM 17 May, 2024

      Node 'ns_1@cb-example-0004.cb-example.default.svc' saw that node 'ns_1@cb-example-0001.cb-example.default.svc' went down. Details: [{nodedown_reason, connection_closed}]ns_node_disco 005ns_1@cb-example-0004.cb-example.default.svc 

       
      10:46:27 AM 17 May, 2024

      Rebalance exited with reason {service_rebalance_failed,n1ql, {{badmatch, {error, {bad_nodes,n1ql,get_agent, [{'ns_1@cb-example-0009.cb-example.default.svc', {exit, {{nodedown, 'ns_1@cb-example-0009.cb-example.default.svc'}, {gen_server,call, [{'service_agent-n1ql', 'ns_1@cb-example-0009.cb-example.default.svc'}, get_agent,infinity]}}}}]}}}, [{service_manager,wait_for_agents,1, [{file,"src/service_manager.erl"}, {line,165}]}, {service_manager,run_op,1, show...ns_orchestrator 000ns_1@cb-example-0001.cb-example.default.svc 

      10:49:14 AM 17 May, 2024

      Starting rebalance, KeepNodes = ['ns_1@cb-example-0000.cb-example.default.svc', 'ns_1@cb-example-0001.cb-example.default.svc', 'ns_1@cb-example-0002.cb-example.default.svc', 'ns_1@cb-example-0003.cb-example.default.svc', 'ns_1@cb-example-0004.cb-example.default.svc', 'ns_1@cb-example-0005.cb-example.default.svc', 'ns_1@cb-example-0006.cb-example.default.svc', 'ns_1@cb-example-0007.cb-example.default.svc', 'ns_1@cb-example-0008.cb-example.default.svc', 'ns_1@cb-example-0009.cb-example.default.svc'], EjectNodes = [], Failed over and being ejected nodes = []; **Delta recovery nodes = ['ns_1@cb-example-0000.cb-example.default.svc']**, Delta recovery buckets = all;; Operation Id = 8a8ec0d43d1a1ae6586f5a show...ns_orchestrator 000ns_1@cb-example-0001.cb-example.default.svc 
      

       
      11:03:05 AM 17 May, 2024

      Rebalance interrupted due to auto-failover of nodes ['ns_1@cb-example-0008.cb-example.default.svc']. Rebalance Operation Id = 8b558d90ef57438882ec1fbe6c0db75fns_orchestrator 000ns_1@cb-example-0001.cb-example.default.svc 
      

       
      1:03:05 AM 17 May, 2024

      Rebalance interrupted due to auto-failover of nodes ['ns_1@cb-example-0008.cb-example.default.svc']. Rebalance Operation Id = 8b558d90ef57438882ec1fbe6c0db75fns_orchestrator 000ns_1@cb-example-0001.cb-example.default.svc1
      

      1. Why did the N1QL rebalance fail?
      2. Why did the operator consider the rebalance failure as a rebalance completion and proceed with upgrading the next node? Shouldn't the cluster be balanced before the operator upgrades the next node?

      CB logs - http://supportal.couchbase.com/snapshot/d98f94df31477f6c622956049790e725::1
      s3://cb-customers-secure/k8s-3485_delta_ugrade_before_rebalance_complete/2024-05-17/collectinfo-2024-05-17t110024-ns_1@cb-example-0000.cb-example.default.svc-d2b7a81f0e30fc2c.zip
      s3://cb-customers-secure/k8s-3485_delta_ugrade_before_rebalance_complete/2024-05-17/collectinfo-2024-05-17t110024-ns_1@cb-example-0001.cb-example.default.svc-c289c28a1bc81eb9.zip
      s3://cb-customers-secure/k8s-3485_delta_ugrade_before_rebalance_complete/2024-05-17/collectinfo-2024-05-17t110024-ns_1@cb-example-0002.cb-example.default.svc-b59a45b634abd3a7.zip
      s3://cb-customers-secure/k8s-3485_delta_ugrade_before_rebalance_complete/2024-05-17/collectinfo-2024-05-17t110024-ns_1@cb-example-0003.cb-example.default.svc-fd4119ace950d119.zip
      s3://cb-customers-secure/k8s-3485_delta_ugrade_before_rebalance_complete/2024-05-17/collectinfo-2024-05-17t110024-ns_1@cb-example-0004.cb-example.default.svc-506c0c19b7108d29.zip
      s3://cb-customers-secure/k8s-3485_delta_ugrade_before_rebalance_complete/2024-05-17/collectinfo-2024-05-17t110024-ns_1@cb-example-0005.cb-example.default.svc-60bf9c0169440d44.zip
      s3://cb-customers-secure/k8s-3485_delta_ugrade_before_rebalance_complete/2024-05-17/collectinfo-2024-05-17t110024-ns_1@cb-example-0006.cb-example.default.svc-d1cc5490d0deb786.zip
      s3://cb-customers-secure/k8s-3485_delta_ugrade_before_rebalance_complete/2024-05-17/collectinfo-2024-05-17t110024-ns_1@cb-example-0007.cb-example.default.svc-a22dcb24701a5cfc.zip
      s3://cb-customers-secure/k8s-3485_delta_ugrade_before_rebalance_complete/2024-05-17/collectinfo-2024-05-17t110024-ns_1@cb-example-0009.cb-example.default.svc-381d02376b7ab129.zip

      K8s Operator console logs  while upgrade :- operator_logs.txt
      Operator logs while upgrade :- cbopinfo-20240517T163222+0530.tar.gz

      Attachments

        1. cbopinfo-20240517T163222+0530.tar.gz
          1.70 MB
        2. operator_logs.txt
          394 kB
        3. screenshot-1.png
          screenshot-1.png
          248 kB
        4. screenshot-2.png
          screenshot-2.png
          315 kB

        Issue Links

          Activity

            People

              manik.mahajan Manik Mahajan
              manik.mahajan Manik Mahajan
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                PagerDuty