Uploaded image for project: 'Couchbase Kubernetes'
  1. Couchbase Kubernetes
  2. K8S-3504

[operator 2.6.4-115] Multiple graceful failover attempts before completing delta recovery resulting unnecessary rebalances and unpredictable upgrade time

    XMLWordPrintable

Details

    • Bug
    • Resolution: Fixed
    • Critical
    • 2.6.4
    • 2.6.4
    • operator
    • None
    • 0

    Description

      Couchbase Cluster Description

      • Set up the cluster as per the required specifications
      • Each node is an m5.4xlarge instance. (16 vCPUs and 64GB RAM) 
      • 6 Data Service, 4 Index Service and Query Service Nodes.
      • 10 Buckets (with 1 replica), Full Eviction and Auto-failover set to 5s.
      • Total Data around 3TiB on  6 KV nodes.
      • 50 Primary Indexes with 1 Replica each. (Total 100 Indexes)
      • DeltaRecovery Upgrade to update Couchbase Server from 7.2.5 to 7.6.1
      • Continuous data and query workload on all buckets during the update process.

      Observation:- #  

      1.  Successful graceful failover upgrade of cb-example-0001.
      2. Attempting to upgrade cb-example-0000, operator reported an unexpected counter change error and attempted a new graceful failover.
      3. Operator completes delta recovery upgrade for cb-example-0000 in 2 attempts.
      4. Operator completes delta recovery upgrade cb-example-0002 in 2 attempts.
      5. Operator completes delta recovery upgrade cb-example-0003 in 4 attempts.
      6. Decided to stop upgrade and remove the cluster to save AWS cost. 

       
      Conclusion :- #

      1. Multiple attempts to gracefully failover the same nodes lead to unnecessary rebalances, which put pressure on the Couchbase server and degrade performance.
      2. These redundant attempts can result in an unpredictable upgrade duration.

       
      Expectation :- #

      1. Avoid multiple attempts for a graceful upgrade on the same Couchbase node.
      2. Ensure the upgrade process is completed within a known duration.

       
      Suggestion for fix :-
      I think we might be checking for unnecessary other counters which will get update resulting in unexpected counter change. Only required counters are "graceful_failover_start" and "graceful_failover_success" in gracefullyFailoverNode  instead of matching old and new values of all initial counters.

      ErrUnexpectedCounterChange might be coming because we are checking old value and new value of other counters which are not necessary for determining the success of a graceful failover that are [rebalance_start, rebalance_success, failover and failover_complete].

      for name, curVal := range clusterInfo.Counters {
          oldVal := initialCounters[name]
          switch name {
          case "graceful_failover_start", "graceful_failover_success":
             continue
          }
       
          if curVal != oldVal {
             return ErrUnexpectedCounterChange, false
          }
      } 

      we can see in following json, other values like [rebalance_start, rebalance_success, failover and failover_complete]  can  change in between of graceful failover due to auto failover which can lead to ErrUnexpectedCounterChange.

      "counters":{"rebalance_start":5,"graceful_failover_success":4,"failover":4,"failover_complete":4,"graceful_failover_start":4,"rebalance_success":4}
       

       

      git diff which could possibly fix 

      ```
      diff --git a/pkg/cluster/nodereconcile.go b/pkg/cluster/nodereconcile.go
      index 0dc2e9f1..e5375b36 100644
      — a/pkg/cluster/nodereconcile.go
      +++ b/pkg/cluster/nodereconcile.go
      @@ -1188,24 +1188,12 @@ func (r *ReconcileMachine) gracefullyFailoverNode(candidate couchbaseutil.Member
                              return nil, true
                      }
       
      -               // Compare the counters to see if anything has changed
      -               if len(initialCounters) > len(clusterInfo.Counters) {
      -                       return ErrUnexpectedCounterChange, false
      +               // If the rebalance is unknown state, may lead to fail attempt for graceful failover
      +               if clusterInfo.RebalanceStatus == couchbaseutil.RebalanceStatusUnknown

      { +                       log.Info("unknown rebalance status during graceful failover")                 }

       
      -               for name, curVal := range clusterInfo.Counters {
      -                       oldVal := initialCounters[name]
      -                       switch name

      { -                       case "graceful_failover_start", "graceful_failover_success": -                               continue -                       }

      -
      -                       if curVal != oldVal

      { -                               return ErrUnexpectedCounterChange, false -                       }

      -               }
      -
      -               return nil, false
      +               return ErrFailoverSuccessCounterNotIncremented, false
              })
       
              if err != nil {
      ```
       
       
       
      Analysis :-
       
      cb-example-0000

      • First attempt for graceful failover failed for cb-example-0000 .
         

        json {"level":"info","ts":"2024-05-23T05:29:39Z","logger":"cluster","msg":"cb-example-0000"}   
         
        {"level":"info","ts":"2024-05-23T05:29:39Z","logger":"cluster","msg":"Upgrading pods with DeltaRecovery","cluster":"default/cb-example","names":["cb-example-0000"],"target-version":"7.6.1"}   
         
        {"level":"error","ts":"2024-05-23T05:42:14Z","logger":"cluster","msg":"Reconciliation failed","cluster":"default/cb-example","error":"graceful failover failed: unexpected counter change","stacktrace":"github.com/couchbase/couchbase-operator/pkg/cluster.(*Cluster).runReconcile\n\tgithub.com/couchbase/couchbase-operator/pkg/cluster/cluster.go:498\ngithub.com/couchbase/couchbase-operator/pkg/cluster.(*Cluster).Update\n\tgithub.com/couchbase/couchbase-operator/pkg/cluster/cluster.go:535\ngithub.com/couchbase/couchbase-operator/pkg/controller.(*CouchbaseClusterReconciler).Reconcile\n\tgithub.com/couchbase/couchbase-operator/pkg/controller/controller.go:90\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile\n\tsigs.k8s.io/controller-runtime@v0.16.3/pkg/internal/controller/controller.go:119\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\tsigs.k8s.io/controller-runtime@v0.16.3/pkg/internal/controller/controller.go:316\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\tsigs.k8s.io/controller-runtime@v0.16.3/pkg/internal/controller/controller.go:266\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2\n\tsigs.k8s.io/controller-runtime@v0.16.3/pkg/internal/controller/controller.go:227"} ```

         
         

      • Second attempt to graceful failover cb-example-0000 results in success 

        {"level":"info","ts":"2024-05-23T06:07:11Z","logger":"cluster","msg":"Upgrading pods with DeltaRecovery","cluster":"default/cb-example","names":["cb-example-0000"],"target-version":"7.6.1"}
         
        {"level":"info","ts":"2024-05-23T06:21:53Z","logger":"kubernetes","msg":"Creating pod","cluster":"default/cb-example","name":"cb-example-0000","image":"couchbase/server:7.6.1"}     
         
        {"level":"info","ts":"2024-05-23T06:23:03Z","logger":"cluster","msg":"Resource updated","cluster":"default/cb-example","diff":"-{v2.ClusterStatus}.Conditions[1->?]:{Type:Balanced Status:True LastUpdateTime:2024-05-23T06:07:09Z LastTransitionTime:2024-05-23T06:07:09Z Reason:Balanced Message:Data is equally distributed across all nodes in the cluster};+{v2.ClusterStatus}.Conditions[?->1]:{Type:Balanced Status:False LastUpdateTime:2024-05-23T06:23:03Z LastTransitionTime:2024-05-23T06:23:03Z Reason:Unbalanced Message:The operator is attempting to rebalance the data to correct this issue}"} ```

         
         
         
        cb-example-0002

      • first attempt for graceful failover for cb-example-0002 fails

      {"level":"info","ts":"2024-05-23T06:39:33Z","logger":"cluster","msg":"cb-example-0002"}     
       
      {"level":"info","ts":"2024-05-23T06:39:33Z","logger":"cluster","msg":"Upgrading pods with DeltaRecovery","cluster":"default/cb-example","names":["cb-example-0002"],"target-version":"7.6.1"}     
       
      {"level":"error","ts":"2024-05-23T06:50:21Z","logger":"cluster","msg":"Reconciliation failed","cluster":"default/cb-example","error":"graceful failover failed: unexpected counter change","stacktrace":"github.com/couchbase/couchbase-operator/pkg/cluster.(*Cluster).runReconcile\n\tgithub.com/couchbase/couchbase-operator/pkg/cluster/cluster.go:498\ngithub.com/couchbase/couchbase-operator/pkg/cluster.(*Cluster).Update\n\tgithub.com/couchbase/couchbase-operator/pkg/cluster/cluster.go:535\ngithub.com/couchbase/couchbase-operator/pkg/controller.(*CouchbaseClusterReconciler).Reconcile\n\tgithub.com/couchbase/couchbase-operator/pkg/controller/controller.go:90\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile\n\tsigs.k8s.io/controller-runtime@v0.16.3/pkg/internal/controller/controller.go:119\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\tsigs.k8s.io/controller-runtime@v0.16.3/pkg/internal/controller/controller.go:316\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\tsigs.k8s.io/controller-runtime@v0.16.3/pkg/internal/controller/controller.go:266\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2\n\tsigs.k8s.io/controller-runtime@v0.16.3/pkg/internal/controller/controller.go:227"}

       
       

      • second attempt to graceful failover cb-example-0002 results in success

      {"level":"info","ts":"2024-05-23T07:15:05Z","logger":"cluster","msg":"cb-example-0002"}     
       
      {"level":"info","ts":"2024-05-23T07:15:05Z","logger":"cluster","msg":"Upgrading pods with DeltaRecovery","cluster":"default/cb-example","names":["cb-example-0002"],"target-version":"7.6.1"}     {"level":"info","ts":"2024-05-23T07:27:02Z","logger":"kubernetes","msg":"Creating pod","cluster":"default/cb-example","name":"cb-example-0002","image":"couchbase/server:7.6.1"}   
       
      {"level":"info","ts":"2024-05-23T07:30:26Z","logger":"cluster","msg":"Resource updated","cluster":"default/cb-example","diff":"-{v2.ClusterStatus}.Conditions[1->?]:{Type:Balanced Status:True LastUpdateTime:2024-05-23T07:15:02Z LastTransitionTime:2024-05-23T07:15:02Z Reason:Balanced Message:Data is equally distributed across all nodes in the cluster};+{v2.ClusterStatus}.Conditions[?->1]:{Type:Balanced Status:False LastUpdateTime:2024-05-23T07:30:26Z LastTransitionTime:2024-05-23T07:30:26Z Reason:Unbalanced Message:The operator is attempting to rebalance the data to correct this issue}"}

        
      cb-example-0003

      • first attempt for graceful failover on cb-example-0003 fails

      {"level":"info","ts":"2024-05-23T07:46:16Z","logger":"cluster","msg":"cb-example-0003"}     
       
      {"level":"info","ts":"2024-05-23T07:46:16Z","logger":"cluster","msg":"Upgrading pods with DeltaRecovery","cluster":"default/cb-example","names":["cb-example-0003"],"target-version":"7.6.1"}   
       
      {"level":"error","ts":"2024-05-23T07:50:05Z","logger":"cluster","msg":"Reconciliation failed","cluster":"default/cb-example","error":"graceful failover failed: unexpected counter change","stacktrace":"github.com/couchbase/couchbase-operator/pkg/cluster.(*Cluster).runReconcile\n\tgithub.com/couchbase/couchbase-operator/pkg/cluster/cluster.go:498\ngithub.com/couchbase/couchbase-operator/pkg/cluster.(*Cluster).Update\n\tgithub.com/couchbase/couchbase-operator/pkg/cluster/cluster.go:535\ngithub.com/couchbase/couchbase-operator/pkg/controller.(*CouchbaseClusterReconciler).Reconcile\n\tgithub.com/couchbase/couchbase-operator/pkg/controller/controller.go:90\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile\n\tsigs.k8s.io/controller-runtime@v0.16.3/pkg/internal/controller/controller.go:119\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\tsigs.k8s.io/controller-runtime@v0.16.3/pkg/internal/controller/controller.go:316\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\tsigs.k8s.io/controller-runtime@v0.16.3/pkg/internal/controller/controller.go:266\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2\n\tsigs.k8s.io/controller-runtime@v0.16.3/pkg/internal/controller/controller.go:227"}

       

      • second attempt for graceful failover on cb-example-0003  failed 

        {"level":"info","ts":"2024-05-23T08:00:57Z","logger":"cluster","msg":"cb-example-0003"}     
         
        {"level":"info","ts":"2024-05-23T08:00:57Z","logger":"cluster","msg":"Upgrading pods with DeltaRecovery","cluster":"default/cb-example","names":["cb-example-0003"],"target-version":"7.6.1"}    
         
        {"level":"error","ts":"2024-05-23T08:01:24Z","logger":"cluster","msg":"Reconciliation failed","cluster":"default/cb-example","error":"graceful failover failed: unexpected counter change","stacktrace":"github.com/couchbase/couchbase-operator/pkg/cluster.(*Cluster).runReconcile\n\tgithub.com/couchbase/couchbase-operator/pkg/cluster/cluster.go:498\ngithub.com/couchbase/couchbase-operator/pkg/cluster.(*Cluster).Update\n\tgithub.com/couchbase/couchbase-operator/pkg/cluster/cluster.go:535\ngithub.com/couchbase/couchbase-operator/pkg/controller.(*CouchbaseClusterReconciler).Reconcile\n\tgithub.com/couchbase/couchbase-operator/pkg/controller/controller.go:90\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile\n\tsigs.k8s.io/controller-runtime@v0.16.3/pkg/internal/controller/controller.go:119\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\tsigs.k8s.io/controller-runtime@v0.16.3/pkg/internal/controller/controller.go:316\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\tsigs.k8s.io/controller-runtime@v0.16.3/pkg/internal/controller/controller.go:266\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2\n\tsigs.k8s.io/controller-runtime@v0.16.3/pkg/internal/controller/controller.go:227"}

         

      • third attempt for graceful failover on cb-example-0003  failed

      {"level":"info","ts":"2024-05-23T08:12:11Z","logger":"cluster","msg":"cb-example-0003"}
       
      {"level":"info","ts":"2024-05-23T08:12:11Z","logger":"cluster","msg":"Upgrading pods with DeltaRecovery","cluster":"default/cb-example","names":["cb-example-0003"],"target-version":"7.6.1"}
       
      {"level":"error","ts":"2024-05-23T08:12:38Z","logger":"cluster","msg":"Reconciliation failed","cluster":"default/cb-example","error":"graceful failover failed: unexpected counter change","stacktrace":"github.com/couchbase/couchbase-operator/pkg/cluster.(*Cluster).runReconcile\n\tgithub.com/couchbase/couchbase-operator/pkg/cluster/cluster.go:498\ngithub.com/couchbase/couchbase-operator/pkg/cluster.(*Cluster).Update\n\tgithub.com/couchbase/couchbase-operator/pkg/cluster/cluster.go:535\ngithub.com/couchbase/couchbase-operator/pkg/controller.(*CouchbaseClusterReconciler).Reconcile\n\tgithub.com/couchbase/couchbase-operator/pkg/controller/controller.go:90\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile\n\tsigs.k8s.io/controller-runtime@v0.16.3/pkg/internal/controller/controller.go:119\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\tsigs.k8s.io/controller-runtime@v0.16.3/pkg/internal/controller/controller.go:316\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\tsigs.k8s.io/controller-runtime@v0.16.3/pkg/internal/controller/controller.go:266\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2\n\tsigs.k8s.io/controller-runtime@v0.16.3/pkg/internal/controller/controller.go:227"}

      • fourth attempt to graceful failover cb-example-0003 results in success

        {"level":"info","ts":"2024-05-23T08:23:21Z","logger":"cluster","msg":"cb-example-0003"}
         
        {"level":"info","ts":"2024-05-23T08:23:21Z","logger":"cluster","msg":"Upgrading pods with DeltaRecovery","cluster":"default/cb-example","names":["cb-example-0003"],"target-version":"7.6.1"}
         
        {"level":"info","ts":"2024-05-23T08:23:50Z","logger":"kubernetes","msg":"Creating pod","cluster":"default/cb-example","name":"cb-example-0003","image":"couchbase/server:7.6.1"}
        {"level":"info","ts":"2024-05-23T08:27:31Z","logger":"cluster","msg":"Resource updated","cluster":"default/cb-example","diff":"-{v2.ClusterStatus}.Conditions[1->?]:{Type:Balanced Status:True LastUpdateTime:2024-05-23T08:23:19Z LastTransitionTime:2024-05-23T08:23:19Z Reason:Balanced Message:Data is equally distributed across all nodes in the cluster};+{v2.ClusterStatus}.Conditions[?->1]:{Type:Balanced Status:False LastUpdateTime:2024-05-23T08:27:31Z LastTransitionTime:2024-05-23T08:27:31Z Reason:Unbalanced Message:The operator is attempting to rebalance the data to correct this issue}"}

         

      Screen Shots:-

      Multiple graceful failover attempts for 0001 and 0002.

      Multiple graceful for 0003

      Operator logs-
      cbopinfo-20240523T140321+0530.tar.gz

      CB logs - 
      https://cb-engineering.s3.amazonaws.com/K8S-3492-multiple-failovers/collectinfo-2024-05-23T083914-ns_1%40cb-example-0000.cb-example.default.svc.zip
      https://cb-engineering.s3.amazonaws.com/K8S-3492-multiple-failovers/collectinfo-2024-05-23T083914-ns_1%40cb-example-0001.cb-example.default.svc.zip
      https://cb-engineering.s3.amazonaws.com/K8S-3492-multiple-failovers/collectinfo-2024-05-23T083914-ns_1%40cb-example-0002.cb-example.default.svc.zip
      https://cb-engineering.s3.amazonaws.com/K8S-3492-multiple-failovers/collectinfo-2024-05-23T083914-ns_1%40cb-example-0003.cb-example.default.svc.zip
      https://cb-engineering.s3.amazonaws.com/K8S-3492-multiple-failovers/collectinfo-2024-05-23T083914-ns_1%40cb-example-0004.cb-example.default.svc.zip
      https://cb-engineering.s3.amazonaws.com/K8S-3492-multiple-failovers/collectinfo-2024-05-23T083914-ns_1%40cb-example-0005.cb-example.default.svc.zip
      https://cb-engineering.s3.amazonaws.com/K8S-3492-multiple-failovers/collectinfo-2024-05-23T083914-ns_1%40cb-example-0006.cb-example.default.svc.zip
      https://cb-engineering.s3.amazonaws.com/K8S-3492-multiple-failovers/collectinfo-2024-05-23T083914-ns_1%40cb-example-0007.cb-example.default.svc.zip
      https://cb-engineering.s3.amazonaws.com/K8S-3492-multiple-failovers/collectinfo-2024-05-23T083914-ns_1%40cb-example-0008.cb-example.default.svc.zip
      https://cb-engineering.s3.amazonaws.com/K8S-3492-multiple-failovers/collectinfo-2024-05-23T083914-ns_1%40cb-example-0009.cb-example.default.svc.zip

      Attachments

        Issue Links

          Activity

            People

              usamah.jassat Usamah Jassat
              manik.mahajan Manik Mahajan
              Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                PagerDuty