Uploaded image for project: 'Couchbase Kubernetes'
  1. Couchbase Kubernetes
  2. K8S-3555

[CAO 2.7.0-197] Rebalance Failures after Delta Recovery Upgrade

    XMLWordPrintable

Details

    • Bug
    • Resolution: Unresolved
    • Major
    • 2.8.0
    • 2.7.0
    • None
    • 18 -Lost to Eternity
    • 1

    Description

      Upgrade Process Description provided here:
      https://issues.couchbase.com/browse/K8S-3547?focusedId=783386&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-783386

       

      Kubernetes Version 1.28
      Couchbase Server 7.2.5 → 7.6.1
      Operator 2.7.0-197

      Cluster Setup

      • Each node is an m5.4xlarge instance. (16 vCPUs and 64GB RAM) 
      • 6 Data Service, 4 Index Service & Query Service Nodes.
      • 10 Buckets (with 1 replica), Full Eviction and Auto-failover set to 5s.
      • 85-120GB data per bucket → ~1TB data loaded onto cluster before beginning of upgrade.
      • 50 Primary Indexes with 1 Replica each. (Total 100 Indexes with Index Storage: Plasma)

      Upgrade Process

      • DeltaRecovery Upgrade to update Couchbase Server from 7.2.5 to 7.6.1.
      • Continuous query and data workload on the buckets during the update process.
      • Around 60% CPU load on all servers during the upgrade.

      Issue

      All the nodes got updated, after which there was a rebalance failure.

      • All 10 nodes get upgraded
      • Some issue happens with node cb-example-0006
      • Cluster state is seen as unbalanced
      • Rebalance is initiated, which fails. This is retried.
      • Rebalance succeeds yet we see an error message "ErrorEncountered Message:failed to rebalance"

      Logs

      Pre Delta Recovery Upgrade - 7.2.5

      Couchbase Server Logs

      http://supportal.couchbase.com/snapshot/e544f65397cb5e33922cc1a1ee4556f7::0

       

      http://supportal.couchbase.com/snapshot/e544f65397cb5e33922cc1a1ee4556f7::1

      https://cb-engineering.s3.amazonaws.com/2024-06-30A_k8s1.28_cb7.2.5_delta_upgrade/collectinfo-2024-06-29T222152-ns_1%40cb-example-0000.cb-example.default.svc.zip
      https://cb-engineering.s3.amazonaws.com/2024-06-30A_k8s1.28_cb7.2.5_delta_upgrade/collectinfo-2024-06-29T222152-ns_1%40cb-example-0001.cb-example.default.svc.zip
      https://cb-engineering.s3.amazonaws.com/2024-06-30A_k8s1.28_cb7.2.5_delta_upgrade/collectinfo-2024-06-29T222152-ns_1%40cb-example-0002.cb-example.default.svc.zip
      https://cb-engineering.s3.amazonaws.com/2024-06-30A_k8s1.28_cb7.2.5_delta_upgrade/collectinfo-2024-06-29T222152-ns_1%40cb-example-0003.cb-example.default.svc.zip
      https://cb-engineering.s3.amazonaws.com/2024-06-30A_k8s1.28_cb7.2.5_delta_upgrade/collectinfo-2024-06-29T222152-ns_1%40cb-example-0004.cb-example.default.svc.zip
      https://cb-engineering.s3.amazonaws.com/2024-06-30A_k8s1.28_cb7.2.5_delta_upgrade/collectinfo-2024-06-29T222152-ns_1%40cb-example-0005.cb-example.default.svc.zip
      https://cb-engineering.s3.amazonaws.com/2024-06-30A_k8s1.28_cb7.2.5_delta_upgrade/collectinfo-2024-06-29T222152-ns_1%40cb-example-0006.cb-example.default.svc.zip
      https://cb-engineering.s3.amazonaws.com/2024-06-30A_k8s1.28_cb7.2.5_delta_upgrade/collectinfo-2024-06-29T222152-ns_1%40cb-example-0007.cb-example.default.svc.zip
      https://cb-engineering.s3.amazonaws.com/2024-06-30A_k8s1.28_cb7.2.5_delta_upgrade/collectinfo-2024-06-29T222152-ns_1%40cb-example-0008.cb-example.default.svc.zip
      https://cb-engineering.s3.amazonaws.com/2024-06-30A_k8s1.28_cb7.2.5_delta_upgrade/collectinfo-2024-06-29T222152-ns_1%40cb-example-0009.cb-example.default.svc.zip

       

      Post Delta Recovery Upgrade - 7.6.1

      CAO Collect Logs: 

      Couchbase Server Logs

      http://supportal.couchbase.com/snapshot/e544f65397cb5e33922cc1a1ee4556f7::1

       

      https://cb-engineering.s3.amazonaws.com/2024-06-30A_k8s1.28_cb7.6.1_post_delta_upgrade/collectinfo-2024-06-30T044219-ns_1%40cb-example-0000.cb-example.default.svc.zip
      https://cb-engineering.s3.amazonaws.com/2024-06-30A_k8s1.28_cb7.6.1_post_delta_upgrade/collectinfo-2024-06-30T044219-ns_1%40cb-example-0001.cb-example.default.svc.zip
      https://cb-engineering.s3.amazonaws.com/2024-06-30A_k8s1.28_cb7.6.1_post_delta_upgrade/collectinfo-2024-06-30T044219-ns_1%40cb-example-0002.cb-example.default.svc.zip
      https://cb-engineering.s3.amazonaws.com/2024-06-30A_k8s1.28_cb7.6.1_post_delta_upgrade/collectinfo-2024-06-30T044219-ns_1%40cb-example-0003.cb-example.default.svc.zip
      https://cb-engineering.s3.amazonaws.com/2024-06-30A_k8s1.28_cb7.6.1_post_delta_upgrade/collectinfo-2024-06-30T044219-ns_1%40cb-example-0004.cb-example.default.svc.zip
      https://cb-engineering.s3.amazonaws.com/2024-06-30A_k8s1.28_cb7.6.1_post_delta_upgrade/collectinfo-2024-06-30T044219-ns_1%40cb-example-0005.cb-example.default.svc.zip
      https://cb-engineering.s3.amazonaws.com/2024-06-30A_k8s1.28_cb7.6.1_post_delta_upgrade/collectinfo-2024-06-30T044219-ns_1%40cb-example-0006.cb-example.default.svc.zip
      https://cb-engineering.s3.amazonaws.com/2024-06-30A_k8s1.28_cb7.6.1_post_delta_upgrade/collectinfo-2024-06-30T044219-ns_1%40cb-example-0007.cb-example.default.svc.zip
      https://cb-engineering.s3.amazonaws.com/2024-06-30A_k8s1.28_cb7.6.1_post_delta_upgrade/collectinfo-2024-06-30T044219-ns_1%40cb-example-0008.cb-example.default.svc.zip
      https://cb-engineering.s3.amazonaws.com/2024-06-30A_k8s1.28_cb7.6.1_post_delta_upgrade/collectinfo-2024-06-30T044219-ns_1%40cb-example-0009.cb-example.default.svc.zip

      Bug / Issue

      After all 10 nodes get updated

      {"level":"info","ts":"2024-06-30T00:51:56Z","logger":"cluster","msg":"Resource updated","cluster":"default/cb-example","diff":"{v2.ClusterStatus}.CurrentVersion:7.2.5->7.6.1"}
      {"level":"info","ts":"2024-06-30T00:51:59Z","logger":"cluster","msg":"Cluster status","cluster":"default/cb-example","balance":"balanced","rebalancing":false}

       

      Then we see this in the logs

      {"level":"info","ts":"2024-06-30T00:52:07Z","logger":"cluster","msg":"Resource updated","cluster":"default/cb-example","diff":"-{v2.ClusterStatus}.Members.Ready[6->?]:cb-example-0006;+{v2.ClusterStatus}.Members.Unready:[cb-example-0006]"}
      {"level":"info","ts":"2024-06-30T00:52:07Z","logger":"cluster","msg":"Cluster status","cluster":"default/cb-example","balance":"unbalanced","rebalancing":false}

      Next, CAO logs the state as unbalanced and starts a rebalance. This rebalance fails.

       

      {"level":"info","ts":"2024-06-30T00:52:07Z","logger":"cluster","msg":"Setting recovery type","cluster":"default/cb-example","name":"cb-example-0006","type":"full"} {"level":"info","ts":"2024-06-30T00:52:09Z","logger":"cluster","msg":"Resource updated","cluster":"default/cb-example","diff":"-{v2.ClusterStatus}.Conditions[1->?]:{Type:Balanced Status:True LastUpdateTime:2024-06-30T00:51:55Z LastTransitionTime:2024-06-30T00:51:55Z Reason:Balanced Message:Data is equally distributed across all nodes in the cluster};+{v2.ClusterStatus}.Conditions[?->1]:{Type:Balanced Status:False LastUpdateTime:2024-06-30T00:52:09Z LastTransitionTime:2024-06-30T00:52:09Z Reason:Unbalanced Message:The operator is attempting to rebalance the data to correct this issue}"} {"level":"info","ts":"2024-06-30T00:52:10Z","logger":"cluster","msg":"Rebalancing","cluster":"default/cb-example","progress":0} {"level":"info","ts":"2024-06-30T00:52:14Z","logger":"cluster","msg":"Rebalancing","cluster":"default/cb-example","progress":36} {"level":"info","ts":"2024-06-30T00:52:28Z","logger":"cluster","msg":"Reconciliation failed","cluster":"default/cb-example","error":"failed to rebalance: timeout: unexpected rebalance error","stack":"github.com/couchbase/couchbase-operator/pkg/cluster.(*Cluster).verifyRebalance.func1\n\tgithub.com/couchbase/couchbase-operator/pkg/cluster/rebalance.go:299\ngithub.com/couchbase/couchbase-operator/pkg/util/retryutil.Retry\n\tgithub.com/couchbase/couchbase-operator/pkg/util/retryutil/retryutil.go:14\ngithub.com/couchbase/couchbase-operator/pkg/util/retryutil.RetryFor\n\tgithub.com/couchbase/couchbase-operator/pkg/util/retryutil/retryutil.go:30\ngithub.com/couchbase/couchbase-operator/pkg/cluster.(*Cluster).verifyRebalance\n\tgithub.com/couchbase/couchbase-operator/pkg/cluster/rebalance.go:324\ngithub.com/couchbase/couchbase-operator/pkg/cluster.(*Cluster).rebalanceWithRetriesOnVerifyFails\n\tgithub.com/couchbase/couchbase-operator/pkg/cluster/rebalance.go:258\ngithub.com/couchbase/couchbase-operator/pkg/cluster.(*Cluster).rebalance\n\tgithub.com/couchbase/couchbase-operator/pkg/cluster/rebalance.go:185\ngithub.com/couchbase/couchbase-operator/pkg/cluster.(*ReconcileMachine).handleRebalance\n\tgithub.com/couchbase/couchbase-operator/pkg/cluster/nodereconcile.go:1631\ngithub.com/couchbase/couchbase-operator/pkg/cluster.(*ReconcileMachine).exec\n\tgithub.com/couchbase/couchbase-operator/pkg/cluster/nodereconcile.go:321\ngithub.com/couchbase/couchbase-operator/pkg/cluster.(*Cluster).reconcileMembers\n\tgithub.com/couchbase/couchbase-operator/pkg/cluster/reconcile.go:264\ngithub.com/couchbase/couchbase-operator/pkg/cluster.(*Cluster).reconcile\n\tgithub.com/couchbase/couchbase-operator/pkg/cluster/reconcile.go:173\ngithub.com/couchbase/couchbase-operator/pkg/cluster.(*Cluster).runReconcile\n\tgithub.com/couchbase/couchbase-operator/pkg/cluster/cluster.go:511\ngithub.com/couchbase/couchbase-operator/pkg/cluster.(*Cluster).Update\n\tgithub.com/couchbase/couchbase-operator/pkg/cluster/cluster.go:558\ngithub.com/couchbase/couchbase-operator/pkg/controller.(*CouchbaseClusterReconciler).Reconcile\n\tgithub.com/couchbase/couchbase-operator/pkg/controller/controller.go:90\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile\n\tsigs.k8s.io/controller-runtime@v0.16.3/pkg/internal/controller/controller.go:119\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\tsigs.k8s.io/controller-runtime@v0.16.3/pkg/internal/controller/controller.go:316\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\tsigs.k8s.io/controller-runtime@v0.16.3/pkg/internal/controller/controller.go:266\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2\n\tsigs.k8s.io/controller-runtime@v0.16.3/pkg/internal/controller/controller.go:227"} 

       

      The rebalance is initiated again, this rebalance completed successfully yet the error message came that `ErrorEncountered Message:failed to rebalance: timeout: unexpected rebalance error`

      {"level":"info","ts":"2024-06-30T00:52:28Z","logger":"cluster","msg":"Resource updated","cluster":"default/cb-example","diff":"+{v2.ClusterStatus}.Conditions[?->2]:{Type:Error Status:True LastUpdateTime:2024-06-30T00:52:28Z LastTransitionTime:2024-06-30T00:52:28Z Reason:ErrorEncountered Message:failed to rebalance: timeout: unexpected rebalance error}"}
      {"level":"info","ts":"2024-06-30T00:52:28Z","logger":"cluster","msg":"Resource updated","cluster":"default/cb-example","diff":"+{v2.ClusterStatus}.Members.Ready[?->6]:cb-example-0006;-{v2.ClusterStatus}.Members.Unready:[cb-example-0006]"}
      {"level":"info","ts":"2024-06-30T00:52:31Z","logger":"cluster","msg":"Rebalancing","cluster":"default/cb-example","progress":0}
      {"level":"info","ts":"2024-06-30T00:52:35Z","logger":"cluster","msg":"Rebalancing","cluster":"default/cb-example","progress":18}
      {"level":"info","ts":"2024-06-30T00:52:39Z","logger":"cluster","msg":"Rebalancing","cluster":"default/cb-example","progress":48}
      {"level":"info","ts":"2024-06-30T00:52:43Z","logger":"cluster","msg":"Rebalancing","cluster":"default/cb-example","progress":61.99999999999999}
      ...
      {"level":"info","ts":"2024-06-30T01:06:56Z","logger":"cluster","msg":"Rebalancing","cluster":"default/cb-example","progress":80}
      {"level":"info","ts":"2024-06-30T01:07:00Z","logger":"cluster","msg":"Rebalance completed successfully","cluster":"default/cb-example"}
      {"level":"info","ts":"2024-06-30T01:07:00Z","logger":"cluster","msg":"Resource updated","cluster":"default/cb-example","diff":"-{v2.ClusterStatus}.Conditions[1->?]:{Type:Balanced Status:False LastUpdateTime:2024-06-30T00:52:09Z LastTransitionTime:2024-06-30T00:52:09Z Reason:Unbalanced Message:The operator is attempting to rebalance the data to correct this issue};+{v2.ClusterStatus}.Conditions[?->1]:{Type:Balanced Status:True LastUpdateTime:2024-06-30T01:07:00Z LastTransitionTime:2024-06-30T01:07:00Z Reason:Balanced Message:Data is equally distributed across all nodes in the cluster}"}
      {"level":"info","ts":"2024-06-30T01:07:01Z","logger":"cluster","msg":"Resource updated","cluster":"default/cb-example","diff":"-{v2.ClusterStatus}.Conditions[2->?]:{Type:Error Status:True LastUpdateTime:2024-06-30T00:52:28Z LastTransitionTime:2024-06-30T00:52:28Z Reason:ErrorEncountered Message:failed to rebalance: timeout: unexpected rebalance error};+{v2.ClusterStatus}.Autoscalers:[]"}
      {"level":"info","ts":"2024-06-30T01:07:03Z","logger":"cluster","msg":"Cluster status","cluster":"default/cb-example","balance":"balanced","rebalancing":false}
      

      Attachments

        Issue Links

          No reviews matched the request. Check your Options in the drop-down menu of this sections header.

          Activity

            People

              aryaan.bhaskar Aryaan Bhaskar
              aryaan.bhaskar Aryaan Bhaskar
              Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

                Created:
                Updated:

                Gerrit Reviews

                  There are no open Gerrit changes

                  PagerDuty