Uploaded image for project: 'Couchbase Kubernetes'
  1. Couchbase Kubernetes
  2. K8S-3558

[CAO-2.7.0-197] Operator proceeds to delta upgrade a CB data node while the CB cluster is having 1 CB node less than required.

    XMLWordPrintable

Details

    • Bug
    • Resolution: Unresolved
    • Major
    • 2.8.0
    • 2.7.0
    • operator
    • 18 -Lost to Eternity
    • 1

    Description

      Find the details of the Delta Recovery Upgrade here:
      https://issues.couchbase.com/browse/K8S-3548?focusedId=784164&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-784164

      The operator proceeds to perform a delta recovery in a scaled down cluster (having 9 CB nodes instead 10). This could lead to issues with clusters having resource constraints.

      Summary of Events

      `cb-example-0000 to 0005` are data nodes, `cb-example-0006 to 0009` are index-query nodes.

       1. `cb-example-0001` delta recovery update proceeds without any interruptions
      2. Next `cb-example-0000` is being rebalanced before upgrading, while this rebalance is going on, I restart the k8s node.
          1. The rebalance fails. Failover happens.
          2. `cb-example-0000` rejoins the cluster. `cb-example-0000` rejoins with CB 7.6.1
          3. Delta recovery warmup happens and it is rebalanced successfully.
      3. `cb-example-0002` starts rebalance before upgrading. Rebalance is successful.
          1. `cb-example-0002` is failed over, added back to cluster with CB 7.6.1. 
          2. Delta recovery warmup begins.
          3. During warmup I restart the K8s node.
          4. `cb-example-0002` gets auto failed over.
          5. Now, a rebalance starts to eject cb node `cb-example-0002`. 
          6. During this rebalance, `cb-example-0002` pod has come up in the K8s cluster, but it is not being added back to cb cluster.
          7.  Instead, the rebalance to eject `cb-example-0002` proceeds and is completed successfully. `cb-example-0002` is removed from the cb cluster.
          8. The cb cluster is a 9 node cluster now.
      4. Next, the delta recovery upgrade of the `cb-example-0003` starts and is successful. (In 9 node CB cluster).

      Bug / Issue

      cb-example-0002 is getting upgraded. It is failed over, added back to cluster with CB 7.6.1. Delta recovery warmup begins. During warmup, I restart the K8s node. The Rebalance fails. Instead of adding back 0002, it is ejected from the cluster.

      {"level":"info","ts":"2024-07-04T12:12:15Z","logger":"cluster","msg":"Rebalance failed, reverting to full recovery: timeout: unexpected rebalance error: node cb-example-0002 not found in cluster"} {"level":"info","ts":"2024-07-04T12:12:15Z","logger":"cluster","msg":"Reconciliation failed","cluster":"default/cb-example","error":"request failed: unexpected status code POST http://cb-example-0004.cb-example.default.svc:8091/controller/setRecoveryType 400 Bad Request: {\"otpNode\":\"invalid node name or node can't be used for delta recovery\"}","stack":"github.com/couchbase/couchbase-operator/pkg/util/couchbaseutil.Client.doRequest\n\tgithub.com/couchbase/couchbase-operator/pkg/util/couchbaseutil/core.go:240\ngithub.com/couchbase/couchbase-operator/pkg/util/couchbaseutil.(*Client).Post\n\tgithub.com/couchbase/couchbase-operator/pkg/util/couchbaseutil/core.go:302\ngithub.com/couchbase/couchbase-operator/pkg/util/couchbaseutil.(*Request).On.func1\n\tgithub.com/couchbase/couchbase-operator/pkg/util/couchbaseutil/api.go:222\ngithub.com/couchbase/couchbase-operator/pkg/util/couchbaseutil.(*Request).On\n\tgithub.com/couchbase/couchbase-operator/pkg/util/couchbaseutil/api.go:249\ngithub.com/couchbase/couchbase-operator/pkg/cluster.(*ReconcileMachine).recreateAndRebalanceNode\n\tgithub.com/couchbase/couchbase-operator/pkg/cluster/nodereconcile.go:1332\ngithub.com/couchbase/couchbase-operator/pkg/cluster.(*ReconcileMachine).handleDeltaRecovery\n\tgithub.com/couchbase/couchbase-operator/pkg/cluster/nodereconcile.go:1404\ngithub.com/couchbase/couchbase-operator/pkg/cluster.(*ReconcileMachine).handleUpgradeNode\n\tgithub.com/couchbase/couchbase-operator/pkg/cluster/nodereconcile.go:1559\ngithub.com/couchbase/couchbase-operator/pkg/cluster.(*ReconcileMachine).exec\n\tgithub.com/couchbase/couchbase-operator/pkg/cluster/nodereconcile.go:321\ngithub.com/couchbase/couchbase-operator/pkg/cluster.(*Cluster).reconcileMembers\n\tgithub.com/couchbase/couchbase-operator/pkg/cluster/reconcile.go:264\ngithub.com/couchbase/couchbase-operator/pkg/cluster.(*Cluster).reconcile\n\tgithub.com/couchbase/couchbase-operator/pkg/cluster/reconcile.go:173\ngithub.com/couchbase/couchbase-operator/pkg/cluster.(*Cluster).runReconcile\n\tgithub.com/couchbase/couchbase-operator/pkg/cluster/cluster.go:511\ngithub.com/couchbase/couchbase-operator/pkg/cluster.(*Cluster).Update\n\tgithub.com/couchbase/couchbase-operator/pkg/cluster/cluster.go:558\ngithub.com/couchbase/couchbase-operator/pkg/controller.(*CouchbaseClusterReconciler).Reconcile\n\tgithub.com/couchbase/couchbase-operator/pkg/controller/controller.go:90\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile\n\tsigs.k8s.io/controller-runtime@v0.16.3/pkg/internal/controller/controller.go:119\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\tsigs.k8s.io/controller-runtime@v0.16.3/pkg/internal/controller/controller.go:316\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\tsigs.k8s.io/controller-runtime@v0.16.3/pkg/internal/controller/controller.go:266\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2\n\tsigs.k8s.io/controller-runtime@v0.16.3/pkg/internal/controller/controller.go:227"} {"level":"info","ts":"2024-07-04T12:12:15Z","logger":"cluster","msg":"Resource updated","cluster":"default/cb-example","diff":"+{v2.ClusterStatus}.Conditions[?->3]:{Type:Error Status:True LastUpdateTime:2024-07-04T12:12:15Z LastTransitionTime:2024-07-04T12:12:15Z Reason:ErrorEncountered Message:request failed: unexpected status code POST http://cb-example-0004.cb-example.default.svc:8091/controller/setRecoveryType 400 Bad Request: {\"otpNode\":\"invalid node name or node can't be used for delta recovery\"}}"} {"level":"info","ts":"2024-07-04T12:12:15Z","logger":"cluster","msg":"Resource updated","cluster":"default/cb-example","diff":"-{v2.ClusterStatus}.Members.Ready[2->?]:cb-example-0002;+{v2.ClusterStatus}.Members.Unready:[cb-example-0002]"} {"level":"info","ts":"2024-07-04T12:12:15Z","logger":"cluster","msg":"Deleted terminated pod","cluster":"default/cb-example","name":"cb-example-0002"} {"level":"info","ts":"2024-07-04T12:12:16Z","logger":"cluster","msg":"Cluster status","cluster":"default/cb-example","balance":"balanced","rebalancing":false} {"level":"info","ts":"2024-07-04T12:12:16Z","logger":"cluster","msg":"Node status","cluster":"default/cb-example","name":"cb-example-0000","version":"7.6.1","class":"data-only","managed":true,"status":"Active"} {"level":"info","ts":"2024-07-04T12:12:16Z","logger":"cluster","msg":"Node status","cluster":"default/cb-example","name":"cb-example-0001","version":"7.6.1","class":"data-only","managed":true,"status":"Active"} {"level":"info","ts":"2024-07-04T12:12:16Z","logger":"cluster","msg":"Node status","cluster":"default/cb-example","name":"cb-example-0002","version":"7.6.1","class":"data-only","managed":true,"status":""} {"level":"info","ts":"2024-07-04T12:12:16Z","logger":"cluster","msg":"Node status","cluster":"default/cb-example","name":"cb-example-0003","version":"7.2.5","class":"data-only","managed":true,"status":"Active"} {"level":"info","ts":"2024-07-04T12:12:16Z","logger":"cluster","msg":"Node status","cluster":"default/cb-example","name":"cb-example-0004","version":"7.2.5","class":"data-only","managed":true,"status":"Active"} {"level":"info","ts":"2024-07-04T12:12:16Z","logger":"cluster","msg":"Node status","cluster":"default/cb-example","name":"cb-example-0005","version":"7.2.5","class":"data-only","managed":true,"status":"Active"} {"level":"info","ts":"2024-07-04T12:12:16Z","logger":"cluster","msg":"Node status","cluster":"default/cb-example","name":"cb-example-0006","version":"7.2.5","class":"index-query","managed":true,"status":"Active"} {"level":"info","ts":"2024-07-04T12:12:16Z","logger":"cluster","msg":"Node status","cluster":"default/cb-example","name":"cb-example-0007","version":"7.2.5","class":"index-query","managed":true,"status":"Active"} {"level":"info","ts":"2024-07-04T12:12:16Z","logger":"cluster","msg":"Node status","cluster":"default/cb-example","name":"cb-example-0008","version":"7.2.5","class":"index-query","managed":true,"status":"Active"} {"level":"info","ts":"2024-07-04T12:12:16Z","logger":"cluster","msg":"Node status","cluster":"default/cb-example","name":"cb-example-0009","version":"7.2.5","class":"index-query","managed":true,"status":"Active"} {"level":"info","ts":"2024-07-04T12:12:16Z","logger":"scheduler","msg":"Scheduler status","cluster":"default/cb-example","name":"cb-example-0000","class":"data-only","group":"us-east-2a"} {"level":"info","ts":"2024-07-04T12:12:16Z","logger":"scheduler","msg":"Scheduler status","cluster":"default/cb-example","name":"cb-example-0003","class":"data-only","group":"us-east-2b"} {"level":"info","ts":"2024-07-04T12:12:16Z","logger":"scheduler","msg":"Scheduler status","cluster":"default/cb-example","name":"cb-example-0005","class":"data-only","group":"us-east-2b"} {"level":"info","ts":"2024-07-04T12:12:16Z","logger":"scheduler","msg":"Scheduler status","cluster":"default/cb-example","name":"cb-example-0001","class":"data-only","group":"us-east-2c"} {"level":"info","ts":"2024-07-04T12:12:16Z","logger":"scheduler","msg":"Scheduler status","cluster":"default/cb-example","name":"cb-example-0004","class":"data-only","group":"us-east-2c"} {"level":"info","ts":"2024-07-04T12:12:16Z","logger":"scheduler","msg":"Scheduler status","cluster":"default/cb-example","name":"cb-example-0007","class":"index-query","group":"us-east-2a"} {"level":"info","ts":"2024-07-04T12:12:16Z","logger":"scheduler","msg":"Scheduler status","cluster":"default/cb-example","name":"cb-example-0009","class":"index-query","group":"us-east-2a"} {"level":"info","ts":"2024-07-04T12:12:16Z","logger":"scheduler","msg":"Scheduler status","cluster":"default/cb-example","name":"cb-example-0006","class":"index-query","group":"us-east-2b"} {"level":"info","ts":"2024-07-04T12:12:16Z","logger":"scheduler","msg":"Scheduler status","cluster":"default/cb-example","name":"cb-example-0008","class":"index-query","group":"us-east-2c"} {"level":"info","ts":"2024-07-04T12:12:17Z","logger":"cluster","msg":"Pod deleted","cluster":"default/cb-example","name":"cb-example-0002"} {"level":"info","ts":"2024-07-04T12:12:17Z","logger":"cluster","msg":"Resource updated","cluster":"default/cb-example","diff":"{v2.ClusterStatus}.Size:10->9;-{v2.ClusterStatus}.Members.Unready:[cb-example-0002]"} {"level":"info","ts":"2024-07-04T12:12:17Z","logger":"cluster","msg":"Pod unclustered, deleting","cluster":"default/cb-example","name":"cb-example-0002"}

       

      Next, the delta recovery upgrade of the `cb-example-0003` starts and is successful. (In 9 node cb cluster).

      {"level":"info","ts":"2024-07-04T12:12:17Z","logger":"cluster","msg":"Resource updated","cluster":"default/cb-example","diff":"{v2.ClusterStatus}.Conditions[2].LastUpdateTime:2024-07-04T10:08:37Z->2024-07-04T12:12:17Z;{v2.ClusterStatus}.Conditions[2].Message:Cluster upgrading (progress 9/10)->Cluster upgrading (progress 8/9)"}
      {"level":"info","ts":"2024-07-04T12:12:18Z","logger":"cluster","msg":"Upgrading pods with DeltaRecovery","cluster":"default/cb-example","names":["cb-example-0003"],"target-version":"7.6.1"}
      {"level":"info","ts":"2024-07-04T12:25:00Z","logger":"kubernetes","msg":"Creating pod","cluster":"default/cb-example","name":"cb-example-0003","image":"couchbase/server:7.6.1"}
       
      ......
      ......
      ......
       
      {"level":"info","ts":"2024-07-04T12:48:17Z","logger":"cluster","msg":"Rebalancing","cluster":"default/cb-example","progress":48.75324855812661}
      {"level":"info","ts":"2024-07-04T12:48:21Z","logger":"cluster","msg":"Rebalancing","cluster":"default/cb-example","progress":72.22222222222223}
      {"level":"info","ts":"2024-07-04T12:48:25Z","logger":"cluster","msg":"Rebalance completed successfully","cluster":"default/cb-example"}
      {"level":"info","ts":"2024-07-04T12:48:25Z","logger":"cluster","msg":"Resource updated","cluster":"default/cb-example","diff":"-{v2.ClusterStatus}.Conditions[1->?]:{Type:Balanced Status:False LastUpdateTime:2024-07-04T10:59:31Z LastTransitionTime:2024-07-04T10:59:31Z Reason:Unbalanced Message:The operator is attempting to rebalance the data to correct this issue};+{v2.ClusterStatus}.Conditions[?->1]:{Type:Balanced Status:True LastUpdateTime:2024-07-04T12:48:25Z LastTransitionTime:2024-07-04T12:48:25Z Reason:Balanced Message:Data is equally distributed across all nodes in the cluster}"}

       

      Now, `cb-example-0010` (data node with CB 7.6.1) is brought up and is added to CB cluster.

      {"level":"info","ts":"2024-07-04T12:48:25Z","logger":"cluster","msg":"Resource updated","cluster":"default/cb-example","diff":"+{v2.ClusterStatus}.Conditions[?->4]:{Type:Scaling Status:True LastUpdateTime:2024-07-04T12:48:25Z LastTransitionTime:2024-07-04T12:48:25Z Reason:ClusterScaling Message:The operator is attempting to scale the cluster};+{v2.ClusterStatus}.Conditions[?->5]:{Type:ScalingUp Status:True LastUpdateTime:2024-07-04T12:48:25Z LastTransitionTime:2024-07-04T12:48:25Z Reason:ScalingUp Message:Scaling Server Class data-only from 5 to 6}"}
      {"level":"info","ts":"2024-07-04T12:48:26Z","logger":"kubernetes","msg":"Creating pod","cluster":"default/cb-example","name":"cb-example-0010","image":"couchbase/server:7.6.1"}
       
      ...
      ...
      ...
       
      {"level":"info","ts":"2024-07-04T14:33:12Z","logger":"cluster","msg":"Rebalancing","cluster":"default/cb-example","progress":60}
      {"level":"info","ts":"2024-07-04T14:33:16Z","logger":"cluster","msg":"Rebalance completed successfully","cluster":"default/cb-example"}
      {"level":"info","ts":"2024-07-04T14:33:16Z","logger":"cluster","msg":"Resource updated","cluster":"default/cb-example","diff":"-{v2.ClusterStatus}.Conditions[1->?]:{Type:Balanced Status:False LastUpdateTime:2024-07-04T12:49:19Z LastTransitionTime:2024-07-04T12:49:19Z Reason:Unbalanced Message:The operator is attempting to rebalance the data to correct this issue};+{v2.ClusterStatus}.Conditions[?->1]:{Type:Balanced Status:True LastUpdateTime:2024-07-04T14:33:16Z LastTransitionTime:2024-07-04T14:33:16Z Reason:Balanced Message:Data is equally distributed across all nodes in the cluster}"}
      {"level":"info","ts":"2024-07-04T14:33:17Z","logger":"cluster","msg":"Resource updated","cluster":"default/cb-example","diff":"+{v2.ClusterStatus}.Members.Ready[?->9]:cb-example-0010;-{v2.ClusterStatus}.Members.Unready:[cb-example-0010]"}

       

      Logs

      Pre Delta Recovery Upgrade - 7.2.5

      CAO Collect Logs: 2024-07-04A_PreDeltaNodeRestart_cbopinfo-20240704T152126+0530.tar.gz

      Couchbase Server Logs

      Supportal: http://supportal.couchbase.com/snapshot/cd71b7064848a57cafabf26334cfbfa9::0

      https://cb-engineering.s3.amazonaws.com/2024-07-04A_k8s1.28_cb7.2.5_pre_delta_upgrade/collectinfo-2024-07-04T093902-ns_1%40cb-example-0000.cb-example.default.svc.zip
      https://cb-engineering.s3.amazonaws.com/2024-07-04A_k8s1.28_cb7.2.5_pre_delta_upgrade/collectinfo-2024-07-04T093902-ns_1%40cb-example-0001.cb-example.default.svc.zip
      https://cb-engineering.s3.amazonaws.com/2024-07-04A_k8s1.28_cb7.2.5_pre_delta_upgrade/collectinfo-2024-07-04T093902-ns_1%40cb-example-0002.cb-example.default.svc.zip
      https://cb-engineering.s3.amazonaws.com/2024-07-04A_k8s1.28_cb7.2.5_pre_delta_upgrade/collectinfo-2024-07-04T093902-ns_1%40cb-example-0003.cb-example.default.svc.zip
      https://cb-engineering.s3.amazonaws.com/2024-07-04A_k8s1.28_cb7.2.5_pre_delta_upgrade/collectinfo-2024-07-04T093902-ns_1%40cb-example-0004.cb-example.default.svc.zip
      https://cb-engineering.s3.amazonaws.com/2024-07-04A_k8s1.28_cb7.2.5_pre_delta_upgrade/collectinfo-2024-07-04T093902-ns_1%40cb-example-0005.cb-example.default.svc.zip
      https://cb-engineering.s3.amazonaws.com/2024-07-04A_k8s1.28_cb7.2.5_pre_delta_upgrade/collectinfo-2024-07-04T093902-ns_1%40cb-example-0006.cb-example.default.svc.zip
      https://cb-engineering.s3.amazonaws.com/2024-07-04A_k8s1.28_cb7.2.5_pre_delta_upgrade/collectinfo-2024-07-04T093902-ns_1%40cb-example-0007.cb-example.default.svc.zip
      https://cb-engineering.s3.amazonaws.com/2024-07-04A_k8s1.28_cb7.2.5_pre_delta_upgrade/collectinfo-2024-07-04T093902-ns_1%40cb-example-0008.cb-example.default.svc.zip
      https://cb-engineering.s3.amazonaws.com/2024-07-04A_k8s1.28_cb7.2.5_pre_delta_upgrade/collectinfo-2024-07-04T093902-ns_1%40cb-example-0009.cb-example.default.svc.zip

       

      Post Delta Recovery Upgrade - 7.6.1

      CAO Collect Logs: 2024-07-04A_PostDeltaNodeRestart_cbopinfo-20240705T020256+0530.tar.gz

      Couchbase Server Logs

      Supportal: http://supportal.couchbase.com/snapshot/cd71b7064848a57cafabf26334cfbfa9::1

      https://cb-engineering.s3.amazonaws.com/2024-07-04A_k8s1.28_cb7.6.1_post_delta_upgrade/collectinfo-2024-07-04T164137-ns_1%40cb-example-0000.cb-example.default.svc.zip
      https://cb-engineering.s3.amazonaws.com/2024-07-04A_k8s1.28_cb7.6.1_post_delta_upgrade/collectinfo-2024-07-04T164137-ns_1%40cb-example-0001.cb-example.default.svc.zip
      https://cb-engineering.s3.amazonaws.com/2024-07-04A_k8s1.28_cb7.6.1_post_delta_upgrade/collectinfo-2024-07-04T164137-ns_1%40cb-example-0003.cb-example.default.svc.zip
      https://cb-engineering.s3.amazonaws.com/2024-07-04A_k8s1.28_cb7.6.1_post_delta_upgrade/collectinfo-2024-07-04T164137-ns_1%40cb-example-0004.cb-example.default.svc.zip
      https://cb-engineering.s3.amazonaws.com/2024-07-04A_k8s1.28_cb7.6.1_post_delta_upgrade/collectinfo-2024-07-04T164137-ns_1%40cb-example-0005.cb-example.default.svc.zip
      https://cb-engineering.s3.amazonaws.com/2024-07-04A_k8s1.28_cb7.6.1_post_delta_upgrade/collectinfo-2024-07-04T164137-ns_1%40cb-example-0006.cb-example.default.svc.zip
      https://cb-engineering.s3.amazonaws.com/2024-07-04A_k8s1.28_cb7.6.1_post_delta_upgrade/collectinfo-2024-07-04T164137-ns_1%40cb-example-0007.cb-example.default.svc.zip
      https://cb-engineering.s3.amazonaws.com/2024-07-04A_k8s1.28_cb7.6.1_post_delta_upgrade/collectinfo-2024-07-04T164137-ns_1%40cb-example-0008.cb-example.default.svc.zip
      https://cb-engineering.s3.amazonaws.com/2024-07-04A_k8s1.28_cb7.6.1_post_delta_upgrade/collectinfo-2024-07-04T164137-ns_1%40cb-example-0009.cb-example.default.svc.zip
      https://cb-engineering.s3.amazonaws.com/2024-07-04A_k8s1.28_cb7.6.1_post_delta_upgrade/collectinfo-2024-07-04T164137-ns_1%40cb-example-0010.cb-example.default.svc.zip

      Attachments

        Issue Links

          No reviews matched the request. Check your Options in the drop-down menu of this sections header.

          Activity

            People

              justin.ashworth Justin Ashworth
              aryaan.bhaskar Aryaan Bhaskar
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated:

                Gerrit Reviews

                  There are no open Gerrit changes

                  PagerDuty