Uploaded image for project: 'Couchbase Kubernetes'
  1. Couchbase Kubernetes
  2. K8S-3576

During a cluster upgrade, initiating a downgrade caused the operator to repeatedly create incompatible pods with the downgrade version, resulting in a continuous loop of failed additions to the cluster.

Details

    • Bug
    • Resolution: Unresolved
    • Critical
    • 2.8.0
    • 2.7.0
    • operator
    • Initial Couchbase Version : 7.2.5-7596
      Upgraded Couchbase Version : 7.6.1-3200
      Kubernetes Version : v1.30.0
      CAO and operator : 2.7.0 built locally
      Environment : Kind cluster
    • 3

    Description

      Cluster Setup

      • Kind cluster locally run on Mac
      • 3 nodes with all services
      • 1 bucket
      • Initial cluster version : 7.2.5
      • Upgrade cluster version : 7.6.1

      Steps taken in the scenario

      • Created a cluster
      • Issues an upgrade from 7.2.5-7596 to 7.6.1-3200
      • Swap rebalance upgrade takes place.
      • cb-example-0000 and cb-example-0001 are replaced by cb-example-0003 and cb-example-0004
      • When upgrade swap rebalance of cb-example-0002 is taking place by replacing with cb-example-0005, issued an downgrade back to 7.2.5.
      • Upgrade goes through fine
      • Post upgrade operator tries to add a pod with 7.2.5 onto the cluster. The addition is not allowed and fails.
      • Operator continues to retry the procedure and fails and this occurs forever in an infinite loop.

       

      {"level":"info","ts":"2024-07-16T09:46:07Z","logger":"cluster","msg":"cb-example-0004"}
      {"level":"info","ts":"2024-07-16T09:46:07Z","logger":"cluster","msg":"No persistent volumes in cluster. Reverting to SwapRebalance.","cluster":"default/cb-example"}
      {"level":"info","ts":"2024-07-16T09:46:07Z","logger":"cluster","msg":"Upgrading pods with SwapRebalance","cluster":"default/cb-example","names":["cb-example-0004"],"target-version":"7.2.5"}
      {"level":"info","ts":"2024-07-16T09:46:07Z","logger":"cluster","msg":"Swap-Rebalancing pod ","cluster":"default/cb-example","name":"cb-example-0004","source-version":"7.6.1"}
      {"level":"info","ts":"2024-07-16T09:46:07Z","logger":"kubernetes","msg":"Creating pod","cluster":"default/cb-example","name":"cb-example-0015","image":"couchbase/server:7.2.5"}
      {"level":"info","ts":"2024-07-16T09:46:19Z","logger":"cluster","msg":"Resource updated","cluster":"default/cb-example","diff":"{v2.ClusterStatus}.Size:3->4;+{v2.ClusterStatus}.Members.Unready:[cb-example-0015]"}
      {"level":"info","ts":"2024-07-16T09:49:20Z","logger":"cluster","msg":"Pod added to cluster","cluster":"default/cb-example","name":"cb-example-0015"}
      {"level":"error","ts":"2024-07-16T09:49:20Z","logger":"cluster","msg":"Pod addition to cluster failed","cluster":"default/cb-example","pod":"cb-example-0015","error":"timeout: request failed: unexpected status code POST http://cb-example-0004.cb-example.default.svc:8091/controller/addNode 400 Bad Request: [\"This node cannot add another node ('ns_1@cb-example-0015.cb-example.default.svc') because of cluster version compatibility mismatch. Cluster works in [7,6] mode and node only supports [7,2]\"]","stacktrace":"github.com/couchbase/couchbase-operator/pkg/cluster.(*ReconcileMachine).swapRebalanceMembers\n\tgithub.com/couchbase/couchbase-operator/pkg/cluster/nodereconcile.go:1855\ngithub.com/couchbase/couchbase-operator/pkg/cluster.(*ReconcileMachine).handleUpgradeNode\n\tgithub.com/couchbase/couchbase-operator/pkg/cluster/nodereconcile.go:1587\ngithub.com/couchbase/couchbase-operator/pkg/cluster.(*ReconcileMachine).exec\n\tgithub.com/couchbase/couchbase-operator/pkg/cluster/nodereconcile.go:323\ngithub.com/couchbase/couchbase-operator/pkg/cluster.(*Cluster).reconcileMembers\n\tgithub.com/couchbase/couchbase-operator/pkg/cluster/reconcile.go:266\ngithub.com/couchbase/couchbase-operator/pkg/cluster.(*Cluster).reconcile\n\tgithub.com/couchbase/couchbase-operator/pkg/cluster/reconcile.go:173\ngithub.com/couchbase/couchbase-operator/pkg/cluster.(*Cluster).runReconcile\n\tgithub.com/couchbase/couchbase-operator/pkg/cluster/cluster.go:544\ngithub.com/couchbase/couchbase-operator/pkg/cluster.(*Cluster).Update\n\tgithub.com/couchbase/couchbase-operator/pkg/cluster/cluster.go:591\ngithub.com/couchbase/couchbase-operator/pkg/controller.(*CouchbaseClusterReconciler).Reconcile\n\tgithub.com/couchbase/couchbase-operator/pkg/controller/controller.go:90\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile\n\tsigs.k8s.io/controller-runtime@v0.16.3/pkg/internal/controller/controller.go:119\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\tsigs.k8s.io/controller-runtime@v0.16.3/pkg/internal/controller/controller.go:316\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\tsigs.k8s.io/controller-runtime@v0.16.3/pkg/internal/controller/controller.go:266\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2\n\tsigs.k8s.io/controller-runtime@v0.16.3/pkg/internal/controller/controller.go:227"}
      {"level":"info","ts":"2024-07-16T09:49:20Z","logger":"cluster","msg":"Reconciliation failed","cluster":"default/cb-example","error":"swap rebalance failed to add new node to cluster: timeout: request failed: unexpected status code POST http://cb-example-0004.cb-example.default.svc:8091/controller/addNode 400 Bad Request: [\"This node cannot add another node ('ns_1@cb-example-0015.cb-example.default.svc') because of cluster version compatibility mismatch. Cluster works in [7,6] mode and node only supports [7,2]\"]","stack":"github.com/couchbase/couchbase-operator/pkg/util/couchbaseutil.Client.doRequest\n\tgithub.com/couchbase/couchbase-operator/pkg/util/couchbaseutil/core.go:240\ngithub.com/couchbase/couchbase-operator/pkg/util/couchbaseutil.(*Client).Post\n\tgithub.com/couchbase/couchbase-operator/pkg/util/couchbaseutil/core.go:302\ngithub.com/couchbase/couchbase-operator/pkg/util/couchbaseutil.(*Request).On.func1\n\tgithub.com/couchbase/couchbase-operator/pkg/util/couchbaseutil/api.go:222\ngithub.com/couchbase/couchbase-operator/pkg/util/couchbaseutil.(*Request).On.func2.1\n\tgithub.com/couchbase/couchbase-operator/pkg/util/couchbaseutil/api.go:240\ngithub.com/couchbase/couchbase-operator/pkg/util/retryutil.Retry\n\tgithub.com/couchbase/couchbase-operator/pkg/util/retryutil/retryutil.go:14\ngithub.com/couchbase/couchbase-operator/pkg/util/retryutil.RetryFor\n\tgithub.com/couchbase/couchbase-operator/pkg/util/retryutil/retryutil.go:30\ngithub.com/couchbase/couchbase-operator/pkg/util/couchbaseutil.(*Request).On.func2\n\tgithub.com/couchbase/couchbase-operator/pkg/util/couchbaseutil/api.go:243\ngithub.com/couchbase/couchbase-operator/pkg/util/couchbaseutil.(*Request).On\n\tgithub.com/couchbase/couchbase-operator/pkg/util/couchbaseutil/api.go:249\ngithub.com/couchbase/couchbase-operator/pkg/cluster.(*Cluster).addMembers\n\tgithub.com/couchbase/couchbase-operator/pkg/cluster/member.go:328\ngithub.com/couchbase/couchbase-operator/pkg/cluster.(*ReconcileMachine).swapRebalanceMembers\n\tgithub.com/couchbase/couchbase-operator/pkg/cluster/nodereconcile.go:1834\ngithub.com/couchbase/couchbase-operator/pkg/cluster.(*ReconcileMachine).handleUpgradeNode\n\tgithub.com/couchbase/couchbase-operator/pkg/cluster/nodereconcile.go:1587\ngithub.com/couchbase/couchbase-operator/pkg/cluster.(*ReconcileMachine).exec\n\tgithub.com/couchbase/couchbase-operator/pkg/cluster/nodereconcile.go:323\ngithub.com/couchbase/couchbase-operator/pkg/cluster.(*Cluster).reconcileMembers\n\tgithub.com/couchbase/couchbase-operator/pkg/cluster/reconcile.go:266\ngithub.com/couchbase/couchbase-operator/pkg/cluster.(*Cluster).reconcile\n\tgithub.com/couchbase/couchbase-operator/pkg/cluster/reconcile.go:173\ngithub.com/couchbase/couchbase-operator/pkg/cluster.(*Cluster).runReconcile\n\tgithub.com/couchbase/couchbase-operator/pkg/cluster/cluster.go:544\ngithub.com/couchbase/couchbase-operator/pkg/cluster.(*Cluster).Update\n\tgithub.com/couchbase/couchbase-operator/pkg/cluster/cluster.go:591\ngithub.com/couchbase/couchbase-operator/pkg/controller.(*CouchbaseClusterReconciler).Reconcile\n\tgithub.com/couchbase/couchbase-operator/pkg/controller/controller.go:90\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile\n\tsigs.k8s.io/controller-runtime@v0.16.3/pkg/internal/controller/controller.go:119\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\tsigs.k8s.io/controller-runtime@v0.16.3/pkg/internal/controller/controller.go:316\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\tsigs.k8s.io/controller-runtime@v0.16.3/pkg/internal/controller/controller.go:266\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2\n\tsigs.k8s.io/controller-runtime@v0.16.3/pkg/internal/controller/controller.go:227"}
      {"level":"info","ts":"2024-07-16T09:49:20Z","logger":"cluster","msg":"Resource updated","cluster":"default/cb-example","diff":"{v2.ClusterStatus}.Conditions[3].LastUpdateTime:2024-07-16T09:46:05Z->2024-07-16T09:49:20Z;{v2.ClusterStatus}.Conditions[3].Message:swap rebalance failed to add new node to cluster: timeout: request failed: unexpected status code POST http://cb-example-0003.cb-example.default.svc:8091/controller/addNode 400 Bad Request: [\"This node cannot add another node ('ns_1@cb-example-0014.cb-example.default.svc') because of cluster version compatibility mismatch. Cluster works in [7,6] mode and node only supports [7,2]\"]->swap rebalance failed to add new node to cluster: timeout: request failed: unexpected status code POST http://cb-example-0004.cb-example.default.svc:8091/controller/addNode 400 Bad Request: [\"This node cannot add another node ('ns_1@cb-example-0015.cb-example.default.svc') because of cluster version compatibility mismatch. Cluster works in [7,6] mode and node only supports [7,2]\"]"}
      

      Issue

      • Operator should not try to downgrade once the upgrade is successful.

      Operator logs :

      https://cb-engineering.s3.amazonaws.com/K8S-3576/cbopinfo-20240716T151927+0530.tar.gz

      Cluster logs :
      https://cb-engineering.s3.amazonaws.com/K8S-3576/collectinfo-2024-07-16T095145-ns_1%40cb-example-0003.cb-example.default.svc.zip
      https://cb-engineering.s3.amazonaws.com/K8S-3576/collectinfo-2024-07-16T095145-ns_1%40cb-example-0004.cb-example.default.svc.zip
      https://cb-engineering.s3.amazonaws.com/K8S-3576/collectinfo-2024-07-16T095145-ns_1%40cb-example-0005.cb-example.default.svc.zip


      The cao tool and operator images were built locally on this commit

      commit e00cf70597dbc0a7422c82f0efd0a1a28f75bfcd (HEAD -> master, origin/master, origin/HEAD)
      Author: usamah jassat <usamah.jassat@couchbase.com>
      Date:   Thu Jul 11 15:55:19 2024 +0100    K8S-3564: fix TestServerGroupRescheduling when more SGs
          
          Change-Id: I13dabc775ad8f47e6f9f89b3445a19a4dd28112e
          Reviewed-on: https://review.couchbase.org/c/couchbase-operator/+/212585
          Reviewed-by: Justin Ashworth <justin.ashworth@couchbase.com>
          Tested-by: Build Bot <build@couchbase.com>

      Attachments

        No reviews matched the request. Check your Options in the drop-down menu of this sections header.

        Activity

          People

            usamah.jassat Usamah Jassat
            raghav.sk Raghav S K
            Votes:
            0 Vote for this issue
            Watchers:
            5 Start watching this issue

            Dates

              Created:
              Updated:

              Gerrit Reviews

                There are no open Gerrit changes

                PagerDuty