Uploaded image for project: 'Couchbase Kubernetes'
  1. Couchbase Kubernetes
  2. K8S-3584

Pod already part of cluster was attempted to be added repeatedly by the operator during upgrade

Details

    • Bug
    • Resolution: Unresolved
    • Minor
    • 2.8.0
    • 2.7.0
    • operator
    • Initial Couchbase Version : 7.6.0-2176
      Upgrade Couchbase Version : 7.6.1-3200
      Kubernetes Version : v1.30.0
      CAO and operator : 2.7.0 built locally
      Environment : Kind cluster
    • 15 - First Frontier, 16 - Killing Time, 17 -Timetrap
    • 2

    Description

      Cluster Setup

      • Kind cluster locally run on Mac
      • 5 nodes with all services
      • 2 buckets
      • Cluster version : 7.6.0-2176
      • Upgrade version : 7.6.1-3200

      Steps taken in the scenario

      • Created a cluster
      • Created 2 buckets
      • Changed the storage backend of one of the cluster from couchstore to magma
      • After swap rebalance of first pod for migration, issued an upgrade
      • The upgrade was started before the migration was fully completed
      • Henceforth, migration and upgrade were completed in a single swap rebalance operation for each pod - Tracked in K8S-3583
      • Post this already existing pod in cluster was attempted to be added to cluster repeatedly by the operator and the operation fails

      {"level":"error","ts":"2024-07-17T14:48:03Z","logger":"cluster","msg":"Pod addition to cluster failed","cluster":"default/cb-example","pod":"cb-example-0008","error":"timeout: request failed: unexpected status code POST http://cb-example-0007.cb-example.default.svc:8091/controller/addNode 400 Bad Request: [\"Prepare join failed. Node is already part of cluster.\"]","stacktrace":"github.com/couchbase/couchbase-operator/pkg/cluster.(*ReconcileMachine).swapRebalanceMembers\n\tgithub.com/couchbase/couchbase-operator/pkg/cluster/nodereconcile.go:1855\ngithub.com/couchbase/couchbase-operator/pkg/cluster.(*ReconcileMachine).handleUpgradeNode\n\tgithub.com/couchbase/couchbase-operator/pkg/cluster/nodereconcile.go:1587\ngithub.com/couchbase/couchbase-operator/pkg/cluster.(*ReconcileMachine).exec\n\tgithub.com/couchbase/couchbase-operator/pkg/cluster/nodereconcile.go:323\ngithub.com/couchbase/couchbase-operator/pkg/cluster.(*Cluster).reconcileMembers\n\tgithub.com/couchbase/couchbase-operator/pkg/cluster/reconcile.go:266\ngithub.com/couchbase/couchbase-operator/pkg/cluster.(*Cluster).reconcile\n\tgithub.com/couchbase/couchbase-operator/pkg/cluster/reconcile.go:173\ngithub.com/couchbase/couchbase-operator/pkg/cluster.(*Cluster).runReconcile\n\tgithub.com/couchbase/couchbase-operator/pkg/cluster/cluster.go:544\ngithub.com/couchbase/couchbase-operator/pkg/cluster.(*Cluster).Update\n\tgithub.com/couchbase/couchbase-operator/pkg/cluster/cluster.go:591\ngithub.com/couchbase/couchbase-operator/pkg/controller.(*CouchbaseClusterReconciler).Reconcile\n\tgithub.com/couchbase/couchbase-operator/pkg/controller/controller.go:90\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile\n\tsigs.k8s.io/controller-runtime@v0.16.3/pkg/internal/controller/controller.go:119\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\tsigs.k8s.io/controller-runtime@v0.16.3/pkg/internal/controller/controller.go:316\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\tsigs.k8s.io/controller-runtime@v0.16.3/pkg/internal/controller/controller.go:266\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2\n\tsigs.k8s.io/controller-runtime@v0.16.3/pkg/internal/controller/controller.go:227"}{"level":"info","ts":"2024-07-17T14:48:03Z","logger":"cluster","msg":"Reconciliation failed","cluster":"default/cb-example","error":"swap rebalance failed to add new node to cluster: timeout: request failed: unexpected status code POST http://cb-example-0007.cb-example.default.svc:8091/controller/addNode 400 Bad Request: [\"Prepare join failed. Node is already part of cluster.\"]","stack":"github.com/couchbase/couchbase-operator/pkg/util/couchbaseutil.Client.doRequest\n\tgithub.com/couchbase/couchbase-operator/pkg/util/couchbaseutil/core.go:240\ngithub.com/couchbase/couchbase-operator/pkg/util/couchbaseutil.(*Client).Post\n\tgithub.com/couchbase/couchbase-operator/pkg/util/couchbaseutil/core.go:302\ngithub.com/couchbase/couchbase-operator/pkg/util/couchbaseutil.(*Request).On.func1\n\tgithub.com/couchbase/couchbase-operator/pkg/util/couchbaseutil/api.go:222\ngithub.com/couchbase/couchbase-operator/pkg/util/couchbaseutil.(*Request).On.func2.1\n\tgithub.com/couchbase/couchbase-operator/pkg/util/couchbaseutil/api.go:240\ngithub.com/couchbase/couchbase-operator/pkg/util/retryutil.Retry\n\tgithub.com/couchbase/couchbase-operator/pkg/util/retryutil/retryutil.go:14\ngithub.com/couchbase/couchbase-operator/pkg/util/retryutil.RetryFor\n\tgithub.com/couchbase/couchbase-operator/pkg/util/retryutil/retryutil.go:30\ngithub.com/couchbase/couchbase-operator/pkg/util/couchbaseutil.(*Request).On.func2\n\tgithub.com/couchbase/couchbase-operator/pkg/util/couchbaseutil/api.go:243\ngithub.com/couchbase/couchbase-operator/pkg/util/couchbaseutil.(*Request).On\n\tgithub.com/couchbase/couchbase-operator/pkg/util/couchbaseutil/api.go:249\ngithub.com/couchbase/couchbase-operator/pkg/cluster.(*Cluster).addMembers\n\tgithub.com/couchbase/couchbase-operator/pkg/cluster/member.go:328\ngithub.com/couchbase/couchbase-operator/pkg/cluster.(*ReconcileMachine).swapRebalanceMembers\n\tgithub.com/couchbase/couchbase-operator/pkg/cluster/nodereconcile.go:1834\ngithub.com/couchbase/couchbase-operator/pkg/cluster.(*ReconcileMachine).handleUpgradeNode\n\tgithub.com/couchbase/couchbase-operator/pkg/cluster/nodereconcile.go:1587\ngithub.com/couchbase/couchbase-operator/pkg/cluster.(*ReconcileMachine).exec\n\tgithub.com/couchbase/couchbase-operator/pkg/cluster/nodereconcile.go:323\ngithub.com/couchbase/couchbase-operator/pkg/cluster.(*Cluster).reconcileMembers\n\tgithub.com/couchbase/couchbase-operator/pkg/cluster/reconcile.go:266\ngithub.com/couchbase/couchbase-operator/pkg/cluster.(*Cluster).reconcile\n\tgithub.com/couchbase/couchbase-operator/pkg/cluster/reconcile.go:173\ngithub.com/couchbase/couchbase-operator/pkg/cluster.(*Cluster).runReconcile\n\tgithub.com/couchbase/couchbase-operator/pkg/cluster/cluster.go:544\ngithub.com/couchbase/couchbase-operator/pkg/cluster.(*Cluster).Update\n\tgithub.com/couchbase/couchbase-operator/pkg/cluster/cluster.go:591\ngithub.com/couchbase/couchbase-operator/pkg/controller.(*CouchbaseClusterReconciler).Reconcile\n\tgithub.com/couchbase/couchbase-operator/pkg/controller/controller.go:90\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile\n\tsigs.k8s.io/controller-runtime@v0.16.3/pkg/internal/controller/controller.go:119\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\tsigs.k8s.io/controller-runtime@v0.16.3/pkg/internal/controller/controller.go:316\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\tsigs.k8s.io/controller-runtime@v0.16.3/pkg/internal/controller/controller.go:266\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2\n\tsigs.k8s.io/controller-runtime@v0.16.3/pkg/internal/controller/controller.go:227"}{"level":"info","ts":"2024-07-17T14:48:03Z","logger":"cluster","msg":"Resource updated","cluster":"default/cb-example","diff":"{v2.ClusterStatus}.Conditions[3].LastUpdateTime:2024-07-17T14:42:58Z->2024-07-17T14:48:03Z;{v2.ClusterStatus}.Conditions[3].Message:failed to rebalance: timeout: unexpected rebalance error->swap rebalance failed to add new node to cluster: timeout: request failed: unexpected status code POST http://cb-example-0007.cb-example.default.svc:8091/controller/addNode 400 Bad Request: [\"Prepare join failed. Node is already part of cluster.\"]"} 

      Also it considers rebalance a failure as the join failed

      {"level":"info","ts":"2024-07-17T15:06:22Z","logger":"cluster","msg":"Resource updated","cluster":"default/cb-example","diff":"{v2.ClusterStatus}.Conditions[3].LastUpdateTime:2024-07-17T14:48:03Z->2024-07-17T15:06:22Z;{v2.ClusterStatus}.Conditions[3].Message:swap rebalance failed to add new node to cluster: timeout: request failed: unexpected status code POST http://cb-example-0007.cb-example.default.svc:8091/controller/addNode 400 Bad Request: [\"Prepare join failed. Node is already part of cluster.\"]->failed to rebalance: timeout: unexpected rebalance error"} 

       


      Operator logs:

      https://cb-engineering.s3.amazonaws.com/K8S-3583/cbopinfo-20240717T213656+0530.tar.gz

      Cluster logs:
      https://cb-engineering.s3.amazonaws.com/K8S-3583/collectinfo-2024-07-17T160636-ns_1%40cb-example-0006.cb-example.default.svc.zip
      https://cb-engineering.s3.amazonaws.com/K8S-3583/collectinfo-2024-07-17T160636-ns_1%40cb-example-0007.cb-example.default.svc.zip
      https://cb-engineering.s3.amazonaws.com/K8S-3583/collectinfo-2024-07-17T160636-ns_1%40cb-example-0008.cb-example.default.svc.zip
      https://cb-engineering.s3.amazonaws.com/K8S-3583/collectinfo-2024-07-17T160636-ns_1%40cb-example-0009.cb-example.default.svc.zip
      https://cb-engineering.s3.amazonaws.com/K8S-3583/collectinfo-2024-07-17T160636-ns_1%40cb-example-0010.cb-example.default.svc.zip


      The cao tool and operator images were built locally on this commit

      commit e00cf70597dbc0a7422c82f0efd0a1a28f75bfcd (HEAD -> master, origin/master, origin/HEAD)
      Author: usamah jassat <usamah.jassat@couchbase.com>
      Date:   Thu Jul 11 15:55:19 2024 +0100    K8S-3564: fix TestServerGroupRescheduling when more SGs
          
          Change-Id: I13dabc775ad8f47e6f9f89b3445a19a4dd28112e
          Reviewed-on: [https://review.couchbase.org/c/couchbase-operator/+/212585]
          Reviewed-by: Justin Ashworth <justin.ashworth@couchbase.com>
          Tested-by: Build Bot <build@couchbase.com>

      Attachments

        No reviews matched the request. Check your Options in the drop-down menu of this sections header.

        Activity

          People

            raghav.sk Raghav S K
            raghav.sk Raghav S K
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

              Created:
              Updated:

              Gerrit Reviews

                There are no open Gerrit changes

                PagerDuty