Uploaded image for project: 'Couchbase Kubernetes'
  1. Couchbase Kubernetes
  2. K8S-3575

Rebalance is repeatedly failing due to timeouts, causing an infinite retry loop without a break point to halt the process.

    XMLWordPrintable

Details

    • Bug
    • Resolution: Unresolved
    • Major
    • 2.8.0
    • 2.7.0
    • operator
    • Couchbase Version : 7.6.0-2176
      Kubernetes Version : v1.30.0
      CAO and operator : 2.7.0 built locally
      Environment : Kind cluster
    • 19 - A Rock and a Hard Place
    • 1

    Description

      Cluster Setup

      • Kind cluster locally run on Mac
      • 3 nodes with all services
      • 1 bucket

      Steps taken in the scenario

      • Created a cluster
      • On one pod, ran a bash script to kill memcached in a loop
      • The node fails over in the cluster and delta recovery rebalances continuously fail as expected.
      • Stopped the memcached kill loop
      • The rebalance post this fails again and again due to a problem with eventing service. 

      The couchbase server issues are tracked under - MB-62725, MB-62724

      Issue

      • The rebalance fails due to timeouts with eventing service continuously in a loop for 2+ hours
      • When rebalance is failing continuously with the same error, there should be a break point to stop the rebalance loop and operator should not attempt to retry rebalance again and again.

       


      Cluster logs
      https://cb-engineering.s3.amazonaws.com/MB-62724/collectinfo-2024-07-15T093417-ns_1%40cb-example-0000.cb-example.default.svc.zip
      https://cb-engineering.s3.amazonaws.com/MB-62724/collectinfo-2024-07-15T093417-ns_1%40cb-example-0001.cb-example.default.svc.zip
      https://cb-engineering.s3.amazonaws.com/MB-62724/collectinfo-2024-07-15T093417-ns_1%40cb-example-0002.cb-example.default.svc.zip

      Operator logs
      https://cb-engineering.s3.amazonaws.com/MB-62724/cbopinfo-20240715T143931+0530.tar.gz

       


      The cao tool and operator images were built locally on this commit

       

      commit 127d1f23932294386bf0375be927758a8dee282c (HEAD -> master, origin/master, origin/HEAD)
      Author: usamah jassat <usamah.jassat@couchbase.com>
      Date:   Mon Jul 1 18:24:20 2024 +0100    K8S-3417: Allow rescheduling to different AZ
          
          Change-Id: I4194d211dabd7bb680a61930b5ac4d63ab4996f1
          Reviewed-on: https://review.couchbase.org/c/couchbase-operator/+/212115
          Reviewed-by: Justin Ashworth <justin.ashworth@couchbase.com>
          Tested-by: Build Bot <build@couchbase.com> 

       

      Attachments

        No reviews matched the request. Check your Options in the drop-down menu of this sections header.

        Activity

          People

            justin.ashworth Justin Ashworth
            raghav.sk Raghav S K
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated:

              Gerrit Reviews

                There are no open Gerrit changes

                PagerDuty