Uploaded image for project: 'Couchbase Kubernetes'
  1. Couchbase Kubernetes
  2. K8S-3528

[CAO 2.6.4-121] Rebalance starts before node added back to cluster

    XMLWordPrintable

Details

    • Bug
    • Resolution: Unresolved
    • Major
    • 2.8.0
    • 2.6.4
    • operator
    • None
    • 18 -Lost to Eternity
    • 1

    Description

      Swap Rebalance Upgrade

      Kubernetes Version 1.25
      Couchbase Server 7.2.5 (Pre Upgrade) → 7.6.1 (Post Upgrade)
      Operator 2.6.4-121

      Cluster Setup

      • Each node is an m5.4xlarge instance. (16 vCPUs and 64GB RAM) 
      • 6 Data Service, 4 Index Service & Query Service Nodes.
      • 10 Buckets (with 1 replica), Full Eviction and Auto-failover set to 5s.
      • 115GB data per bucket → ~1.15TB data loaded onto cluster before beginning of upgrade.
      • 50 Primary Indexes with 1 Replica each. (Total 100 Indexes with Index Storage: Plasma)

      Upgrade Process

      • SwapRebalance Upgrade to update Couchbase Server from 7.2.5 to 7.6.1.
      • Continuous query and data workload on the buckets during the update process.
      • Around 35-50% CPU load on all servers during the upgrade.
      • Node Restarts

       

      Two nodes cb-example-0000 and cb-example-0001 have been swap rebalance upgraded with nodes cb-example-0010 and cb-example-0011.

      The third data service node cb-example-0002 was being upgraded and ejected, while cb-example-0012 was being added to the node. Rebalance was initiated.

      I restarted the physical K8s node on which cb-example-0012 was present during the rebalance. This led to a rebalance failure which was expected.

       

      {"level":"info","ts":"2024-05-31T14:09:10Z","logger":"cluster","msg":"Resource updated","cluster":"default/cb-example","diff":"+{v2.ClusterStatus}.Conditions[?->5]:{Type:Error Status:True LastUpdateTime:2024-05-31T14:09:10Z LastTransitionTime:2024-05-31T14:09:10Z Reason:ErrorEncountered Message:failed to rebalance: timeout: unexpected rebalance error}"}
      {"level":"info","ts":"2024-05-31T14:09:10Z","logger":"cluster","msg":"Cluster status","cluster":"default/cb-example","balance":"unbalanced","rebalancing":false} 

      Next, while 0012 has not joined back the cluster, another rebalance was initiated which eventually failed.

       

      {"level":"info","ts":"2024-05-31T14:10:12Z","logger":"cluster","msg":"reconciler","clustered":["cb-example-0003","cb-example-0004","cb-example-0005","cb-example-0006","cb-example-0007","cb-example-0008","cb-example-0009","cb-example-0010","cb-example-0011","cb-example-0012"],"running":["cb-example-0003","cb-example-0004","cb-example-0005","cb-example-0006","cb-example-0007","cb-example-0008","cb-example-0009","cb-example-0010","cb-example-0011","cb-example-0012"],"eject":["cb-example-0002"],"unclustered":[],"rebalance":true}
      {"level":"info","ts":"2024-05-31T14:10:12Z","logger":"cluster","msg":"External address collection failed","cluster":"default/cb-example","name":"cb-example-0012"}
      {"level":"info","ts":"2024-05-31T14:10:15Z","logger":"cluster","msg":"Pod add-back failed, forcing full recovery","cluster":"default/cb-example"}
      {"level":"info","ts":"2024-05-31T14:10:15Z","logger":"cluster","msg":"Resource updated","cluster":"default/cb-example","diff":"{v2.ClusterStatus}.Conditions[5].LastUpdateTime:2024-05-31T14:09:10Z->2024-05-31T14:10:15Z;{v2.ClusterStatus}.Conditions[5].Message:failed to rebalance: timeout: unexpected rebalance error->reconcile was blocked from running: rebalance failed, forcing full recovery"}
       
      {"level":"info","ts":"2024-05-31T14:11:18Z","logger":"cluster","msg":"Resource updated","cluster":"default/cb-example","diff":"{v2.ClusterStatus}.Conditions[5].LastUpdateTime:2024-05-31T14:11:00Z->2024-05-31T14:11:18Z;{v2.ClusterStatus}.Conditions[5].Message:failed to rebalance: timeout: unexpected rebalance error->reconcile was blocked from running: recovering pending add node cb-example-0012"}
      

       

       

      Now, after the node 0012 came back, another rebalance started and it was successful.

      You can get the full task and logs here in the comment of the task K8S-3487 

      Logs:

      CAO Collect: 2024-05-31A_DuringUpgrade_AfterKVRestart_cbopinfo-20240531T212108+0530.tar.gz

      Supportal: http://supportal.couchbase.com/snapshot/7476142a9b4960384458d1bb417decbe::1

      https://cb-engineering.s3.amazonaws.com/2024-05-31A_K8s1.25_CB7.2.5_DurSwapRebUpgrade/collectinfo-2024-05-31T155058-ns_1%40cb-example-0003.cb-example.default.svc.zip
      https://cb-engineering.s3.amazonaws.com/2024-05-31A_K8s1.25_CB7.2.5_DurSwapRebUpgrade/collectinfo-2024-05-31T155058-ns_1%40cb-example-0004.cb-example.default.svc.zip
      https://cb-engineering.s3.amazonaws.com/2024-05-31A_K8s1.25_CB7.2.5_DurSwapRebUpgrade/collectinfo-2024-05-31T155058-ns_1%40cb-example-0005.cb-example.default.svc.zip
      https://cb-engineering.s3.amazonaws.com/2024-05-31A_K8s1.25_CB7.2.5_DurSwapRebUpgrade/collectinfo-2024-05-31T155058-ns_1%40cb-example-0006.cb-example.default.svc.zip
      https://cb-engineering.s3.amazonaws.com/2024-05-31A_K8s1.25_CB7.2.5_DurSwapRebUpgrade/collectinfo-2024-05-31T155058-ns_1%40cb-example-0007.cb-example.default.svc.zip
      https://cb-engineering.s3.amazonaws.com/2024-05-31A_K8s1.25_CB7.2.5_DurSwapRebUpgrade/collectinfo-2024-05-31T155058-ns_1%40cb-example-0008.cb-example.default.svc.zip
      https://cb-engineering.s3.amazonaws.com/2024-05-31A_K8s1.25_CB7.2.5_DurSwapRebUpgrade/collectinfo-2024-05-31T155058-ns_1%40cb-example-0009.cb-example.default.svc.zip
      https://cb-engineering.s3.amazonaws.com/2024-05-31A_K8s1.25_CB7.2.5_DurSwapRebUpgrade/collectinfo-2024-05-31T155058-ns_1%40cb-example-0010.cb-example.default.svc.zip
      https://cb-engineering.s3.amazonaws.com/2024-05-31A_K8s1.25_CB7.2.5_DurSwapRebUpgrade/collectinfo-2024-05-31T155058-ns_1%40cb-example-0011.cb-example.default.svc.zip
      https://cb-engineering.s3.amazonaws.com/2024-05-31A_K8s1.25_CB7.2.5_DurSwapRebUpgrade/collectinfo-2024-05-31T155058-ns_1%40cb-example-0012.cb-example.default.svc.zip

       

      Attachments

        Issue Links

          No reviews matched the request. Check your Options in the drop-down menu of this sections header.

          Activity

            People

              usamah.jassat Usamah Jassat
              aryaan.bhaskar Aryaan Bhaskar
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

                Created:
                Updated:

                Gerrit Reviews

                  There are no open Gerrit changes

                  PagerDuty