Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-60849

[Upgrade] : CBAS nodes are in bad state post upgrades and rebalances

    XMLWordPrintable

Details

    • Bug
    • Resolution: Duplicate
    • Blocker
    • 7.6.0
    • 7.6.0
    • analytics
    • Operating System : Debian
      Initial Version : Couchbase Enterprise Edition build 7.1.1-3175
      Upgrade Version : Couchbase Enterprise Edition build 7.6.0-2149

    Description

      Steps to reproduce

      1. Created a cluster on Couchbase Enterprise Edition build 7.1.1-3175 with the following setup 
        1. 172.23.122.233 - cbas
        2. 172.23.122.222 - index, kv, n1ql
        3. 172.23.122.195 - index, kv, n1ql
        4. 172.23.122.207 - cbas 
        5. 172.23.122.232 - cbas 
      2. Created a bucket called "bucket-0" 
      3. Loaded 10000 items onto it
      4. Created dataverses, links, datasets, synonyms, indexes 
      5. Upgraded the whole cluster to 7.6.0-2149 by swap rebalancing
      6. The cluster at the end of upgrade is
        1. 172.23.122.194 - index, kv, n1ql
        2. 172.23.122.222- cbas
        3. 172.23.122.195 - cbas
        4. 172.23.122.207 - index, kv, n1ql
        5. 172.23.122.232 - cbas 
      7. Started a rebalance post upgrade - Rebalance succeeds

      Post that requests are getting rejected 

      Logs 172.23.122.222 - ns_server.analytics_access.log

      172.23.106.205 - Administrator [19/Feb/2024:02:16:50 -0800] "GET /analytics/cluster HTTP/1.1" 503 0 - "python-requests/2.24.0"172.23.106.205 - Administrator [19/Feb/2024:02:16:52 -0800] "POST /analytics/service HTTP/1.1" 503 448 - "python-requests/2.24.0" 

      Observing on 172.23.122.222 - ns_server.analytics_error.log

      2024-02-19T02:16:22.512-08:00 ERRO CBAS.rebalance.Rebalance [Rebalancer (3872218fb1c66d12a827c641947edc1f)] Rebalance 3872218fb1c66d12a827c641947edc1f failedjava.lang.IllegalStateException: timed out waiting for all nodes to join & cluster active (missing nodes: [172.23.122.195:8091 (4d1ac180b136386c3a43feb317990027)], state: UNUSABLE)    at com.couchbase.analytics.control.rebalance.Rebalance.ensureNodesClusterActive(Rebalance.java:580) ~[cbas-server-7.6.0-2149.jar:7.6.0-2149]    at com.couchbase.analytics.control.rebalance.Rebalance.adjustClusterBeforeRebalance(Rebalance.java:748) ~[cbas-server-7.6.0-2149.jar:7.6.0-2149]    at com.couchbase.analytics.control.rebalance.Rebalance.doRebalance(Rebalance.java:240) ~[cbas-server-7.6.0-2149.jar:7.6.0-2149]    at com.couchbase.analytics.control.rebalance.Rebalance.doCall(Rebalance.java:201) [cbas-server-7.6.0-2149.jar:7.6.0-2149]    at com.couchbase.analytics.control.rebalance.Rebalance.doCall(Rebalance.java:88) [cbas-server-7.6.0-2149.jar:7.6.0-2149]    at com.couchbase.analytics.util.WriteLockCallable.call(WriteLockCallable.java:40) [cbas-common-7.6.0-2149.jar:7.6.0-2149]    at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264) [?:?]    at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136) [?:?]    at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) [?:?]    at java.base/java.lang.Thread.run(Thread.java:840) [?:?]2024-02-19T02:16:23.216-08:00 ERRO CBAS.servlet.RebalanceServlet [HttpExecutor(port:9111)-4] Rebalance 3872218fb1c66d12a827c641947edc1f failedjava.lang.IllegalStateException: timed out waiting for all nodes to join & cluster active (missing nodes: [172.23.122.195:8091 (4d1ac180b136386c3a43feb317990027)], state: UNUSABLE)    at com.couchbase.analytics.control.rebalance.Rebalance.ensureNodesClusterActive(Rebalance.java:580) ~[cbas-server-7.6.0-2149.jar:7.6.0-2149]    at com.couchbase.analytics.control.rebalance.Rebalance.adjustClusterBeforeRebalance(Rebalance.java:748) ~[cbas-server-7.6.0-2149.jar:7.6.0-2149]    at com.couchbase.analytics.control.rebalance.Rebalance.doRebalance(Rebalance.java:240) ~[cbas-server-7.6.0-2149.jar:7.6.0-2149]    at com.couchbase.analytics.control.rebalance.Rebalance.doCall(Rebalance.java:201) ~[cbas-server-7.6.0-2149.jar:7.6.0-2149]    at com.couchbase.analytics.control.rebalance.Rebalance.doCall(Rebalance.java:88) ~[cbas-server-7.6.0-2149.jar:7.6.0-2149]    at com.couchbase.analytics.util.WriteLockCallable.call(WriteLockCallable.java:40) ~[cbas-common-7.6.0-2149.jar:7.6.0-2149]    at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264) [?:?]    at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136) [?:?]    at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) [?:?]    at java.base/java.lang.Thread.run(Thread.java:840) [?:?]
       

       

      Also want to understand how rebalance succeeded when we see failed in the above logs

      Marking this as a blocker since this is affecting all upgrade tests involving cbas irrespective of initial version. It is a regression since this behaviour was not observed during runs for RC4 - 7.6.0-2119

       


      TAF Script to reproduce

      guides/gradlew --refresh-dependencies testrunner -P jython=/opt/jython/bin/jython -P 'args=-i /data/workspace/debian-p0-analytics-vset00-00-analytics_upgrade_from_7.1.1_with_collections/testexec.5802.ini -p GROUP=7_1_1;online_upgrade,kv_quota_percent=70,bucket_storage=couchstore,key=test_collections,get-cbcollect-info=True,upgrade_version=7.6.0-2149,aws_access_key=xxxxxxx,aws_secret_key=xxxxxx,sirius_url=http://172.23.120.103:4000 -t upgrade.cbas_upgrade.UpgradeTests.test_upgrade,upgrade_chain=7.1.1,upgrade_type=online_swap,update_nodes=kv;cbas,nodes_init=5,services_init=kv:index:n1ql-kv:index:n1ql-cbas-cbas-cbas,pre_update_no_of_dv=2,pre_update_ds_per_dv=4,pre_update_no_of_synonym=5,pre_update_no_of_index=3,replica_num=3,override_spec_params=num_buckets;num_scopes;num_collections;replicas;num_items,num_items=10000,num_buckets=3,num_scopes=5,num_collections=5,no_of_dv=10,ds_per_dv=3,no_of_synonym=10,no_of_index=5,GROUP=7_1_1;online_upgrade'

      Job : debian-analytics-analytics_upgrade_from_7.1.1_with_collections

      Job ref : http://cb-logs-qe.s3-website-us-west-2.amazonaws.com/7.6.0-2149/jenkins_logs/test_suite_executor-TAF/313211/

      Attachments

        Issue Links

          No reviews matched the request. Check your Options in the drop-down menu of this sections header.

          Activity

            People

              raghav.sk Raghav S K
              raghav.sk Raghav S K
              Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Gerrit Reviews

                  There are no open Gerrit changes

                  PagerDuty