Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-51642

[Upgrade Test] CBAS service keeps crashing when one of the CBAS nodes is failed over

    XMLWordPrintable

Details

    Description

      Steps to reproduce -
      1. Have a 5 node cluster like below -

      Node Services CPU_utilization Mem_total Mem_free Swap_mem_used Active / Replica Version
      172.23.105.19 cbas 0.251256281407 3.91 GiB 3.06 GiB 39.00 MiB / 3.50 GiB 0 / 0 6.6.4-9961-enterprise
      172.23.105.31 index, kv, n1ql 0.753768844221 3.91 GiB 3.38 GiB 80.50 MiB / 3.50 GiB 0 / 0 6.6.4-9961-enterprise
      172.23.105.20 cbas 0.501253132832 3.91 GiB 3.13 GiB 114.57 MiB / 3.50 GiB 0 / 0 6.6.4-9961-enterprise
      172.23.105.244 index, kv, n1ql 0.503778337531 3.91 GiB 3.41 GiB 39.00 MiB / 3.50 GiB 0 / 0 6.6.4-9961-enterprise
      172.23.105.245 cbas 1.00250626566 3.91 GiB 3.12 GiB 94.25 MiB / 3.50 GiB 0 / 0 6.6.4-9961-enterprise

      2. Create a single KV bucket and load data into it.
      3. Create following CBAS infra - 2 dataverses, 8 datasets and 3 indexes.
      4. Now upgrade each node in the cluster using "online swap upgrade" method. Cluster should look something like below -

      Nodes Services Version CPU Status Membership / Recovery
      172.23.105.19 cbas 7.1.0-2534-enterprise 0.777331995988 Cluster node active / none
      172.23.105.20 index, kv, n1ql 7.1.0-2534-enterprise 1.15577889447 Cluster node active / none
      172.23.105.244 cbas 7.1.0-2534-enterprise 0.57701956849 Cluster node active / none
      172.23.105.245 index, kv, n1ql 7.1.0-2534-enterprise 1.15490836053 Cluster node active / none
      172.23.105.24 cbas 7.1.0-2534-enterprise 1.44927536232 Cluster node active / none

      5. Rebalance cluster again to enable CBAS service.
      6. Validate pre-upgrade CBAS infra is still intact and no data loss happened.
      7. Enable CBAS replicas and set it to 3.
      8. Rebalance for replica to take effect.
      9. Load more docs in the bucket that was created before the upgrade and verify the ingestion into datasets is happening as expected.
      10. Now delete all the data from the bucket and verify that the data was flushed from datasets.
      11. Create new scopes and collections in the existing bucket and load data.
      12. Create new buckets, scopes and collections and load data.
      13. Create new CBAS infra - 10 dataverses, 30 datasets, 10 synonyms and 5 indexes.
      14. Verify ingestion completed for all the newly created datasets.
      15. Failover one of the CBAS nodes (except the CBAS CC node).
      16. CBAS service crash is observed. CBAS does not come up
      17. Rebalancing after adding the failed over node also throws error.

      Analytics Service unable to successfully rebalance 474e248ff3cee0964d49d60bdbb255d2 due to 'java.lang.InterruptedException: sleep interrupted'; see analytics_info.log for details
       
      Analytics Service unable to successfully rebalance bcd9bae69f337714db5221bfa407208d due to 'java.lang.Exception: replica 3@172.23.105.19:9120 failed'; see analytics_info.log for details
       
      Rebalance exited with reason {{badmatch,failed},
      [{ns_rebalancer,rebalance_body,5,
      [{file,"src/ns_rebalancer.erl"},
      {line,508}]},
      {async,'-async_init/4-fun-1-',3,
      [{file,"src/async.erl"},{line,191}]}]}.
      Rebalance Operation Id = 37586dd42292b66579d4bf0104f975ef
       
      Analytics Service unable to successfully rebalance adb71312dce14443412608f6b5731c08 due to 'java.lang.Exception: replica 3@172.23.105.19:9120 failed'; see analytics_info.log for details
      

      This issue was found after adding support for N2N encryption in upgrade tests.

      Attachments

        Issue Links

          No reviews matched the request. Check your Options in the drop-down menu of this sections header.

          Activity

            Build couchbase-server-7.1.0-2546 contains cbas-core commit bd5ab94 with commit message:
            MB-51642: Increase replica inactivity timeout

            build-team Couchbase Build Team added a comment - Build couchbase-server-7.1.0-2546 contains cbas-core commit bd5ab94 with commit message: MB-51642 : Increase replica inactivity timeout

            Build couchbase-server-7.1.1-3006 contains cbas-core commit bd5ab94 with commit message:
            MB-51642: Increase replica inactivity timeout

            build-team Couchbase Build Team added a comment - Build couchbase-server-7.1.1-3006 contains cbas-core commit bd5ab94 with commit message: MB-51642 : Increase replica inactivity timeout
            umang.agrawal Umang added a comment -

            Verified with build 7.1.0-2546

            umang.agrawal Umang added a comment - Verified with build 7.1.0-2546

            Build couchbase-server-7.2.0-1094 contains cbas-core commit bd5ab94 with commit message:
            MB-51642: Increase replica inactivity timeout

            build-team Couchbase Build Team added a comment - Build couchbase-server-7.2.0-1094 contains cbas-core commit bd5ab94 with commit message: MB-51642 : Increase replica inactivity timeout

            Build couchbase-server-7.2.0-1094 contains cbas-core commit a8611a0 with commit message:
            MB-51642: Ensure global recovery isn't running before failover

            build-team Couchbase Build Team added a comment - Build couchbase-server-7.2.0-1094 contains cbas-core commit a8611a0 with commit message: MB-51642 : Ensure global recovery isn't running before failover

            People

              umang.agrawal Umang
              umang.agrawal Umang
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Gerrit Reviews

                  There are no open Gerrit changes

                  PagerDuty