Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-48421

[System Test] - Online upgrade with graceful failover fails with "Rebalance exited with reason {mover_crashed, {unexpected_exit, {'EXIT',<0.18147.69>, {failed_to_update_vbucket_map,"HISTORY",641, {error, [{'ns_1@172.23.120.81', {exit, {{{timeout"

    XMLWordPrintable

Details

    • Untriaged
    • Centos 64-bit
    • 1
    • No

    Description

      Steps to Repro
      1. Run the following longevity script on 6.6.3 for 5 days.

      ./sequoia -client 172.23.104.254:2375 -provider file:centos_second_cluster.yml -test tests/integration/test_allFeatures_madhatter_durability.yml -scope tests/integration/scope_Xattrs_Madhatter.yml -scale 3 -repeat 0 -log_level 0 -version 6.6.3-9808 -skip_setup=true -skip_test=false -skip_teardown=true -skip_cleanup=false -continue=false -collect_on_error=false -stop_on_error=false -duration=604800 -show_topology=true
      

      At this point it should have a 27 node cluster ( 9 Kv, 6 Index, 3 analytics, 3 fts, 3 eventing and 3 n1ql)
      2. Create 10k metakv tombstones. This has been part of our testing since MB-44838 was fixed. We used to have a total of around 25k for CC, have reduced it here to around 12k.

       #!/bin/sh
      for i in {0..10000}
          do
              `curl -X PUT -u Administrator:password http://localhost:8091/_metakv/key{$i} -d 'value=foo1'`
              `curl -X DELETE -v -u Administrator:password http://localhost:8091/_metakv/key{$i}`
          done       
      

      3. Swap rebalance 6 nodes , 1 of each service with that of 7.0.2 nodes. Rebalance goes through successfully.
      4. Failover 6 nodes(6.6.3 nodes)1 of each service(kv is graceful failover), Upgrade these nodes to 7.0.2, do a recovery of all the 6 node(kv is delta recovery) and rebalance.

      ns_1@172.23.106.136 1:12:01 AM 13 Sep, 2021

      Starting rebalance, KeepNodes = ['ns_1@172.23.106.134','ns_1@172.23.106.136',
      'ns_1@172.23.106.137','ns_1@172.23.106.138',
      'ns_1@172.23.120.58','ns_1@172.23.120.73',
      'ns_1@172.23.120.74','ns_1@172.23.120.75',
      'ns_1@172.23.120.77','ns_1@172.23.120.81',
      'ns_1@172.23.120.86','ns_1@172.23.121.118',
      'ns_1@172.23.121.77','ns_1@172.23.123.24',
      'ns_1@172.23.123.25','ns_1@172.23.123.26',
      'ns_1@172.23.123.31','ns_1@172.23.123.32',
      'ns_1@172.23.123.33','ns_1@172.23.96.122',
      'ns_1@172.23.96.14','ns_1@172.23.96.243',
      'ns_1@172.23.97.105','ns_1@172.23.97.148',
      'ns_1@172.23.97.149','ns_1@172.23.97.150',
      'ns_1@172.23.97.151'], EjectNodes = [], Failed over and being ejected nodes = [], Delta recovery nodes = ['ns_1@172.23.96.14'], Delta recovery buckets = all; Operation Id = 8fa9cee395483fda91678362bea50af3
      

      The above rebalance fails as shown in rebalance_report_20210913T082158.json. The rebalance failure is humongous which I believe is a dup of MB-46805. If it's not, we should file a new one.

      cbcollect_info attached. This the first time we are running this system test upgrade on 7.0.2, hence there is no baseline as such and no last working build.

      Attachments

        No reviews matched the request. Check your Options in the drop-down menu of this sections header.

        Activity

          People

            Balakumaran.Gopal Balakumaran Gopal
            Balakumaran.Gopal Balakumaran Gopal
            Votes:
            0 Vote for this issue
            Watchers:
            5 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Gerrit Reviews

                There are no open Gerrit changes

                PagerDuty