Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-46564

[System test]Online upgrade using graceful failover + full recovery + rebalance fails in eventing with "service_rebalance_failed,eventing, {worker_died,"

    XMLWordPrintable

Details

    • Untriaged
    • Centos 64-bit
    • 1
    • Yes

    Description

      Steps to Repro
      1. Run the following longevity on 6.6.2 for 3-4 days

      ./sequoia -client 172.23.96.162:2375 -provider file:centos_third_cluster.yml -test tests/integration/test_allFeatures_madhatter_durability.yml -scope tests/integration/scope_Xattrs_Madhatter.yml -scale 3 -repeat 0 -log_level 0 -version 6.6.2-9588 -skip_setup=false -skip_test=false -skip_teardown=true -skip_cleanup=false -continue=false -collect_on_error=false -stop_on_error=false -duration=604800 -show_topology=true
      

      2. We have 27 node cluster in 6.6.2
      3. Add 6 nodes(1 of each service - 7.0.0-5226) and remove 6 nodes(6.6.2) and do a swap rebalance to upgrade the cluster.
      4. Failover 6 node(1 of each service - 6.6.2), upgrade, do a recovery and rebalance.
      5. Tried to continue those steps for the rest of the nodes in the cluster, but one of the rebalances failed as shown below.

      ns_1@172.23.106.70 7:18:13 AM   26 May, 2021

      Starting rebalance, KeepNodes = ['ns_1@172.23.104.15','ns_1@172.23.104.214',
      'ns_1@172.23.104.232','ns_1@172.23.104.244',
      'ns_1@172.23.104.245','ns_1@172.23.105.102',
      'ns_1@172.23.105.109','ns_1@172.23.105.112',
      'ns_1@172.23.105.118','ns_1@172.23.105.206',
      'ns_1@172.23.105.210','ns_1@172.23.105.25',
      'ns_1@172.23.105.29','ns_1@172.23.105.61',
      'ns_1@172.23.105.86','ns_1@172.23.105.90',
      'ns_1@172.23.106.117','ns_1@172.23.106.191',
      'ns_1@172.23.106.207','ns_1@172.23.106.225',
      'ns_1@172.23.106.232','ns_1@172.23.106.239',
      'ns_1@172.23.106.246','ns_1@172.23.106.37',
      'ns_1@172.23.106.54','ns_1@172.23.106.70',
      'ns_1@172.23.110.75'], EjectNodes = [], Failed over and being ejected nodes = []; no delta recovery nodes; Operation Id = 57cca96fe563d50d27549ba664c85dfe
      

      ns_1@172.23.106.70 7:53:28 AM   26 May, 2021

      Rebalance exited with reason {service_rebalance_failed,eventing,
      {worker_died,
      {'EXIT',<0.15454.774>,
      {rebalance_failed,
      {service_error,
      <<"eventing rebalance hasn't made progress for past 1200 secs">>}}}}}.
      Rebalance Operation Id = 57cca96fe563d50d27549ba664c85dfe
      

      attaching cbcollect in some time.
      This was not seen on upgrade from 6.6.2-9588 -> 7.0.0-5141.

      Attachments

        Issue Links

          Activity

            People

              Balakumaran.Gopal Balakumaran Gopal
              Balakumaran.Gopal Balakumaran Gopal
              Votes:
              0 Vote for this issue
              Watchers:
              7 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                PagerDuty