Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-46778

[Upgrade] - Rebalance fails during online upgrade with graceful failover + delta recovery + rebalance.

    XMLWordPrintable

Details

    • Untriaged
    • Centos 64-bit
    • 1
    • Yes

    Description

      Script to Repro
      1. Run the following 6.6.2 longevity test for 3-4 days. We will have 27 node cluster at the end of it.

      ./sequoia -client 172.23.96.162:2375 -provider file:centos_third_cluster.yml -test tests/integration/test_allFeatures_madhatter_durability.yml -scope tests/integration/scope_Xattrs_Madhatter.yml -scale 3 -repeat 0 -log_level 0 -version 6.6.2-9588 -skip_setup=false -skip_test=false -skip_teardown=true -skip_cleanup=false -continue=false -collect_on_error=false -stop_on_error=false -duration=604800 -show_topology=true
      

      2. Run the script create_drop.sh on 6.6.2 nodes on the cluster. And this was run on the 7.0.0 nodes as well that will be brought into the cluster using swap rebalance for upgrade
      3. Swap rebalance 6(1 of each service) 6.6.2 nodes with 7.0.0 nodes.
      4. Graceful failover 6 node (1 of each service), upgrade, do a recovery and start rebalance.
      5. Graceful failover 6 node (1 of each service), upgrade, do a recovery and start rebalance.

      Rebalance at step 5 failed.

      ns_1@172.23.110.76 8:55:11 AM   7 Jun, 2021

      Starting rebalance, KeepNodes = ['ns_1@172.23.104.15','ns_1@172.23.104.214',
      'ns_1@172.23.104.232','ns_1@172.23.104.244',
      'ns_1@172.23.104.245','ns_1@172.23.105.102',
      'ns_1@172.23.105.109','ns_1@172.23.105.112',
      'ns_1@172.23.105.118','ns_1@172.23.105.164',
      'ns_1@172.23.105.206','ns_1@172.23.105.210',
      'ns_1@172.23.105.25','ns_1@172.23.105.61',
      'ns_1@172.23.105.62','ns_1@172.23.105.90',
      'ns_1@172.23.105.93','ns_1@172.23.106.117',
      'ns_1@172.23.106.191','ns_1@172.23.106.207',
      'ns_1@172.23.106.225','ns_1@172.23.106.232',
      'ns_1@172.23.106.239','ns_1@172.23.106.246',
      'ns_1@172.23.106.37','ns_1@172.23.106.54',
      'ns_1@172.23.110.76'], EjectNodes = [], Failed over and being ejected nodes = [], Delta recovery nodes = ['ns_1@172.23.105.61'], Delta recovery buckets = all; Operation Id = aa7bf1ba9bd044e1f30ef2d87c58af56
      

      ns_1@172.23.110.76 9:10:51 AM   7 Jun, 2021

      Rebalance exited with reason {mover_crashed,
      {unexpected_exit,
      {'EXIT',<0.31594.230>,
      {failed_to_update_vbucket_map,"default",654,
      {error,
      [{'ns_1@172.23.106.239',
      {exit,
      {{nodedown,'ns_1@172.23.106.239'},
      {gen_server,call,
      [{ns_config_rep,'ns_1@172.23.106.239'},
      synchronize_everything,
      infinity]}}}}]}}}}}.
      Rebalance Operation Id = aa7bf1ba9bd044e1f30ef2d87c58af56
      

      ns_1@172.23.106.239 9:10:57 AM   7 Jun, 2021

      Service 'ns_server' exited with status 1. Restarting. Messages:
      working as port
      99626: Booted. Waiting for shutdown request
      working as port
      eheap_alloc: Cannot allocate 11753808384 bytes of memory (of type "old_heap").
       
      Crash dump is being written to: erl_crash.dump.1623048781.99514.ns_server...done
      [os_mon] cpu supervisor port (cpu_sup): Erlang has closed
      [os_mon] memory supervisor port (memsup): Erlang has closed
      

      Looking at the time of crash which is almost at the same time as the rebalance failure, looks like ns-serv crash is the reason for the rebalance failure. Assigning it to the ns_server team.

      Also , much later we notice one more ns_serv crash.
      ns_1@172.23.106.239 11:22:09 AM   7 Jun, 2021

      Service 'ns_server' exited with status 137. Restarting. Messages:
      working as port
      88751: Booted. Waiting for shutdown request
      [os_mon] memory supervisor port (memsup): Erlang has closed
      [os_mon] cpu supervisor port (cpu_sup): Erlang has closed
      

      cbcollect_info attached. This was not seen on last system test upgrade we had from 6.6.2-9588 -> 7.0.0-5226

      Attachments

        Issue Links

          For Gerrit Dashboard: MB-46778
          # Subject Branch Project Status CR V

          Activity

            People

              Balakumaran.Gopal Balakumaran Gopal
              Balakumaran.Gopal Balakumaran Gopal
              Votes:
              0 Vote for this issue
              Watchers:
              7 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Gerrit Reviews

                  There are no open Gerrit changes

                  PagerDuty