Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-47491

[System Test] - Graceful failover done during upgrade fails with "Graceful failover exited with reason {mover_crashed,{unexpected_exit,{'EXIT',<0.1591.25>,{failed_to_update_vbucket_map,"

    XMLWordPrintable

Details

    • Bug
    • Resolution: Fixed
    • Major
    • 6.6.3
    • 6.6.3
    • None
    • 6.6.2-9588. --> 6.6.3-9796
    • Untriaged
    • Centos 64-bit
    • 1
    • No

    Description

      Steps to Repro
      1. Run the following 6.6.2 longevity test for 4 days.

      ./sequoia -client 172.23.96.162:2375 -provider file:centos_third_cluster.yml -test tests/integration/test_allFeatures_madhatter_durability.yml -scope tests/integration/scope_Xattrs_Madhatter.yml -scale 3 -repeat 0 -log_level 0 -version 6.6.2-9588 -skip_setup=true -skip_test=false -skip_teardown=true -skip_cleanup=false -continue=false -collect_on_error=false -stop_on_error=false -duration=604800 -show_topology=true
      

      2. At this point we would have 27 node cluster(3 analytics, 3 index, 3 fts, 3 query, 6 index, 9 data).
      3. Do a swap rebalance of 6 nodes (1 of each service type). Worked fine.
      4. Do a failover(graceful for kv) of 6 nodes (1 of each service type). Do an upgrade, recovery(delta for kv) and do a rebalance. Worked fine.
      5. Do a failover(graceful for kv) of 6 nodes (1 of each service type). This graceful failover on kv node(172.23.105.164) failed as shown below.

      "completionMessage": "Graceful failover exited with reason {mover_crashed,\n                                      {unexpected_exit,\n                                       {'EXIT',<0.1591.25>,\n                                        {failed_to_update_vbucket_map,\n                                         \"NEW_ORDER\",977,\n                                         {error,\n                                          [{'ns_1@172.23.106.54',\n                                            {exit,\n                                             {{nodedown,'ns_1@172.23.106.54'},\n                                              {gen_server,call,\n                                               [{ns_config_rep,\n                                                 'ns_1@172.23.106.54'},\n                                                synchronize_everything,\n                                                infinity]}}}}]}}}}}."
      

      This reminds me of the bug I hit into during 6.6.2 -> 7.0.0 upgrade because of bloated metakv tombstones. Notably MB-46778 and MB-46787. Not sure if it's the same though.

      Some important things to note.
      1. This is the first time we are doing system upgrade from 6.6.2 -> 6.6.3. So there is no baseline to speak of. This test was done for the first time in 7.0.0 using 6.6.2 -> 7.0.0 upgrade
      2. Number of metakv tombstone are

      [root@localhost ~]#  curl --silent -u Administrator:password http://localhost:8091/diag/eval -d 'ns_config:get()' | grep '_deleted' | wc -l
      17539
      [root@localhost ~]# 
      

      Please not these are organically created tombstones unlike the ones we used to do in 7.0.0 using a shell script to test metakv purge testing for system test upgrade. No changes were done to the longevity test. These were written during MH time frame possibly around 2 years ago.

      cbcollect_info attached.

      Attachments

        No reviews matched the request. Check your Options in the drop-down menu of this sections header.

        Activity

          People

            Balakumaran.Gopal Balakumaran Gopal
            Balakumaran.Gopal Balakumaran Gopal
            Votes:
            0 Vote for this issue
            Watchers:
            6 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Gerrit Reviews

                There are no open Gerrit changes

                PagerDuty