Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-59397

[System Test][Fast failover] :- Nodes trying to get auto failed over are prevented from a running rebalance.

    XMLWordPrintable

Details

    • Bug
    • Resolution: Not a Bug
    • Critical
    • 7.6.0
    • 7.6.0
    • ns_server

    • Enterprise Edition 7.6.0 build 1728
    • Untriaged
    • Linux x86_64
    • 0
    • No

    Description

      Script to Repro

      ./sequoia -client 172.23.104.27:2375 -provider file:debian_pine.yml -test tests/integration/7.6/test_7.6.yml -scope tests/integration/7.6/scope_7.6_magma.yml -scale 3 -repeat 0 -log_level 0 -version 7.6.0-1728 -skip_setup=false -skip_test=false -skip_teardown=true -skip_cleanup=false -continue=false -collect_on_error=false -stop_on_error=false -duration=1209600 -show_topology=true
      

      This is one of our early runs with fast failover enalbed(1s AF timeout) as requested by Abhijeeth Nuthan on our longevity test.

      [2023-10-31T07:19:58-07:00, sequoiatools/couchbase-cli:7.6:d60f0f] setting-autofailover -c 172.23.106.108:8091 -u Administrator -p password --enable-auto-failover=1 --auto-failover-timeout=1 --max-failovers=10
      

      During the following step of the longevity run where we hard failover a node(172.23.106.109) and rebalance it out.

      [2023-10-31T18:28:58-07:00, sequoiatools/couchbase-cli:7.6:2c7c64] failover -c 172.23.106.108:8091 --server-failover 172.23.106.109:8091 -u Administrator -p password --hard
      [2023-10-31T18:29:08-07:00, sequoiatools/couchbase-cli:7.6:0aebf7] rebalance -c 172.23.106.108:8091 -u Administrator -p password
      

      ns_1@172.23.104.2156:29:01 PM 31 Oct, 2023

      Failed over ['ns_1@172.23.106.109']: ok
      

      ns_1@172.23.104.2156:29:02 PM 31 Oct, 2023

      Failover completed successfully.
      Rebalance Operation Id = d77efaca00c7b3dbc719aa54792ab014
      

      ns_1@172.23.104.2156:29:11 PM 31 Oct, 2023

      Starting rebalance, KeepNodes = ['ns_1@172.23.104.213','ns_1@172.23.104.215',
      'ns_1@172.23.104.227','ns_1@172.23.105.237',
      'ns_1@172.23.105.238','ns_1@172.23.105.63',
      'ns_1@172.23.106.108','ns_1@172.23.106.110',
      'ns_1@172.23.106.121','ns_1@172.23.106.124',
      'ns_1@172.23.106.164','ns_1@172.23.120.59',
      'ns_1@172.23.121.72','ns_1@172.23.121.87',
      'ns_1@172.23.121.94','ns_1@172.23.124.27',
      'ns_1@172.23.96.170','ns_1@172.23.96.186',
      'ns_1@172.23.96.203','ns_1@172.23.96.251',
      'ns_1@172.23.96.252','ns_1@172.23.96.253',
      'ns_1@172.23.97.189','ns_1@172.23.97.229',
      'ns_1@172.23.97.242','ns_1@172.23.97.243',
      'ns_1@172.23.97.244','ns_1@172.23.97.245'], EjectNodes = [], Failed over and being ejected nodes = ['ns_1@172.23.106.109']; no delta recovery nodes; Operation Id = caf337cdcce2c0887390af6a11e6356a
      

      While the above rebalance was running we saw the node 172.23.106.108 was trying to get auto failedover.

      ns_1@172.23.104.215 8:18:11 PM 31 Oct, 2023

      Could not automatically fail over nodes (['ns_1@172.23.106.108']). Rebalance is running. (repeated 1 times, last seen 45.471187 secs ago)
      

      This is interesting cause we removed this feature that prevented AF from not aborting rebalance through MB-49059. We saw the above message possibly 100's of times but AF was not getting triggered because of an ongoing rebalance.

      Interestingly few hours later orchestrator was trying to failover another 7 nodes which again failed cause of ongoing rebalance.
      ns_1@172.23.104.215 10:35:20 PM 31 Oct, 2023

      Could not automatically fail over nodes (['ns_1@172.23.106.108',
      'ns_1@172.23.96.170',
      'ns_1@172.23.96.186',
      'ns_1@172.23.96.251',
      'ns_1@172.23.97.189',
      'ns_1@172.23.97.229',
      'ns_1@172.23.97.244']). Rebalance is running.
      

      I believe there are 2 important questions to be answered here.

      1. Why are multiple nodes trying to get auto failed over ? ( Possibly because we reduced AF timeout to 1s and now we are more prone to AF's because of the flakiness in the system suggesting that a setup like ours might not be good use case for fast failover ? )
      2. Why is the AF being prevented from ongoing rebalance which was removed in 7.5 through MB-49059?

      cbcollect_info attached.

      Attachments

        No reviews matched the request. Check your Options in the drop-down menu of this sections header.

        Activity

          People

            Balakumaran.Gopal Balakumaran Gopal
            Balakumaran.Gopal Balakumaran Gopal
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Gerrit Reviews

                There are no open Gerrit changes

                PagerDuty