Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-56650

[System Test][CBBS] Backup service rebalance failed as 1 of the Backup nodes went down during rebalance operation

    XMLWordPrintable

Details

    • Bug
    • Resolution: Cannot Reproduce
    • Major
    • None
    • 7.2.0
    • tools
    • Enterprise Edition 7.2.0 build 5318

    Description

      QE TEST

      ./sequoia -client 172.23.104.254:2375 -provider file:centos_third_cluster.yml -test tests/integration/7.2/test_7.2.yml -scope tests/integration/7.2/scope_7.2_magma.yml -scale 3 -repeat 0 -log_level 0 -version 7.2.0-5318 -skip_setup=false -skip_test=false -skip_teardown=true -skip_cleanup=false -continue=false -collect_on_error=false -stop_on_error=false -duration=604800 -show_topology=true
      

      Day - 1
      Cycle - 1
      Scale - 3

      Cluster consists of 2 Backup nodes -

      • 172.23.104.249
      • 172.23.105.0

      TEST STEP

      Rebalance in single Analytics node.

      [2023-04-26T09:10:39-07:00, sequoiatools/couchbase-cli:7.1:82450b] server-add -c 172.23.108.139:8091 --server-add https://172.23.105.38 -u Administrator -p password --server-add-username Administrator --server-add-password password --services analytics
      [2023-04-26T09:10:53-07:00, sequoiatools/couchbase-cli:7.1:b2e0fb] rebalance -c 172.23.108.139:8091 -u Administrator -p password
      

      OBSERVATION
      While rebalance operation was ongoing 1 of the Backup nodes went down.

      [chronicle:info,2023-04-26T09:46:32.321-07:00,ns_1@172.23.108.139:chronicle_proposer<0.16773.1>:chronicle_proposer:handle_down:1142]Observed agent {chronicle_agent,'ns_1@172.23.105.0'} on peer 'ns_1@172.23.105.0' go down with reason noconnection
      

      Auto failover did not happen as rebalance was ongoing.

      [user:info,2023-04-26T09:48:34.885-07:00,ns_1@172.23.108.139:<0.16915.1>:auto_failover:report_failover_error:727]Could not automatically fail over nodes (['ns_1@172.23.105.0']). Rebalance is running.
      

      Rebalance operation failed with timeout few seconds later.

      [user:error,2023-04-26T09:49:36.278-07:00,ns_1@172.23.108.139:<0.16912.1>:ns_orchestrator:log_rebalance_completion:1433]Rebalance exited with reason {service_rebalance_failed,backup,
                                    {{badmatch,
                                      {error,
                                       {bad_nodes,backup,get_agent,
                                        [{'ns_1@172.23.105.0',
                                          {exit,
                                           {{nodedown,'ns_1@172.23.105.0'},
                                            {gen_server,call,
                                             [{'service_agent-backup',
                                               'ns_1@172.23.105.0'},
                                              get_agent,infinity]}}}}]}}},
                                     [{service_rebalancer,wait_for_agents,1,
                                       [{file,"src/service_rebalancer.erl"},
                                        {line,80}]},
                                      {service_rebalancer,run_rebalance,1,
                                       [{file,"src/service_rebalancer.erl"},
                                        {line,59}]},
                                      {proc_lib,init_p,3,
                                       [{file,"proc_lib.erl"},{line,211}]}]}}.
      Rebalance Operation Id = 9125ca776adbab9a1801e9c479e82b34
      

      Attachments

        No reviews matched the request. Check your Options in the drop-down menu of this sections header.

        Activity

          People

            sujay.gad Sujay Gad
            sujay.gad Sujay Gad
            Votes:
            0 Vote for this issue
            Watchers:
            6 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Gerrit Reviews

                There are no open Gerrit changes

                PagerDuty