Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-60054

[System Test] FTS rebalance operation is still ongoing even though rebalance progress is marked as 100%

    XMLWordPrintable

Details

    • Bug
    • Resolution: Unresolved
    • Critical
    • Morpheus
    • 7.6.0
    • fts
    • Enterprise Edition 7.6.0 build 18141

    Description

      QE Test

      ./sequoia -client 172.23.104.254:2375 -provider file:centos_third_cluster.yml -test tests/fts/cheshire-cat/test_fts_clusterops_coll_crud_magma.yml -scope tests/fts/cheshire-cat/scope_fts_magma.yml -scale 3 -repeat 0 -log_level 0 -version 7.6.0-18141 -skip_setup=false -skip_test=false -skip_teardown=true -skip_cleanup=false -continue=false -collect_on_error=false -stop_on_error=false -duration=604800 -show_topology=true
      

      Day - 1
      Cycle - 1
      Scale - 3

      Test Step

      Rebalance out single FTS node from the cluster.

      2023-12-08T07:44:56.547-08:00, ns_orchestrator:0:info:message(ns_1@172.23.107.25) - Starting rebalance, KeepNodes = ['ns_1@172.23.104.216','ns_1@172.23.107.236',
                                       'ns_1@172.23.107.25','ns_1@172.23.108.134',
                                       'ns_1@172.23.108.136','ns_1@172.23.108.138',
                                       'ns_1@172.23.108.139','ns_1@172.23.108.141',
                                       'ns_1@172.23.108.143','ns_1@172.23.108.146',
                                       'ns_1@172.23.108.148'], EjectNodes = ['ns_1@172.23.108.145'], Failed over and being ejected nodes = []; no delta recovery nodes; Operation Id = 02c25254d0cddf2d5aac8b7a1dec8922
      

      Observation

      From output of pools/default/rebalanceProgress endpoint we can see that FTS rebalance is marked as completed for all the FTS nodes present in the cluster.

      curl -u Administrator:password http://172.23.108.139:8091/pools/default/rebalanceProgress | jq
      {
        "status": "running",
        "ns_1@172.23.108.141": {
          "progress": 1
        },
        "ns_1@172.23.108.143": {
          "progress": 1
        },
        "ns_1@172.23.108.134": {
          "progress": 1
        },
        "ns_1@172.23.108.145": {
          "progress": 1
        },
        "ns_1@172.23.107.25": {
          "progress": 0
        },
        "ns_1@172.23.108.136": {
          "progress": 1
        },
        "ns_1@172.23.104.216": {
          "progress": 1
        },
        "ns_1@172.23.108.146": {
          "progress": 1
        },
        "ns_1@172.23.107.236": {
          "progress": 0
        },
        "ns_1@172.23.108.148": {
          "progress": 1
        },
        "ns_1@172.23.108.138": {
          "progress": 1
        },
        "ns_1@172.23.108.139": {
          "progress": 1
        }
      }
      

      Output of pools/default/tasks shows total progress for FTS rebalance as 100% but completedTime field is not populated which indicates rebalance is still ongoing.

      "search": {
              "totalProgress": 100,
              "perNodeProgress": {
                "ns_1@172.23.108.143": 1,
                "ns_1@172.23.108.145": 1,
                "ns_1@172.23.108.136": 1,
                "ns_1@172.23.104.216": 1,
                "ns_1@172.23.108.148": 1,
                "ns_1@172.23.108.138": 1
              },
              "startTime": "2023-12-08T07:45:02.023-08:00",
              "completedTime": false,
              "timeTaken": 136409396
            }
      

      From UI we can see that FTS rebalance is still ongoing.
      Screenshot 2023-12-10 at 11.19.27 AM.png

      From fts.log file on 172.23.108.148 (rebalance orchestrator node) we can see that rebalance progress has been at 100% for a decent amount of time.

      grep "progress: 1." ns_server.fts.log | head -5
      2023-12-08T23:33:15.728-08:00 [INFO] ctl/manager: revNum: 88793, progress: 1.000000
      2023-12-08T23:33:16.363-08:00 [INFO] ctl/manager: revNum: 88795, progress: 1.000000
      2023-12-08T23:33:26.429-08:00 [INFO] ctl/manager: revNum: 88797, progress: 1.000000
      2023-12-08T23:33:35.738-08:00 [INFO] ctl/manager: revNum: 88799, progress: 1.000000
      2023-12-08T23:33:36.917-08:00 [INFO] ctl/manager: revNum: 88801, progress: 1.000000
      

      grep "progress: 1." ns_server.fts.log | tail -5
      2023-12-10T00:43:52.306-08:00 [INFO] ctl/manager: revNum: 115949, progress: 1.000000
      2023-12-10T00:43:56.614-08:00 [INFO] ctl/manager: revNum: 115951, progress: 1.000000
      2023-12-10T00:44:06.111-08:00 [INFO] ctl/manager: revNum: 115953, progress: 1.000000
      2023-12-10T00:44:12.454-08:00 [INFO] ctl/manager: revNum: 115955, progress: 1.000000
      2023-12-10T00:44:16.318-08:00 [INFO] ctl/manager: revNum: 115957, progress: 1.000000
      

       

      Note

      We are running this test on toy build containing vector search changes. Upcoming runs, we will trigger on normal 7.6 builds now that code is available on mainstream trinity builds.

      Search Nodes

      • 172.23.104.216
      • 172.23.108.136
      • 172.23.108.138
      • 172.23.108.143
      • 172.23.108.145
      • 172.23.108.148

      Attachments

        No reviews matched the request. Check your Options in the drop-down menu of this sections header.

        Activity

          People

            aditi.ahuja Aditi Ahuja
            sujay.gad Sujay Gad
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

              Created:
              Updated:

              Gerrit Reviews

                There are no open Gerrit changes

                PagerDuty