Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-58450

Failure to rebuild partitions for search indexes after node failover



    • Untriaged
    • 0
    • Yes


      Build: 7.2.1-5928




      There is a bug with node failover scenario. If a user does not bring a replacement node before the failover occurred and rebalances then the system will prevent any lost actives or replicas for search indexes that had partitions residing on the affected node from being rebuilt. As a result, search requests will return partial/incomplete results. In addition, existing indexes with defined replica(s) became vulnerable to additional node failure.

      Test Steps:

      1. Create a cluster with 3 server groups with following config :
        • server_groups=sg1-D:F:F|sg2-D:F:F|sg3-D:F:F,replicas=2,partitions=12
          (x:y:z means 1 node with x service, 2nd node with y service and 3rd node with z service)
      2. Create fts index on default bucket
      3. Run a fts query  

        [2023-08-30 01:07:22,871] - [fts_base:2679] INFO - Running query {"indexName": "fts_idx", "size": 10000000, "from": 0, "explain": false, "query":{"match": "emp", "field": "type"}, "fields": [], "ctl": {"consistency": {"level": "", "vectors": {}}, "timeout": 60000}} on node as : Administrator:

      4. Failover FTS nodes in serverGroup3
      5. Let only server group having max no of partitions stay and failover rest FTS nodes
      6. Find fts node in maximal server group holding min index partitions
      7. To test min fts node from maximal server group individually, failover over all the fts nodes from maximal server group
      8. Run the same query as run in step 3 

        [2023-08-30 01:08:14,163] - [fts_base:2679] INFO - Running query {"indexName": "fts_idx", "size": 10000000, "from": 0, "explain": false, "query":{"match": "emp", "field": "type"}, "fields": [], "ctl": {"consistency": {"level": "", "vectors": {}}, "timeout": 60000}} on node as : Administrator:

      9. Assert same no of results <–  FAILURE

      Seeing less no of query hits in step 9 as compared to step 3.

      FAIL: test_best_effort_distribution_max_group (fts.fts_server_groups.FTSServerGroups)
      Traceback (most recent call last):
      File "pytests/fts/fts_server_groups.py", line 290, in test_best_effort_distribution_max_group
      self.assertEqual(initial_hits, min_fts_node_hints, "Best effort distribution test is failed.")
      AssertionError: 1000 != 744 : Best effort distribution test is failed.

      Jenkins Log : http://qa.sc.couchbase.com/job/test_suite_executor/611447/consoleFull

      Job Logs: https://cb-engineering.s3.amazonaws.com/logs/MB-58450/server_group.zip


        Issue Links

          No reviews matched the request. Check your Options in the drop-down menu of this sections header.



              sarthak.dua Sarthak Dua
              sarthak.dua Sarthak Dua
              0 Vote for this issue
              13 Start watching this issue



                Gerrit Reviews

                  There are no open Gerrit changes