Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-58450

Failure to rebuild partitions for search indexes after node failover

    XMLWordPrintable

Details

    • Untriaged
    • 0
    • Yes

    Description

      Build: 7.2.1-5928

      test: 

      test_best_effort_distribution_max_group,GROUP=P1,cluster=D,D,D,F,F,F,F,F,F,fts_quota=3000,index_type=scorch,server_groups=sg1-D:F:F|sg2-D:F:F|sg3-D:F:F,replicas=2,partitions=12,eject_nodes=sg3-F:F,eject_type=failover 

      Description: 

      There is a bug with node failover scenario. If a user does not bring a replacement node before the failover occurred and rebalances then the system will prevent any lost actives or replicas for search indexes that had partitions residing on the affected node from being rebuilt. As a result, search requests will return partial/incomplete results. In addition, existing indexes with defined replica(s) became vulnerable to additional node failure.

      Test Steps:

      1. Create a cluster with 3 server groups with following config :
        • server_groups=sg1-D:F:F|sg2-D:F:F|sg3-D:F:F,replicas=2,partitions=12
          (x:y:z means 1 node with x service, 2nd node with y service and 3rd node with z service)
      2. Create fts index on default bucket
      3. Run a fts query  

        [2023-08-30 01:07:22,871] - [fts_base:2679] INFO - Running query {"indexName": "fts_idx", "size": 10000000, "from": 0, "explain": false, "query":{"match": "emp", "field": "type"}, "fields": [], "ctl": {"consistency": {"level": "", "vectors": {}}, "timeout": 60000}} on node as 172.23.106.207 : Administrator:
        

      4. Failover FTS nodes in serverGroup3
      5. Let only server group having max no of partitions stay and failover rest FTS nodes
      6. Find fts node in maximal server group holding min index partitions
      7. To test min fts node from maximal server group individually, failover over all the fts nodes from maximal server group
      8. Run the same query as run in step 3 

        [2023-08-30 01:08:14,163] - [fts_base:2679] INFO - Running query {"indexName": "fts_idx", "size": 10000000, "from": 0, "explain": false, "query":{"match": "emp", "field": "type"}, "fields": [], "ctl": {"consistency": {"level": "", "vectors": {}}, "timeout": 60000}} on node as 172.23.106.207 : Administrator:
        

      9. Assert same no of results <–  FAILURE

      Seeing less no of query hits in step 9 as compared to step 3.

      ======================================================================
      FAIL: test_best_effort_distribution_max_group (fts.fts_server_groups.FTSServerGroups)
      ----------------------------------------------------------------------
      Traceback (most recent call last):
      File "pytests/fts/fts_server_groups.py", line 290, in test_best_effort_distribution_max_group
      self.assertEqual(initial_hits, min_fts_node_hints, "Best effort distribution test is failed.")
      AssertionError: 1000 != 744 : Best effort distribution test is failed.
      ----------------------------------------------------------------------
      

      Jenkins Log : http://qa.sc.couchbase.com/job/test_suite_executor/611447/consoleFull

      Job Logs: https://cb-engineering.s3.amazonaws.com/logs/MB-58450/server_group.zip

      Attachments

        Issue Links

          No reviews matched the request. Check your Options in the drop-down menu of this sections header.

          Activity

            People

              sarthak.dua Sarthak Dua
              sarthak.dua Sarthak Dua
              Votes:
              0 Vote for this issue
              Watchers:
              13 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Gerrit Reviews

                  There are no open Gerrit changes

                  PagerDuty