Details
-
Bug
-
Resolution: Fixed
-
Blocker
-
7.2.1
-
Untriaged
-
0
-
Yes
Description
Build: 7.2.1-5928
test:
test_best_effort_distribution_max_group,GROUP=P1,cluster=D,D,D,F,F,F,F,F,F,fts_quota=3000,index_type=scorch,server_groups=sg1-D:F:F|sg2-D:F:F|sg3-D:F:F,replicas=2,partitions=12,eject_nodes=sg3-F:F,eject_type=failover |
Description:
There is a bug with node failover scenario. If a user does not bring a replacement node before the failover occurred and rebalances then the system will prevent any lost actives or replicas for search indexes that had partitions residing on the affected node from being rebuilt. As a result, search requests will return partial/incomplete results. In addition, existing indexes with defined replica(s) became vulnerable to additional node failure.
Test Steps:
- Create a cluster with 3 server groups with following config :
- server_groups=sg1-D:F:F|sg2-D:F:F|sg3-D:F:F,replicas=2,partitions=12
(x:y:z means 1 node with x service, 2nd node with y service and 3rd node with z service)
- server_groups=sg1-D:F:F|sg2-D:F:F|sg3-D:F:F,replicas=2,partitions=12
- Create fts index on default bucket
- Run a fts query
[2023-08-30 01:07:22,871] - [fts_base:2679] INFO - Running query {"indexName": "fts_idx", "size": 10000000, "from": 0, "explain": false, "query":{"match": "emp", "field": "type"}, "fields": [], "ctl": {"consistency": {"level": "", "vectors": {}}, "timeout": 60000}} on node as 172.23.106.207 : Administrator:
- Failover FTS nodes in serverGroup3
- Let only server group having max no of partitions stay and failover rest FTS nodes
- Find fts node in maximal server group holding min index partitions
- To test min fts node from maximal server group individually, failover over all the fts nodes from maximal server group
- Run the same query as run in step 3
[2023-08-30 01:08:14,163] - [fts_base:2679] INFO - Running query {"indexName": "fts_idx", "size": 10000000, "from": 0, "explain": false, "query":{"match": "emp", "field": "type"}, "fields": [], "ctl": {"consistency": {"level": "", "vectors": {}}, "timeout": 60000}} on node as 172.23.106.207 : Administrator:
- Assert same no of results <– FAILURE
Seeing less no of query hits in step 9 as compared to step 3.
======================================================================
|
FAIL: test_best_effort_distribution_max_group (fts.fts_server_groups.FTSServerGroups)
|
----------------------------------------------------------------------
|
Traceback (most recent call last):
|
File "pytests/fts/fts_server_groups.py", line 290, in test_best_effort_distribution_max_group |
self.assertEqual(initial_hits, min_fts_node_hints, "Best effort distribution test is failed.") |
AssertionError: 1000 != 744 : Best effort distribution test is failed. |
----------------------------------------------------------------------
|
Jenkins Log : http://qa.sc.couchbase.com/job/test_suite_executor/611447/consoleFull
Job Logs: https://cb-engineering.s3.amazonaws.com/logs/MB-58450/server_group.zip