[FTS] Queries on fields analyzed with shingle token filter yield incorrect results

Description

Build : 4.7.0-990

Testcase :
./testrunner -i INI_FILE.ini get-cbcollect-info=True,get-coredumps=True,get-logs=False,stop-on-failure=False,cluster=D+F,GROUP=ALL -t fts.stable_topology_fts.StableTopFTS.index_query_custom_mapping,items=1000,custom_map=True,num_custom_analyzers=1,cm_id=22,num_queries=100,compare_es=True,GROUP=P0

Description:
If a field is analyzed using a custom analyzer that has shingle token filter, query types match, prefix, wildcard and inclusion/exclusion do not return expected hits when compared to ES.

Sample Query of Prefix query type:

[2016-08-15 12:40:07,398] - [task:1112] INFO - ------------------------------------------------------------------ Query # 4 ----------------------------------------------------------------- [2016-08-15 12:40:07,417] - [fts_base:1173] INFO - Running query {"from": 0, "indexName": "custom_index", "fields": [], "explain": false, "ctl": {"timeout": 60000, "consistency": {"vectors": {}, "level": ""}}, "query": {"field": "dept", "prefix": "Pr"}, "size": 10000000} on node: 172.23.106.72: [2016-08-15 12:40:07,453] - [task:1116] INFO - Status: {u'successful': 32, u'failed': 0, u'total': 32} [2016-08-15 12:40:07,454] - [task:1140] INFO - FTS hits for query: {"field": "dept", "prefix": "Pr"} is 99 (took 5.217532ms) [2016-08-15 12:40:07,457] - [task:1150] INFO - ES hits for query: {"prefix": {"dept": "Pr"}} on es_index is 0 (took 1ms) [2016-08-15 12:40:07,458] - [task:1155] ERROR - FAIL: FTS hits: 99, while ES hits: 0 [2016-08-15 12:40:07,458] - [task:1170] ERROR - FAIL: Following 99 doc(s) were not returned by ES,but FTS, printing 50: [u'emp10000732', u'emp10000124', u'emp10000231', u'emp10000230', u'emp10000438', u'emp10000877', u'emp10000871', u'emp10000703', u'emp10000331', u'emp10000426', u'emp10000548', u'emp10000420', u'emp10000542', u'emp10000541', u'emp10000038', u'emp10000584', u'emp10000984', u'emp10000630', u'emp10000903', u'emp10000653', u'emp10000305', u'emp10000655', u'emp10000145', u'emp10000143', u'emp10000414', u'emp10000416', u'emp10000021', u'emp10000791', u'emp10000025', u'emp10000251', u'emp10000250', u'emp10000798', u'emp10000994', u'emp10000625', u'emp10000289', u'emp10000640', u'emp10000728', u'emp10000648', u'emp10000405', u'emp10000017', u'emp10000016', u'emp10000568', u'emp10000399', u'emp10000565', u'emp10000561', u'emp10000847', u'emp10000719', u'emp10000496', u'emp10000499', u'emp10000710']

Sample document not returned by ES, but returned by FTS:
{
"salary": 143829.85,
"name": "Treva Gerónimo",
"dept": "Finance",
"is_manager": false,
"mutated": 0,
"join_date": "1996-09-18T08:46:00",
"languages_known": [
"Arabic",
"Vietnamese",
"Romanian"
],
"emp_id": "10000732",
"type": "emp",
"email": "treva@mcdiabetes.com"
}

Sample Query of Wildcard query type:

[2016-08-15 12:40:07,582] - [task:1112] INFO - ------------------------------------------------------------------ Query # 7 ----------------------------------------------------------------- [2016-08-15 12:40:07,600] - [fts_base:1173] INFO - Running query {"from": 0, "indexName": "custom_index", "fields": [], "explain": false, "ctl": {"timeout": 60000, "consistency": {"vectors": {}, "level": ""}}, "query": {"field": "name", "wildcard": "Richardson*"}, "size": 10000000} on node: 172.23.106.72: [2016-08-15 12:40:07,629] - [task:1116] INFO - Status: {u'successful': 32, u'failed': 0, u'total': 32} [2016-08-15 12:40:07,629] - [task:1140] INFO - FTS hits for query: {"field": "name", "wildcard": "Richardson*"} is 20 (took 4.505438ms) [2016-08-15 12:40:07,634] - [task:1150] INFO - ES hits for query: {"wildcard": {"name": "Richardson*"}} on es_index is 11 (took 1ms) [2016-08-15 12:40:07,635] - [task:1155] ERROR - FAIL: FTS hits: 20, while ES hits: 11 [2016-08-15 12:40:07,635] - [task:1170] ERROR - FAIL: Following 9 doc(s) were not returned by ES,but FTS, printing 50: [u'emp10000083', u'emp10000673', u'emp10000725', u'emp10000147', u'emp10000133', u'emp10000623', u'emp10000776', u'emp10000203', u'emp10000375']

Sample document not returned by ES, but returned by FTS:
{
"salary": 77199.42,
"name": "Balandria Campbell",
"mutated": 0,
"is_manager": true,
"dept": "Engineering",
"join_date": "1975-05-11T19:11:00",
"manages": {
"team_size": 5,
"reports": [
"Solita Simón",
"Kerry Baker III",
"Basha Sr.",
"Araceli Turner",
"Treva Palmer"
]
},
"languages_known": [
"Malay",
"Dutch",
"Africans"
],
"emp_id": "10000083",
"type": "emp",
"email": "balandria@mcdiabetes.com"
}

Sample Query of Match query type:

[2016-08-15 12:40:08,106] - [task:1112] INFO - ------------------------------------------------------------------ Query # 11 ----------------------------------------------------------------- [2016-08-15 12:40:08,124] - [fts_base:1173] INFO - Running query {"from": 0, "indexName": "custom_index", "fields": [], "explain": false, "ctl": {"timeout": 60000, "consistency": {"vectors": {}, "level": ""}}, "query": {"disjuncts": [{"field": "dept", "match": "Pre-sales"}, {"field": "dept", "match": "Finance"}, {"field": "dept", "match": "Support"}]}, "size": 10000000} on node: 172.23.106.72: [2016-08-15 12:40:08,153] - [task:1116] INFO - Status: {u'successful': 32, u'failed': 0, u'total': 32} [2016-08-15 12:40:08,153] - [task:1140] INFO - FTS hits for query: {"disjuncts": [{"field": "dept", "match": "Pre-sales"}, {"field": "dept", "match": "Finance"}, {"field": "dept", "match": "Support"}]} is 9 (took 6.172896ms) [2016-08-15 12:40:08,157] - [task:1150] INFO - ES hits for query: {"bool": {"should": [{"match": {"dept": "Pre-sales"}}, {"match": {"dept": "Finance"}}, {"match": {"dept": "Support"}}]}} on es_index is 0 (took 1ms) [2016-08-15 12:40:08,157] - [task:1155] ERROR - FAIL: FTS hits: 9, while ES hits: 0 [2016-08-15 12:40:08,157] - [task:1170] ERROR - FAIL: Following 9 doc(s) were not returned by ES,but FTS, printing 50: [u'emp10000894', u'emp10000981', u'emp10000679', u'emp10000141', u'emp10000526', u'emp10000022', u'emp10000374', u'emp10000027', u'emp10000606']

Sample document not returned by ES, but returned by FTS:
{
"salary": 59826.13,
"name": "Safiya Jones",
"mutated": 0,
"is_manager": true,
"dept": "Finance",
"join_date": "1983-05-20T08:50:00",
"manages": {
"team_size": 9,
"reports": [
"Duvessa Lee",
"Treva White",
"Chatha Morgan",
"Jerica King Jr.",
"Caryssa Carter",
"Mia Williams",
"Callia Stewart",
"Kerry Lewis",
"Mia Moore"
]
},
"languages_known": [
"Vietnamese",
"Sinhalese",
"Malay"
],
"emp_id": "10000894",
"type": "emp",
"email": "safiya@mcdiabetes.com"
}

Sample Query of Inclusion query type:

[2016-08-15 12:40:07,635] - [task:1112] INFO - ------------------------------------------------------------------ Query # 8 ----------------------------------------------------------------- [2016-08-15 12:40:07,654] - [fts_base:1173] INFO - Running query {"from": 0, "indexName": "custom_index", "fields": [], "explain": false, "ctl": {"timeout": 60000, "consistency": {"vectors": {}, "level": ""}}, "query": {"query": "mutated:>4 +mutated:<=2444 salary:<147000.0 +dept:\"Finance\""}, "size": 10000000} on node: 172.23.106.72: [2016-08-15 12:40:07,789] - [task:1116] INFO - Status: {u'successful': 32, u'failed': 0, u'total': 32} [2016-08-15 12:40:07,789] - [task:1140] INFO - FTS hits for query: {"query": "mutated:>4 +mutated:<=2444 salary:<147000.0 +dept:\"Finance\""} is 0 (took 110.217102ms) [2016-08-15 12:40:07,928] - [task:1150] INFO - ES hits for query: {"query_string": {"query": "mutated:>4 +mutated:<=2444 salary:<147000.0 +dept:\"Finance\""}} on es_index is 1000 (took 17ms) [2016-08-15 12:40:07,929] - [task:1155] ERROR - FAIL: FTS hits: 0, while ES hits: 1000 [2016-08-15 12:40:07,930] - [task:1170] ERROR - FAIL: Following 1000 docs were not returned by FTS, but ES, printing 50: [u'emp10000538', u'emp10000539', u'emp10000536', u'emp10000537', u'emp10000534', u'emp10000535', u'emp10000532', u'emp10000533', u'emp10000530', u'emp10000531', u'emp10000125', u'emp10000436', u'emp10000435', u'emp10000126', u'emp10000433', u'emp10000120', u'emp10000431', u'emp10000430', u'emp10000129', u'emp10000128', u'emp10000439', u'emp10000438', u'emp10000509', u'emp10000508', u'emp10000192', u'emp10000471', u'emp10000721', u'emp10000417', u'emp10000222', u'emp10000223', u'emp10000549', u'emp10000221', u'emp10000226', u'emp10000227', u'emp10000224', u'emp10000225', u'emp10000543', u'emp10000542', u'emp10000228', u'emp10000229', u'emp10000547', u'emp10000546', u'emp10000545', u'emp10000544', u'emp10000983', u'emp10000982', u'emp10000981', u'emp10000980', u'emp10000987', u'emp10000986']

Sample document not returned by FTS, but returned by ES:
{
"salary": 100356.34,
"name": "Quella Green",
"mutated": 0,
"is_manager": true,
"dept": "HR",
"join_date": "2014-03-23T11:12:00",
"manages": {
"team_size": 6,
"reports": [
"Trista Lee",
"Casondrah Scott",
"Ambika Lee",
"Desdomna Campbell",
"Hedda Moore",
"Antonia Richardson IX"
]
},
"languages_known": [
"Vietnamese",
"English",
"Portuguese"
],
"emp_id": "10000538",
"type": "emp",
"email": "quella@mcdiabetes.com"
}

Sample Query of Exclusion query type:

[2016-08-15 12:40:13,809] - [task:1112] INFO - ------------------------------------------------------------------ Query # 64 ----------------------------------------------------------------- [2016-08-15 12:40:13,829] - [fts_base:1173] INFO - Running query {"from": 0, "indexName": "custom_index", "fields": [], "explain": false, "ctl": {"timeout": 60000, "consistency": {"vectors": {}, "level": ""}}, "query": {"query": "-dept:\"Sales\""}, "size": 10000000} on node: 172.23.106.72: [2016-08-15 12:40:13,931] - [task:1116] INFO - Status: {u'successful': 32, u'failed': 0, u'total': 32} [2016-08-15 12:40:13,932] - [task:1140] INFO - FTS hits for query: {"query": "-dept:\"Sales\""} is 1000 (took 60.117285ms) [2016-08-15 12:40:13,936] - [task:1150] INFO - ES hits for query: {"query_string": {"query": "-dept:\"Sales\""}} on es_index is 0 (took 1ms) [2016-08-15 12:40:13,936] - [task:1155] ERROR - FAIL: FTS hits: 1000, while ES hits: 0 [2016-08-15 12:40:13,937] - [task:1170] ERROR - FAIL: Following 1000 doc(s) were not returned by ES,but FTS, printing 50: [u'emp10000538', u'emp10000539', u'emp10000536', u'emp10000537', u'emp10000534', u'emp10000535', u'emp10000532', u'emp10000533', u'emp10000530', u'emp10000531', u'emp10000437', u'emp10000124', u'emp10000435', u'emp10000434', u'emp10000121', u'emp10000120', u'emp10000431', u'emp10000122', u'emp10000719', u'emp10000129', u'emp10000128', u'emp10000439', u'emp10000438', u'emp10000168', u'emp10000472', u'emp10000441', u'emp10000466', u'emp10000467', u'emp10000471', u'emp10000721', u'emp10000417', u'emp10000222', u'emp10000223', u'emp10000549', u'emp10000548', u'emp10000226', u'emp10000227', u'emp10000224', u'emp10000225', u'emp10000543', u'emp10000542', u'emp10000228', u'emp10000229', u'emp10000547', u'emp10000546', u'emp10000545', u'emp10000544', u'emp10000983', u'emp10000982', u'emp10000981']

Sample document returned by FTS, but not returned by ES:
{
"salary": 100356.34,
"name": "Quella Green",
"mutated": 0,
"is_manager": true,
"dept": "HR",
"join_date": "2014-03-23T11:12:00",
"manages": {
"team_size": 6,
"reports": [
"Trista Lee",
"Casondrah Scott",
"Ambika Lee",
"Desdomna Campbell",
"Hedda Moore",
"Antonia Richardson IX"
]
},
"languages_known": [
"Vietnamese",
"English",
"Portuguese"
],
"emp_id": "10000538",
"type": "emp",
"email": "quella@mcdiabetes.com"
}

Attaching the testrunner console output for this test. It contains the index definition and other queries that failed for this test.

Components

Affects versions

Fix versions

Environment

None

Link to Log File, atop/blg, CBCollectInfo, Core dump

None

Release Notes Description

None

Attachments

2
  • 15 Sep 2016, 06:57 AM
  • 16 Aug 2016, 06:10 PM

Activity

Show:

Mihir Kamdar September 26, 2016 at 6:59 AM

Closing as this issue is no more seen in Build 4.7.0-1142

MartyM September 21, 2016 at 8:58 PM

Bleve SHA bump has happened, subsequent builds should contain the fix.

MartyM September 19, 2016 at 6:55 PM

I have broken out this second issue identified into a separate issue: https://couchbasecloud.atlassian.net/browse/MB-20992#icft=MB-20992

Now this one can be marked resolved, once the bleve SHA can be bumped.

MartyM September 15, 2016 at 2:16 PM

The remaining issue is actually quite different, and my preference would be to break it out as a separate issue for documentation purposes.

The query is:

2016-09-15 09:05:03 | INFO | MainProcess | Cluster_Thread | [task.execute] ------------------------------------------------------------------ Query # 57 ----------------------------------------------------------------- 2016-09-15 09:05:03 | INFO | MainProcess | Cluster_Thread | [fts_base.run_fts_query] Running query {"from": 0, "indexName": "custom_index", "fields": [], "explain": false, "ctl": {"timeout": 60000, "consistency": {"vectors": {}, "level": ""}}, "query": {"query": "-languages_known:\"German\""}, "size": 10000000} on node: 127.0.0.1:9201 2016-09-15 09:05:03 | INFO | MainProcess | Cluster_Thread | [task.execute] Status: {u'successful': 32, u'failed': 0, u'total': 32} 2016-09-15 09:05:03 | INFO | MainProcess | Cluster_Thread | [task.execute] FTS hits for query: {"query": "-languages_known:\"German\""} is 1000 (took 26.677828ms) 2016-09-15 09:05:03 | INFO | MainProcess | Cluster_Thread | [task.execute] ES hits for query: {"query_string": {"query": "-languages_known:\"German\""}} on es_index is 0 (took 1ms) 2016-09-15 09:05:03 | ERROR | MainProcess | Cluster_Thread | [task.execute] FAIL: FTS hits: 1000, while ES hits: 0 2016-09-15 09:05:03 | ERROR | MainProcess | Cluster_Thread | [task.execute] FAIL: Following 1000 doc(s) were not returned by ES,but FTS, printing 50: [u'emp10000538', u'emp10000539', u'emp10000536', u'emp10000537', u'emp10000534', u'emp10000535', u'emp10000532', u'emp10000533', u'emp10000530', u'emp10000531', u'emp10000125', u'emp10000436', u'emp10000127', u'emp10000126', u'emp10000121', u'emp10000120', u'emp10000431', u'emp10000122', u'emp10000129', u'emp10000128', u'emp10000439', u'emp10000438', u'emp10000472', u'emp10000509', u'emp10000508', u'emp10000620', u'emp10000471', u'emp10000222', u'emp10000223', u'emp10000549', u'emp10000221', u'emp10000226', u'emp10000227', u'emp10000224', u'emp10000225', u'emp10000543', u'emp10000542', u'emp10000228', u'emp10000229', u'emp10000547', u'emp10000546', u'emp10000545', u'emp10000544', u'emp10000983', u'emp10000982', u'emp10000981', u'emp10000980', u'emp10000987', u'emp10000986', u'emp10000985']

The issue here is that a search for a single term like 'Germany' when using this particular analyzer (shingle min length 2), produces 0 tokens. So, no documents have any indexed values at all for the "languages_known" field. At at search time, when analysis is done on the search term 'Germany' it also produces 0 terms, so what do we do? Currently, if the analysis produces 0 tokens, we convert it to MatchNoneQuery. The MatchNoneQuery is actually inside the MustNot of a boolean, and this we return everything.

Obviously our behavior differs slightly than ES in this case. There are 2 places it could be and I'm still tracking that down.

MartyM September 15, 2016 at 1:07 PM

I have pushed one fix to bleve here: https://github.com/blevesearch/bleve/commit/c5159251a90a0f2b3372ecb4a4604f8ac93d04db

However, another query is still failing.

Fixed
Pinned fields
Click on the next to a field label to start pinning.

Details

Assignee

Reporter

Is this a Regression?

No

Triage

Untriaged

Priority

Instabug

Open Instabug

PagerDuty

Sentry

Zendesk Support

Created August 16, 2016 at 6:10 PM
Updated September 26, 2016 at 6:59 AM
Resolved September 21, 2016 at 8:58 PM
Instabug