Description
Build
4.5.0-1740
Testcase
./testrunner -i INI_FILE.ini -p skip-cleanup=True,get-cbcollect-info=False,get-logs=False,stop-on-failure=False,cluster=D:F,GROUP=ALL -t fts.stable_topology_fts.StableTopFTS.index_query_custom_mapping,items=1000,custom_map=True,cm_id=10,num_queries=100,compare_es=True,GROUP=P0
1 of 100 queries fail, the failing query being -
2016-03-09 14:26:16 | INFO | MainProcess | Cluster_Thread | [task.execute] ------------------------------------------------------------------ Query # 20 -----------------------------------------------------------------
|
2016-03-09 14:26:16 | INFO | MainProcess | Cluster_Thread | [fts_base.run_fts_query] Running query {"from": 0, "indexName": "custom_index", "fields": [], "explain": false, "ctl": {"timeout": 0, "consistency": {"vectors": {}, "level": ""}}, "query": {"field": "email", "wildcard": "c*"}, "size": 10000000} on node: 172.23.106.175
|
2016-03-09 14:26:16 | INFO | MainProcess | Cluster_Thread | [task.execute] FTS hits for query: {"field": "email", "wildcard": "c*"} is 118 (took 10.685187ms)
|
2016-03-09 14:26:16 | INFO | MainProcess | Cluster_Thread | [task.execute] ES hits for query: {"wildcard": {"email": "c*"}} on es_index is 1000 (took 17ms)
|
2016-03-09 14:26:16 | ERROR | MainProcess | Cluster_Thread | [task.execute] FAIL: FTS hits: 118, while ES hits: 1000
|
2016-03-09 14:26:16 | ERROR | MainProcess | Cluster_Thread | [task.execute] FAIL: Following 882 docs were not returned by FTS, but ES, printing 50: [u'emp10000538', u'emp10000536', u'emp10000534', u'emp10000533', u'emp10000530', u'emp10000531', u'emp10000125', u'emp10000436', u'emp10000435', u'emp10000126', u'emp10000121', u'emp10000120', u'emp10000431', u'emp10000122', u'emp10000129', u'emp10000128', u'emp10000439', u'emp10000438', u'emp10000509', u'emp10000508', u'emp10000192', u'emp10000471', u'emp10000417', u'emp10000222', u'emp10000549', u'emp10000221', u'emp10000226', u'emp10000227', u'emp10000224', u'emp10000225', u'emp10000543', u'emp10000542', u'emp10000228', u'emp10000229', u'emp10000547', u'emp10000546', u'emp10000545', u'emp10000544', u'emp10000982', u'emp10000981', u'emp10000986', u'emp10000985', u'emp10000124', u'emp10000989', u'emp10000988', u'emp10000127', u'emp10000864', u'emp10000865', u'emp10000866', u'emp10000867']
|
It was seen that ES was returning the following doc for the above query:
{
|
_index: "es_index",
|
_type: "emp",
|
_id: "emp10000538",
|
_version: 1,
|
found: true,
|
_source: {
|
salary: 77923.86,
|
name: "Dominique Josué",
|
mutated: 0,
|
is_manager: true,
|
dept: "Finance",
|
join_date: "1978-09-16T22:26:00",
|
manages: {
|
team_size: 10,
|
reports: [
|
"Desdomna Nicolás",
|
"Cytheria Simón",
|
"Araceli Johnson",
|
"Callia Carter",
|
"Callia Jackson",
|
"Dabria Jones",
|
"Safiya Cooper",
|
"Solita Young",
|
"Hedda Williams",
|
"Antonia Cook"
|
]
|
},
|
languages_known: [
|
"English",
|
"Spanish",
|
"German",
|
"Italian",
|
"French",
|
"Arabic",
|
"Africans",
|
"Hindi",
|
"Vietnamese",
|
"Urdu",
|
"Dutch",
|
"Quechua",
|
"Japanese",
|
"Chinese",
|
"Nepalese",
|
"Thai",
|
"Malay",
|
"Sinhalese"
|
],
|
emp_id: "10000538",
|
type: "emp",
|
email: "dominique@mcdiabetes.com"
|
}
|
}
|
Adding a brief exchange between Marty and myself on the above behavior -
[3/9/16, 2:50:01 PM] Aruna Piraviperumal: http://analysis.blevesearch.com/analysis tells me the terms in index for email are "dominique" and "mcdiabetes.com"
[3/9/16, 2:50:19 PM] Marty Schoch: possibly they tokenize it differently
[3/9/16, 2:50:26 PM] Marty Schoch: and the com is also a separate term
[3/9/16, 2:52:29 PM] Marty Schoch: ES has a REST api to let you see how it analyzes things
[3/9/16, 2:52:34 PM] Marty Schoch: very much like analyzer.blevesearch.com
[3/9/16, 2:52:37 PM] Aruna Piraviperumal: oh nice
[3/9/16, 2:52:42 PM] Marty Schoch: so i ran
[3/9/16, 2:52:43 PM] Marty Schoch: curl -XGET 'localhost:9900/_analyze?analyzer=simple' -d 'dominique@mcdiabetes.com'
[3/9/16, 2:52:54 PM] Marty Schoch: which is pointed at my local elasticsearch
[3/9/16, 2:52:59 PM] Aruna Piraviperumal: came back with 3 terms?
[3/9/16, 2:53:03 PM] Marty Schoch: you can change the name of the analyzer and the text to pass in
[3/9/16, 2:53:05 PM] Marty Schoch: yes it returns
[3/9/16, 2:53:13 PM] Marty Schoch: {
"tokens": [
,
,
{ "token": "com", "start_offset": 21, "end_offset": 24, "type": "word", "position": 3 } ]
}
[3/9/16, 2:53:19 PM] Aruna Piraviperumal: clean
[3/9/16, 2:53:35 PM] Marty Schoch: so, it would appear that our tokenizer behaves differently than there
[3/9/16, 2:53:39 PM] Aruna Piraviperumal: yes
[3/9/16, 2:53:47 PM] Aruna Piraviperumal: should it be similar?
[3/9/16, 2:54:16 PM] Marty Schoch: its designed to be the same, but the tokenizer is very complicated
[3/9/16, 2:54:22 PM] Marty Schoch: theres was using older unicode spec
[3/9/16, 2:54:30 PM] Marty Schoch: so i actually made ours with newer unicode spec
[3/9/16, 2:54:36 PM] Marty Schoch: but i cant say for sure thats the reason
Attachments
For Gerrit Dashboard: MB-18629 | ||||||
---|---|---|---|---|---|---|
# | Subject | Branch | Project | Status | CR | V |
62254,2 | bump bleve SHA for MB-18629 | master | manifest | Status: MERGED | +2 | +1 |