Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-18631

[FTS] Bleve's standard analyzer performs stop word removal by default but ES' doesn't

    XMLWordPrintable

Details

    • Bug
    • Resolution: Won't Fix
    • Major
    • 4.5.0
    • 4.5.0
    • cbft
    • None
    • Untriaged
    • Unknown

    Description

      Build
      4.5.0-1740

      Testcase
      test_27 in http://qa.sc.couchbase.com/view/FTS/job/cen006-p0-fts-vset00-00-custom-map-rqg/
      ./testrunner -i INI_FILE.ini get-cbcollect-info=False,get-logs=False,stop-on-failure=False,cluster=D+F,GROUP=ALL -t fts.stable_topology_fts.StableTopFTS.index_query_custom_mapping,items=1000,custom_map=True,cm_id=6,num_queries=100,compare_es=True,dataset=wiki,GROUP=P0

      [2016-03-09 14:42:52,900] - [fts_base:1085] INFO - Running query {"from": 0, "indexName": "custom_index", "fields": [], "explain": false, "ctl": {"timeout": 0, "consistency": {"vectors": {}, "level": ""}}, "query": {"disjuncts": [{"field": "title", "match": "George II of Great Britain"}, {"field": "title", "match": "AOC"}]}, "size": 10000000} on node: 172.23.106.72
      [2016-03-09 14:42:52,927] - [task:1071] INFO - FTS hits for query: {"disjuncts": [{"field": "title", "match": "George II of Great Britain"}, {"field": "title", "match": "AOC"}]} is 2 (took 9.423861ms)
      [2016-03-09 14:42:52,951] - [task:1081] INFO - ES hits for query: {"bool": {"should": [{"match": {"title": "George II of Great Britain"}}, {"match": {"title": "AOC"}}]}} on es_index is 53 (took 5ms)
      [2016-03-09 14:42:52,951] - [task:1086] ERROR - FAIL: FTS hits: 2, while ES hits: 53
      [2016-03-09 14:42:52,951] - [task:1101] ERROR - FAIL: Following 51 docs were not returned by FTS, but ES, printing 50: [u'wiki10000289', u'wiki10000288', u'wiki10000117', u'wiki10000507', u'wiki10000527', u'wiki10000285', u'wiki10000690', u'wiki10000287', u'wiki10000951', u'wiki10000661', u'wiki10000682', u'wiki10000666', u'wiki10000664', u'wiki10000705', u'wiki10000732', u'wiki10000725', u'wiki10000416', u'wiki10000695', u'wiki10000642', u'wiki10000318', u'wiki10000313', u'wiki10000899', u'wiki10000024', u'wiki10000218', u'wiki10000372', u'wiki10000276', u'wiki10000277', u'wiki10000007', u'wiki10000658', u'wiki10000659', u'wiki10000762', u'wiki10000654', u'wiki10000121', u'wiki10000650', u'wiki10000753', u'wiki10000392', u'wiki10000672', u'wiki10000673', u'wiki10000674', u'wiki10000675', u'wiki10000718', u'wiki10000758', u'wiki10000098', u'wiki10000903', u'wiki10000671', u'wiki10000355', u'wiki10000689', u'wiki10000685', u'wiki10000219', u'wiki10000290']
      

      Marty had noted once that ES' default analyzer was 'standard' with stop word removal disabled but it turns out ES' standard analyzer does no stop word removal by default. https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-standard-analyzer.html says the stopwords list is empty by default.

      Using Marty's ES rest endpoint for analysis:

      curl -XGET '172.23.106.53:9200/_analyze?analyzer=standard' -d 'George II of Great Britain'
      {  
         "tokens":[  
            {  
               "token":"george",
               "start_offset":0,
               "end_offset":6,
               "type":"<ALPHANUM>",
               "position":1
            },
            {  
               "token":"ii",
               "start_offset":7,
               "end_offset":9,
               "type":"<ALPHANUM>",
               "position":2
            },
            {  
               "token":"of",
               "start_offset":10,
               "end_offset":12,
               "type":"<ALPHANUM>",
               "position":3
            },
            {  
               "token":"great",
               "start_offset":13,
               "end_offset":18,
               "type":"<ALPHANUM>",
               "position":4
            },
            {  
               "token":"britain",
               "start_offset":19,
               "end_offset":26,
               "type":"<ALPHANUM>",
               "position":5
            }
         ]
      }
      

      And I see analysis.blevesearch.com returning only 4 terms ('of' stop word is removed) -

       Analyze
      Text: George II of Great Britain
       
      Position	Term	Start	End
      1	george	0	6
      2	ii	 7	9
      4	great	13	18
      5	britain	19	26
      

      Attachments

        No reviews matched the request. Check your Options in the drop-down menu of this sections header.

        Activity

          People

            apiravi Aruna Piravi (Inactive)
            apiravi Aruna Piravi (Inactive)
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Gerrit Reviews

                There are no open Gerrit changes

                PagerDuty