Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-18629

[FTS] Bleve tokenization is slightly different from ES

    XMLWordPrintable

Details

    • Bug
    • Resolution: Fixed
    • Major
    • 4.5.0
    • 4.5.0
    • cbft
    • None
    • Untriaged
    • Unknown

    Description

      Build
      4.5.0-1740

      Testcase
      ./testrunner -i INI_FILE.ini -p skip-cleanup=True,get-cbcollect-info=False,get-logs=False,stop-on-failure=False,cluster=D:F,GROUP=ALL -t fts.stable_topology_fts.StableTopFTS.index_query_custom_mapping,items=1000,custom_map=True,cm_id=10,num_queries=100,compare_es=True,GROUP=P0

      1 of 100 queries fail, the failing query being -

      2016-03-09 14:26:16 | INFO | MainProcess | Cluster_Thread | [task.execute] ------------------------------------------------------------------ Query # 20 -----------------------------------------------------------------
      2016-03-09 14:26:16 | INFO | MainProcess | Cluster_Thread | [fts_base.run_fts_query] Running query {"from": 0, "indexName": "custom_index", "fields": [], "explain": false, "ctl": {"timeout": 0, "consistency": {"vectors": {}, "level": ""}}, "query": {"field": "email", "wildcard": "c*"}, "size": 10000000} on node: 172.23.106.175
      2016-03-09 14:26:16 | INFO | MainProcess | Cluster_Thread | [task.execute] FTS hits for query: {"field": "email", "wildcard": "c*"} is 118 (took 10.685187ms)
      2016-03-09 14:26:16 | INFO | MainProcess | Cluster_Thread | [task.execute] ES hits for query: {"wildcard": {"email": "c*"}} on es_index is 1000 (took 17ms)
      2016-03-09 14:26:16 | ERROR | MainProcess | Cluster_Thread | [task.execute] FAIL: FTS hits: 118, while ES hits: 1000
      2016-03-09 14:26:16 | ERROR | MainProcess | Cluster_Thread | [task.execute] FAIL: Following 882 docs were not returned by FTS, but ES, printing 50: [u'emp10000538', u'emp10000536', u'emp10000534', u'emp10000533', u'emp10000530', u'emp10000531', u'emp10000125', u'emp10000436', u'emp10000435', u'emp10000126', u'emp10000121', u'emp10000120', u'emp10000431', u'emp10000122', u'emp10000129', u'emp10000128', u'emp10000439', u'emp10000438', u'emp10000509', u'emp10000508', u'emp10000192', u'emp10000471', u'emp10000417', u'emp10000222', u'emp10000549', u'emp10000221', u'emp10000226', u'emp10000227', u'emp10000224', u'emp10000225', u'emp10000543', u'emp10000542', u'emp10000228', u'emp10000229', u'emp10000547', u'emp10000546', u'emp10000545', u'emp10000544', u'emp10000982', u'emp10000981', u'emp10000986', u'emp10000985', u'emp10000124', u'emp10000989', u'emp10000988', u'emp10000127', u'emp10000864', u'emp10000865', u'emp10000866', u'emp10000867']
      

      It was seen that ES was returning the following doc for the above query:

      {
      _index: "es_index",
      _type: "emp",
      _id: "emp10000538",
      _version: 1,
      found: true,
      _source: {
      salary: 77923.86,
      name: "Dominique Josué",
      mutated: 0,
      is_manager: true,
      dept: "Finance",
      join_date: "1978-09-16T22:26:00",
      manages: {
      team_size: 10,
      reports: [
      "Desdomna Nicolás",
      "Cytheria Simón",
      "Araceli Johnson",
      "Callia Carter",
      "Callia Jackson",
      "Dabria Jones",
      "Safiya Cooper",
      "Solita Young",
      "Hedda Williams",
      "Antonia Cook"
      ]
      },
      languages_known: [
      "English",
      "Spanish",
      "German",
      "Italian",
      "French",
      "Arabic",
      "Africans",
      "Hindi",
      "Vietnamese",
      "Urdu",
      "Dutch",
      "Quechua",
      "Japanese",
      "Chinese",
      "Nepalese",
      "Thai",
      "Malay",
      "Sinhalese"
      ],
      emp_id: "10000538",
      type: "emp",
      email: "dominique@mcdiabetes.com"
      }
      }
      

      Adding a brief exchange between Marty and myself on the above behavior -
      [3/9/16, 2:50:01 PM] Aruna Piraviperumal: http://analysis.blevesearch.com/analysis tells me the terms in index for email are "dominique" and "mcdiabetes.com"
      [3/9/16, 2:50:19 PM] Marty Schoch: possibly they tokenize it differently
      [3/9/16, 2:50:26 PM] Marty Schoch: and the com is also a separate term
      [3/9/16, 2:52:29 PM] Marty Schoch: ES has a REST api to let you see how it analyzes things
      [3/9/16, 2:52:34 PM] Marty Schoch: very much like analyzer.blevesearch.com
      [3/9/16, 2:52:37 PM] Aruna Piraviperumal: oh nice
      [3/9/16, 2:52:42 PM] Marty Schoch: so i ran
      [3/9/16, 2:52:43 PM] Marty Schoch: curl -XGET 'localhost:9900/_analyze?analyzer=simple' -d 'dominique@mcdiabetes.com'
      [3/9/16, 2:52:54 PM] Marty Schoch: which is pointed at my local elasticsearch
      [3/9/16, 2:52:59 PM] Aruna Piraviperumal: came back with 3 terms?
      [3/9/16, 2:53:03 PM] Marty Schoch: you can change the name of the analyzer and the text to pass in
      [3/9/16, 2:53:05 PM] Marty Schoch: yes it returns
      [3/9/16, 2:53:13 PM] Marty Schoch: {
      "tokens": [

      { "token": "dominique", "start_offset": 0, "end_offset": 9, "type": "word", "position": 1 }

      ,

      { "token": "mcdiabetes", "start_offset": 10, "end_offset": 20, "type": "word", "position": 2 }

      ,

      { "token": "com", "start_offset": 21, "end_offset": 24, "type": "word", "position": 3 }

      ]
      }
      [3/9/16, 2:53:19 PM] Aruna Piraviperumal: clean
      [3/9/16, 2:53:35 PM] Marty Schoch: so, it would appear that our tokenizer behaves differently than there
      [3/9/16, 2:53:39 PM] Aruna Piraviperumal: yes
      [3/9/16, 2:53:47 PM] Aruna Piraviperumal: should it be similar?
      [3/9/16, 2:54:16 PM] Marty Schoch: its designed to be the same, but the tokenizer is very complicated
      [3/9/16, 2:54:22 PM] Marty Schoch: theres was using older unicode spec
      [3/9/16, 2:54:30 PM] Marty Schoch: so i actually made ours with newer unicode spec
      [3/9/16, 2:54:36 PM] Marty Schoch: but i cant say for sure thats the reason

      Attachments

        No reviews matched the request. Check your Options in the drop-down menu of this sections header.

        Activity

          People

            apiravi Aruna Piravi (Inactive)
            apiravi Aruna Piravi (Inactive)
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Gerrit Reviews

                There are no open Gerrit changes

                PagerDuty