'How to get index item that has : "name" - "McLaren" by searching with "mclaren" in Elasticsearch 1.7?

Here is the tokenizer -

"tokenizer": {
   "filename" : {
      "pattern" : "[^\\p{L}\\d]+",
      "type" : "pattern"
   }
},

Mapping -

"name": {
      "type": "string",
      "analyzer": "filename_index",
      "include_in_all": true,
      "fields": {
        "raw": {
          "type": "string",
          "index": "not_analyzed"
        },
        "lower_case_sort": {
          "type": "string",
          "analyzer": "naturalsort"
        }
      }
    },

Analyzer -

"filename_index" : {
         "tokenizer" : "filename",
         "filter" : [
          "word_delimiter", 
          "lowercase",
          "russian_stop", 
          "russian_keywords", 
          "russian_stemmer",
          "czech_stop",
          "czech_keywords",
          "czech_stemmer"
        ]
      },

I would like to get index item by searching - mclaren, but the name indexed is McLaren. I would like to stick to query_string cause a lot of other functionality is based on that. Here is the query with what I cant get the expected result -

{
"query": {
    "filtered": {
        "query": {
            "query_string" : {
                "query" : "mclaren",
                "default_operator" : "AND",
                "analyze_wildcard" : true,
            }
        }
    }
},
"size" :50,
"from" : 0,
"sort": {}
}

How I could accomplish this? Thank you!



Solution 1:[1]

I got it ! The problem is certainly around the word_delimiter token filter. By default it :

Split tokens at letter case transitions. For example: PowerShot ? Power, Shot

Cf documentation

So macLaren generate two tokens -> [mac, Laren] when maclaren only generate one token ['maclaren'].

analyze example :

POST _analyze
{
  "tokenizer": {
    "pattern": """[^\p{L}\d]+""",
    "type": "pattern"
  },
  "filter": [
    "word_delimiter"
  ],
  "text": ["macLaren", "maclaren"]
}

Response:

{
  "tokens" : [
    {
      "token" : "mac",
      "start_offset" : 0,
      "end_offset" : 3,
      "type" : "word",
      "position" : 0
    },
    {
      "token" : "Laren",
      "start_offset" : 3,
      "end_offset" : 8,
      "type" : "word",
      "position" : 1
    },
    {
      "token" : "maclaren",
      "start_offset" : 9,
      "end_offset" : 17,
      "type" : "word",
      "position" : 102
    }
  ]
}

So I think one option is to configure your word_delimiter with the option split_on_case_change to false (see parameters doc)

Ps: remeber to remove the settings you previously added (cf comments), since with this setting, your query string query will only target the name field that does not exists.

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 Pierre Mallet