'Elastic search Synonym completely eliminated by analyzer
I am using synonym file to create synonyms in elasticsearch, My requirement is to show photo frames of different sizes.
For example-
6x9, 6 x 9 => 6x9
But when I close and re-open the index, I am getting following error.
{
"error": {
"root_cause": [
{
"type": "illegal_argument_exception",
"reason": "failed to build synonyms"
}
],
"type": "illegal_argument_exception",
"reason": "failed to build synonyms",
"caused_by": {
"type": "parse_exception",
"reason": "Invalid synonym rule at line 107",
"caused_by": {
"type": "illegal_argument_exception",
"reason": "term: 6 x 9 was completely eliminated by analyzer"
}
}
},
"status": 400
}
It works fine for
8x10, 8 x 10 => 8x10
Which means it is only working if it has minimum 2 digits after x ie. 10 in 8 x 10 . Regarding 6x9 it is working fine. The only issue is with 6 x 9 as it has spaces and the last digit is single. But it is working fine if I change it to 6 x 09.
Here are the Settings -
"analysis": {
"filter": {
"synonym_filter": {
"type": "synonym",
"synonyms_path": "analysis/synonyms.txt"
},
"suggestions_shingle": {
"max_shingle_size": "4",
"min_shingle_size": "2",
"type": "shingle"
},
"english_stemmer_filter": {
"name": "minimal_english",
"type": "stemmer"
},
"edgeNGram_filter": {
"min_gram": "2",
"side": "front",
"type": "edgeNGram",
"max_gram": "20"
}
},
"analyzer": {
"whitespace_punc_analyzer": {
"filter": [
"lowercase",
"asciifolding",
"word_delimiter"
],
"type": "custom",
"tokenizer": "whitespace"
},
"edge_nGram_analyzer": {
"filter": [
"lowercase",
"asciifolding",
"synonym_filter"
],
"type": "custom",
"tokenizer": "edge_ngram_tokenizer"
},
"path_analyzer_lc": {
"filter": [
"lowercase"
],
"tokenizer": "path_tokenizer"
},
"stemmer_synonym_analyzer": {
"filter": [
"synonym_filter",
"lowercase",
"english_stemmer_filter"
],
"tokenizer": "standard"
},
"whitespace_analyzer": {
"filter": [
"lowercase",
"asciifolding"
],
"type": "custom",
"tokenizer": "whitespace"
},
"synonym_analyzer": {
"filter": [
"synonym_filter",
"lowercase",
"edgeNGram_filter"
],
"tokenizer": "standard"
},
"edge_nGram_shingle_analyzer": {
"filter": [
"lowercase",
"asciifolding",
"synonym_filter",
"suggestions_shingle"
],
"type": "custom",
"tokenizer": "edge_ngram_tokenizer"
},
"path_analyzer": {
"tokenizer": "path_tokenizer"
}
},
"tokenizer": {
"edge_ngram_tokenizer": {
"token_chars": [
"letter",
"digit"
],
"min_gram": "2",
"type": "edgeNGram",
"max_gram": "6"
},
"path_tokenizer": {
"ignore_case": "true",
"type": "path_hierarchy",
"delimiter": ">"
}
}}
Thanks in advance!
Solution 1:[1]
It's because of the edge_ngram_tokenizer tokenizer that has min_gram set to 2, and hence, it cannot produce any tokens for single character input.
POST _analyze
{
"text": "6 x 9",
"tokenizer": {
"token_chars": [
"letter",
"digit"
],
"min_gram": "2",
"type": "edgeNGram",
"max_gram": "6"
}
}
=> tokens: []
For 8 x 10, only the token 10 is produced, which is probably not what you want either:
POST _analyze
{
"text": "8 x 10",
"tokenizer": {
"token_chars": [
"letter",
"digit"
],
"min_gram": "2",
"type": "edgeNGram",
"max_gram": "6"
}
}
=> tokens: [10]
So the reason you're getting this error message is because the tokenizer doesn't produce any tokens, and then token filters have nothing to chew on.
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | Val |
