'Missing text from Elasticsearch highlighted text when field contains exclamation mark

When searching for a text and requesting results query highlight, if the matched document field contains exclamation mark, then the returned highlighted text does not contain part of the text that contains the exclamation mark

Elasticsearch version 7.1.1

document: { "name" : "Yahoo! Inc [Please refer to Altaba Inc and Verizon Communications Inc]"} searching with highlight for "inc" wildcard

expected: highlighed text should be:

"Yahoo! <em>Inc</em> [Please refer to Altaba <em>Inc</em> and Verizon Communications <em>Inc</em>]"

actual: "Yahoo!" is missing from the response. Got:

"<em>Inc</em> [Please refer to Altaba <em>Inc</em> and Verizon Communications <em>Inc</em>]"

I think this was something to do with the ! mark. If I remove that then everything is OK.

Steps to reproduce:

Add document to a new index

POST test/_doc/ { "name" : "Yahoo! Inc [Please refer to Altaba Inc and Verizon Communications Inc]" }

no other settings / mapping

Run the query

GET test/_search { "query": { "bool": { "should": [ { "wildcard": { "name": { "wildcard": "inc*" } } } ] } }, "highlight": { "fields": { "name" : {} } } }

Got following results:

"hits" : [ { "_index" : "test", "_type" : "_doc", "_id" : "511tP3ABoqekxkoUshVf", "_score" : 1.0, "_source" : { "name" : "Yahoo! Inc [Please refer to Altaba Inc and Verizon Communications Inc]" }, "highlight" : { "name" : [ "<em>Inc</em> [Please refer to Altaba <em>Inc</em> and Verizon Communications <em>Inc</em>]" ] } } ]

expecting highlight:

"Yahoo! <em>Inc</em> [Please refer to Altaba <em>Inc</em> and Verizon Communications <em>Inc</em>]"


Solution 1:[1]

This is expected behavior because, by default, the Elasticsearch highlight returns a part of the searched text (fragments) see: https://www.elastic.co/guide/en/elasticsearch/reference/7.1/search-request-highlighting.html#unified-highlighter

! and . are considered end of previous sentence and the highlight does not return that fragment.

In my case, the searched text was representing a name which had a small text length and by adding "number_of_fragments" : 0 I am forcing the highlight to return the entire document field.

"highlight": {
  "fields": {
     "name" : {"number_of_fragments" : 0}
  }
}

same as: https://github.com/elastic/elasticsearch/issues/52333

Solution 2:[2]

As andreyro says, it is expected behavior for the unified (default) Elasticsearch highlighter. I had this same issue and reducing the number of fragments just made the issue worse. Fortunately, you can change which highlighter is used. I added the following and the issue was fixed.

"highlight": {
    "fields": {
        "*": {
            "type": "plain"
        }
    }
}

Replace the wildcard "*" as needed for whatever fields you are searching. See the same documentation: https://www.elastic.co/guide/en/elasticsearch/reference/current/highlighting.html#set-highlighter-type

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 andreyro
Solution 2