'Solr search t-shirt returns shirt

When i'm searching for t-shirts on my solr, it returns shirts first. I configured my field as follows:

    <fieldType name="my_test" class="solr.TextField" positionIncrementGap="100" omitNorms="true">
      <analyzer type="index">        
        <tokenizer class="solr.StandardTokenizerFactory"/>
        <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt"/>
        <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
        <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="0" catenateNumbers="0" catenateAll="0" splitOnCaseChange="1" splitOnNumerics="1" preserveOriginal="1"/>
        <filter class="solr.LowerCaseFilterFactory"/>
        <filter class="solr.NGramFilterFactory" maxGramSize="20" minGramSize="2"/>
        <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
      </analyzer>
      <analyzer type="query">        
        <tokenizer class="solr.StandardTokenizerFactory"/>
        <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt"/>
        <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
        <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="0" catenateNumbers="0" catenateAll="0" splitOnCaseChange="1" splitOnNumerics="1" preserveOriginal="1"/>
        <filter class="solr.LowerCaseFilterFactory"/>
        <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
      </analyzer>
    </fieldType>

Do you see any optimization point? Thanks



Solution 1:[1]

Here you are using the StandardTokenizerFactory for your field which is creating a token as shirt and hence a match.

StandardTokenizerFactory :- It tokenizes on whitespace, as well as strips characters

The Documentation for StandardTokenizerFactory mentions as :-

Splits words at punctuation characters, removing punctuations. However, a dot that's not followed by whitespace is considered part of a token. Splits words at hyphens, unless there's a number in the token. In that case, the whole token is interpreted as a product number and is not split. Recognizes email addresses and Internet hostnames as one token.

If you want to perform search on the "t-shirt", then it should be tokenized. I would suggest you to use the KeywordTokenizerFactory

Keyword Tokenizer does not split the input provided to it. It does not do any processing on the string, and the entire string is treated as a single token. This doesn't actually do any tokenization. It returns the original text as one term.

This KeywordTokenizerFactory is used for sorting or faceting requirements, where one want to perform the exact match. Its helpful in faceting and sorting.

You can have another field and apply KeywordTokenizerFactory to it and perform your search on it.

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 Abhijit Bashetti