'Postgres Full-Text Search with Hyphen and Numerals

I have observed what seems to me an odd behavior of Postgres' to_tsvector function.

SELECT to_tsvector('english', 'abc-xyz');

returns

'abc':2 'abc-xyz':1 'xyz':3

However,

SELECT to_tsvector('english', 'abc-001');

returns

'-001':2 'abc':1

Why not something like this?

'abc':2 'abc-001':1 '001':3

And what should I do to be able to search by the numeric portion alone, without the hyphen?



Solution 1:[1]

Seems the text search parser identifies the hyphen followed by digits to be the sign of a signed integer. Debug with ts_debug():

SELECT * FROM ts_debug('english', 'abc-001');

   alias   |   description   | token | dictionaries | dictionary | lexemes 
-----------+-----------------+-------+--------------+------------+---------
 asciiword | Word, all ASCII | abc   | {simple}     | simple     | {abc}
 int       | Signed integer  | -001  | {simple}     | simple     | {-001}

Other text search configurations (like 'simple' instead of 'english') won't help as the parser itself is "at fault" here (debatable).

A simple way around it (other than modifying the parser, which I never tried) would to pre-process strings and replace hyphens with m-dash () or just blanks to make sure those are identified as "Space symbols". (Actual signed integers lose their negative sign in the process.)

SELECT to_tsvector('english', translate('abc-001', '-', '—'))
    @@ to_tsquery ('english', '001');  -- true now

db<>fiddle here

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 Erwin Brandstetter