'How can I match things like "C#" and "C++" in PostgreSQL's full text search?
Unfortunately the parser discards all punctuation, so that 'C', 'C++' and 'C#' all end up as 'c'.
The only thing I can think of (aside from replacing the parser which isn't really an option right now), would be to implement my own miniature synonym dictionary before using any of the FTS functions, replacing "C++" with "Cpp" and "C#" with "Csharp" either in the application or in the queries, for example. Is there any way to do this with FTS functions/features?
Solution 1:[1]
The dictionary search filters out a number of symbols (spaces, punctuation) and irrelevant words (aka stopwords: a, theā¦). However, some programming languages contain one or the other, or even both!
To overcome this, we need to create a custom thesaurus, bring it to the attention of PostgreSQL, link it to a dictionary that is also customised because it must not be aware of stopwords, and finally alter the configuration for ASCII characters and symbols.
Our thesaurus should be located in /usr/local/share/postgresql/tsearch_data/
and we will name it prog_thesaurus.ths
. It'll look like this:
c + + : cplusplus
c # : csharp
. net : dotnet
f # : fsharp
Notice the spaces between the words and symbols!
Now we need to create a custom dictionary without stopwords. It will be based on the english one:
CREATE TEXT SEARCH DICTIONARY english_nostop_stem (
TEMPLATE = snowball,
Language = english
);
Then we tell PostgreSQL about our thesaurus:
CREATE TEXT SEARCH DICTIONARY prog_thesaurus (
TEMPLATE = thesaurus,
DictFile = prog_thesaurus,
Dictionary = english_nostop_stem
);
Finally, we must map our thesaurus with some token types. Here we'll map asciiword (c
, f
, net
, etc.) and blank (+
, #
, .
, etc.):
ALTER TEXT SEARCH CONFIGURATION english
ALTER MAPPING FOR asciiword, blank
WITH prog_thesaurus, english_stem;
If we test our new configuration:
SELECT * FROM ts_debug('english', 'c++');
alias | description | token | dictionaries | dictionary | lexemes
-----------+-----------------+-------+------------------------------+-------------+---------
asciiword | Word, all ASCII | c | {prog_thesaurus,english_stem} | english_stem | {}
blank | Space symbols | + | {prog_thesaurus,english_stem} | english_stem | {+}
blank | Space symbols | + | {prog_thesaurus,english_stem} | english_stem | {+}
(3 lignes)
Yay! Our thesaurus is used! And with a little query:
SELECT * FROM plainto_tsquery('english','c++');
plainto_tsquery
-----------------
'cplusplus'
(1 ligne)
Tada! c++
is found as an occurrence of cplusplus
.
Solution 2:[2]
you can create thesaurus dictionary with lines
c + + : cpp
c # : csharp
and then add mapping for text search configuration for "blank" token type with your new dictionary mentioned first
Solution 3:[3]
Transliterate any relevant punctuation to words. C++ -> Cplusplus, C# -> Csharp, PL/SQL -> PLslashSQL
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
Solution | Source |
---|---|
Solution 1 | akarzim |
Solution 2 | cobets |
Solution 3 | Ants Aasma |