'How to compare a tsvector against another tsvector?

I'm trying to get "possibly related" records of a given record.

There's a tsvector (tsv) on the table, so I'm thinking how to convert the source tsv to a tsquery format to then find the most closely related matches like a normal ranked search.

SELECT title,
    link,
    IMAGE,
    intro,
    created_at,
    updated_at,
    ts_rank_cd(tsv, q.match::tsquery) AS rank
FROM items,
    (
        SELECT tsv AS match
        FROM items
        WHERE id = 1234
        ) AS q
WHERE id <> 1234
ORDER BY rank DESC LIMIT 10;

Is there a nice way to achieve this?



Solution 1:[1]

I did some poking around and it didn't seem like there was an easy way of doing this. I think to do it effectively you would probably need your own C functions which could provide a distance from one tsvector to another (then you could use KNN searches).

Either way there is not a very easy way to do this and it is likely a significant amount of work, but it seems like it should be a generally applicable problem so the general community might be interested in a solution.

Note this is not as trivial as it sounds. Suppose I write a book about Albert Lord's the Singer of Tales and his emphasis on poetic formulas. Suppose I call it "Albert Lord and the Ring of Words." This would create a tsvector of Albert:1 Lord:2 Ring:5 Words:7, The Lord of the Rings is Lord:2 Ring:5 which would create a very false sense of similarity. If you have any categorization involved, you would want to leverage that as well.

Solution 2:[2]

You could perhaps compare tsvector with similarity from the pg_trgm extension. Something like this:

SELECT title, similarity(STRIP(to_tsvector('english', title))::text, STRIP(to_tsvector('english', 'The Lord of the Rings'))::text) sim
FROM (VALUES
    ('Albert Lord and the Ring of Words'),
    ('The Ring of Words'),
    ('Albert Lord')
) t(title)
ORDER BY sim DESC

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 Chris Travers
Solution 2 PPH