'Check if a word is a part of another word

I have two columns with some text:

text_1 text_2
astro lumen cosm planet microcosm astronomy planet magnitude

I need to remove a word from column text_1 if this word occurs in text_2 column (i.e., is a complete duplicate) or is a part of some word in text_2 column.

Desired output:

text_1 text_2
lumen microcosm astronomy planet magnitude

How can I do this in PostgreSQL and/or PySpark?



Solution 1:[1]

Here is a way to do it in SQL:

WITH data AS (
   SELECT 'astro lumen cosm planet' AS needles,
          'microcosm astronomy planet magnitude' AS haystack
)
SELECT string_agg(needle.n, ' ')
FROM data
   CROSS JOIN LATERAL regexp_split_to_table(data.needles, ' +') AS needle(n)
WHERE strpos(data.haystack, needle.n) = 0;

 string_agg 
????????????
 lumen
(1 row)

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 Laurenz Albe