'Check if a word is a part of another word
I have two columns with some text:
| text_1 | text_2 |
|---|---|
| astro lumen cosm planet | microcosm astronomy planet magnitude |
I need to remove a word from column text_1 if this word occurs in text_2 column (i.e., is a complete duplicate) or is a part of some word in text_2 column.
Desired output:
| text_1 | text_2 |
|---|---|
| lumen | microcosm astronomy planet magnitude |
How can I do this in PostgreSQL and/or PySpark?
Solution 1:[1]
Here is a way to do it in SQL:
WITH data AS (
SELECT 'astro lumen cosm planet' AS needles,
'microcosm astronomy planet magnitude' AS haystack
)
SELECT string_agg(needle.n, ' ')
FROM data
CROSS JOIN LATERAL regexp_split_to_table(data.needles, ' +') AS needle(n)
WHERE strpos(data.haystack, needle.n) = 0;
string_agg
????????????
lumen
(1 row)
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | Laurenz Albe |
