'How get a field contains which at least two matching words from other table in clickhouse
I would like to get results from the fields (path) in table URL where the last directory in the path field from URL table contains at least two matching words from the same row in the Locations table.
Table URL
WITH arrayJoin([
('/directory1/paris-oslo-london.htm', 0 ),
('/directory1/paris-oslo-granada.htm', 1 ),
('/directory1/london-moscow.htm', 2),
('/directory1/barcelona-moscow.htm', 3),
('/directory2/keyword1-keyword2-lisboa-london.htm', 4 ),
('/directory3/ny-anykeyword-milan.htm', 5),
('/directory3/anykeyword-berlin-milan-moscow', 6)] )as data
SELECT data.1 as path,data.2 as value
Table
| path | value |
|---|---|
| /directory1/paris-oslo-london.htm | 0 |
| /directory1/paris-oslo-granada.htm | 1 |
| /directory1/london-moscow.htm | 2 |
| /directory1/barcelona-moscow.htm | 3 |
| /directory1/keyword1-keyword2-lisboa-london.htm | 4 |
| /directory3/ny-anykeyword-milan.htm | 5 |
Table Locations
WITH arrayJoin([
('paris','row 1'),
('oslo','row 2'),
('granada','row 3'),
('ny-paris-milan-granada','row 4'),
('london-moscow','row 5'),
('lisboa','row 6'),
('ny-aaaa-milan','row 7'),
('berlin-moscow-keyword3','row 8')] )as data
SELECT data.1 as location_name,data.2 as row_number
Table Locations
| location_name | row_number |
|---|---|
| paris | 1 |
| oslo | 2 |
| granada | 3 |
| ny-paris-milan-granada | 4 |
| london-moscow | 5 |
| lisboa | 6 |
| ny-aaaa-milan | 7 |
| berlin-moscow-keyword3 | 8 |
My sql
I try this SQL but dowcounts each keyword if exists in any rows, not in the same row
WITH arrayFlatten(groupUniqArray(splitByNonAlpha(arrayJoin([
('paris','row 1'),
('oslo','row 2'),
('granada','row 3'),
('ny-paris-milan-granada','row 4'),
('london-moscow','row 5'),
('lisboa','row 6'),
('ny-aaaa-milan','row 7'),
('berlin-moscow-keyword3','row 8')] ).1))) as location_namee
SELECT arrayJoin([
('/directory1/paris-oslo-london.htm', 0 ),
('/directory1/paris-oslo-granada.htm', 1 ),
('/directory1/london-moscow.htm', 2),
('/directory1/barcelona-moscow.htm', 3),
('/directory2/keyword1-keyword2-lisboa-london.htm', 4 ),
('/directory3/ny-anykeyword-milan.htm', 5),
('/directory3/anykeyword-berlin-milan-moscow', 6)] ).1 as path,
splitByNonAlpha(replaceRegexpAll(path,'^\/(.*)\/(.*)\.htm$', '\\2')) as words_in_directory,
arrayFilter(x -> has(location_namee,x),words_in_directory) as matchedWords
Result of my sql (not result desired)
| path | words_in_directory | matchedWords |
|---|---|---|
| /directory1/paris-oslo-london.htm | ['paris','oslo','london'] | ['paris','oslo','london'] |
| /directory1/paris-oslo-granada.htm | ['paris','oslo','granada'] | ['paris','oslo','granada'] |
| /directory1/london-moscow.htm | ['london','moscow'] | ['london','moscow'] |
| /directory1/barcelona-moscow.htm | ['barcelona','moscow'] | ['moscow'] |
| /directory2/keyword1-keyword2-lisboa-london.htm | ['keyword1','keyword2','lisboa','london'] | ['lisboa','london'] |
| /directory3/ny-anykeyword-milan.htm | ['ny','anykeyword','milan'] | ['ny','milan'] |
| /directory3/anykeyword-berlin-milan-moscow | ['directory3','anykeyword','berlin','milan','moscow'] | ['berlin','milan','moscow'] |
Results desired
| url | match | number_of_elements_matching_in_url | matchedWords | row_matched |
|---|---|---|---|---|
| directory1/paris-oslo-granada.htm | yes | 2 | 'paris','granada' | row 4 |
| directory3/ny-anykeyword-otherkeyword-milan.htm | yes | 2 | 'ny','milan' | row 7 |
| directory3/anykeyword-berlin-milan-moscow | yes | 2 | 'berlin','moscow' | row 8 |
| directory1/barcelona-moscow.htm | not | 1 | '-' | - |
| directory1/london-moscow.htm | yes | 2 | 'london','moscow' | row 5 |
| directory1/paris-oslo-london. | not | 1 | '-' | - |
| directory1/keyword1-keyword2-lisboa-london.htm | not | 1 | '-' | - |
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|
