'in AWS Athena, how to left join that matches only 1 row on the right with high performance?
I have 2 tables:
regex_table: contains the regex patterns
| id | regex_pattern |
|---|---|
| 1 | 'hel.*' |
| 2 | '1.*' |
string_table: contains strings
| id | string |
|---|---|
| 1 | hello |
| 2 | hell |
The expected result should contain 1 matched string for each regex pattern, like
| regex_pattern | string |
|---|---|
| 'hel.*' | hello |
| '1.*' |
each regex_pattern only needs one string, so in this example, both hell and hello can be a match for 'hel.*'
I need to do this on AWS Athena (presto under the hood) for a giant data set
I've tried:
- use row_number()
select *
from (
select *,
row_number() over (
partition by regex_pattern
order by string
) as rn
from regex_table
left join string_table on regexp_like(string, regex_pattern)
)
where rn = 1
this works, but too slow because the join has to check all the strings of string table while only one matched string is needed, the rest can be skipped, just like a break loop in ruby or python
- Use Correlated subquery
select regex_pattern,
(
select string
from string_table
where regexp_like(string, regex_pattern)
limit 1
)
from regex_table
but received error Correlated subquery in given context is not supported
Is there a better way than left join to solve this problem in athena?
Currently I have to do this matching in python to take advantage of the break loop to break out of looping string_table when a matched string is found. Is it possible to do the same in sql?
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|
