'in AWS Athena, how to left join that matches only 1 row on the right with high performance?

I have 2 tables:

regex_table: contains the regex patterns

id regex_pattern
1 'hel.*'
2 '1.*'

string_table: contains strings

id string
1 hello
2 hell

The expected result should contain 1 matched string for each regex pattern, like

regex_pattern string
'hel.*' hello
'1.*'

each regex_pattern only needs one string, so in this example, both hell and hello can be a match for 'hel.*'

I need to do this on AWS Athena (presto under the hood) for a giant data set

I've tried:

  1. use row_number()
select * 
from (
  select *, 
    row_number() over (
      partition by regex_pattern
      order by string
    ) as rn
  from regex_table
    left join string_table on regexp_like(string, regex_pattern)
)
where rn = 1

this works, but too slow because the join has to check all the strings of string table while only one matched string is needed, the rest can be skipped, just like a break loop in ruby or python

  1. Use Correlated subquery
select regex_pattern, 
  (
    select string 
    from string_table 
    where regexp_like(string, regex_pattern) 
    limit 1
  )
from regex_table

but received error Correlated subquery in given context is not supported

Is there a better way than left join to solve this problem in athena?

Currently I have to do this matching in python to take advantage of the break loop to break out of looping string_table when a matched string is found. Is it possible to do the same in sql?



Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source