'in AWS Athena, how to left join that matches only 1 row on the right with high performance?

I have 2 tables:

regex_table: contains the regex patterns

id	regex_pattern
1	'hel.*'
2	'1.*'

string_table: contains strings

id	string
1	hello
2	hell

The expected result should contain 1 matched string for each regex pattern, like

regex_pattern	string
'hel.*'	hello
'1.*'

each regex_pattern only needs one string, so in this example, both hell and hello can be a match for 'hel.*'

I need to do this on AWS Athena (presto under the hood) for a giant data set

I've tried:

use row_number()

select * 
from (
  select *, 
    row_number() over (
      partition by regex_pattern
      order by string
    ) as rn
  from regex_table
    left join string_table on regexp_like(string, regex_pattern)
)
where rn = 1

this works, but too slow because the join has to check all the strings of string table while only one matched string is needed, the rest can be skipped, just like a break loop in ruby or python

Use Correlated subquery

select regex_pattern, 
  (
    select string 
    from string_table 
    where regexp_like(string, regex_pattern) 
    limit 1
  )
from regex_table

but received error Correlated subquery in given context is not supported

Is there a better way than left join to solve this problem in athena?

Currently I have to do this matching in python to take advantage of the break loop to break out of looping string_table when a matched string is found. Is it possible to do the same in sql?

sql amazon-athena presto

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution	Source

'in AWS Athena, how to left join that matches only 1 row on the right with high performance?

Sources

Related Questions