'Hive - How to extract dats from a string with different formats

Hi iam trying to extract dates from the string using regex function.How to extract dates dynamically when the date format changes from mm/dd/yyyy to mm/dd/yy. The below regexp works for mm/dd/yyyy format.What if my string has both the formats and i want to extract dates dynamically.

string1 = '2021 Flight 5 Accessory Fixtures: DELIVER BETWEEN 7/25/2021 and 8/7/2021' below function works fine.

string1 = '2021 Flight 5 Accessory Fixtures: DELIVER BETWEEN 7/25/21 and 8/7/21' how to extract if 2 dates in both the scenarios.

regexp_extract(string1, '^.*(\\d{1,2}/\\d{1,2}/\\d{4}).*(\\d{1,2}/\\d{1,2}/\\d{4})',1) date1,
regexp_extract(string1, '^.*(\\d{1,2}/\\d{1,2}/\\d{4}).*(\\d{1,2}/\\d{1,2}/\\d{4})',2) date2


Solution 1:[1]

Use CASE expression to check different formats and parse accordingly. Two digits year is a bad practice, but if you already have such data, then nothing to do, this is common approach for parsing different formats - CASE expressions:

Demo:

with mytable as (

select '2021 Flight 5 Accessory Fixtures: DELIVER BETWEEN 7/25/2021 and 8/7/2021' string1 union all
select '2021 Flight 5 Accessory Fixtures: DELIVER BETWEEN 7/25/21 and 8/7/21'
)

select case when string1 rlike '(?i)BETWEEN (\\d{1,2}/\\d{1,2}/\\d{4})'
            then date(from_unixtime (unix_timestamp(regexp_extract(string1, '(?i)BETWEEN (\\d{1,2}/\\d{1,2}/\\d{4})',1),'MM/dd/yyyy')
                       ))
            
            when string1 rlike '(?i)BETWEEN \\d{1,2}/\\d{1,2}/\\d{2} '
            then date(from_unixtime (unix_timestamp(regexp_extract(string1, '(?i)BETWEEN (\\d{1,2}/\\d{1,2}/\\d{2})',1),'MM/dd/yy')
                       ))
        end date1,
        case when string1 rlike '(?i)AND (\\d{1,2}/\\d{1,2}/\\d{4})'
            then date(from_unixtime (unix_timestamp(regexp_extract(string1, '(?i)AND (\\d{1,2}/\\d{1,2}/\\d{4})',1),'MM/dd/yyyy')
                       ))
            
            when string1 rlike '(?i)AND \\d{1,2}/\\d{1,2}/\\d{2}'
            then date(from_unixtime (unix_timestamp(regexp_extract(string1, '(?i)AND (\\d{1,2}/\\d{1,2}/\\d{2})',1),'MM/dd/yy')
                       ))
        end date2
from mytable

Result:

date1       date2   
2021-07-25  2021-08-07
2021-07-25  2021-08-07

In case of two digits year, unix_timestamp function will guess missing digits: 7/25/21 is parsed as 2021-07-25 and 7/25/98 parsed as 1998-07-25. There can be errors in such guess, mentioned by @PanagiotisKanavos in the comment.

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1