'Hive - How to extract dats from a string with different formats
Hi iam trying to extract dates from the string using regex function.How to extract dates dynamically when the date format changes from mm/dd/yyyy to mm/dd/yy. The below regexp works for mm/dd/yyyy format.What if my string has both the formats and i want to extract dates dynamically.
string1 = '2021 Flight 5 Accessory Fixtures: DELIVER BETWEEN 7/25/2021 and 8/7/2021' below function works fine.
string1 = '2021 Flight 5 Accessory Fixtures: DELIVER BETWEEN 7/25/21 and 8/7/21' how to extract if 2 dates in both the scenarios.
regexp_extract(string1, '^.*(\\d{1,2}/\\d{1,2}/\\d{4}).*(\\d{1,2}/\\d{1,2}/\\d{4})',1) date1,
regexp_extract(string1, '^.*(\\d{1,2}/\\d{1,2}/\\d{4}).*(\\d{1,2}/\\d{1,2}/\\d{4})',2) date2
Solution 1:[1]
Use CASE expression to check different formats and parse accordingly. Two digits year is a bad practice, but if you already have such data, then nothing to do, this is common approach for parsing different formats - CASE expressions:
Demo:
with mytable as (
select '2021 Flight 5 Accessory Fixtures: DELIVER BETWEEN 7/25/2021 and 8/7/2021' string1 union all
select '2021 Flight 5 Accessory Fixtures: DELIVER BETWEEN 7/25/21 and 8/7/21'
)
select case when string1 rlike '(?i)BETWEEN (\\d{1,2}/\\d{1,2}/\\d{4})'
then date(from_unixtime (unix_timestamp(regexp_extract(string1, '(?i)BETWEEN (\\d{1,2}/\\d{1,2}/\\d{4})',1),'MM/dd/yyyy')
))
when string1 rlike '(?i)BETWEEN \\d{1,2}/\\d{1,2}/\\d{2} '
then date(from_unixtime (unix_timestamp(regexp_extract(string1, '(?i)BETWEEN (\\d{1,2}/\\d{1,2}/\\d{2})',1),'MM/dd/yy')
))
end date1,
case when string1 rlike '(?i)AND (\\d{1,2}/\\d{1,2}/\\d{4})'
then date(from_unixtime (unix_timestamp(regexp_extract(string1, '(?i)AND (\\d{1,2}/\\d{1,2}/\\d{4})',1),'MM/dd/yyyy')
))
when string1 rlike '(?i)AND \\d{1,2}/\\d{1,2}/\\d{2}'
then date(from_unixtime (unix_timestamp(regexp_extract(string1, '(?i)AND (\\d{1,2}/\\d{1,2}/\\d{2})',1),'MM/dd/yy')
))
end date2
from mytable
Result:
date1 date2
2021-07-25 2021-08-07
2021-07-25 2021-08-07
In case of two digits year, unix_timestamp function will guess missing digits: 7/25/21 is parsed as 2021-07-25 and 7/25/98 parsed as 1998-07-25. There can be errors in such guess, mentioned by @PanagiotisKanavos in the comment.
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 |
