'Parsing messy date strings in Python
R has a very nice workflow that allows user to set the date/month/year order but otherwise handles messiness of user-input date strings:
date_str = c('05/03/2022', '14/03/2022', '14.03.2022', '14/03.2022')
lubridate::parse_date_time(date_str, orders = 'dmy')
#> [1] "2022-03-05 UTC" "2022-03-14 UTC" "2022-03-14 UTC" "2022-03-14 UTC"
The closest I've found in Python is:
from dateparser import parse
date_str = ['05/03/2022', '14/03/2022', '14.03.2022', '14/03.2022']
list(map(lambda l: parse(l, date_formats = ['dmy']), date_str))
[datetime.datetime(2022, 5, 3, 0, 0),
datetime.datetime(2022, 3, 14, 0, 0),
datetime.datetime(2022, 3, 14, 0, 0),
datetime.datetime(2022, 3, 14, 0, 0)]
which handles messiness but transposes day/month in the first observation, I think because date_formats prioritises explicitly defined formats and otherwise reverts to the (silly) default US month-day-year format?
Is there a nice implementation in Python that can be relied upon to handle messiness as well as assume a date/month ordering?
Solution 1:[1]
Well, if dateparser otherwise does what you like, you can gently wrap it to prioritize the format you like:
import dateparser
import datetime
import re
dmy_re = re.compile(r"^(?P<day>\d+)/(?P<month>\d+)/(?P<year>\d+)$")
def parse_with_dmy_priority(ds):
dmy_match = dmy_re.match(ds)
if dmy_match:
return datetime.datetime(**{k: int(v) for (k, v) in dmy_match.groupdict().items()})
return dateparser.parse(ds)
in_data = ['05/03/2022', '14/03/2022', '14.03.2022', '14/03.2022']
print([parse_with_dmy_priority(d) for d in in_data])
[
datetime.datetime(2022, 3, 5, 0, 0),
datetime.datetime(2022, 3, 14, 0, 0),
datetime.datetime(2022, 3, 14, 0, 0),
datetime.datetime(2022, 3, 14, 0, 0),
]
This generalizes nicely too:
def parse_date(ds, regexps=()):
for regexp in regexps:
match = regexp.match(ds)
if match:
return datetime.datetime(**{k: int(v) for (k, v) in match.groupdict().items()})
return dateparser.parse(ds)
print([parse_date(d, regexps=[dmy_re]) for d in in_data])
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | AKX |
