'How to extract non-standard dates from text in Python?
I have a dataframe similar to the following one:
df = pd.DataFrame({'Text': ['Hello I would like to get only the date which is 12-13 December 2018 amid this text.', 'Ciao, what I would like to do is to keep dates, e.g. 11-14 October 2019, and remove all the rest.','Hi, SO can you help me delete everything but 10 January 2011. I found it hard doing it myself.']})
I would like to extract only dates from the text. The problem is that it is hard to find patterns. The only rule I can find there is: keep 2/3 objects before a four-digit number (i.e. the year).
I tried many convoluted solutions but I am not able to get what I need.
The result should look like this:
["12-13 December 2018"
"11-14 October 2019"
"10 January 2011"]
Can anyone help me?
Thanks!
Solution 1:[1]
If "keep 2/3 object before a four-digit number (i.e. the year)" is a reliable rule then you could use the following:
import re
data = {'Text': ['Hello I would like to get only the date which is 12-13 December 2018 amid this text.', 'Ciao, what I would like to do is to keep dates, e.g. 11-14 October 2019, and remove all the rest.','Hi, SO can you help me delete everything but 10 January 2011. I found it hard doing it myself.']}
date_strings = []
for string in data['Text']: # loop through each string
words = string.split() # split string by ' ' characters
for index in range(len(words)):
if re.search(r'(\d){4}', words[index]): # if the 'word' is 4 digits
date_strings.append( ' '.join(words[index-2:index+1]) ) # extract that word & the preceeding 2
break
print(date_strings)
To get:
['12-13 December 2018', '11-14 October 2019,', '10 January 2011.']
Some assumptions:
- the dates are always 3 'words' long
- the years are always at the end of the dates
- as pointed out in the comments, the only 4-digit number in the text is the year
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | PangolinPaws |
