'How to extract non-standard dates from text in Python?

I have a dataframe similar to the following one:

df = pd.DataFrame({'Text': ['Hello I would like to get only the date which is 12-13 December 2018 amid this text.', 'Ciao, what I would like to do is to keep dates, e.g. 11-14 October 2019, and remove all the rest.','Hi, SO can you help me delete everything but 10 January 2011. I found it hard doing it myself.']})

I would like to extract only dates from the text. The problem is that it is hard to find patterns. The only rule I can find there is: keep 2/3 objects before a four-digit number (i.e. the year).

I tried many convoluted solutions but I am not able to get what I need.

The result should look like this:

["12-13 December 2018"
"11-14 October 2019"
"10 January 2011"]

Can anyone help me?

Thanks!



Solution 1:[1]

If "keep 2/3 object before a four-digit number (i.e. the year)" is a reliable rule then you could use the following:

import re

data = {'Text': ['Hello I would like to get only the date which is 12-13 December 2018 amid this text.', 'Ciao, what I would like to do is to keep dates, e.g. 11-14 October 2019, and remove all the rest.','Hi, SO can you help me delete everything but 10 January 2011. I found it hard doing it myself.']}

date_strings = []
for string in data['Text']:     # loop through each string
    words = string.split()      # split string by ' ' characters
    for index in range(len(words)):
        if re.search(r'(\d){4}', words[index]):     # if the 'word' is 4 digits
            date_strings.append( ' '.join(words[index-2:index+1]) )     # extract that word & the preceeding 2
            break

print(date_strings)

To get:

['12-13 December 2018', '11-14 October 2019,', '10 January 2011.']

Some assumptions:

  • the dates are always 3 'words' long
  • the years are always at the end of the dates
  • as pointed out in the comments, the only 4-digit number in the text is the year

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 PangolinPaws