'Problems in converting ".to_datetime" in Python
I have the following list:
l = [<div class="date">8 December 2004</div>,
<div class="date">6 December 2004</div>,
<div class="date">18 October 2004</div>,
<div class="date">9 October 2004</div>,
<div class="date">8 August 2004</div>,
<div class="date">18 June 2004</div>,
<div class="date">23 December 2005</div>,
<div class="date">19 December 2005</div>,
<div class="date">19 December 2005</div>,
<div class="date">15 December 2005</div>]
I would like to convert it into a dataframe with a Date column in a to.datetime format.
I tried many solutions (see one below) but I couln't get my head around it.
pd.to_datetime(pd.DataFrame({'Date':l}), format = '%d %B %Y')
Can anyone help me?
Thanks!
Solution 1:[1]
If you're scraping it with BeautifulSoup, you should be able to just call the following for your series.
pd.to_datetime(pd.Series([e.text for e in l]))
But if it's actually a string already, you'll need to extract the date out of the divs. Then you might want something like to remove the div tags:
import re
pd.to_datetime(pd.Series([re.sub(r'<\/?div.*?>', '', s) for s in l]))
Alternatively, you could extract the dates themselves using a regular expression perhaps like \d{1,2} \w+ \d{4}.
Nb that compilation is not necessary. For short scripts like most Pandas scripts, regular expressions are compiled and cached, according to the re module documentation.
The compiled versions of the most recent patterns passed to
re.compile()and the module-level matching functions are cached, so programs that use only a few regular expressions at a time needn’t worry about compiling regular expressions.
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 |
