'Getting date modified of the files - webscraping with beautifulsoup in python

I am trying to download all csv files from the following website: https://emi.ea.govt.nz/Wholesale/Datasets/FinalPricing/EnergyPrices . I have managed to do that with the following code:

from bs4 import BeautifulSoup
import requests
url = 'https://emi.ea.govt.nz/Wholesale/Datasets/FinalPricing/EnergyPrices'
r = requests.get(url)
soup = BeautifulSoup(r.text, 'html.parser')

csv_links = ['https://emi.ea.govt.nz'+a['href'] for a in soup.select('td.csv a')]

contents = []
for i in csv_links:
    req = requests.get(i)
    csv_contents = req.content
    s=str(csv_contents,'utf-8')
    data = StringIO(s) 

df=pd.read_csv(data)
contents.append(df)

final_price = pd.concat(contents)

If at all feasible, I'd like to streamline this process.  The file on the website is modified every day, and I don't want to run the script every day to extract all of the files; instead, I simply want to extract files from Yesterday and append the existing files in my folder. And to achieve this, I need to scrape the Date Modified column along with the files URL. I'd be grateful if someone could tell me how to acquire the dates when the files were updated.



Solution 1:[1]

You can apply list comprehension technique

from bs4 import BeautifulSoup
import requests
import pandas as pd
url = 'https://emi.ea.govt.nz/Wholesale/Datasets/FinalPricing/EnergyPrices'
r = requests.get(url)
print(r)
soup = BeautifulSoup(r.text, 'html.parser')

links=[]
date=[]
csv_links = ['https://emi.ea.govt.nz'+a['href'] for a in soup.select('td[class="expand-column csv"] a')]
modified_date=[ date.text for date in soup.select('td[class="two"] a')[1:]]
links.extend(csv_links)
date.extend(modified_date)

df = pd.DataFrame(data=list(zip(links,date)),columns=['csv_links','modified_date'])
print(df)

Output:

                                      csv_links         modified_date
0    https://emi.ea.govt.nz/Wholesale/Datasets/Fina...   22 Mar 2022
1    https://emi.ea.govt.nz/Wholesale/Datasets/Fina...   22 Mar 2022
2    https://emi.ea.govt.nz/Wholesale/Datasets/Fina...   22 Mar 2022
3    https://emi.ea.govt.nz/Wholesale/Datasets/Fina...   22 Mar 2022
4    https://emi.ea.govt.nz/Wholesale/Datasets/Fina...   22 Mar 2022
..                                                 ...           ...
107  https://emi.ea.govt.nz/Wholesale/Datasets/Fina...   20 Dec 2021
108  https://emi.ea.govt.nz/Wholesale/Datasets/Fina...   20 Dec 2021
109  https://emi.ea.govt.nz/Wholesale/Datasets/Fina...   20 Dec 2021
110  https://emi.ea.govt.nz/Wholesale/Datasets/Fina...   20 Dec 2021
111  https://emi.ea.govt.nz/Wholesale/Datasets/Fina...   20 Dec 2021

[112 rows x 2 columns]

Solution 2:[2]

You can use nth-child range to filter for columns 1 and 2 of table along with the appropriate row offset within the table initially matched by class.

Then extract url or date (as text) within list comprehensions over split initial returned list (as will alternate column 1 column 2 column 1 etc). Complete the url or convert to actual dates (text) within respective list comprehensions, zip the resultant lists and convert to DataFrame

import requests
from datetime import datetime
from bs4 import BeautifulSoup as bs
import pandas as pd

r = requests.get(
    'https://emi.ea.govt.nz/Wholesale/Datasets/FinalPricing/EnergyPrices')
soup = bs(r.content, 'lxml')
selected_columns = soup.select('.table tr:nth-child(n+3) td:nth-child(-n+2)')
df = pd.DataFrame(zip(['https://emi.ea.govt.nz' + i.a['href'] for i in selected_columns[0::1]],
                      [datetime.strptime(i.text, '%d %b %Y').date() for i in selected_columns[1::2]]), columns=['name', 'date_modified'])

print(df)

enter image description here

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1
Solution 2