'How to scrap a table from a webpage and convert it to a pandas df via looping on Python when pd.read_html() throws "HTTPError: Forbidden"?
I'm trying to build a program that takes this Economic Calendar Table and convert it to a pandas dataframe for later use.
Initially, I thought of using the common pd.read_html() method in this piece of code:
#function to wait for the element to be located by its XPATH
def wait_xpath(code):
WebDriverWait(driver, 8).until(EC.presence_of_element_located((By.XPATH, code)))
#go to investing.com to check the economic calendar
driver.get('https://www.investing.com/economic-calendar/')
#wait for the economic calendar table to be located
wait_xpath('/html/body/div[5]/section/div[6]/table')
economic_calendar_for_today = pd.read_html('https://www.investing.com/economic-calendar/')
print(economic_calendar_for_today)
But after running it, it threw the following error:
HTTPError: Forbidden
Which means that this page was coded to play tough (i.e. it doesn't want its information to be scrapped so easily).
Fortunately, I barely managed to deal with it, I realized that the information in the Economic Calendar Table from that page can actually be iterated and stored in a tuple via looping, with this code here:
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
#the variable that will store the selenium options
opt = Options()
#this allows selenium to take control of your Chrome Browser in DevTools mode.
opt.add_experimental_option("debuggerAddress", "localhost:9222")
#Use the chrome driver located at the corresponding path
s = Service(r'C:\Users\ResetStoreX\AppData\Local\Programs\Python\Python39\Scripts\chromedriver.exe')
#execute the chrome driver with the previous conditions
driver = webdriver.Chrome(service=s, options=opt)
def wait_xpath(code): #function to wait for the element to be located by its XPATH
WebDriverWait(driver, 8).until(EC.presence_of_element_located((By.XPATH, code)))
#a tuple for later use
the_tuple = ()
#go to investing.com to check the economic calendar
driver.get('https://www.investing.com/economic-calendar/')
#wait for the economic calendar table to be located
wait_xpath('/html/body/div[5]/section/div[6]/table')
#store the table body information
table_body = driver.find_element(By.XPATH, '/html/body/div[5]/section/div[6]/table/tbody')
#iterate over the tr elements in the economic calendar table and store them in the tuple
rows = table_body.find_elements_by_tag_name('tr')
for row in rows:
the_tuple = the_tuple + (row.text,)
print(the_tuple)
Output:
The thing is, I don't know how to include the volatility expected (represented with stars) in the method above, where for instance:
- the XPATH of the second row is
/html/body/div[5]/section/div[6]/table/tbody/tr[2]/td[3] - the XPATH of the third row is
/html/body/div[5]/section/div[6]/table/tbody/tr[3]/td[3] - And so on...
And
- 1 star = Low Volatility Expected
- 2 stars = Moderate Volatility Expected
- 3 stars = High Volatility Expected
In the end, I would like to get some help to improve the code above to get df like this one (based on today):
|Time |Currency |Volatility expected |Event |Actual |Forecast |Previous |
--------------------------------------------------------------------------------------------
|02:00 |CHF |Low Volatility Expected|SECO Economic Forecasts| | | |
|04:00 |EUR |Low Volatility Expected|Italian PPI (MoM) (Jan)|9.7% | |1.1% |
|04:00 |ZAR |Low Volatility Expected|Current Account (Q4) |120.0B |150.0B |216.0B |
.
.
.
|21:00 |CNY |Low Volatility Expected|China Thomson Reuters | | | |
| | | |IPSOS PCSI (Mar) | | |72.04 |
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|

