'How to scrap a table from a webpage and convert it to a pandas df via looping on Python when pd.read_html() throws "HTTPError: Forbidden"?

I'm trying to build a program that takes this Economic Calendar Table and convert it to a pandas dataframe for later use.

Initially, I thought of using the common pd.read_html() method in this piece of code:

#function to wait for the element to be located by its XPATH
def wait_xpath(code): 
    WebDriverWait(driver, 8).until(EC.presence_of_element_located((By.XPATH, code)))

#go to investing.com to check the economic calendar
driver.get('https://www.investing.com/economic-calendar/')

#wait for the economic calendar table to be located
wait_xpath('/html/body/div[5]/section/div[6]/table')

economic_calendar_for_today = pd.read_html('https://www.investing.com/economic-calendar/')
    
print(economic_calendar_for_today)

But after running it, it threw the following error:

HTTPError: Forbidden

Which means that this page was coded to play tough (i.e. it doesn't want its information to be scrapped so easily).

Fortunately, I barely managed to deal with it, I realized that the information in the Economic Calendar Table from that page can actually be iterated and stored in a tuple via looping, with this code here:

from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By

#the variable that will store the selenium options
opt = Options()
#this allows selenium to take control of your Chrome Browser in DevTools mode.
opt.add_experimental_option("debuggerAddress", "localhost:9222") 
#Use the chrome driver located at the corresponding path
s = Service(r'C:\Users\ResetStoreX\AppData\Local\Programs\Python\Python39\Scripts\chromedriver.exe')
#execute the chrome driver with the previous conditions
driver = webdriver.Chrome(service=s, options=opt) 

def wait_xpath(code): #function to wait for the element to be located by its XPATH
    WebDriverWait(driver, 8).until(EC.presence_of_element_located((By.XPATH, code)))

#a tuple for later use
the_tuple = () 

#go to investing.com to check the economic calendar
driver.get('https://www.investing.com/economic-calendar/')

#wait for the economic calendar table to be located
wait_xpath('/html/body/div[5]/section/div[6]/table')

#store the table body information
table_body = driver.find_element(By.XPATH, '/html/body/div[5]/section/div[6]/table/tbody')

#iterate over the tr elements in the economic calendar table and store them in the tuple
rows = table_body.find_elements_by_tag_name('tr')
for row in rows:
    the_tuple = the_tuple + (row.text,)
    
print(the_tuple)

Output:

initial_solution

The thing is, I don't know how to include the volatility expected (represented with stars) in the method above, where for instance:

  • the XPATH of the second row is /html/body/div[5]/section/div[6]/table/tbody/tr[2]/td[3]
  • the XPATH of the third row is /html/body/div[5]/section/div[6]/table/tbody/tr[3]/td[3]
  • And so on...

And

  • 1 star = Low Volatility Expected
  • 2 stars = Moderate Volatility Expected
  • 3 stars = High Volatility Expected

In the end, I would like to get some help to improve the code above to get df like this one (based on today):

|Time   |Currency  |Volatility expected    |Event                  |Actual   |Forecast   |Previous  |
--------------------------------------------------------------------------------------------
|02:00  |CHF       |Low Volatility Expected|SECO Economic Forecasts|         |           |          |
|04:00  |EUR       |Low Volatility Expected|Italian PPI (MoM) (Jan)|9.7%     |           |1.1%      |
|04:00  |ZAR       |Low Volatility Expected|Current Account (Q4)   |120.0B   |150.0B     |216.0B    |
                                               .
                                               .
                                               .
|21:00  |CNY       |Low Volatility Expected|China Thomson Reuters  |         |           |          |
|       |          |                       |IPSOS PCSI (Mar)       |         |           |72.04     |


Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source