'Python - scraping files that are in JavaScript objects

I'm trying to download files from a VA dataset website with a Python scraper but I'm having trouble figuring out how to parse the JavaScript in the HTML website that appears to contain the files. This is source code for the website (view-source:https://www.data.va.gov/dataset/Air-Force-Veterans-2017-Living-Only/9u8y-zaby). I'm trying to download the ".xlsx" files, which (by just using command+F on my Mac) I think are in JavaScript objects. I've looked around this site and others but haven't been able to figure out how to scrape links from within JavaScript. How should I go about doing this? Any help would be greatly appreciated.



Solution 1:[1]

That website is dynamically generated, you can use selenium to download the desired files

Here is a working code using wget, selenium and webdriver_manager

This will check for the link and save the xlsx file in used-defined directory

import time
import wget
import requests
from selenium import webdriver
from selenium.webdriver import Chrome
from selenium.webdriver import ChromeOptions, FirefoxOptions
from selenium.webdriver.common.by import By
from selenium.webdriver.chrome.service import Service
from webdriver_manager.chrome import ChromeDriverManager

options = ChromeOptions()
# try out options
# options.binary_location = '/opt/headless-chromium'
# options.add_argument("--headless")
# options.add_argument("--disable-gpu")
# options.add_argument("--no-sandbox")
# options.add_argument('--disable-dev-shm-usage')
# options.add_argument('--disable-gpu-sandbox')
# options.add_argument("--single-process")
options.add_argument(
    "user-agent=Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/101.0.4951.54 Safari/537.36")
options.add_argument('--disable-blink-features=AutomationControlled')
options.add_experimental_option("useAutomationExtension", False)
options.add_experimental_option("excludeSwitches", ["enable-automation"])

s = Service(ChromeDriverManager().install())
# s = Service(GeckoDriverManager().install())
driver = webdriver.Chrome(service=s, options=options)

driver.get('https://www.data.va.gov/dataset/Air-Force-Veterans-2017-Living-Only/9u8y-zaby')
time.sleep(3)

# get the link
download_link = driver.find_element(By.XPATH, '//*[@id="app"]/div/div[2]/section/div/div/div[2]/a').get_attribute(
    'href')

# download the file
output_directory = 'data'  # it will download the file to data directory
filename = wget.download(download_link, out=output_directory)

time.sleep(3)
driver.close()
driver.quit()

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 ahmedshahriar