'Trying to figure out how to use Beautiful Soup to parse multiple sub URLs from one main URL
Here is the main URL. https://github.com/vsoch/hospital-chargemaster/tree/0.0.2/data
I can collect strings that have this structure, into a list. /vsoch/hospital-chargemaster/0.0.2/data/baptist-health-system-(san-antonio)
The full file path looks something like this. https://raw.githubusercontent.com/vsoch/hospital-chargemaster/0.0.2/data/baptist-health-system-(san-antonio)/data-latest.tsv
My question is, how can I download a bunch of TSV files to my desktop in one go? I know some TSV files are pretty hard to pars, and I don't want to invest a lot of time getting at things that are hard to reach. I just want to get the code to download some/most TSV files to a folder on my desktop.
# main URL
# https://github.com/vsoch/hospital-chargemaster/tree/0.0.2/data
import requests
from bs4 import BeautifulSoup
import urllib
all_links = []
url = "https://github.com/vsoch/hospital-chargemaster/tree/0.0.2/data"
# Getting the webpage, creating a Response object.
response = requests.get(url)
# Extracting the source code of the page.
data = response.text
# Passing the source code to BeautifulSoup to create a BeautifulSoup object for it.
soup = BeautifulSoup(data, 'lxml')
# Extracting all the <a> tags into a list.
tags = soup.find_all('a')
# Extracting URLs from the attribute href in the <a> tags.
for tag in tags:
all_links.append(tag.get('href'))
for item in all_links:
item = item.replace('tree/','')
print(item)
try:
DOWNLOAD_URL = 'https://raw.githubusercontent.com' + item + '/data-latest.tsv'
print(DOWNLOAD_URL)
r = requests.get(DOWNLOAD_URL)
print(r)
soup = BeautifulSoup(r.text, "html.parser")
#print(soup)
slash = DOWNLOAD_URL.find('/') + 1
DOWNLOAD_URL = DOWNLOAD_URL[0:-slash]
DOWNLOAD_URL = DOWNLOAD_URL + slash
except Exception as e: print(e)
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|
