'Trying to figure out how to use Beautiful Soup to parse multiple sub URLs from one main URL

Here is the main URL. https://github.com/vsoch/hospital-chargemaster/tree/0.0.2/data

I can collect strings that have this structure, into a list. /vsoch/hospital-chargemaster/0.0.2/data/baptist-health-system-(san-antonio)

The full file path looks something like this. https://raw.githubusercontent.com/vsoch/hospital-chargemaster/0.0.2/data/baptist-health-system-(san-antonio)/data-latest.tsv

My question is, how can I download a bunch of TSV files to my desktop in one go? I know some TSV files are pretty hard to pars, and I don't want to invest a lot of time getting at things that are hard to reach. I just want to get the code to download some/most TSV files to a folder on my desktop.

# main URL
# https://github.com/vsoch/hospital-chargemaster/tree/0.0.2/data

import requests
from bs4 import BeautifulSoup
import urllib

all_links = []
url = "https://github.com/vsoch/hospital-chargemaster/tree/0.0.2/data"
# Getting the webpage, creating a Response object.
response = requests.get(url)
# Extracting the source code of the page.
data = response.text
# Passing the source code to BeautifulSoup to create a BeautifulSoup object for it.
soup = BeautifulSoup(data, 'lxml')
# Extracting all the <a> tags into a list.
tags = soup.find_all('a')
# Extracting URLs from the attribute href in the <a> tags.
for tag in tags:
    all_links.append(tag.get('href'))

for item in all_links:
    item = item.replace('tree/','')
    print(item)
    try:
        DOWNLOAD_URL = 'https://raw.githubusercontent.com' + item + '/data-latest.tsv'
        print(DOWNLOAD_URL)
        r = requests.get(DOWNLOAD_URL)
        print(r)
        soup = BeautifulSoup(r.text, "html.parser")
        #print(soup)
        slash = DOWNLOAD_URL.find('/') + 1
        DOWNLOAD_URL = DOWNLOAD_URL[0:-slash]
        DOWNLOAD_URL = DOWNLOAD_URL + slash
    except Exception as e: print(e)

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution	Source

'Trying to figure out how to use Beautiful Soup to parse multiple sub URLs from one main URL

Sources

Related Questions