'How can we parse tab-delimited data as it's being downloaded from the web and also parse a URL that contains an apostrophe character?
I put together some scrappy code that downloads data from a few URLs. I have two problems that I am trying to overcome.
- I need to parse this tab-delimited data before it is written to a CSV file, so the final saved version is a CSV (not TSV)
- I need to download data from a link that has an apostrophe in the URL (the apostrophe is not handled correctly so the download fails)
My hacked-together code.
import requests
from bs4 import BeautifulSoup
import urllib
all_links = ['/vsoch/hospital-chargemaster/tree/0.0.2/data/ochsner-clinic-foundation',
'/vsoch/hospital-chargemaster/tree/0.0.2/data/ohio-state-university-hospital',
'/vsoch/hospital-chargemaster/tree/0.0.2/data/orlando-health',
'vsoch/hospital-chargemaster/blob/0.0.2/data/st.-joseph\'s-hospital-(tampa)']
for item in all_links:
#print(item)
item = item.replace('tree/','')
#print(item)
try:
length = len(item)
last_slash = item.rfind('/') + 1
file_name = (length-last_slash)
file_name = item[-file_name:]
print(file_name)
DOWNLOAD_URL = 'https://raw.githubusercontent.com' + item + '/data-latest.tsv'
r = requests.get(DOWNLOAD_URL)
soup = BeautifulSoup(r.text, "html.parser")
DOWNLOAD_PATH = 'C:\\Users\\ryans\\Desktop\\hospital_data\\' + file_name + '.csv'
urllib.request.urlretrieve(DOWNLOAD_URL,DOWNLOAD_PATH)
except Exception as e: print(e)
So, how can I parse a TSV into a CSV? Also, how can I download the data from the last URL in the list of four URLs?
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|
