'How can I download all files in a web directory using python?
I am writing a python script to download all files in a directory.
Example indir:
https://data.chc.ucsb.edu/products/EWX/data/forecasts/CHIRPS-GEFS_precip_v12/daily_16day/2016/01/02
This directory is programmatically generated in my loop due to a specific reason.
tmptime=stime
while tmptime < etime:
tmptime = tmptime + timedelta(days=1) # increase timestamp daily
tmppath = os.path.join(str(tmptime.year), str(tmptime.strftime("%m")), str(tmptime.strftime("%d")))
indirtmp = os.path.join(indir, tmppath)
outdir = os.path.join(outdir, tmppath)
Now, how can I download all files in that link and move to another directory outdir I have created in my script? I am okay with a library or offloading it to a linux process.
I will basically be doing this for 20 years every day.
Solution 1:[1]
I suggest following wget command to download superset of files you need
wget --force-html --base=https://data.chc.ucsb.edu/products/EWX/data/forecasts/CHIRPS-GEFS_precip_v12/daily_16day/2016/01/02/ -i https://data.chc.ucsb.edu/products/EWX/data/forecasts/CHIRPS-GEFS_precip_v12/daily_16day/2016/01/02/
Explanation: I used -i option with external file, --force-html prompts GNU Wget to look for links inside pointed file, --base=https://data.chc.ucsb.edu/products/EWX/data/forecasts/CHIRPS-GEFS_precip_v12/daily_16day/2016/01/02/ is required as file here uses relative links. Note that this will download all files referenced, so you might need to remove non-tiff files after download finish. Files are saved in current working directory.
Solution 2:[2]
Since you say you're okay to shelling out to a program, you can spare the trouble of parsing that index HTML by using wget's mirror mode:
import os
import shlex
tmptime=stime
while tmptime < etime:
tmptime = tmptime + timedelta(days=1) # increase timestamp daily
tmppath = os.path.join(str(tmptime.year), str(tmptime.strftime("%m")), str(tmptime.strftime("%d")))
indirtmp = os.path.join(indir, tmppath)
outdir = os.path.join(outdir, tmppath)
# assumes `indir` is the internet URL
os.system(shlex.join(["wget", "-m", "-np", "-nd", "-P", outdir, indir]))
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | |
| Solution 2 | AKX |
