'How can I download all files in a web directory using python?

I am writing a python script to download all files in a directory.

Example indir:

https://data.chc.ucsb.edu/products/EWX/data/forecasts/CHIRPS-GEFS_precip_v12/daily_16day/2016/01/02

This directory is programmatically generated in my loop due to a specific reason.

tmptime=stime
while tmptime < etime:
    tmptime = tmptime + timedelta(days=1)  # increase timestamp daily
    tmppath = os.path.join(str(tmptime.year), str(tmptime.strftime("%m")), str(tmptime.strftime("%d")))
    indirtmp = os.path.join(indir, tmppath)
    outdir = os.path.join(outdir, tmppath)

Now, how can I download all files in that link and move to another directory outdir I have created in my script? I am okay with a library or offloading it to a linux process.

I will basically be doing this for 20 years every day.



Solution 1:[1]

I suggest following wget command to download superset of files you need

wget --force-html --base=https://data.chc.ucsb.edu/products/EWX/data/forecasts/CHIRPS-GEFS_precip_v12/daily_16day/2016/01/02/ -i https://data.chc.ucsb.edu/products/EWX/data/forecasts/CHIRPS-GEFS_precip_v12/daily_16day/2016/01/02/

Explanation: I used -i option with external file, --force-html prompts GNU Wget to look for links inside pointed file, --base=https://data.chc.ucsb.edu/products/EWX/data/forecasts/CHIRPS-GEFS_precip_v12/daily_16day/2016/01/02/ is required as file here uses relative links. Note that this will download all files referenced, so you might need to remove non-tiff files after download finish. Files are saved in current working directory.

Solution 2:[2]

Since you say you're okay to shelling out to a program, you can spare the trouble of parsing that index HTML by using wget's mirror mode:

import os
import shlex

tmptime=stime
while tmptime < etime:
    tmptime = tmptime + timedelta(days=1)  # increase timestamp daily
    tmppath = os.path.join(str(tmptime.year), str(tmptime.strftime("%m")), str(tmptime.strftime("%d")))
    indirtmp = os.path.join(indir, tmppath)
    outdir = os.path.join(outdir, tmppath)

    # assumes `indir` is the internet URL
    os.system(shlex.join(["wget", "-m", "-np", "-nd", "-P", outdir, indir]))

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1
Solution 2 AKX