'Not all threads finish their work (files downloaded are corrupted, not downloaded entirely)

I want to download files - using python threading module - from Web. All URLs are stored in text file - urls.txt. My python script works, but not all files are downloaded entirely, most not - they are "corrupt".

Looks like, not all threads finish their work.

How can I fix this problem?

Here is python script:


import threading
import requests
from os import path


DIRECTORY = "DATA"
urls = None


def download(link, idx):
    file_name = "my_file-" + str(idx) + ".pdf"
    file_name = path.join(DIRECTORY, file_name)
    print("\n URL -> ", link)
    print("\n FN -> ", file_name)
    try:
        r = requests.get(link, stream=True)
        with open(file_name, 'wb') as f:
            for chunk in r.iter_content(1024):
                if chunk:
                    f.write(chunk)
    except Exception as e:
        print(e.with_traceback())


def create_new_download_thread(link, idx):
    download_thread = threading.Thread(target=download, args=(link, idx,))
    return download_thread


with open('urls.txt', 'r') as f:
    urls = f.readlines()

print(len(urls))
threads = list()
for i, url in enumerate(urls):
    print("\n i= ", i)
    t = create_new_download_thread(url, i)
    threads.append(t)

for t in threads:
    t.start()

for t in threads:
    t.join()


I also add my txt file for faster use and try:

urls.txt

http://www.open3d.org/paper.pdf
http://www.bodden.de/pubs/kps+21qualitative.pdf
http://proceedings.esri.com/library/userconf/proc17/tech-workshops/tw_447-391.pdf
http://www.ijitee.org/wp-content/uploads/papers/v8i12S/L114510812S19.pdf
http://www.unifr.ch/appecon/en/assets/public/Lehrstuhl/Winter%20School%202022/Unboxing%20the%20Machine%20Learning%20Algorithms%20in%20Python_Christian%20Kauth.pdf
http://englishonlineclub.com/pdf/Data%20Structures%20and%20Algorithms%20in%20Python%20[EnglishOnlineClub.com].pdf
http://link.springer.com/content/pdf/bfm%3A978-1-4842-0055-1%2F1.pdf
http://www.astro.sunysb.edu/steinkirch/reviews/algorithms_in_python.pdf
http://nptel.ac.in/content/syllabus_pdf/106106145.pdf
http://www.cs.auckland.ac.nz/compsci105s1c/resources/ProblemSolvingwithAlgorithmsandDataStructures.pdf
http://www.leesbrook.co.uk/wp-content/uploads/sites/19/2021/08/Computer-Science-KO-Y10-Algorithms-and-Python.pdf
http://www.leesbrook.co.uk/wp-content/uploads/sites/19/2021/12/Computer-Science-3.pdf
http://jict.ilmauniversity.edu.pk/journal/jict/14.1/4.pdf
http://d1rkab7tlqy5f1.cloudfront.net/TUDelft/Onderwijs/Opleidingen/Master/MSc_Life_Science_and_Technology/Application%20and%20admission/Indicative%20entry%20level%20Algorithms%20and%20Programming.pdf
http://www.theoj.org/joss-papers/joss.03917/10.21105.joss.03917.pdf
http://cbseacademic.nic.in/web_material/Curriculum21/publication/secondary/Python_Content_Manual.pdf
http://www.ijert.org/research/python-libraries-development-frameworks-and-algorithms-for-machine-learning-applications-IJERTV7IS040173.pdf
http://raw.githubusercontent.com/mdipierro/nlib/master/docs/book_numerical.pdf
http://hal.inria.fr/hal-03100076/file/jmlr20.pdf
http://www.jmlr.org/papers/volume13/fortin12a/fortin12a.pdf
http://www-s3-live.kent.edu/s3fs-root/s3fs-public/file/CS%2061002%20%20ALGORITHMS%20AND%20PROGRAMMING%20I.pdf
http://rmd.ac.in/onlinecourses/2018/oct7ecwr.pdf
http://ocw.mit.edu/courses/electrical-engineering-and-computer-science/6-0001-introduction-to-computer-science-and-programming-in-python-fall-2016/lecture-slides-code/MIT6_0001F16_Lec12.pdf
http://cs-cmuq.github.io/110-www/lectures/04-algo-to-python.pdf
http://www.purdue.edu/hla/sites/varalalab/wp-content/uploads/sites/20/2018/04/Lecture_13.pdf


Solution 1:[1]

urls.txt contains exactly what you show in your question. The code:

import requests
from concurrent.futures import ThreadPoolExecutor
from http import HTTPStatus
from requests.exceptions import HTTPError

headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/15.2 Safari/605.1.15'}

def change_scheme(url):
    return url.replace('http:', 'https:') if url.startswith('http:') else url.replace('https:', 'http:')

def do_get(url):
    for _ in range(2):
        try:
            (r := requests.get(url.strip(), headers=headers, stream=True)).raise_for_status()
            # process content
            break
        except HTTPError as e:
            if r.status_code != HTTPStatus.FORBIDDEN:
                raise e
            url = change_scheme(url)


with open('urls.txt', encoding='utf-8') as urls:
    with ThreadPoolExecutor() as executor:
        for _ in executor.map(do_get, urls):
            pass

There is no attempt in this code to download any content. This is just to demonstrate the importance of checking the HTTP response code

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1