'Not all threads finish their work (files downloaded are corrupted, not downloaded entirely)
I want to download files - using python threading module - from Web. All URLs are stored in text file - urls.txt. My python script works, but not all files are downloaded entirely, most not - they are "corrupt".
Looks like, not all threads finish their work.
How can I fix this problem?
Here is python script:
import threading
import requests
from os import path
DIRECTORY = "DATA"
urls = None
def download(link, idx):
file_name = "my_file-" + str(idx) + ".pdf"
file_name = path.join(DIRECTORY, file_name)
print("\n URL -> ", link)
print("\n FN -> ", file_name)
try:
r = requests.get(link, stream=True)
with open(file_name, 'wb') as f:
for chunk in r.iter_content(1024):
if chunk:
f.write(chunk)
except Exception as e:
print(e.with_traceback())
def create_new_download_thread(link, idx):
download_thread = threading.Thread(target=download, args=(link, idx,))
return download_thread
with open('urls.txt', 'r') as f:
urls = f.readlines()
print(len(urls))
threads = list()
for i, url in enumerate(urls):
print("\n i= ", i)
t = create_new_download_thread(url, i)
threads.append(t)
for t in threads:
t.start()
for t in threads:
t.join()
I also add my txt file for faster use and try:
urls.txt
http://www.open3d.org/paper.pdf
http://www.bodden.de/pubs/kps+21qualitative.pdf
http://proceedings.esri.com/library/userconf/proc17/tech-workshops/tw_447-391.pdf
http://www.ijitee.org/wp-content/uploads/papers/v8i12S/L114510812S19.pdf
http://www.unifr.ch/appecon/en/assets/public/Lehrstuhl/Winter%20School%202022/Unboxing%20the%20Machine%20Learning%20Algorithms%20in%20Python_Christian%20Kauth.pdf
http://englishonlineclub.com/pdf/Data%20Structures%20and%20Algorithms%20in%20Python%20[EnglishOnlineClub.com].pdf
http://link.springer.com/content/pdf/bfm%3A978-1-4842-0055-1%2F1.pdf
http://www.astro.sunysb.edu/steinkirch/reviews/algorithms_in_python.pdf
http://nptel.ac.in/content/syllabus_pdf/106106145.pdf
http://www.cs.auckland.ac.nz/compsci105s1c/resources/ProblemSolvingwithAlgorithmsandDataStructures.pdf
http://www.leesbrook.co.uk/wp-content/uploads/sites/19/2021/08/Computer-Science-KO-Y10-Algorithms-and-Python.pdf
http://www.leesbrook.co.uk/wp-content/uploads/sites/19/2021/12/Computer-Science-3.pdf
http://jict.ilmauniversity.edu.pk/journal/jict/14.1/4.pdf
http://d1rkab7tlqy5f1.cloudfront.net/TUDelft/Onderwijs/Opleidingen/Master/MSc_Life_Science_and_Technology/Application%20and%20admission/Indicative%20entry%20level%20Algorithms%20and%20Programming.pdf
http://www.theoj.org/joss-papers/joss.03917/10.21105.joss.03917.pdf
http://cbseacademic.nic.in/web_material/Curriculum21/publication/secondary/Python_Content_Manual.pdf
http://www.ijert.org/research/python-libraries-development-frameworks-and-algorithms-for-machine-learning-applications-IJERTV7IS040173.pdf
http://raw.githubusercontent.com/mdipierro/nlib/master/docs/book_numerical.pdf
http://hal.inria.fr/hal-03100076/file/jmlr20.pdf
http://www.jmlr.org/papers/volume13/fortin12a/fortin12a.pdf
http://www-s3-live.kent.edu/s3fs-root/s3fs-public/file/CS%2061002%20%20ALGORITHMS%20AND%20PROGRAMMING%20I.pdf
http://rmd.ac.in/onlinecourses/2018/oct7ecwr.pdf
http://ocw.mit.edu/courses/electrical-engineering-and-computer-science/6-0001-introduction-to-computer-science-and-programming-in-python-fall-2016/lecture-slides-code/MIT6_0001F16_Lec12.pdf
http://cs-cmuq.github.io/110-www/lectures/04-algo-to-python.pdf
http://www.purdue.edu/hla/sites/varalalab/wp-content/uploads/sites/20/2018/04/Lecture_13.pdf
Solution 1:[1]
urls.txt contains exactly what you show in your question. The code:
import requests
from concurrent.futures import ThreadPoolExecutor
from http import HTTPStatus
from requests.exceptions import HTTPError
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/15.2 Safari/605.1.15'}
def change_scheme(url):
return url.replace('http:', 'https:') if url.startswith('http:') else url.replace('https:', 'http:')
def do_get(url):
for _ in range(2):
try:
(r := requests.get(url.strip(), headers=headers, stream=True)).raise_for_status()
# process content
break
except HTTPError as e:
if r.status_code != HTTPStatus.FORBIDDEN:
raise e
url = change_scheme(url)
with open('urls.txt', encoding='utf-8') as urls:
with ThreadPoolExecutor() as executor:
for _ in executor.map(do_get, urls):
pass
There is no attempt in this code to download any content. This is just to demonstrate the importance of checking the HTTP response code
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 |
