'Python Requests-futures slow
I'm currently using requests-futures for faster web scraping. The problem is, it's still very slow. Around 1 every other second. Here's how the ThreadPoolExecutor looks:
with FuturesSession(executor=ThreadPoolExecutor(max_workers=8)) as session:
futures = {session.get(url, proxies={
'http': str(random.choice(proxy_list).replace("https:/", "http:/")),
'https': str(random.choice(proxy_list).replace("https:/", "http:/")),
}, headers={
'User-Agent': str(ua.chrome),
'Accept': '*/*',
'Accept-Encoding': 'gzip, deflate, br',
'Connection': 'keep-alive',
'Content-Type': 'text/plain;charset=UTF-8',
}): url for url in url_list}
# ---
for future in as_completed(futures):
del futures[future]
try:
resp = future.result()
except:
print("Error getting result from thread. Ignoring")
try:
multiprocessing.Process(target=main_func, args=(resp,))
del resp
del future
except requests.exceptions.JSONDecodeError:
logging.warning(
"[requests.custom.debug]: requests.exceptions.JSONDecodeError: [Error] print(resp.json())")
I believe it's slow because of the as_completed for loop since that's not a concurrent loop. As for the main_func I pass the response to, that's the function that uses the information from the site using bs4. If the as_completed for loop would have been concurrent, then it would still be faster than this. I really want the scraper to be faster and I feel like I'd like to keep using requests-futures, but if there's something that's a lot faster, I'd be happy to change. So if anyone knows something that's quite a lot faster than requests-futures, then please feel free to share that
Is anyone able to help with this? Thank you
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|
