'BeautifulSoup library Parsing very slow for Amazon Scraped pages (using both html.parser and lxml)
I have been scraping using BeautifulSoup library in Python for scraping Amazon.com. I am successful in that using proxy rotation, but the problem is that the part where I am parsing the received BeautifulResponse is very slow. It takes usually 14s to 50s just for this statement: BeautifulSoup(content, 'html.parser').
In some of the responses in stackoverflow, I found that this can be mitigated using lxml instead of html.parser and importing the cchardet library. I have tried with that, but that doesn't seem to help. It still takes around 15-30 seconds for some of the responses.
Is there any other workaround to make this faster?
What I have done so far: I am using multithreading to scrape multiple pages using proxy rotation. So, there is no problem in scraping the page using the requests library, and I am getting the response . eg.
r = requests.get("--- the url of the web page ---")
content = r.content
my_soup = BeautifulSoup(content, 'html.parser')
# tried with below also, as this is one of the solution all over
# internet to make it faster:
import cchardet
r = requests.get("--- the url of the web page ---")
content = r.content
my_soup = BeautifulSoup(content, 'lxml')
But when I run above code and time the line: "my_soup = BeautifulSoup(content, 'lxml')", this alone seems to take around 15-30 seconds, even 50 seconds sometime.
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|
