'BeautifulSoup library Parsing very slow for Amazon Scraped pages (using both html.parser and lxml)

I have been scraping using BeautifulSoup library in Python for scraping Amazon.com. I am successful in that using proxy rotation, but the problem is that the part where I am parsing the received BeautifulResponse is very slow. It takes usually 14s to 50s just for this statement: BeautifulSoup(content, 'html.parser').

In some of the responses in stackoverflow, I found that this can be mitigated using lxml instead of html.parser and importing the cchardet library. I have tried with that, but that doesn't seem to help. It still takes around 15-30 seconds for some of the responses.

Is there any other workaround to make this faster?

What I have done so far: I am using multithreading to scrape multiple pages using proxy rotation. So, there is no problem in scraping the page using the requests library, and I am getting the response . eg.

r = requests.get("--- the url of the web page ---")
content = r.content
my_soup = BeautifulSoup(content, 'html.parser')

# tried with below also, as this is one of the solution all over 
# internet to make it faster:

import cchardet
r = requests.get("--- the url of the web page ---")
content = r.content
my_soup = BeautifulSoup(content, 'lxml')

But when I run above code and time the line: "my_soup = BeautifulSoup(content, 'lxml')", this alone seems to take around 15-30 seconds, even 50 seconds sometime.



Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source