'undetected_chromedriver works too slow

I am trying to scrape different websites of the one unique domain. I have the following URL structure:

URL = 'https://somewebsite.eu/id/{}'.format(ID) where the variable ID takes many many values. This website is protected by Cloudflare system, therefore I decided to use selenium and undetected chrome driver to bypass it. All the other methods such as requests with sessions and cfcscrape do not work with the website.

Since I need to parse many pages with similar URL structure, I decided to use a loop over all values of ID variable.

import pandas as pd
import numpy as np
import requests
import selenium

from undetected_chromedriver import Chrome 
from selenium.webdriver.chrome.options import Options
import time

def extracting_html_files_v11(ids):
    options = Options()
    options.add_argument("start-maximized")
    for x in ids:
        start_time = time.time()
        browser = Chrome(option = options)
        print('initialization of the browser')
        url = 'https://somewebsite.eu/id/{}/'.format(x)
        print(url)
        browser.get(url) 
        print('the page was downloaded')
        
        time_to_wait = np.random.uniform(low = 7, high = 10)
        time.sleep(time_to_wait)

        file_name = 'data_8000_9000/case_{}.html'.format(x)
        with open(file_name, 'w', encoding="utf-8") as f:
            f.write(browser.page_source)
        print('the file was saved')
        browser.quit()
        print('the browser was quited')
        print("--- %s seconds ---" % (time.time() - start_time))
        for i in range(3):
            print('_____')

However, this process takes too long. After each launch of the browser I need to wait roughly 5 seconds for Cloudflare to let me download the page (that's why I have time.sleep(time_to_wait)). Can the code be optimized? And should I think about parallel programming or something like that? (I am completely a beginner in parallel processes).

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution	Source

'undetected_chromedriver works too slow

Sources

Related Questions