'How to scrape all google images results for any search and download the html pages containing those returned results using scrapy and selenium?

I am working on a web application project that will allow the user to get image search results from keywords. For this, I would like to send the user's search query to google images and get the search results webpage to display to the user on my site. However, I can't create a file automatically in my project where the content of the results html page will be written in scrapy. Having learned after research that scrapy does not support the basic json format, I was advised to use Selenium and Scrapy to perform this type of task with files containing json. So I watched this video and I read this scrapy documentation page. Here is my code :

from requests import options

import scrapy
from scrapy.utils.project import get_project_settings
from selenium.webdriver import Chrome , ChromeOptions

class ImageSpider(scrapy.Spider):
    name ="imageSearch"

    def start_requests(self):
        settings = get_project_settings()
        driver_path = settings.get()
        options = ChromeOptions()
        options.headless = True
        driver = Chrome(executable_path= driver_path , options = options)
        driver.get("https://www.google.com/search?q=money&sxsrf=ALiCzsY8zansk7q3trz5BEHZr-NeKtDHJQ:1652478338282&source=lnms&tbm=isch&sa=X&ved=2ahUKEwjeu-jCud33AhUTgs4BHRBWCHMQ_AUoAXoECAIQAw")
        xpath = '//*'  #here is the modification 
        link_elements = driver.find_elements_by_xpath(xpath)  #here is the modification 
        
        
        for link_el in link_elements :
            href = link_el.get_attribute('href')
            yield scrapy.Request(href)
            driver.quit()
            
    def parse(self, response):
        filename = f'imageSearch.html'
        with open(filename, 'wb') as f:
            f.write(response.body)
            self.log(f'Saved file {filename}')

Here is my settings.py :

BOT_NAME = 'searchEngine'

SPIDER_MODULES = ['searchEngine.spiders']
NEWSPIDER_MODULE = 'searchEngine.spiders'
CHROME_DRIVER_PATH = "C:\Windows"

#scrapeOps
## Add Your ScrapeOps API key
SCRAPEOPS_API_KEY = 'cc7988ca-72cc-4456-86ec-ed8c6e5fbdaf'

## Add In The ScrapeOps Extension
EXTENSIONS = {
 'scrapeops_scrapy.extension.ScrapeOpsMonitor': 500, 
}

## Update The Download Middlewares
DOWNLOADER_MIDDLEWARES = { 
'scrapeops_scrapy.middleware.retry.RetryMiddleware': 550, 
'scrapy.downloadermiddlewares.retry.RetryMiddleware': None, 
}
#scrapeOps

Unfortunately my program does not work as hoped because it does not create a file containing the html code of the page yet I have no problem, according to the console line scrappy started fine except I get the following errors:

  • TypeError: BaseSettings.get() missing 1 required positional argument: 'name'

'log_count/DEBUG': 5,

'log_count/ERROR': 1,

'log_count/INFO': 10,

*

.I am beginner in webscrapping with selenium and scrapy and I don't really don't know what my mistakes are. Thank you, I look forward to your responses.



Solution 1:[1]

This is a suggestion. Store your API key on dynamoDB and fetch whenever required. You will need to write an extra piece of code that reads from dynamoDB using boto3 library and then store the key on your variable.

import logging

GOOGLE_API_KEY = get_apikey_db()
GOOGLE_GEOCODE_API_PATH = "https://maps.googleapis.com/maps/api/geocode/json?"


def get_apikey_db():
    #read from db here

def get_lat_lon(address):
...

I hope this solves your problem.

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 Ashish Yadav