'Scrapy Playwright: execute CrawlSpider using scrapy playwright
Is it possible to execute CrawlSpider using Playwright integration for Scrapy? I am trying the following script to execute a CrawlSpider but it does not scrape anything. It also does not show any error!
import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
class GumtreeCrawlSpider(CrawlSpider):
name = 'gumtree_crawl'
allowed_domains = ['www.gumtree.com']
def start_requests(self):
yield scrapy.Request(
url='https://www.gumtree.com/property-for-sale/london/page',
meta={"playwright": True}
)
return super().start_requests()
rules = (
Rule(LinkExtractor(restrict_xpaths="//div[@class='grid-col-12']/ul[1]/li/article/a"), callback='parse_item', follow=False),
)
async def parse_item(self, response):
yield {
'Title': response.xpath("//div[@class='css-w50tn5 e1pt9h6u11']/h1/text()").get(),
'Price': response.xpath("//h3[@itemprop='price']/text()").get(),
'Add Posted': response.xpath("//dl[@class='css-16xsajr elf7h8q4'][1]/dd/text()").get(),
'Links': response.url
}
Solution 1:[1]
Requests extracted from the rule do not have the playwright=True meta key, that's a problem if they need to be rendered by the browser to have useful content. You could solve that by using Rule.process_request, something like:
def set_playwright_true(request, response):
request.meta["playwright"] = True
return request
class MyCrawlSpider(CrawlSpider):
...
rules = (
Rule(LinkExtractor(...), callback='parse_item', follow=False, process_request=set_playwright_true),
)
Update after comment
Make sure your URL is correct, I get no results for that particular one (remove
/page?).Bring back your
start_requestsmethod, seems like the first page also needs to be downloaded using the browserUnless marked explicitly (e.g.
@classmethod,@staticmethod) Python instance methods receive the calling object as implicit first argument. The convention is to call thisself(e.g.def set_playwright_true(self, request, response)). However, if you do this, you will need to change the way you create the rule, either:Rule(..., process_request=self.set_playwright_true)Rule(..., process_request="set_playwright_true")
From the docs: "process_request is a callable (or a string, in which case a method from the spider object with that name will be used)"
My original example defines the processing function outside of the spider, so it's not an instance method.
Solution 2:[2]
As suggested by elacuesta, I'd only add change your "parse_item" def from an async to a standard def.
def parse_item(self, response):
It defies what all I've read too, but that got me through.
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | |
| Solution 2 |
