'How to check a broken link in scrapy?
I have an array of links, how can I check in the broken link method or not. In general, I need to implement something like this construction
def parse(self, response, **cb_kwargs):
for link in links:
*if response HTTP 404 callback=self.parse_data...*
*elif response HTTP 200 callback=self.parse_product...*
def parse_data(self, response, **cb_kwargs):
pass
def parse_product(self, response, **cb_kwargs):
pass
the fact is that I need to know the status in the first method (parse), is this possible?
Solution 1:[1]
You could add links in stat_urls and in parse() you can check response.status (and get response.url) and you can run directly code to process this url - there is no need to send it again with Requests - besides Scrapy (as default) skip the same requests.
But Scrapy skips parse() for urls which gives errors so you have to change list handle_httpstatus_list.
import scrapy
class MySpider(scrapy.Spider):
name = 'myspider'
start_urls = [
'http://httpbin.org/get', # 200
'http://httpbin.org/error', # 404
'http://httpbin.org/post', # 405
]
handle_httpstatus_list = [404, 405]
def parse(self, response):
print('url:', response.url)
print('status:', response.status)
if response.status == 200:
self.process_200(response)
if response.status == 404:
self.process_404(response)
if response.status == 405:
self.process_405(response)
def process_200(self, response):
print('Process 200:', response.url)
def process_404(self, response):
print('Process 404:', response.url)
def process_405(self, response):
print('Process 405:', response.url)
# --- run without project and save in `output.csv` ---
from scrapy.crawler import CrawlerProcess
c = CrawlerProcess({
# 'USER_AGENT': 'Mozilla/5.0',
# 'FEEDS': {'output.csv': {'format': 'csv'}}, # new in 2.1
})
c.crawl(MySpider)
c.start()
EDIT:
I didn't test it but in documentation you can also see
Using errbacks to catch exceptions in request processing
which shows how to use errback=function to send response to function when it gets error.
import scrapy
from scrapy.spidermiddlewares.httperror import HttpError
from twisted.internet.error import DNSLookupError
from twisted.internet.error import TimeoutError, TCPTimedOutError
class ErrbackSpider(scrapy.Spider):
name = "errback_example"
start_urls = [
"http://www.httpbin.org/", # HTTP 200 expected
"http://www.httpbin.org/status/404", # Not found error
"http://www.httpbin.org/status/500", # server issue
"http://www.httpbin.org:12345/", # non-responding host, timeout expected
"http://www.httphttpbinbin.org/", # DNS error expected
]
def start_requests(self):
for u in self.start_urls:
yield scrapy.Request(u, callback=self.parse_httpbin,
errback=self.errback_httpbin,
dont_filter=True)
def parse_httpbin(self, response):
self.logger.info('Got successful response from {}'.format(response.url))
# do something useful here...
def errback_httpbin(self, failure):
# log all failures
self.logger.error(repr(failure))
# in case you want to do something special for some errors,
# you may need the failure's type:
if failure.check(HttpError):
# these exceptions come from HttpError spider middleware
# you can get the non-200 response
response = failure.value.response
self.logger.error('HttpError on %s', response.url)
elif failure.check(DNSLookupError):
# this is the original request
request = failure.request
self.logger.error('DNSLookupError on %s', request.url)
elif failure.check(TimeoutError, TCPTimedOutError):
request = failure.request
self.logger.error('TimeoutError on %s', request.url)
There is also
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 |
