'How to check a broken link in scrapy?

I have an array of links, how can I check in the broken link method or not. In general, I need to implement something like this construction

def parse(self, response, **cb_kwargs):
    for link in links:
        *if response HTTP 404 callback=self.parse_data...*
        *elif response HTTP 200 callback=self.parse_product...*

def parse_data(self, response, **cb_kwargs):
    pass

def parse_product(self, response, **cb_kwargs):
    pass

the fact is that I need to know the status in the first method (parse), is this possible?



Solution 1:[1]

You could add links in stat_urls and in parse() you can check response.status (and get response.url) and you can run directly code to process this url - there is no need to send it again with Requests - besides Scrapy (as default) skip the same requests.

But Scrapy skips parse() for urls which gives errors so you have to change list handle_httpstatus_list.

import scrapy

class MySpider(scrapy.Spider):

    name = 'myspider'

    start_urls = [
        'http://httpbin.org/get',    # 200
        'http://httpbin.org/error',  # 404
        'http://httpbin.org/post',   # 405
    ]

    handle_httpstatus_list = [404, 405]
    
    def parse(self, response):
        print('url:', response.url)
        print('status:', response.status)

        if response.status == 200:
            self.process_200(response)
        
        if response.status == 404:
            self.process_404(response)

        if response.status == 405:
            self.process_405(response)

    def process_200(self, response):
        print('Process 200:', response.url)

    def process_404(self, response):
        print('Process 404:', response.url)

    def process_405(self, response):
        print('Process 405:', response.url)
        
# --- run without project and save in `output.csv` ---

from scrapy.crawler import CrawlerProcess

c = CrawlerProcess({
#    'USER_AGENT': 'Mozilla/5.0',
#    'FEEDS': {'output.csv': {'format': 'csv'}},  # new in 2.1
    
})
c.crawl(MySpider)
c.start()

EDIT:

I didn't test it but in documentation you can also see

Using errbacks to catch exceptions in request processing

which shows how to use errback=function to send response to function when it gets error.

import scrapy

from scrapy.spidermiddlewares.httperror import HttpError
from twisted.internet.error import DNSLookupError
from twisted.internet.error import TimeoutError, TCPTimedOutError

class ErrbackSpider(scrapy.Spider):
    name = "errback_example"
    start_urls = [
        "http://www.httpbin.org/",              # HTTP 200 expected
        "http://www.httpbin.org/status/404",    # Not found error
        "http://www.httpbin.org/status/500",    # server issue
        "http://www.httpbin.org:12345/",        # non-responding host, timeout expected
        "http://www.httphttpbinbin.org/",       # DNS error expected
    ]

    def start_requests(self):
        for u in self.start_urls:
            yield scrapy.Request(u, callback=self.parse_httpbin,
                                    errback=self.errback_httpbin,
                                    dont_filter=True)

    def parse_httpbin(self, response):
        self.logger.info('Got successful response from {}'.format(response.url))
        # do something useful here...

    def errback_httpbin(self, failure):
        # log all failures
        self.logger.error(repr(failure))

        # in case you want to do something special for some errors,
        # you may need the failure's type:

        if failure.check(HttpError):
            # these exceptions come from HttpError spider middleware
            # you can get the non-200 response
            response = failure.value.response
            self.logger.error('HttpError on %s', response.url)

        elif failure.check(DNSLookupError):
            # this is the original request
            request = failure.request
            self.logger.error('DNSLookupError on %s', request.url)

        elif failure.check(TimeoutError, TCPTimedOutError):
            request = failure.request
            self.logger.error('TimeoutError on %s', request.url)

There is also

Accessing additional data in errback functions

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1