'Scrapy SitemapSpider - How to yield entry from sitemap_filter to parse

I'm building a SitemapSpider. I'm trying to filter sitemap entries to exclude entries that contain this substring '/p/' in the link:

<url>
       <loc>https://example.co.za/product-name/p/product-id</loc>
       <lastmod>2019-08-27</lastmod>
       <changefreq>daily</changefreq>
</url>

According to the Scrapy docs , we can define a sitemap_filter function:

        for entry in entries:
            date_time = datetime.strptime(entry['lastmod'], '%Y-%m-%d')
            if date_time.year >= 2005:
                yield entry

In my case I'm filtering on entry['loc'] instead of entry['lastmod'].

Unfortunately I haven't found an example that uses sitemap_filter besides the above.

from scrapy.spiders import SitemapSpider

class mySpider(SitemapSpider)
    name = 'spiderName'
    sitemap_urls = ['https://example.co.za/medias/sitemap']
    # sitemap_rules = [ ('donut/c', 'parse')]
    
    def sitemap_filter(self, entries):
        for entry in entries:
            if '/p/' not in entry['loc']
                print(entry)
                yield entry
    def parse(self, response):
        ...

The code runs fine without the sitemap_filter function, but it's not feasible to define all the sitemap_rules.

When I run the code above it prints the correct sitemap entries, but it seemingly doesn't proceed to the parse function. The log file shows no errors:

2022-05-10 17:02:00 [scrapy.core.engine] INFO: Spider opened
2022-05-10 17:02:00 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2022-05-10 17:02:00 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2022-05-10 17:02:00 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://example.co.za/robots.txt> (referer: None)
2022-05-10 17:02:00 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://example.co.za/medias/sitemap.xml> (referer: None)
2022-05-10 17:02:05 [scrapy.crawler] INFO: Received SIGINT, shutting down gracefully. Send again to force 
2022-05-10 17:02:06 [scrapy.crawler] INFO: Received SIGINT twice, forcing unclean shutdown
2022-05-10 17:02:09 [scrapy.core.engine] INFO: Closing spider (shutdown)
2022-05-10 17:02:09 [scrapy.statscollectors] INFO: Dumping Scrapy stats:

I'm looking for a way to send the entries yielded by sitemap_filter to the parse function, or alternatively, a way to filter sitemap entries before scrapy opens the links.



Solution 1:[1]

Thanks for the suggestions everyone. Based on @Georgiy's comments and old answer, replacing entry['loc'] with entry.get('loc') is what worked.

from scrapy.spiders import SitemapSpider

class mySpider(SitemapSpider)
    name = 'spiderName'
    sitemap_urls = ['https://example.co.za/medias/sitemap']
    # sitemap_rules = [ ('donut/c', 'parse')]
    
    def sitemap_filter(self, entries):
        for entry in entries:
            if '/p/' not in entry.get('loc')
                #print(entry)
                yield entry
    def parse(self, response):
        ...

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 Clemence Padya