'Scrapy SitemapSpider - How to yield entry from sitemap_filter to parse
I'm building a SitemapSpider. I'm trying to filter sitemap entries to exclude entries that contain this substring '/p/' in the link:
<url>
<loc>https://example.co.za/product-name/p/product-id</loc>
<lastmod>2019-08-27</lastmod>
<changefreq>daily</changefreq>
</url>
According to the Scrapy docs , we can define a sitemap_filter function:
for entry in entries:
date_time = datetime.strptime(entry['lastmod'], '%Y-%m-%d')
if date_time.year >= 2005:
yield entry
In my case I'm filtering on entry['loc'] instead of entry['lastmod'].
Unfortunately I haven't found an example that uses sitemap_filter besides the above.
from scrapy.spiders import SitemapSpider
class mySpider(SitemapSpider)
name = 'spiderName'
sitemap_urls = ['https://example.co.za/medias/sitemap']
# sitemap_rules = [ ('donut/c', 'parse')]
def sitemap_filter(self, entries):
for entry in entries:
if '/p/' not in entry['loc']
print(entry)
yield entry
def parse(self, response):
...
The code runs fine without the sitemap_filter function, but it's not feasible to define all the sitemap_rules.
When I run the code above it prints the correct sitemap entries, but it seemingly doesn't proceed to the parse function. The log file shows no errors:
2022-05-10 17:02:00 [scrapy.core.engine] INFO: Spider opened
2022-05-10 17:02:00 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2022-05-10 17:02:00 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2022-05-10 17:02:00 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://example.co.za/robots.txt> (referer: None)
2022-05-10 17:02:00 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://example.co.za/medias/sitemap.xml> (referer: None)
2022-05-10 17:02:05 [scrapy.crawler] INFO: Received SIGINT, shutting down gracefully. Send again to force
2022-05-10 17:02:06 [scrapy.crawler] INFO: Received SIGINT twice, forcing unclean shutdown
2022-05-10 17:02:09 [scrapy.core.engine] INFO: Closing spider (shutdown)
2022-05-10 17:02:09 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
I'm looking for a way to send the entries yielded by sitemap_filter to the parse function, or alternatively, a way to filter sitemap entries before scrapy opens the links.
Solution 1:[1]
Thanks for the suggestions everyone. Based on @Georgiy's comments and old answer, replacing entry['loc'] with entry.get('loc') is what worked.
from scrapy.spiders import SitemapSpider
class mySpider(SitemapSpider)
name = 'spiderName'
sitemap_urls = ['https://example.co.za/medias/sitemap']
# sitemap_rules = [ ('donut/c', 'parse')]
def sitemap_filter(self, entries):
for entry in entries:
if '/p/' not in entry.get('loc')
#print(entry)
yield entry
def parse(self, response):
...
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | Clemence Padya |
