'Variable number of crawled items each run using Scrapy
I am using Scrapy to crawl a website that contains a category menu with different sub-levels of categories (i.e. category, subcategory, sub-subcategory, sub-subcategory, etc (depending on each category)).
For example:
--Category 1
Subcategory 11
Subsubcategory 111
Subsubcategory 112
Subcategory 12
Subsubcategory 121
Subsubsubcategory 1211
Subsubsubcategory 1212
--Category 2
Subcategory 21
...
There are approximately 30.000 categories, subcategories, subsubcategories, etc and I am only scraping this section following one Rule:
rules = [
Rule(
LinkExtractor(
restrict_xpaths=['//div[@class="treecategories"]//a',],
),
follow=True,
callback='parse_categories'
)
]
And it seems to work fine. The problem is that each time I run my scraper I get a different amount of crawled items and I know the website is not being updated. What could be the reason for this behaviour?
This is the settings I am using:
settings = {
'BOT_NAME' : 'crawler',
'USER_AGENT' : 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.183 Safari/537.36',
'CONCURRENT_REQUESTS' : 64,
'COOKIES_ENABLED' : False,
'LOG_LEVEL' : 'DEBUG',
}
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|
