'Scrapy - change settings based on value of scraped item durin runtime
I need to change the FEEDS parameters for the export of csv's to AWS S3 depending on the value of scraped items. I tried to put a condition in settings.py but it doesn't seem to work as I am not able to import item in the settings.py ( I get "cannot import name 'item' from ...") . Tried fom pipelines and spider
if item.get('meta_source') is not None:
FEEDS = {
's3://ghr-crawler-ops/crawler_holding/meta.csv': {
'format': 'csv'}
}
else:
FEEDS = {
's3://ghr-crawler-ops/crawler_holding/results.csv': {
'format': 'csv'}
}
Basically I need to export two csv's to AWS S3 from the same spider depending on the value of the scraped data. It works fine exporting it on my local computer but not to S3 (all the data gets exported in one csv)
Solution 1:[1]
This should be achievable with Item filtering, available since Scrapy 2.6. Something like the following:
from scrapy.extensions.feedexport import ItemFilter
class MetaItemFilter(ItemFilter):
def accepts(self, item) -> bool:
return item.get("meta_source") is not None
class ResultsItemFilter(ItemFilter):
def accepts(self, item) -> bool:
return item.get("meta_source") is None
FEEDS = {
"s3://ghr-crawler-ops/crawler_holding/meta.csv": {
"format": "csv",
"item_filter": MetaItemFilter,
},
"s3://ghr-crawler-ops/crawler_holding/results.csv": {
"format": "csv",
"item_filter": ResultsItemFilter,
}
}
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | elacuesta |
