'scrapy image pipeline filename unsing other crawled info
Is there any way to name a crawled image with other info(text) that we get with the spider? for example in this case I want images with the article title and article published date that I got in spider:
spider file
# lines of code
def parse(self, response):
# lines of code
yield {
'date':date,
'title': article_title,
'image_urls': clean_urls
}
pipelines.py
from scrapy.pipelines.images import ImagesPipeline
class customImagesPipeline(ImagesPipeline):
def file_path(self, request, response=None, info=None, *, item=None):
return f"images/{request.url.split('/')[-1]}"
Solution 1:[1]
One way to go about this is to overwrite the get_media_requests method and set the image name there on the image requests meta attribute, so you can access it in the file_path method.
The following example will work if you pass one image url as string to image_urls:
from scrapy.http import Request
from scrapy.pipelines.images import ImagesPipeline
class customImagesPipeline(ImagesPipeline):
def get_media_requests(self, item, info):
return Request(
item["image_urls"],
meta = {
"image_name": f"{item['title']}_{item['date']}",
}
)
def file_path(self, request, response=None, info=None) -> str:
return f"images/{request.meta['image_name']}.jpg"
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 |
