'How to scrape multiple URLs and store the data in one Item
I am creating a scrapy project to compare the prices of products of sellers on the same site. The URLs of the pages I want to scrape take the format of:
I want to scrape the the name and price of each product on the pages and store that information in an item for each page. I then want to combine these items into another item to the pass through a pipeline and store in a MongoDB database.
The items containing the results of the scraping of each page should containt the name of the seller, the scraped data and the average price of all the products. The final item should contain the date and time stamp and a list containing the 2 previously scraped items.
class ScrapingResultItem(scrapy.Item):
name=scrapy.Field()
scraped_items=scrapy.Field()
price_avg=scrapy.Field()
class AllScrapedDataItem(scrapy.Item):
datetime=scrapy.Field()
data=scrapy.Field()
Currently I am able to scrape both pages separetly by having multiple start URLs and store the results of the scraping in an item via the parse method, and then store the items the database. But they are being stored in separate documents.
Is there a way to combine these items to store all the data in one document in the database? The reason I want to do this is to have them timestamed with the same date and time.
Spider:
class SscSpider(scrapy.Spider):
name = 'ssc'
allowed_domains = ['store.com']
start_urls = ['www.store.com/seller1', 'www.store.com/seller2']
def parse(self, response):
titles = response.css('.v2-listing-card__title::text').extract()
prices = response.css('.currency-value::text').extract()
scraped_items = []
price_count = 0
for item in zip(titles, prices):
scraped_info = {
'title':item[0].strip(),
'price':float(item[1].strip()),
}
price_count += float(item[1].strip())
scraped_items.append(scraped_info)
price_avg = round(price_count / (len(scraped_items)), 2)
scrapingResult = ScrapingResultItem(
name=self.name,
scraped_items=scraped_items,
price_avg=price_avg,
)
yield scrapingResult
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|
