'How to output a nested json in Scrapy?
I am building a Scrapy project and I realised that I need to nest the json's to get desired output for further use. Until this point, I was saving the json regularly without any formatting.
[{"Title": "Cukrus RIMI, 1 kg", "Price": "0.89", "Image": "https://rimibaltic-res.cloudinary.com/image/upload/b_white,c_fit,f_auto,h_480,q_auto,w_480/d_ecommerce:backend-fallback.png/MAT_801045_PCE_LT", "Link": "https://www.rimi.lt/e-parduotuve/lt/produktai/bakaleja/cukrus-ir-saldikliai/baltasis-cukrus-/cukrus-rimi-1-kg/p/801045"},
{"Title": "Pomidorų padažas su bazilikais BARILLA, 400 g", "Price": "2.69", "Image": "https://rimibaltic-res.cloudinary.com/image/upload/b_white,c_fit,f_auto,h_480,q_auto,w_480/d_ecommerce:backend-fallback.png/MAT_106498_PCE_LT", "Link": "https://www.rimi.lt/e-parduotuve/lt/produktai/bakaleja/padazai-garstycios-krienai/padazai-maisto-ruosimui-ir-makaronams/pomidoru-padazas-su-bazilikais-barilla-400g/p/106498"},
{"Title": "Padažas makaronams RIMI su bazilikais, 390 g", "Price": "1.65", "Image": "https://rimibaltic-res.cloudinary.com/image/upload/b_white,c_fit,f_auto,h_480,q_auto,w_480/d_ecommerce:backend-fallback.png/MAT_810787_PCE_LT", "Link": "https://www.rimi.lt/e-parduotuve/lt/produktai/bakaleja/padazai-garstycios-krienai/padazai-maisto-ruosimui-ir-makaronams/padazas-makaronams-su-bazilikais-rimi-390-g/p/810787"},
{"Title": "Ekologiški raudonieji lęšiai I LOVE ECO, 400 g", "Price": "2.79", "Image": "https://rimibaltic-res.cloudinary.com/image/upload/b_white,c_fit,f_auto,h_480,q_auto,w_480/d_ecommerce:backend-fallback.png/MAT_141700_PCE_LT", "Link": "https://www.rimi.lt/e-parduotuve/lt/produktai/bakaleja/ankstiniai/lesiai/ekologiski-raudonieji-lesiai-i-love-eco-400g/p/141700"}]
But now, I am trying to make it nested, by adding the values of the shop that I am scraping at the top of the json.
Example (desired output):
{
"shop" : {
"sid" : 1,
"name" : "Barbora",
"domain" : "https://barbora.lt",
"image_url" : ""
},
"products" : [
{
"Image" : "https://cdn.barbora.lt/products/1d747537-6760-4098-ab24-8c658d1f9491_m.png",
"Link" : "/produktai/bananai-1-kg",
"Price" : "€1,39",
"Title" : "Bananai, 1 kg"
},
{
"Image" : "https://cdn.barbora.lt/products/9d38e2e4-8106-4e8e-9b26-dec28b4eed96_m.png",
"Link" : "/produktai/suris-rokiskio-ekstra-45-proc-rieb-s-m-1-kg",
"Price" : "€8,79",
"Title" : "Sūris ROKIŠKIO EKSTRA, 45% rieb. s. m., 1 kg"
},...
I have tried putting the items into a list, but I get an error (it asks to return an item not a list) So now I tried combining Scrapy items to build myself a structure. There's what I've tried so far, but it does not seem to be working:
import scrapy
from pbl.items import PblSpider
from pbl.items import ShopCard
SHOP_ID = 1
SHOP_NAME = 'Asorti'
shop = ShopCard()
shop['id'] = SHOP_ID
shop['name'] = SHOP_NAME
shop['domain'] = 'https://www.assorti.lt'
#shop['imageurl'] = response.xpath()
class SpiderasortiSpider(scrapy.Spider):
name = 'spiderAsorti'
allowed_domains = ['www.assorti.lt']
start_urls = ['https://www.assorti.lt/katalogas/maistas/']
def __init__(self):
self.declare_xpath()
def declare_xpath(self):
self.getAllItemsXpath = '//*[@id="products_wrapper"]/div[2]/div/a/@href'
self.TitleXpath = '//*[@id="products_detailed"]/div[1]/div/div/div[2]/h1/text()'
self.ImageXpath = '//*[@id="products_photos"]/div[1]/img/@src'
self.PriceXpath = '//*[@id="products_add2cart"]/form/div/div[1]/div/div[1]/span/span[1]/text()'
def parse(self, response):
for href in response.xpath(self.getAllItemsXpath):
url = response.urljoin(href.extract())
yield scrapy.Request(url,callback=self.parse_item)
next_page = response.xpath('//*[@id="products_wrapper"]/div[3]/div[2]/ul/li/a[contains(@class, "pagination_link")]/@href').extract()
if next_page[1] is not '#':
print('-' * 70)
print(next_page[1])
print('-' * 70)
url = response.urljoin(next_page[1])
yield scrapy.Request(url, callback=self.parse)
def parse_item(self, response):
shop = ShopCard()
shop['product'] = PblSpider()
Title = response.xpath(self.TitleXpath).extract_first()
Link = response.url
Image = response.xpath(self.ImageXpath).extract_first()
Price = response.xpath(self.PriceXpath).extract_first()
shop['product']['Title'] = Title
shop['product']['Link'] = Link
shop['product']['Image'] = Image
shop['product']['Price'] = Price
return shop
What am I doing wrong, and is there another way to build nested json files in Scrapy, or is it only capable of doing non-indented output like the first example?
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|
