'Write processed results in JSON files
I am using the Scrapy for broad crawling and have the following requirements:
- Scrapy will scrape the URL;
- Scrapy will parse the response from the URL and will write the parsed results in the file, say
file1.json, if and only if the size offile1.jsonis less than2GB. Otherwise, Scrapy will create a new file, sayfile2.jsonand write the response to this new file; - Once the response is returned, Scrapy will extract the URLs from the response and follow the extracted response. Then start with point 2.
Below is my code, I am able to perform step 1 & step 3, but couldn't understand where should I place the logic of creating the new file, checking the size and writing the response.
def parse(self, response):
url = response.request.url
soup = BeautifulSoup(response.text, 'lxml')
d = {}
for element in soup.find_all():
if element.name in ["html", "body", "script", "footer"]:
pass
else:
x = element.find_all(text=True, recursive=False)
if x:
d[element.name] = x
yield d ---------> I want to write this dictionary in a file as per logic of step 2
for link in soup.find_all('a', href=True):
absoluteUrl = urllib.parse.urljoin(url, link['href'])
parsedUrl = urlparse(absoluteUrl)
if parsedUrl.scheme.strip().lower() != 'https' and parsedUrl.scheme.strip().lower() != 'http':
pass
else:
url = url.replace("'", r"\'")
absoluteUrl = absoluteUrl.replace("'", r"\'")
self.graph.run(
"MERGE (child:page{page_url:'" + url + "'}) " +
"On CREATE " +
"SET child.page_url='" + url + "', child.page_rank = 1.0 " +
"MERGE (parent:page{page_url:'" + absoluteUrl + "'}) " +
"On CREATE " +
"SET parent.page_url = '" + absoluteUrl + "' , parent.page_rank = 1.0 " +
"MERGE (child)-[:FOLLOWS]->(parent)"
)
yield response.follow(absoluteUrl, callback=self.parse). ---> Step 3 ( all good )
My question is where should I write the logic of (should it be in pipeline, middlewares, or init function of the spider) creating the file, checking the file size, and writing the spider response into that file?
Any help would be appreciated. I tried learning middleware, pipelines etc, but couldn't figure out how to implement this functionality.
Solution 1:[1]
If you know the approximate number of items that every file should hold without exceeding the 2GB limit size, then out of the box you can use the FEED_EXPORT_BATCH_ITEM_COUNT setting and scrapy will automatically create new files when the number of items in the file reach the above limit. Read more about this setting on the FEEDS page.
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | msenior_ |
