'python scrapy need assitance. I want to save to a (.csv) file. How can I do this?
I'm using debian Bullseye (11.2) I want to save to a (.csv) file. How can I do this?
from scrapy.spiders import CSVFeedSpider
class CsSpiderSpider(CSVFeedSpider):
name = 'cs_spider'
allowed_domains = ['ocw.mit.edu/courses/electrical-engineering-and-computer-science/']
start_urls = ['http://ocw.mit.edu/courses/electrical-engineering-and-computer-science//feed.csv']
# headers = ['id', 'name', 'description', 'image_link']
# delimiter = '\t'
# Do any adaptations you need here
#def adapt_response(self, response):
# return response
def parse_row(self, response, row):
i = {}
#i['url'] = row['url']
#i['name'] = row['name']
#i['description'] = row['description']
return i
Solution 1:[1]
Here's an example of using the FEEDS export from scrapy.
import scrapy
from scrapy.crawler import CrawlerProcess
class CsspiderSpider(scrapy.Spider):
name = 'cs_spider'
start_urls = ['http://ocw.mit.edu/courses/electrical-engineering-and-computer-science']
def start_requests(self):
for url in self.start_urls:
yield scrapy.Request(
url=url, callback = self.parse_row
)
def parse_row(self, response):
yield {
'test':response.text
}
process = CrawlerProcess(
settings = {
'FEEDS':{
'data.csv':{
'format':'csv'
}
}
}
)
process.crawl(CsspiderSpider)
process.start()
Will save the output of your file into .csv format. Furthermore, To specify columns to export and their order use FEED_EXPORT_FIELDS. You can read more about this in the docs
In the command line you can run:
scrapy crawl cs_spider -o output.csv
However, when running the above in the command line make sure to comment out all the code from process and below.
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | dollar bill |
