'How to separate contents of item containers?
I am in the process of building an email scraper and am having trouble when it comes to yielding items. My yield prints as:
{'email': ['[email protected]', '[email protected]', '[email protected]']}
Whenever I export this into CSV I have an email header and then the three emails are listed in the same cell. How would I separate these into individual cells?
class EmailSpider(CrawlSpider):
name = 'emails'
start_urls = ['https://example.com']
parsed_url = urlparse(start_urls[0])
rules = [Rule(LinkExtractor(allow_domains=parsed_url), callback='parse', follow=True)]
def parse(self, response):
# Scrape page for email links
items = EmailscrapeItem()
hrefs = [response.xpath("//a[starts-with(@href, 'mailto')]/text()").getall()]
# Removes hrefs that are empty or None
hrefs = [d for d in hrefs if d]
# TODO: Add code to capture non-mailto emails as well
# hrefs.append(response.xpath("//*[contains(text(), '@')]/text()"))
for href in hrefs:
items['email'] = href
yield items
Solution 1:[1]
Figured out what I did wrong.
I changed my parse to:
for res in response.xpath("//a[starts-with(@href, 'mailto')]/text()"):
item = EmailscrapeItem()
item['email'] = res.get()
yield item
This yielded the proper results.
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | howshotwebs |
