'Scrapy file output missing for specific spider when using CrawlProcess/pipeline
I have a CrawlProcess.py that runs my spiders, and items are written to a .jsonl file via a pipeline. This works perfectly for all the spiders I've got except for one, which is creating a blank .jsonl file instead. The console output shows 5 items scraped, and if I use scrapy crawl -o output.jl then these items are present in that output file.
I can't tell what is different about my spider, nor can I find any reasons for why this could occur on other threads. Any help to identify the fault is appreciated.
My spider code is:
import scrapy
from Crawlers.items import CrawlersItem
import os
import unicodedata
class CrawlersSpider(scrapy.Spider):
name = "faultySpider"
allowed_domains = ["####"]
start_urls=["####"]
def parse(self, response):
for index in range(len(response.css("h3::text").getall())):
record = CrawlersItem()
groupName = response.css("h3:nth-of-type(%d)::text" % (index + 1)).get()
detail = response.css("h3:nth-of-type(%d) ~ p *::text" % (index + 1)).getall()
cleanDetail = []
for x in detail:
cleanDetail.append((unicodedata.normalize("NFKD", x)).lstrip())
detailIterator = iter(cleanDetail)
email = website = phone = postalAddress = None
for x in detailIterator:
if x.endswith("Register of Members' Interests"):
break
elif x.startswith("Email:"):
email = next(detailIterator)
elif x.startswith("Website:"):
website = next(detailIterator)
elif x.startswith("Tel:"):
phone = x.replace("Tel: ","")
elif x.startswith("Contact:"):
postalAddress = x.replace("Contact: ","").lstrip()
else:
pass
record["productTarget"] = os.path.basename(__file__).split("_")[0]
record["groupType"] = os.path.basename(__file__).split("_")[1]
record["groupName"] = groupName
record["contactPerson"] = None
record["email"] = email
record["website"] = website
record["phone"] = phone
record["postalAddress"] = postalAddress
yield record
I do wonder if my pipeline, which functions fine the rest of the time, could help identify the issue. It's code is:
import threading
from datetime import datetime
import json
from collections import OrderedDict
class CrawlersPipeline(object):
lock = threading.Lock()
filename = "CombinedScrapes_" + datetime.today().strftime('%d%m%Y')
datafile = open(filename + ".jsonl", 'w', encoding="utf-8")
def process_item(self, item, spider):
line = json.dumps(OrderedDict(item), ensure_ascii=False, sort_keys=False) + "\n"
CrawlersPipeline.lock.acquire()
CrawlersPipeline.datafile.write(line)
CrawlersPipeline.lock.release()
return item
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|
