'Scrapy file output missing for specific spider when using CrawlProcess/pipeline

I have a CrawlProcess.py that runs my spiders, and items are written to a .jsonl file via a pipeline. This works perfectly for all the spiders I've got except for one, which is creating a blank .jsonl file instead. The console output shows 5 items scraped, and if I use scrapy crawl -o output.jl then these items are present in that output file.

I can't tell what is different about my spider, nor can I find any reasons for why this could occur on other threads. Any help to identify the fault is appreciated.

My spider code is:

import scrapy
from Crawlers.items import CrawlersItem
import os
import unicodedata

class CrawlersSpider(scrapy.Spider):
    name = "faultySpider"
    allowed_domains = ["####"]
    start_urls=["####"]

    def parse(self, response):

        for index in range(len(response.css("h3::text").getall())):

            record = CrawlersItem()

            groupName = response.css("h3:nth-of-type(%d)::text" % (index + 1)).get()
            detail = response.css("h3:nth-of-type(%d) ~ p *::text" % (index + 1)).getall()
            cleanDetail = []
            for x in detail:
                cleanDetail.append((unicodedata.normalize("NFKD", x)).lstrip())
            detailIterator = iter(cleanDetail)

            email = website = phone = postalAddress = None
            for x in detailIterator:
                if x.endswith("Register of Members' Interests"):
                    break
                elif x.startswith("Email:"):
                    email = next(detailIterator)
                elif x.startswith("Website:"):
                    website = next(detailIterator)
                elif x.startswith("Tel:"):
                    phone = x.replace("Tel: ","")
                elif x.startswith("Contact:"):
                    postalAddress = x.replace("Contact: ","").lstrip()
                else:
                    pass                  

            record["productTarget"] = os.path.basename(__file__).split("_")[0]
            record["groupType"] = os.path.basename(__file__).split("_")[1]
            record["groupName"] = groupName
            record["contactPerson"] = None
            record["email"] = email
            record["website"] = website
            record["phone"] = phone
            record["postalAddress"] = postalAddress
            
            yield record

I do wonder if my pipeline, which functions fine the rest of the time, could help identify the issue. It's code is:

import threading
from datetime import datetime
import json
from collections import OrderedDict

class CrawlersPipeline(object):
    lock = threading.Lock()
    filename = "CombinedScrapes_" + datetime.today().strftime('%d%m%Y')
    datafile = open(filename + ".jsonl", 'w', encoding="utf-8")
    
    def process_item(self, item, spider):
        line = json.dumps(OrderedDict(item), ensure_ascii=False, sort_keys=False) + "\n"

        CrawlersPipeline.lock.acquire()
        CrawlersPipeline.datafile.write(line)
        CrawlersPipeline.lock.release()

        return item

python scrapy

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution	Source

'Scrapy file output missing for specific spider when using CrawlProcess/pipeline

Sources

Related Questions