'Scrapy and crochet

I know this question has already been discussed, however, I could not find a working solution yet.

I am working on a Django + Celery + Scrapy project, where tasks to scrape are issued by users.

Initially I had a task similar to this:

@app.task(bind=True)
def run_task(self):
    ...
    process = CrawlerProcess(crawler_settings)
    process.crawl('quotes')
    process.start() # the script will block here until the crawling is finished

Where crawler_settings had one setting that saved the execution output to a log. This worked for one execution, however calling this task multiple times would cause a Reactor not restartable error. This happens because process.start() starts the Twisted reactor, and calling it multiple times causes an error (because it is already started).

Afterwards I tried starting the reactor with process.start(stop_after_crawl=False) to prevent killing the reactor. However, this caused the celery task to be kept running even after the crawling had finished.

I started searching for ways to reuse the created Twisted reactor and found crochet.

Then I created a function using the crochet's wait_for decorator as suggested in their docs:

@wait_for(timeout=200)
def run_spider(spider_name, crawler_settings):
    runner = CrawlerRunner(crawler_settings)
    deferred = runner.crawl(spider_name)
    return deferred

And used it in my celery task:

@app.task(bind=True)
def run_task(self):
    ...
    run_spider('quotes', crawler_settings)

Unfortunately, this isn't working. Not only no log file is being written, but also after the timeout a TimeoutError is raised, which essentially tells me that the task didn't even run (it took about 15 seconds using the 'regular' approach).

Can a kind soul help me? Thanks!

PS: in the explanation, I am using scrapy's introductory spider ('quotes'), which isn't exactly what I am using in my project, however, I have tried using the 'quotes' example and the result is the same, therefore I hope it can help to reproduce this problem.



Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source