'Scrapy - ReactorAlreadyInstalledError when using TwistedScheduler

I have the following Python code to start APScheduler/TwistedScheduler cronjob to start the spider.

Using one spider was not a problem and worked great. However using two spiders result into the error: twisted.internet.error.ReactorAlreadyInstalledError: reactor already installed.

I did found a related question, using CrawlerRunner as the solution. However, I'm using TwistedScheduler object, so I do not know how to get this working using multiple cron jobs (multiple add_job()).

from scrapy.crawler import CrawlerProcess
from scrapy.utils.project import get_project_settings
from apscheduler.schedulers.twisted import TwistedScheduler

from myprojectscraper.spiders.my_homepage_spider import MyHomepageSpider
from myprojectscraper.spiders.my_spider import MySpider

process = CrawlerProcess(get_project_settings())
# Start the crawler in a scheduler
scheduler = TwistedScheduler(timezone="Europe/Amsterdam")
# Use cron job; runs the 'homepage' spider every 4 hours (eg. 12:10, 16:10, 20:10, etc.)
scheduler.add_job(process.crawl, 'cron', args=[MyHomepageSpider], hour='*/4', minute=10)
# Use cron job; runs the full spider every week on the monday, tuesday and saturday at 4:35 midnight
scheduler.add_job(process.crawl, 'cron', args=[MySpider], day_of_week='mon,thu,sat', hour=4, minute=35)
scheduler.start()
process.start(False)


Solution 1:[1]

I'm now using the BlockingScheduler in combination with Process and CrawlerRunner. As well as enabling logging via configure_logging().

from multiprocessing import Process

from scrapy.crawler import CrawlerRunner
from scrapy.utils.project import get_project_settings
from scrapy.utils.log import configure_logging
from apscheduler.schedulers.blocking import BlockingScheduler

from myprojectscraper.spiders.my_homepage_spider import MyHomepageSpider
from myprojectscraper.spiders.my_spider import MySpider

from twisted.internet import reactor

# Create Process around the CrawlerRunner
class CrawlerRunnerProcess(Process):
    def __init__(self, spider):
        Process.__init__(self)
        self.runner = CrawlerRunner(get_project_settings())
        self.spider = spider

    def run(self):
        deferred = self.runner.crawl(self.spider)
        deferred.addBoth(lambda _: reactor.stop())
        reactor.run(installSignalHandlers=False)

# The wrapper to make it run multiple spiders, multiple times
def run_spider(spider):
    crawler = CrawlerRunnerProcess(spider)
    crawler.start()
    crawler.join()

# Enable logging when using CrawlerRunner
configure_logging()

# Start the crawler in a scheduler
scheduler = BlockingScheduler(timezone="Europe/Amsterdam")
# Use cron job; runs the 'homepage' spider every 4 hours (eg. 12:10, 16:10, 20:10, etc.)
scheduler.add_job(run_spider, 'cron', args=[MyHomepageSpider], hour='*/4', minute=10)
# Use cron job; runs the full spider every week on the monday, tuesday and saturday at 4:35 midnight
scheduler.add_job(run_spider, 'cron', args=[MySpider], day_of_week='mon,thu,sat', hour=4, minute=35)
scheduler.start()

The script at least doesn't exit directly (it blocks). I now get the following output as expected:

2022-03-31 22:50:24 [apscheduler.scheduler] INFO: Adding job tentatively -- it will be properly scheduled when the scheduler starts
2022-03-31 22:50:24 [apscheduler.scheduler] INFO: Adding job tentatively -- it will be properly scheduled when the scheduler starts
2022-03-31 22:50:24 [apscheduler.scheduler] INFO: Added job "run_spider" to job store "default"
2022-03-31 22:50:24 [apscheduler.scheduler] INFO: Added job "run_spider" to job store "default"
2022-03-31 22:50:24 [apscheduler.scheduler] INFO: Scheduler started
2022-03-31 22:50:24 [apscheduler.scheduler] DEBUG: Looking for jobs to run
2022-03-31 22:50:24 [apscheduler.scheduler] DEBUG: Next wakeup is due at 2022-04-01 00:10:00+02:00 (in 4775.280995 seconds)

Since we are using BlockingScheduler the scheduler will not directly exit, but start() is a blocking call. Meaning it allows the scheduler to run the jobs infinitely.

Solution 2:[2]

https://docs.scrapy.org/en/latest/topics/practices.html#run-scrapy-from-a-script

There’s another Scrapy utility that provides more control over the crawling process: scrapy.crawler.CrawlerRunner. This class is a thin wrapper that encapsulates some simple helpers to run multiple crawlers, but it won’t start or interfere with existing reactors in any way.
It’s recommended you use CrawlerRunner instead of CrawlerProcess if your application is already using Twisted and you want to run Scrapy in the same reactor.

https://docs.scrapy.org/en/latest/topics/practices.html#running-multiple-spiders-in-the-same-process

By default, Scrapy runs a single spider per process when you run scrapy crawl. However, Scrapy supports running multiple spiders per process using the internal API.


from scrapy.crawler import CrawlerRunner
from scrapy.utils.project import get_project_settings
from scrapy.utils.log import configure_logging
from twisted.internet import reactor
from apscheduler.schedulers.twisted import TwistedScheduler

from myprojectscraper.spiders.my_homepage_spider import MyHomepageSpider
from myprojectscraper.spiders.my_spider import MySpider

configure_logging()

runner = CrawlerRunner(get_project_settings())
scheduler = TwistedScheduler(timezone="Europe/Amsterdam")
# Use cron job; runs the 'homepage' spider every 4 hours (eg. 12:10, 16:10, 20:10, etc.)
scheduler.add_job(runner.crawl, 'cron', args=[MyHomepageSpider], hour='*/4', minute=10)
# Use cron job; runs the full spider every week on the monday, tuesday and saturday at 4:35 midnight
scheduler.add_job(runner.crawl, 'cron', args=[MySpider], day_of_week='mon,thu,sat', hour=4, minute=35)

deferred = runner.join()
deferred.addBoth(lambda _: reactor.stop())

scheduler.start()
reactor.run()  # the script will block here until all crawling jobs are finished
scheduler.shutdown()

Solution 3:[3]

You could try deleting the reactor installed before starting the process:

from scrapy.crawler import CrawlerProcess
from scrapy.utils.project import get_project_settings
from apscheduler.schedulers.twisted import TwistedScheduler

from myprojectscraper.spiders.my_homepage_spider import MyHomepageSpider
from myprojectscraper.spiders.my_spider import MySpider
import sys  """<--- import sys here"""

process = CrawlerProcess(get_project_settings())
# Start the crawler in a scheduler
scheduler = TwistedScheduler(timezone="Europe/Amsterdam")
# Use cron job; runs the 'homepage' spider every 4 hours (eg. 12:10, 16:10, 20:10, etc.)
scheduler.add_job(process.crawl, 'cron', args=[MyHomepageSpider], hour='*/4', minute=10)
# Use cron job; runs the full spider every week on the monday, tuesday and saturday at 4:35 midnight
scheduler.add_job(process.crawl, 'cron', args=[MySpider], day_of_week='mon,thu,sat', hour=4, minute=35)
scheduler.start()

if "twisted.internet.reactor" in sys.modules:
    del sys.modules["twisted.internet.reactor"] """<--- Delete twisted reactor if already installed here """

process.start(False)


This was what worked for me.

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1
Solution 2
Solution 3 richardec