'Airflow Scheduler stops after MySQL host is rebooted

I am running Airflow 2.1.2 with a LocalExecutor and MySQL as the metadata store. MySQL runs on a different machine. It is running as a systemd unit. Periodically my scheduler will stop executing tasks, although the systemctl status says the service is up. The webserver will give a warning that it hasn't received a heartbeat in some time, and no jobs are running.

In looking at the logs here is the last thing that happens:

sqlalchemy.exc.OperationalError: (MySQLdb._exceptions.OperationalError) (2003, "Can't connect to MySQL server on 'XXXXXXXXXXXXX' (111)")

The timestamp on the log coincides with the time the MySQL server got rebooted. A different team manages that server, and they have a regular maintenance window for applying patches and doing reboots. It takes probably 5-10 minutes for the server to come back, but it seems in this time the Airflow stops trying to reconnect.

I am wondering if there is a setting in airflow I can change, or if anyone else has experience implementing something to recover from this situation.

I had an idea to create another service to check for the heartbeat and restart the scheduler if a certain amount of time passed without the heartbeat, but hadn't gone down that route yet in case there is a better way to handle this.



Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source