'Airflow Scheduler stops after MySQL host is rebooted
I am running Airflow 2.1.2 with a LocalExecutor and MySQL as the metadata store. MySQL runs on a different machine. It is running as a systemd unit. Periodically my scheduler will stop executing tasks, although the systemctl status says the service is up. The webserver will give a warning that it hasn't received a heartbeat in some time, and no jobs are running.
In looking at the logs here is the last thing that happens:
sqlalchemy.exc.OperationalError: (MySQLdb._exceptions.OperationalError) (2003, "Can't connect to MySQL server on 'XXXXXXXXXXXXX' (111)")
The timestamp on the log coincides with the time the MySQL server got rebooted. A different team manages that server, and they have a regular maintenance window for applying patches and doing reboots. It takes probably 5-10 minutes for the server to come back, but it seems in this time the Airflow stops trying to reconnect.
I am wondering if there is a setting in airflow I can change, or if anyone else has experience implementing something to recover from this situation.
I had an idea to create another service to check for the heartbeat and restart the scheduler if a certain amount of time passed without the heartbeat, but hadn't gone down that route yet in case there is a better way to handle this.
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|
