'Airflow using ExternalTaskSensor Operator caused MySQL innodb deadlock
I use ExternalTaskSensor Operator in Airflow to manage dependencies between DAGs, My ExternalTaskSensor Operator code looks like this:
dag = DAG(
dag_id='sushi.batch.load.application.detail.1d',
default_args=InitConf.getArgs(start_date=datetime(2021, 12, 9)),
description='Load Application Detail Data',
schedule_interval='00 */3 * * *',
tags=['sushi', 'develop']
)
monitor_handleApplicationData = ExternalTaskSensor(
task_id='wait_for_application_handle_end_detail',
execution_date_fn=lambda dt: dt + timedelta(minutes=35),
external_dag_id='sushi.batch.handle.application.1d',
external_task_id='application_handle_end',
timeout=7200,
allowed_states=['success'],
mode='reschedule',
pork_interval=60,
check_existence=True,
dag=dag,
)
The sensor running mode is reschedule, The Sensor takes up a worker slot only when it is checking, and sleeps for a set duration between checks.
But I found that Airflow scheduler crashed down because of MySQL Innodb deadlock sometime, so I had to restart the Airflow scheduler often. And here some log that I collect in Airflow scheduler docker container:
sqlalchemy.exc.OperationalError: (MySQLdb._exceptions.OperationalError) (1213, 'Deadlock found when trying to get lock; try restarting transaction')
[SQL: UPDATE task_instance SET external_executor_id=%s WHERE task_instance.task_id = %s AND task_instance.dag_id = %s AND task_instance.execution_date = %s]
[parameters: (('2b14b7a2-46ef-4ec1-b16b-5f6b1f0610d2', 'wait_for_application_handle_end_detail', 'sushi.batch.load.application.detail.1d', datetime.datetime(2022, 5, 20, 0, 0)), ('4e878253-f0dd-4465-a0d1-39dbc444b882', 'wait_for_application_handle_end_dict', 'sushi.batch.application.dict.handle.1d', datetime.datetime(2022, 5, 20, 0, 0)), ('4bafb4a2-c614-41e0-bd1b-5c47dd5334aa', 'wait_for_application_handle_end_dict_test', 'sushi.batch.application.dict.handle.test.1d', datetime.datetime(2022, 5, 20, 0, 0)))]
It shows that there is one update sql caused deadlock, I call it SQL 1:
UPDATE task_instance SET external_executor_id='2b14b7a2-46ef-4ec1-b16b-5f6b1f0610d2'
WHERE task_instance.task_id = 'wait_for_application_handle_end_detail'
AND task_instance.dag_id = 'sushi.batch.load.application.detail.1d'
AND task_instance.execution_date = datetime.datetime(2022, 5, 20, 0, 0)
Here's the
MySQL task_instance table schema
The primary keys are task_id, dag_id, execution_date. When update, innodb engine will lock rows which the condition of the task_id column is satisfied first, it's indeed possible to deadlock if two Task with same task_id in two different DAG. But my dag_id and task_id are both unique in all DAGs and Tasks, there's no reason caused deadlock. So I check the MySQL transaction log and I found another update sql, I call it SQL 2:
UPDATE task_instance SET state='scheduled'
WHERE task_instance.dag_id='sushi.batch.load.application.detail.1d'
AND task_instance.execution_date='2022-05-20 00:00:00'
AND task_instance.task_id. IN ('wait_for_application_handle_end_detail')
I seems know why deadlock happened, SQL 1 and SQL 2 might execute in same time and the task_id are both wait_for_application_handle_end_detail. I know why SQL 2 was executed, because my ExternalTaskSensor running mode is reschedule and poke interval is 60s, it means that SQL 2 will execute every 60 second to change the task current state. But I don't know why SQL 1 was executed, what's external_executor_id used for?
I know change the running mode of ExternalTaskSensor to poke might solve this problem, but it will takes up a worker slot for its entire runtime. Is there any other solution besides this?
Solution 1:[1]
Both of those Updates will run faster with this composite Index:
INDEX(dag_id, execution_date, task_id)
By being indexed and running faster, most (or maybe all) deadlocks will be prevented.
Even so, you should replay the query if it does encounter a deadlock.
Do you have any "transactions"? (EG, with BEGIN and COMMIT?)
Solution 2:[2]
The query will be something (but not exactly) like this:
SELECT id, name, case when rn = 1 then 'yes' else 'no' end as IsFirst
FROM (
SELECT *, row_number() over (partition by id order by id, name) as rn
FROM `MyTable`
)
ORDER BY id, IsFirst
The reason for the "not exactly" is this won't match your results for id 2, because Cassava follows logically after Carrot in that group.
The problem here is tables are not ordered by definition. Insert order doesn't matter. Primary key doesn't matter. At least, not enough for full determinism. The database is free to re-order the table as needed, and it is very possible to get different results from one run to the next unless you can give us a reference within the actual data that specifically determines what the order is within each id value.
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | Rick James |
| Solution 2 | Joel Coehoorn |
