'MWAA in productions - tasks queued for unknown reasons

Does anyone use MWAA in production?

We currently have around 500 DAGs running and we see an unexpected behavior with tasks staying in a "queued" state for unknown reasons.

Task is in the 'queued' state which is not a valid state for execution. The task must be cleared in order to be run.

It happens randomly, can perfectly run for a day and then a few tasks will stay queued. The tasks will stay in this state forever unless we mark them as failed manually.

A DAG run can stay in this "queued" state even if the pool is empty, I don't see any reasons explaining this.

It happens to ~5% of the tasks with all the others running smoothly.

Did you ever encounter this behavior?



Solution 1:[1]

This was happening to me in MWAA as well. It only started happening after I upgraded to Airflow version 2.2.2. The solution that worked for me was adding to Airflow configuration options via the web UI the following option:

Configuration option: celery.pool
Custom value: solo

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 Kevin Vo