'How do I stop Apache Airflow running a task the first time when I unpause it?

I have a DAG. Here is a sample of the parameters.

dag = DAG(
    'My Dag',
    default_args=default_args,
    description='Cron Job : My Dag',
    schedule_interval='45 07 * * *',
    # start_date=days_ago(0),
    start_date = datetime(2021, 4, 6, 10, 45),
    tags=['My Dag Tag'],
    concurrency = 1,
    is_paused_upon_creation=True,
    catchup=False # Don’t run previous and backfill; run only latest
)

Reading the documentation from Apache Airflow, I think I have set the DAG to run at 7:45 every day. However, if I pause the DAG and unpause it a couple of days later, it still runs as soon as I unpause it (of course, for that day) as catch=False which avoids backfills.

That is not the expected behaviour, right?

I mean, I scheduled it on 7:45. When I unpause it at 10:00, it should not be running at all until the next 7:45.

What am I missing here?



Solution 1:[1]

I assume that you are familiar with the scheduling mechanism of Airflow. If this is not the case please read Problem with start date and scheduled date in Apache Airflow before reading the rest of the answer.

As for your case:

You had one/several runs as expected when you deployed the DAG. At some point you paused the DAG on 2021-04-07, today (2021-04-19) you unpaused it. Airflow then executed a DAG run with execution_date='2021-04-18'.

This is expected.

The reason for this is based on the scheduling mechanism of Airflow.

Your last run was on 2021-04-07 and the interval is 45 07 * * * (every day at 07:45). Since you paused the DAG, the runs of 2021-04-08, 2021-04-09, ... , 2021-04-17 were never created. When you unpaused the DAG, Airflow didn't create these runs because of catchup=False, however, today's run (2021-04-19) isn't part of the catchup. It was scheduled because the interval of execution_date=2021-04-18 has reached its end cycle, and thus started running.

The behavior that you are experiencing isn't different than deploying this fresh DAG:

from airflow.operators.dummy_operator import DummyOperator
default_args = {
    'owner': 'airflow',
    'start_date': datetime(2020, 1, 1),

}
with DAG(dag_id='stackoverflow_question',
         default_args=default_args,
         schedule_interval='45 07 * * *',
         catchup=False
         ) as dag:
    DummyOperator(task_id='some_task')

As soon as you will deploy it, a single run will be created:

Enter image description here

Enter image description here

The DAG's start_date is 2020-01-01 with catchup=False. I deployed the DAG today (19/Apr/2021), so it created a run with execution_date='2021-04-18' that started to run today, 2021-04-19.

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 Peter Mortensen