I have the following airflow DAG:
from datetime import datetime
from airflow import DAG
from airflow.operators.bash_operator import BashOperator
#Runs every 1 minute
dag = DAG(dag_id='example_dag', start_date=datetime(2020, 1, 1), schedule_interval='*/1 * * * *')
t1 = BashOperator(task_id='bash_task', bash_command='echo Hello!', dag=dag)
Problem here is that Airflow is scheduling and executing tasks from past dates like the first minute of 2020, the second minute of 2020, the third minute of 2020 and so on.
I want Airflow to schedule and execute only the tasks that occur after the dag deploy (i.e. if I deploy today, I want the first task to be executed in the next minute) and not to execute expired tasks.
Any advice? Thanks!
I found the answer here. Read the "Catchup and Idempotent DAG" section.
Related
I have an Airflow DAG set up to run monthly (with the #monthly time_interal). The next dag runs seem to be scheduled but they don't appear as "queued" in the Airflow UI. I don't understant because everythink seems good otherwise. Here is how my DAG is configured :
with DAG(
"dag_name",
start_date=datetime(2023, 1, 1),
schedule_interval="#monthly",
catchup=True,
default_args={"retries": 5, "retry_delay": timedelta(minutes=1)},
) as dag:
Do you get no runs at all when you unpause your DAG or one that is being backfilled and it says Last run 2023-01-01, 00:00:00?
In the latter case Airflow is behaving as intended, the run that just happened was the one that would have actually been queued and ran at midnight on 2023-02-01. :)
I used your configuration on a new simple DAG and it gave me one backfilled successful run with the run ID scheduled__2023-01-01T00:00:00+00:00 so running for the data interval 2023-01-01 (logical_date) to 2023-02-01, which means the Run that would have actually been queued at midnight on 2023-02-01.
The next run is scheduled for the logical date 2023-02-01 which means for the data from 2023-02-01 to 2023-03-01. This run will only actually be queued and happen at midnight 2023-03-01 as the Run After date shows:
This guide might help with terminology Airflow uses around schedules.
I'm assuming you wanted the DAG to backfill two runs, one that would have happened on 2023-01-01 and one that would have happened on 2023-02-01. This DAG should do that:
from airflow import DAG
from datetime import datetime, timedelta
from airflow.operators.empty import EmptyOperator
with DAG(
"dag_name_3",
start_date=datetime(2022, 12, 1),
schedule_interval="#monthly",
catchup=True,
default_args={"retries": 5, "retry_delay": timedelta(minutes=1)},
) as dag:
t1 = EmptyOperator(task_id="t1")
I wanted to have my dags to have first run on 2:00 AM on 25th and then onwards Tuesday to Sat daily run at 2:00 am.
following is how my scheduling look like.
with DAG(
dag_id='in__xxx__agnt_brk_com',
schedule_interval='0 2 * * 2-6',
start_date=datetime(2022, 10, 24),
catchup=False,
) as dag:
And on Airflow UI also it shows that my first run should be on 25th 2:00 AM. But unfortunately, dags didn't execute on time.
What I am missing here ?
Airflow is scheduling at 2am in your local time, that is 6am in UTC.
Take a look at this link on how to specify the timezone in your dag: https://airflow.apache.org/docs/apache-airflow/stable/timezone.html
You should consult the documentation on dag intervals.
Your dag did not run on the 25th:
Your start date is 2022-10-24
This creates a dag interval of 2022-10-24 - 2022-10-25.
You set catchup=False
You created the dag after midnight on the 25th. The dag interval for the 24th has passed and you've denied catchup.
The next dag is scheduled for 2022-10-25
This creates a dag interval of 2022-10-25 - 2022-10-26
Your dag will run at 2am UTC on the 26th.
I have this simple Airflow DAG:
from airflow import DAG
from datetime import datetime, timedelta
from airflow.operators.bash import BashOperator
with DAG("Second Dag",
start_date=datetime(2022,1,1),
schedule_interval="0 5 * * *",
catchup=False,
max_active_runs=1
) as dag:
task_a = BashOperator(
task_id="ToRepeat",
bash_command="cd /home/xdf/local/ && (env/bin/python workflow/test1.py)",
retries =1,
)
The task takes a variable amount of time between one run and the other, and I don't have any guarantee that it will be finished within the 5 A.M of the next day.
If the task is still running when a new task is scheduled to start, I need to kill the old one before it starts running.
How can I design Airflow DAG to automatically kill the old task if it's still running when a new task is scheduled to start?
More details:
I am looking for something dynamic. The old DAG should be killed only when the new DAG is starting. If, for any reason, the new DAG does not start for one week, then old DAG should be able to run for an entire week. That's why using a timeout is sub-optimal
You should set dagrun_timeout for your DAG.
dagrun_timeout: specify how long a DagRun should be up before
timing out / failing, so that new DagRuns can be created. The timeout
is only enforced for scheduled DagRuns.
Since your DAG runs daily you can set 24 hours for timeout.
with DAG("Second Dag",
start_date=datetime(2022,1,1),
schedule_interval="0 5 * * *",
catchup=False,
max_active_runs=1
dagrun_timeout=timedelta(hours=24)
) as dag:
If you want to set timeout on a specific task in your DAG you should use execution_timeout on your operator.
execution_timeout: max time allowed for the execution of this task instance, if it goes beyond it will raise and fail
Example:
MyOperator(task_id='task', execution_timeout=timedelta(hours=24))
If you really are looking for a dynamic solution; you can take help of Airflow DAGRun APIs and Xcoms; you can push your current dag run_id to Xcom and for subsequent runs you can pull this Xcom to consume with airflow API to check and kill the dag run with that run_id.
check_previous_dag_run_id >> kill_previous_dag_run >> push_current_run_id >> your_main_task
and your API call task should be something like
...
kill_previous_dag_run = BashOperator(
task_id="kill_previous_dag_run",
bash_command="curl -X 'DELETE' \
'http://<<your_webserver_dns>>/api/v1/dags/<<your_dag_name>>/dagRuns/<<url_encoded_run_id>>' \
-H 'accept: */*' --user <<api_username>>:<<api_user_password>>",
dag=dag
)
...
Situation:
Airflow 1.10.6
it's November, 18th, 8.pm
airflow.cfg.default_timezone = system (i.e. Europe/Berlin)
I want to run my new "sample_job" every day at 8.05 p.m.
My configuration:
default_args = {
'owner': 'Airflow',
'start_date': datetime.datetime(year=2019,month=11,day=18,hour=20,minute=0),
'execution_timeout' : timedelta(hours=13)
}
dag = DAG(
'sample_job',
default_args=default_args,
catchup=False,
max_active_runs=1,
schedule_interval='05 20 * * *')
Now when I activate the job at 8.03 pm I realize that the job is executed immediately with yesterday's date as last_run date.
How do I have to change my settings so that the job is not executed before 8.05 pm?
The very first DAG run is triggered soon after start_date + schedule_interval [1]. Your schedule interval is one day and you want the first DAG run to start after 2019-11-18 20:05, so your start_date should be 2019-11-17 20:05.
As to why a DAG run is started as soon as you turns the DAG on, I suspect the reason for this is that you scheduled this DAG with a different start_date or schedule_interval before. If start_date or schedule_interval is changed it is recommended to change the dag_id too [2], because then a fresh set of metadata (and a new schedule) is created for the renamed DAG.
I have scheduled a DAG with a simple bash task to run every 5th minute:
# bash_dag.py
from airflow import DAG
from airflow.operators.bash_operator import BashOperator
from datetime import datetime, timedelta
default_args = {
'owner': 'airflow',
'start_date' : datetime(2019, 5, 30)
}
dag = DAG(
'bash_count',
default_args=default_args,
schedule_interval='*/5 * * * *',
catchup = False
)
t1 = BashOperator(
task_id='print_date',
bash_command='date',
dag=dag
)
Scheduling works fine, DAG is executing every 5th minute threshold. However, I have noticed that there is a significant delay between the 5th minute threshold and task queueing time. For the examples shown in the image, task queueing takes in between 3 to 50 seconds. For example, last DAG execution in the image was supposed to be triggered after 20:05:00 but task instance was queued 28 seconds later (20:05:28).
I'm surprised this is the case, since the DAG being scheduled has a single very simple task. Is this a normal airflow delay? Should I expect further delays when dealing with more complex DAGs?
I'm running a local airflow server with Postgres as db on a 16 GB Mac with OS Mojave. Machine is not resource constrained.