Schedule a DAG in airflow to run every 5 minutes - airflow

I have a DAG in airflow and for now it is running each hour (#hourly).
Is it possible to have it running each 5 minutes ?

Yes, here's an example of a DAG that I have running every 5 min:
dag = DAG(dag_id='eth_rates',
default_args=args,
schedule_interval='*/5 * * * *',
dagrun_timeout=timedelta(seconds=5))
schedule_interval accepts a CRON expression: https://en.wikipedia.org/wiki/Cron#CRON_expression

The documentation states:
Each DAG may or may not have a schedule, which informs how DAG Runs
are created. schedule_interval is defined as a DAG arguments, and
receives preferably a cron expression as a str, or a
datetime.timedelta object.
When following the provided link for CRON expressions it appears you can specify it as */5 * * * * to run it every 5 minutes.
I'm not familiar on the matter, but this is what the documentation states.

Airflow 2 (I'm using 2.4.2) supports timedelta for scheduling DAGs on a particular cadence (hourly, every 5 minutes, etc.) so you can put:
schedule_interval = timedelta(minutes=5)

Related

Airflow External Task Sensor with unscheduled upstream DAG

We use airflow in a hybrid ETL system. By this I mean that some of our DAGs are not scheduled but externally triggered using the Airflow API.
We are trying to do the following: Have a sensor in a scheduled DAG (DAG1) that senses that a task inside an externally triggered DAG (DAG2) has run.
For example, the DAG1 runs at 11 am, and we want to be sure that DAG2 has run (due to an external trigger) at least once since 00:00. I have tried to set execution_delta = timedelta(hours=11) but the sensor is sensing nothing. I think the problem is that the sensor tries to look for a task that has been scheduled exactly at 00:00. This won't be the case, as DAG2 can be triggered at any time from 00:00 to 11:00.
Is there any solution that can serve the purpose we need? I think we might need to create a custom Sensor, but it feels strange to me that the native Airflow Sensor does not solve this issue.
This is the sensor I'm defining:
from datetime import timedelta
from airflow.sensors import external_task
sensor = external_task.ExternalTaskSensor(
task_id='sensor',
dag=dag,
external_dag_id='DAG2',
external_task_id='sensed_task',
mode='reschedule',
check_existence=True,
execution_delta=timedelta(hours=int(execution_type)),
poke_interval=10 * 60, # Check every 10 minutes
timeout=1 * 60 * 60, # Allow for 1 hour of delay in execution
)
I had the same problem & used the execution_date_fn parameter:
ExternalTaskSensor(
task_id="sensor",
external_dag_id="dag_id",
execution_date_fn=get_most_recent_dag_run,
mode="reschedule",
where the get_most_recent_dag_run function looks like this :
from airflow.models import DagRun
def get_most_recent_dag_run(dt):
dag_runs = DagRun.find(dag_id="dag_id")
dag_runs.sort(key=lambda x: x.execution_date, reverse=True)
if dag_runs:
return dag_runs[0].execution_date
As the ExternalTaskSensor needs to know both the dag_id and the exact last_execution_date for cross-DAGs dependencies.

How to force a Airflow Task to restart at the new scheduling date?

I have this simple Airflow DAG:
from airflow import DAG
from datetime import datetime, timedelta
from airflow.operators.bash import BashOperator
with DAG("Second Dag",
start_date=datetime(2022,1,1),
schedule_interval="0 5 * * *",
catchup=False,
max_active_runs=1
) as dag:
task_a = BashOperator(
task_id="ToRepeat",
bash_command="cd /home/xdf/local/ && (env/bin/python workflow/test1.py)",
retries =1,
)
The task takes a variable amount of time between one run and the other, and I don't have any guarantee that it will be finished within the 5 A.M of the next day.
If the task is still running when a new task is scheduled to start, I need to kill the old one before it starts running.
How can I design Airflow DAG to automatically kill the old task if it's still running when a new task is scheduled to start?
More details:
I am looking for something dynamic. The old DAG should be killed only when the new DAG is starting. If, for any reason, the new DAG does not start for one week, then old DAG should be able to run for an entire week. That's why using a timeout is sub-optimal
You should set dagrun_timeout for your DAG.
dagrun_timeout: specify how long a DagRun should be up before
timing out / failing, so that new DagRuns can be created. The timeout
is only enforced for scheduled DagRuns.
Since your DAG runs daily you can set 24 hours for timeout.
with DAG("Second Dag",
start_date=datetime(2022,1,1),
schedule_interval="0 5 * * *",
catchup=False,
max_active_runs=1
dagrun_timeout=timedelta(hours=24)
) as dag:
If you want to set timeout on a specific task in your DAG you should use execution_timeout on your operator.
execution_timeout: max time allowed for the execution of this task instance, if it goes beyond it will raise and fail
Example:
MyOperator(task_id='task', execution_timeout=timedelta(hours=24))
If you really are looking for a dynamic solution; you can take help of Airflow DAGRun APIs and Xcoms; you can push your current dag run_id to Xcom and for subsequent runs you can pull this Xcom to consume with airflow API to check and kill the dag run with that run_id.
check_previous_dag_run_id >> kill_previous_dag_run >> push_current_run_id >> your_main_task
and your API call task should be something like
...
kill_previous_dag_run = BashOperator(
task_id="kill_previous_dag_run",
bash_command="curl -X 'DELETE' \
'http://<<your_webserver_dns>>/api/v1/dags/<<your_dag_name>>/dagRuns/<<url_encoded_run_id>>' \
-H 'accept: */*' --user <<api_username>>:<<api_user_password>>",
dag=dag
)
...

Avoid expired dates in Airflow

I have the following airflow DAG:
from datetime import datetime
from airflow import DAG
from airflow.operators.bash_operator import BashOperator
#Runs every 1 minute
dag = DAG(dag_id='example_dag', start_date=datetime(2020, 1, 1), schedule_interval='*/1 * * * *')
t1 = BashOperator(task_id='bash_task', bash_command='echo Hello!', dag=dag)
Problem here is that Airflow is scheduling and executing tasks from past dates like the first minute of 2020, the second minute of 2020, the third minute of 2020 and so on.
I want Airflow to schedule and execute only the tasks that occur after the dag deploy (i.e. if I deploy today, I want the first task to be executed in the next minute) and not to execute expired tasks.
Any advice? Thanks!
I found the answer here. Read the "Catchup and Idempotent DAG" section.

How can I schedule a DAG Airflow to run in 5 minutes from now for the first time?

Situation:
Airflow 1.10.6
it's November, 18th, 8.pm
airflow.cfg.default_timezone = system (i.e. Europe/Berlin)
I want to run my new "sample_job" every day at 8.05 p.m.
My configuration:
default_args = {
'owner': 'Airflow',
'start_date': datetime.datetime(year=2019,month=11,day=18,hour=20,minute=0),
'execution_timeout' : timedelta(hours=13)
}
dag = DAG(
'sample_job',
default_args=default_args,
catchup=False,
max_active_runs=1,
schedule_interval='05 20 * * *')
Now when I activate the job at 8.03 pm I realize that the job is executed immediately with yesterday's date as last_run date.
How do I have to change my settings so that the job is not executed before 8.05 pm?
The very first DAG run is triggered soon after start_date + schedule_interval [1]. Your schedule interval is one day and you want the first DAG run to start after 2019-11-18 20:05, so your start_date should be 2019-11-17 20:05.
As to why a DAG run is started as soon as you turns the DAG on, I suspect the reason for this is that you scheduled this DAG with a different start_date or schedule_interval before. If start_date or schedule_interval is changed it is recommended to change the dag_id too [2], because then a fresh set of metadata (and a new schedule) is created for the renamed DAG.

Airflow scheduler not scheduling simple DAG task immediately

I have scheduled a DAG with a simple bash task to run every 5th minute:
# bash_dag.py
from airflow import DAG
from airflow.operators.bash_operator import BashOperator
from datetime import datetime, timedelta
default_args = {
'owner': 'airflow',
'start_date' : datetime(2019, 5, 30)
}
dag = DAG(
'bash_count',
default_args=default_args,
schedule_interval='*/5 * * * *',
catchup = False
)
t1 = BashOperator(
task_id='print_date',
bash_command='date',
dag=dag
)
Scheduling works fine, DAG is executing every 5th minute threshold. However, I have noticed that there is a significant delay between the 5th minute threshold and task queueing time. For the examples shown in the image, task queueing takes in between 3 to 50 seconds. For example, last DAG execution in the image was supposed to be triggered after 20:05:00 but task instance was queued 28 seconds later (20:05:28).
I'm surprised this is the case, since the DAG being scheduled has a single very simple task. Is this a normal airflow delay? Should I expect further delays when dealing with more complex DAGs?
I'm running a local airflow server with Postgres as db on a 16 GB Mac with OS Mojave. Machine is not resource constrained.

Resources