What's the proper sequence of Airflow commands to run to schedule a DAG? - airflow

I don't understand what command(s) I need to run in order to get a DAG scheduled. Let's say I tested the DAG using airflow test dag_name task_id_1 2017-06-22 and the second task with airflow test dag_name task_id_2 2017-06-22.
I ran airflow trigger_dag dag_name, but is that for instantiating the DAG for just right that moment?
Let's say I want the dag_name's timing/scheduling to look like:
'start_date': datetime.datetime(2017, 6, 22, 18),
'end_date': datetime.datetime(2017, 6, 23, 20),
schedule_interval = datetime.timedelta(1)
So I just want to schedule and run it today and tomorrow, starting # 18:00 UTC today and 24 hours after that.
Now what command or list of commands am I supposed to run? Do I have to run airflow scheduler every time I want to add and schedule a DAG?

trigger_dag is to trigger the dag run instantaneously. To schedule the DAG, just put it in the DAG folder, go to Airflow UI and enable the DAG.

Related

Airflow DAG scheduled monthly not queued

I have an Airflow DAG set up to run monthly (with the #monthly time_interal). The next dag runs seem to be scheduled but they don't appear as "queued" in the Airflow UI. I don't understant because everythink seems good otherwise. Here is how my DAG is configured :
with DAG(
"dag_name",
start_date=datetime(2023, 1, 1),
schedule_interval="#monthly",
catchup=True,
default_args={"retries": 5, "retry_delay": timedelta(minutes=1)},
) as dag:
Do you get no runs at all when you unpause your DAG or one that is being backfilled and it says Last run 2023-01-01, 00:00:00?
In the latter case Airflow is behaving as intended, the run that just happened was the one that would have actually been queued and ran at midnight on 2023-02-01. :)
I used your configuration on a new simple DAG and it gave me one backfilled successful run with the run ID scheduled__2023-01-01T00:00:00+00:00 so running for the data interval 2023-01-01 (logical_date) to 2023-02-01, which means the Run that would have actually been queued at midnight on 2023-02-01.
The next run is scheduled for the logical date 2023-02-01 which means for the data from 2023-02-01 to 2023-03-01. This run will only actually be queued and happen at midnight 2023-03-01 as the Run After date shows:
This guide might help with terminology Airflow uses around schedules.
I'm assuming you wanted the DAG to backfill two runs, one that would have happened on 2023-01-01 and one that would have happened on 2023-02-01. This DAG should do that:
from airflow import DAG
from datetime import datetime, timedelta
from airflow.operators.empty import EmptyOperator
with DAG(
"dag_name_3",
start_date=datetime(2022, 12, 1),
schedule_interval="#monthly",
catchup=True,
default_args={"retries": 5, "retry_delay": timedelta(minutes=1)},
) as dag:
t1 = EmptyOperator(task_id="t1")

Apache Airflow does not enforce dagrun_timeout

I am using Apache Airflow version 1.10.3 with the sequential executor, and I would like the DAG to fail after a certain amount of time if it has not finished. I tried setting dagrun_timeout in the example code
from airflow import DAG
from airflow.operators.bash_operator import BashOperator
from datetime import datetime, timedelta
default_args = {
'owner': 'me',
'depends_on_past': False,
'start_date': datetime(2019, 6, 1),
'retries': 0,
}
dag = DAG('min_timeout', default_args=default_args, schedule_interval=timedelta(minutes=5), dagrun_timeout = timedelta(seconds=30), max_active_runs=1)
t1 = BashOperator(
task_id='fast_task',
bash_command='date',
dag=dag)
t2 = BashOperator(
task_id='slow_task',
bash_command='sleep 45',
dag=dag)
t2.set_upstream(t1)
slow_task alone takes more than the time limit set by dagrun_timeout, so my understanding is that airflow should stop DAG execution. However, this does not happen, and slow_task is allowed to run for its entire duration. After this occurs, the run is marked as failed, but this does not kill the task or DAG as desired. Using execution_timeout for slow_task does cause the task to be killed at the specified time limit, but I would prefer to use an overall time limit for the DAG rather than specifying execution_timeout for each task.
Is there anything else I should try to achieve this behavior, or any mistakes I can fix?
The Airflow scheduler runs a loop at least every SCHEDULER_HEARTBEAT_SEC (the default is 5 seconds).
Bear in mind at least here, because the scheduler performs some actions that may delay the next cycle of its loop.
These actions include:
parsing the dags
filling up the DagBag
checking the DagRun and updating their state
scheduling next DagRun
In your example, the delayed task isn't terminated at the dagrun_timeout because the scheduler performs its next cycle after the task completes.
According to Airflow documentation:
dagrun_timeout (datetime.timedelta) – specify how long a DagRun should be up before timing out / failing, so that new DagRuns can be created. The timeout is only enforced for scheduled DagRuns, and only once the # of active DagRuns == max_active_runs.
So dagrun_timeout wouldn't work for non-scheduled DagRuns (e.g. manually triggered) and if the number of active DagRuns < max_active_runs parameter.

Avoid expired dates in Airflow

I have the following airflow DAG:
from datetime import datetime
from airflow import DAG
from airflow.operators.bash_operator import BashOperator
#Runs every 1 minute
dag = DAG(dag_id='example_dag', start_date=datetime(2020, 1, 1), schedule_interval='*/1 * * * *')
t1 = BashOperator(task_id='bash_task', bash_command='echo Hello!', dag=dag)
Problem here is that Airflow is scheduling and executing tasks from past dates like the first minute of 2020, the second minute of 2020, the third minute of 2020 and so on.
I want Airflow to schedule and execute only the tasks that occur after the dag deploy (i.e. if I deploy today, I want the first task to be executed in the next minute) and not to execute expired tasks.
Any advice? Thanks!
I found the answer here. Read the "Catchup and Idempotent DAG" section.

How can I schedule a DAG Airflow to run in 5 minutes from now for the first time?

Situation:
Airflow 1.10.6
it's November, 18th, 8.pm
airflow.cfg.default_timezone = system (i.e. Europe/Berlin)
I want to run my new "sample_job" every day at 8.05 p.m.
My configuration:
default_args = {
'owner': 'Airflow',
'start_date': datetime.datetime(year=2019,month=11,day=18,hour=20,minute=0),
'execution_timeout' : timedelta(hours=13)
}
dag = DAG(
'sample_job',
default_args=default_args,
catchup=False,
max_active_runs=1,
schedule_interval='05 20 * * *')
Now when I activate the job at 8.03 pm I realize that the job is executed immediately with yesterday's date as last_run date.
How do I have to change my settings so that the job is not executed before 8.05 pm?
The very first DAG run is triggered soon after start_date + schedule_interval [1]. Your schedule interval is one day and you want the first DAG run to start after 2019-11-18 20:05, so your start_date should be 2019-11-17 20:05.
As to why a DAG run is started as soon as you turns the DAG on, I suspect the reason for this is that you scheduled this DAG with a different start_date or schedule_interval before. If start_date or schedule_interval is changed it is recommended to change the dag_id too [2], because then a fresh set of metadata (and a new schedule) is created for the renamed DAG.

Airflow backfill clarification

I'm just getting started with Airbnb's airflow, and I'm still not clear on how/when backfilling is done.
Specifically, there are 2 use-cases that confuse me:
If I run airflow scheduler for a few minutes, stop it for a minute, then restart it again, my DAG seems to run extra tasks for the first 30 seconds or so, then it continues as normal (runs every 10 sec). Are these extra tasks "backfilled" tasks that weren't able to complete in an earlier run? If so, how would I tell airflow not to backfill those tasks?
If I run airflow scheduler for a few minutes, then run airflow clear MY_tutorial, then restart airflow scheduler, it seems to run a TON of extra tasks. Are these tasks also somehow "backfilled" tasks? Or am I missing something.
Currently, I have a very simple dag:
default_args = {
'owner': 'me',
'depends_on_past': False,
'start_date': datetime(2016, 10, 4),
'email': ['airflow#airflow.com'],
'email_on_failure': False,
'email_on_retry': False,
'retries': 1,
'retry_delay': timedelta(minutes=5),
# 'queue': 'bash_queue',
# 'pool': 'backfill',
# 'priority_weight': 10,
# 'end_date': datetime(2016, 1, 1),
}
dag = DAG(
'MY_tutorial', default_args=default_args, schedule_interval=timedelta(seconds=10))
# t1, t2 and t3 are examples of tasks created by instantiating operators
t1 = BashOperator(
task_id='print_date',
bash_command='date',
dag=dag)
t2 = BashOperator(
task_id='sleep',
bash_command='sleep 5',
retries=3,
dag=dag)
templated_command = """
{% for i in range(5) %}
echo "{{ ds }}"
echo "{{ macros.ds_add(ds, 8)}}"
echo "{{ params.my_param }}"
{% endfor %}
"""
t3 = BashOperator(
task_id='templated',
bash_command=templated_command,
params={'my_param': 'Parameter I passed in'},
dag=dag)
second_template = """
touch ~/airflow/logs/test
echo $(date) >> ~/airflow/logs/test
"""
t4 = BashOperator(
task_id='write_test',
bash_command=second_template,
dag=dag)
t1.set_upstream(t4)
t2.set_upstream(t1)
t3.set_upstream(t1)
The only two things I've changed in my airflow config are
I changed from using a sqlite db to using a postgres db
I'm using a CeleryExecutor instead of a SequentialExecutor
Thanks so much for you help!
When you change the scheduler toggle to "on" for a DAG, the scheduler will trigger a backfill of all dag run instances for which it has no status recorded, starting with the start_date you specify in your "default_args".
For example: If the start date was "2017-01-21" and you turned on the scheduling toggle at "2017-01-22T00:00:00" and your dag was configured to run hourly, then the scheduler will backfill 24 dag runs and then start running on the scheduled interval.
This is essentially what is happening in both of your question. In #1, it is filling in the 3 missing runs from the 30 seconds which you turned off the scheduler. In #2, it is filling in all of the DAG runs from start_date until "now".
There are 2 ways around this:
Set the start_date to a date in the future so that it will only start scheduling dag runs once that date is reached. Note that if you change the start_date of a DAG, you must change the name of the DAG as well due to the way the start date is stored in airflow's DB.
Manually run backfill from the command line with the "-m" (--mark-success) flag which tells airflow not to actually run the DAG, rather just mark it as successful in the DB.
e.g.
airflow backfill MY_tutorial -m -s 2016-10-04 -e 2017-01-22T14:28:30
Please note that since version 1.8, Airflow lets you control this behaviour using catchup. Either set catchup_by_default=False in airflow.cfg or
catchup=False in your DAG definition.
See https://airflow.apache.org/scheduler.html#backfill-and-catchup
The On/Off on Airflow's UI only states "PAUSE" which means, if its ON, it will only pause on the time it was triggered and continue on that date again if it is turned off.

Resources