Airflow: Prioritize the Dags - airflow

I'm fetching the category from metadata table and creating dynamic dags for each category using python script. Right now, we have around 15 categories, so each category will have its own dag. My Dag file has 3 tasks, and it is running sequentially.
Using LocalExecutor.All the 15 dags(dag-runs) triggering in parallel. We don't have enough resources(tasks are heavy) to run all the 15 dags in parallel.
Any way to prioritize the dag-runs? 5 dags should run first, then next five should run and so on. Jobs should run based on available resources, others should be in queue.This should be dynamic.
Any best way to fix this? Kindly help.
Sample dag:
from airflow import DAG
from airflow.operators.bash_operator import BashOperator
from datetime import datetime, timedelta
default_args = {
'start_date': datetime(2019, 6, 03),
'retries': 1,
'retry_delay': timedelta(minutes=5)
}
dag = DAG('test', catchup=False, default_args=default_args, schedule_interval='*/5 * * * *')
t1 = BashOperator(
task_id='print_start_date',
bash_command='date',
dag=dag)
t2 = BashOperator(
task_id='sleep',
bash_command='sleep 50s',
retries=3,
dag=dag)
t3 = BashOperator(
task_id='print_end_date',
bash_command='date',
dag=dag)
t1 >> t2 >> t3

There isn't a good effective way to do this if you are both running on LocalExecutor and if they all run at the same time.
If you were to move to using a CeleryExecutor and have multiple worker machines then you could use the concept of Airflow Queues to create a "priority" queue which serves the DAGs that you indicate to be high priority.
Another options would be using Sub DAGs. Each of the 15 DAGs can be structured as Sub DAGs and run in the order you want. Here is an example of what that could look like:
start ----> Sub Dag 1 --> Sub Dag 6 --> Sub Dag 11
|--> Sub Dag 2 --> Sub Dag 7 --> Sub Dag 12
|--> Sub Dag 3 --> Sub Dag 8 --> Sub Dag 13
|--> Sub Dag 4 --> Sub Dag 9 --> Sub Dag 14
|--> Sub Dag 5 --> Sub Dag 10 --> Sub Dag 15

Related

How to force a Airflow Task to restart at the new scheduling date?

I have this simple Airflow DAG:
from airflow import DAG
from datetime import datetime, timedelta
from airflow.operators.bash import BashOperator
with DAG("Second Dag",
start_date=datetime(2022,1,1),
schedule_interval="0 5 * * *",
catchup=False,
max_active_runs=1
) as dag:
task_a = BashOperator(
task_id="ToRepeat",
bash_command="cd /home/xdf/local/ && (env/bin/python workflow/test1.py)",
retries =1,
)
The task takes a variable amount of time between one run and the other, and I don't have any guarantee that it will be finished within the 5 A.M of the next day.
If the task is still running when a new task is scheduled to start, I need to kill the old one before it starts running.
How can I design Airflow DAG to automatically kill the old task if it's still running when a new task is scheduled to start?
More details:
I am looking for something dynamic. The old DAG should be killed only when the new DAG is starting. If, for any reason, the new DAG does not start for one week, then old DAG should be able to run for an entire week. That's why using a timeout is sub-optimal
You should set dagrun_timeout for your DAG.
dagrun_timeout: specify how long a DagRun should be up before
timing out / failing, so that new DagRuns can be created. The timeout
is only enforced for scheduled DagRuns.
Since your DAG runs daily you can set 24 hours for timeout.
with DAG("Second Dag",
start_date=datetime(2022,1,1),
schedule_interval="0 5 * * *",
catchup=False,
max_active_runs=1
dagrun_timeout=timedelta(hours=24)
) as dag:
If you want to set timeout on a specific task in your DAG you should use execution_timeout on your operator.
execution_timeout: max time allowed for the execution of this task instance, if it goes beyond it will raise and fail
Example:
MyOperator(task_id='task', execution_timeout=timedelta(hours=24))
If you really are looking for a dynamic solution; you can take help of Airflow DAGRun APIs and Xcoms; you can push your current dag run_id to Xcom and for subsequent runs you can pull this Xcom to consume with airflow API to check and kill the dag run with that run_id.
check_previous_dag_run_id >> kill_previous_dag_run >> push_current_run_id >> your_main_task
and your API call task should be something like
...
kill_previous_dag_run = BashOperator(
task_id="kill_previous_dag_run",
bash_command="curl -X 'DELETE' \
'http://<<your_webserver_dns>>/api/v1/dags/<<your_dag_name>>/dagRuns/<<url_encoded_run_id>>' \
-H 'accept: */*' --user <<api_username>>:<<api_user_password>>",
dag=dag
)
...

Airflow backfill only scheduling for START_DATE

I just started using airflow and I basically want to run my dag to load historical data. So I'm running this command
airflow backfill my_dag -s 2018-07-30 -e 2018-08-01
And airflow is running my dag only for 2018-07-30. My expectation was airflow to run for 2018-07-30, 2018-07-31 and 2018-08-01.
Here's part of my dag's code:
import airflow
import configparser
import os
from airflow import DAG
from airflow.contrib.operators.databricks_operator import DatabricksSubmitRunOperator
from airflow.models import Variable
from datetime import datetime
def getConfFileFullPath(fileName):
return os.path.join(os.path.abspath(os.path.dirname(__file__)), fileName)
config = configparser.ConfigParser(interpolation=configparser.ExtendedInterpolation())
config.read([getConfFileFullPath('pipeline.properties')])
args = {
'owner': 'airflow',
'depends_on_past': True,
'start_date': datetime(2018,7,25),
'end_date':airflow.utils.dates.days_ago(1)
}
dag_id='my_dag'
dag = DAG(
dag_id=dag_id, default_args=args,
schedule_interval=None, catchup=False)
...
So am I doing anything wrong with my dag configuration?
Problem: schedule_interval=None
In order to initiate multiple runs within your defined date range you need to set the schedule interval for the dag. For example try:
schedule_interval=#daily
Start date, end date and schedule interval defines how many runs will be initiated by the scheduler when backfill is executed.
Airflow scheduling and presets

Avoid expired dates in Airflow

I have the following airflow DAG:
from datetime import datetime
from airflow import DAG
from airflow.operators.bash_operator import BashOperator
#Runs every 1 minute
dag = DAG(dag_id='example_dag', start_date=datetime(2020, 1, 1), schedule_interval='*/1 * * * *')
t1 = BashOperator(task_id='bash_task', bash_command='echo Hello!', dag=dag)
Problem here is that Airflow is scheduling and executing tasks from past dates like the first minute of 2020, the second minute of 2020, the third minute of 2020 and so on.
I want Airflow to schedule and execute only the tasks that occur after the dag deploy (i.e. if I deploy today, I want the first task to be executed in the next minute) and not to execute expired tasks.
Any advice? Thanks!
I found the answer here. Read the "Catchup and Idempotent DAG" section.

Airflow scheduler not scheduling simple DAG task immediately

I have scheduled a DAG with a simple bash task to run every 5th minute:
# bash_dag.py
from airflow import DAG
from airflow.operators.bash_operator import BashOperator
from datetime import datetime, timedelta
default_args = {
'owner': 'airflow',
'start_date' : datetime(2019, 5, 30)
}
dag = DAG(
'bash_count',
default_args=default_args,
schedule_interval='*/5 * * * *',
catchup = False
)
t1 = BashOperator(
task_id='print_date',
bash_command='date',
dag=dag
)
Scheduling works fine, DAG is executing every 5th minute threshold. However, I have noticed that there is a significant delay between the 5th minute threshold and task queueing time. For the examples shown in the image, task queueing takes in between 3 to 50 seconds. For example, last DAG execution in the image was supposed to be triggered after 20:05:00 but task instance was queued 28 seconds later (20:05:28).
I'm surprised this is the case, since the DAG being scheduled has a single very simple task. Is this a normal airflow delay? Should I expect further delays when dealing with more complex DAGs?
I'm running a local airflow server with Postgres as db on a 16 GB Mac with OS Mojave. Machine is not resource constrained.

Airflow backfill clarification

I'm just getting started with Airbnb's airflow, and I'm still not clear on how/when backfilling is done.
Specifically, there are 2 use-cases that confuse me:
If I run airflow scheduler for a few minutes, stop it for a minute, then restart it again, my DAG seems to run extra tasks for the first 30 seconds or so, then it continues as normal (runs every 10 sec). Are these extra tasks "backfilled" tasks that weren't able to complete in an earlier run? If so, how would I tell airflow not to backfill those tasks?
If I run airflow scheduler for a few minutes, then run airflow clear MY_tutorial, then restart airflow scheduler, it seems to run a TON of extra tasks. Are these tasks also somehow "backfilled" tasks? Or am I missing something.
Currently, I have a very simple dag:
default_args = {
'owner': 'me',
'depends_on_past': False,
'start_date': datetime(2016, 10, 4),
'email': ['airflow#airflow.com'],
'email_on_failure': False,
'email_on_retry': False,
'retries': 1,
'retry_delay': timedelta(minutes=5),
# 'queue': 'bash_queue',
# 'pool': 'backfill',
# 'priority_weight': 10,
# 'end_date': datetime(2016, 1, 1),
}
dag = DAG(
'MY_tutorial', default_args=default_args, schedule_interval=timedelta(seconds=10))
# t1, t2 and t3 are examples of tasks created by instantiating operators
t1 = BashOperator(
task_id='print_date',
bash_command='date',
dag=dag)
t2 = BashOperator(
task_id='sleep',
bash_command='sleep 5',
retries=3,
dag=dag)
templated_command = """
{% for i in range(5) %}
echo "{{ ds }}"
echo "{{ macros.ds_add(ds, 8)}}"
echo "{{ params.my_param }}"
{% endfor %}
"""
t3 = BashOperator(
task_id='templated',
bash_command=templated_command,
params={'my_param': 'Parameter I passed in'},
dag=dag)
second_template = """
touch ~/airflow/logs/test
echo $(date) >> ~/airflow/logs/test
"""
t4 = BashOperator(
task_id='write_test',
bash_command=second_template,
dag=dag)
t1.set_upstream(t4)
t2.set_upstream(t1)
t3.set_upstream(t1)
The only two things I've changed in my airflow config are
I changed from using a sqlite db to using a postgres db
I'm using a CeleryExecutor instead of a SequentialExecutor
Thanks so much for you help!
When you change the scheduler toggle to "on" for a DAG, the scheduler will trigger a backfill of all dag run instances for which it has no status recorded, starting with the start_date you specify in your "default_args".
For example: If the start date was "2017-01-21" and you turned on the scheduling toggle at "2017-01-22T00:00:00" and your dag was configured to run hourly, then the scheduler will backfill 24 dag runs and then start running on the scheduled interval.
This is essentially what is happening in both of your question. In #1, it is filling in the 3 missing runs from the 30 seconds which you turned off the scheduler. In #2, it is filling in all of the DAG runs from start_date until "now".
There are 2 ways around this:
Set the start_date to a date in the future so that it will only start scheduling dag runs once that date is reached. Note that if you change the start_date of a DAG, you must change the name of the DAG as well due to the way the start date is stored in airflow's DB.
Manually run backfill from the command line with the "-m" (--mark-success) flag which tells airflow not to actually run the DAG, rather just mark it as successful in the DB.
e.g.
airflow backfill MY_tutorial -m -s 2016-10-04 -e 2017-01-22T14:28:30
Please note that since version 1.8, Airflow lets you control this behaviour using catchup. Either set catchup_by_default=False in airflow.cfg or
catchup=False in your DAG definition.
See https://airflow.apache.org/scheduler.html#backfill-and-catchup
The On/Off on Airflow's UI only states "PAUSE" which means, if its ON, it will only pause on the time it was triggered and continue on that date again if it is turned off.

Resources