How to use the same DAG with multiple schedule_intervals in airflow? - airflow

I have one DAG that I pass a variety of configurations, and one of the settings I want to pass is how often it should run.
For example, using the same DAG, I have two different RUNS. RUN A I want to run daily. RUN B I want to run weekly. Both of these use the exact same DAG code but have different configurations passed.
So far as I can see, there is no way to easily pass the schedule within the configuration. The only solution I have is to make multiple DAGs with the exact same code but different schedules, which results in a lot of redundant code duplication.
Is there any better options?
ex: As an example, I have a dag that is a web crawler, and I pass urls for it to crawl. i need to modify the frequency of the crawling for different sets of urls, basically. The urls I am passing can change and there is no way to identify what schedule to use other than the run parameters

In this case since daily contains weekly it's best to just have a daily run and use branch operator to decide what logic to use based on day of the week.
import pendulum
from airflow import DAG
from airflow.operators.empty import EmptyOperator
from airflow.operators.weekday import BranchDayOfWeekOperator
with DAG(
dag_id="my_dag",
start_date=pendulum.datetime(2022, 1, 1, tz="UTC"),
catchup=False,
schedule_interval="#daily",
) as dag:
task_a = EmptyOperator(task_id='logic_a') # Replace with your actual operators for 1st configuration/logic
task_b = EmptyOperator(task_id='logic_b') # Replace with your actual operators for 2nd configuration/logic
branch = BranchDayOfWeekOperator(
task_id="make_choice",
follow_task_ids_if_true="logic_a",
follow_task_ids_if_false="logic_b",
week_day="Monday",
)
branch >> [task_a, task_b]
In this example the DAG is running every day. On Monday it will follow task_a the rest of the week it will follow task_b.

Related

run next tasks in dag if another dag is complete

dag1:
start >> clean >> end
I have a dag where i run a few tasks. But I want to modify it such that the clean steps only runs if another dag "dag2" is not running at the moment.
Is there any way I can import information regarding my "dag2", check its status and if it is in success mode, I can proceed to the clean step
Something like this:
start >> wait_for_dag2 >> clean >> end
How can I achieve the wait_for_dag2 part?
There are some different answers depends on what you want to do:
if you have two dags with the same schedule interval, and you want to make the run of second dag waits the same run of first one, you can use ExternalTaskSensor on the last task of the first dag
if you want to run a dag2, after each run of a dag1 even if it's triggered manually, in this case you need to update dag1 and add a TriggerDagRunOperator and set schedule interval of the second to None
I want to modify it such that the clean steps only runs if another dag "dag2" is not running at the moment.
if you have two dags and you don't want to run them in same time to avoid a conflict on an external server/service, you can use one of the first two propositions or just use higher priority for the task of the first dag, and use the same pool (with 1 slot) for the tasks which lead to the conflict, but you will lose the parallelism on these tasks.
Hossein's Approach is the way people usually go. However if you want to get info about any dag run data, you can use the airlfow functionality to get that info. The following appraoch is good when you do not want(or are not allowed) to modify another dag:
from airflow.models.dagrun import DagRun
from airflow.utils.state import DagRunState
dag_runs = DagRun.find(dag_id='the_dag_id_you_want_to_check')
last_run = dag_runs[-1]
if last_run.state == DagRunState.SUCCESS:
print('the dag run was successfull!')
else:
print('the dag state is -->: ', last_run.state)

How to find out the delayed jobs in airflow

Some of my DAG are waiting to get scheduled, and some are waiting in the queue. I suspect there are reasons for this delay but not sure how I can start to debug this problem. Majority of the pipelines are running Spark jobs.
Can someone help to give me some directions in terms of where to look at to 1) anaylse which DAGs were delayed (did not start at the scheduled time) 2) where are the places I should look at to find out if the resources are enough. I'm quite new to scheduling in Airflow. Many thanks. Please let me know if I can describe the question better.
If you are looking for code that takes advantage of Airflows' wider capabilities.
There are three modules within airflow.models which can be harnessed.
To programmatically retrieve all DAGs which your Airflow is away of, we import DagBag. From the docs "A dagbag is a collection of dags, parsed out of a folder tree and has high"
We utilise DagModel and the method get_current, to initialise each dag_id present in our bag
We check if any DAG is active using the DagModel property is_paused
We retrieve the latest DAG run using the DagRun.find
Sort the individual dag runs by latest to earliest
Here you could just subset [0] to get 1, however, for your debugging purposes I just loop through them all
DagRun returns a lot of information for us to use. In my loop I have output print(i, run.state, run.execution_date, run.start_date). So you can see what is going on under the hood.
id
state
dag_id
queued_at
execution_date
start_date
end_date
run_id
data_interval_start
data_interval_end
last_scheduling_decision
I have commented out an if check for any queued Dags for you to uncomment. Additionally you can do some arithmetic on dates if you desire, to add further conditional functionality.
from datetime import datetime, timedelta
from airflow import DAG
from airflow.models import DagBag, DagModel, DagRun
from airflow.operators.python import PythonOperator
# make a function that returns if a DAG is set to active or paused
def check_dag_active():
bag = DagBag()
for dag_id in bag.dags:
in_bag = DagModel.get_current(dag_id)
if not in_bag.is_paused:
latest = DagRun.find(dag_id=dag_id)
latest.sort(key=lambda x: x.execution_date, reverse=True)
for i, run in enumerate(latest):
print(i, run.state, run.execution_date, run.start_date)
# if run.state == 'queued':
# return [run.dag_id, run.execution_date, run.start_date]
with DAG(
'stack_overflow_ans_3',
tags = ['SO'],
start_date = datetime(2022, 1, 1),
schedule_interval = None,
catchup = False,
is_paused_upon_creation = False
) as dag:
t1 = PythonOperator(
task_id = 'task_that_will_fail',
python_callable = check_dag_active
)
Depending on your version of Airflow and your setup, you should be able to query the Airflow DB directly to get this information.
If you're using Airflow 1.x, there should be an "Ad Hoc Query" executor in the Data Profiling tab in the UI. This was disabled in 2.x though, so if you're running 2.x you'll need to connect directly to your Airflow DB using psql or something similar (this differs from Google to AWS to Docker).
Once you're in, check out this link for some queries on DAG runtime.

Is there any TriggerRule for airflow operator with no Trigger by time (need to

I have use case to create 2 tasks of BigqueryOperator that have same destination table but I need one to run daily, and the second one to be run manually just when I need.
Below are the illustration of Tree View
| task_3rd_adhoc
| task_3rd
|---- task_2nd
|---- task_1st_a
|---- task_1st_b
From example above, DAG are run daily. And I aim to the task will be:
task_1st_a and task_1st_b run first. Target table are:
project.dataset.table_1st_a with _PARTITIONTIME = execution date, and
project.dataset.table_1st_b with _PARTITIONTIME = execution date.
then task_2nd_a will run after task_1st_a and task_1st_b finish. BigQueryOperator use TriggerRule.ALL_SUCCESS. Target table is:
project.dataset.table_2nd with _PARTITIONTIME = execution date.
then task_3rd will run after task_2nd success. BigQueryOperator use TriggerRule.ALL_SUCCESS. Target table is:
project.dataset.table_3rd with PARTITIONTIME = D-2 from execution date.
task_3rd_adhoc will not run in daily job. I need this when I want to backfill table project.dataset.table_3rd. With target table:
project.dataset.table_3rd with _PARTITIONTIME = execution_date
But I still can't find what is the correct TriggerRule for step #4 above. I tried TriggerRule.DUMMY because I thought it can be used to set no Trigger, but task_3rd_adhoc also run in daily job when I tried create DAG above.
(based on this doc dependencies are just for show, trigger at will)
First of all, you've misunderstood TriggerRule.DUMMY.
Usually, when you wire tasks together task_a >> task_b, B would run only after A is complete (success / failed, based on B's trigger_rule).
TriggerRule.DUMMY means that even after wiring tasks A & B together as before, B would run independently of A (run at will). It doesn't mean run at your will, rather it runs at Airflow's will (it will trigger it whenever it feels like). So clearly tasks having dummy trigger rule will pretty much ALWAYS run, albeit, at an unpredictable time
What you need here (to have a particular task in DAG always but run it only when manually specified) is a combination of
AirflowSkipException
Variable
Here's roughly how you can do
A Variable should hold the command for this task (whether or not it should run). This Variable, of course, you can edit anytime from UI (thereby controlling whether or not that task runs in next DagRun)
In the Operator's code (execute() method for custom-operator or just python_callable in case of PythonOperator), you'll check value of Variable (whether or not the task is supposed to run)
Based on the Variable value, if the task is NOT supposed to run, you must throw an AirflowSkipException, so that the task will be marked at skipped. Or else, it will just run as usual

Can an Airflow task dynamically generate a DAG at runtime?

I have an upload folder that gets irregular uploads. For each uploaded file, I want to spawn a DAG that is specific to that file.
My first thought was to do this with a FileSensor that monitors the upload folder and, conditional on presence of new files, triggers a task that creates the separate DAGs. Conceptually:
Sensor_DAG (FileSensor -> CreateDAGTask)
|-> File1_DAG (Task1 -> Task2 -> ...)
|-> File2_DAG (Task1 -> Task2 -> ...)
In my initial implementation, CreateDAGTask was a PythonOperator that created DAG globals, by placing them in the global namespace (see this SO answer), like so:
from airflow import DAG
from airflow.operators.dummy_operator import DummyOperator
from airflow.operators.python_operator import PythonOperator
from airflow.contrib.sensors.file_sensor import FileSensor
from datetime import datetime, timedelta
from pathlib import Path
UPLOAD_LOCATION = "/opt/files/uploaded"
# Dynamic DAG generation task code, for the Sensor_DAG below
def generate_dags_for_files(location=UPLOAD_LOCATION, **kwargs):
dags = []
for filepath in Path(location).glob('*'):
dag_name = f"process_{filepath.name}"
dag = DAG(dag_name, schedule_interval="#once", default_args={
"depends_on_past": True,
"start_date": datetime(2020, 7, 15),
"retries": 1,
"retry_delay": timedelta(hours=12)
}, catchup=False)
dag_task = DummyOperator(dag=dag, task_id=f"start_{dag_name}")
dags.append(dag)
# Try to place the DAG into globals(), which doesn't work
globals()[dag_name] = dag
return dags
The main DAG then invokes this logic via a PythonOperator:
# File-sensing DAG
default_args = {
"depends_on_past" : False,
"start_date" : datetime(2020, 7, 16),
"retries" : 1,
"retry_delay" : timedelta(hours=5),
}
with DAG("Sensor_DAG", default_args=default_args,
schedule_interval= "50 * * * *", catchup=False, ) as sensor_dag:
start_task = DummyOperator(task_id="start")
stop_task = DummyOperator(task_id="stop")
sensor_task = FileSensor(task_id="my_file_sensor_task",
poke_interval=60,
filepath=UPLOAD_LOCATION)
process_creator_task = PythonOperator(
task_id="process_creator",
python_callable=generate_dags_for_files,
)
start_task >> sensor_task >> process_creator_task >> stop_task
But that doesn't work, because by the time process_creator_task runs, the globals have already been parsed by Airflow. New globals after parse time are irrelevant.
Interim solution
Per Airflow dynamic DAG and task Ids, I can achieve what I'm trying to do by omitting the FileSensor task altogether and just letting Airflow generate the per-file task at each scheduler heartbeat, replacing the Sensor_DAG with just executing generate_dags_for_files: Update: Nevermind -- while this does create a DAG in the dashboard, actual execution runs into the "DAG seems to be missing" issue:
generate_dags_for_files()
This does mean that I can no longer regulate the frequency of folder polling with the poke_interval parameter of FileSensor; instead, Airflow will poll the folder every time it collects DAGs.
Is that the best pattern here?
Other related StackOverflow threads
Run Airflow DAG for each file and Airflow: Proper way to run DAG for each file: identical use case, but the accepted answer uses two static DAGs, presumably with different parameters.
Proper way to create dynamic workflows in Airflow - accepted answer dynamically creates tasks, not DAGs, via a complicated XCom setup.
In short: if the task writes where the DagBag reads from, yes, but it's best to avoid a pattern that requires this. Any DAG you're tempted to custom-create in a task should probably instead be a static, heavily parametrized, conditionally-triggered DAG. y2k-shubham provides an excellent example of such a setup, and I'm grateful for his guidance in the comments on this question.
That said, here are the approaches that would accomplish what the question is asking, no matter how bad of an idea it is, in the increasing degree of ham-handedness:
If you dynamically generate DAGs from a Variable (like so), modify the Variable.
If you dynamically generate DAGs from a list of config files, add a new config file to wherever you're pulling config files from, so that a new DAG gets generated on the next DAG collection.
Use something like Jinja templating to write a new Python file in the dags/ folder.
To retain access to the task after it runs, you'd have to keep the new DAG definition stable and accessible on future dashboard updates / DagBag collection. Otherwise, the Airflow dashboard won't be able to render much about it.
Airflow is suited for building DAGs dynamically; as pointed it out by its creator:
https://youtu.be/Fvu2oFyFCT0?t=411 p.s. thanks to #Yiannis for the video reference
Here is an example of how this could be accomplished:
https://docs.astronomer.io/learn/dynamically-generating-dags

How to configure Airflow dag start_date to run tasks like in cron

I’m new to Airflow and I’m trying to understand how to use the scheduler correctly. Basically I want to schedule tasks the same way as I use cron. There’s a task that needs to be run every 5 minutes and I want it to start at the dag run next even 5 min slot after I add the DAG file to dags directory or after I have made some changes to the dag file.
I know that the DAG is run at the end of the schedule_interval. If I add a new DAG and use start_date=days_ago(0) then I will get the unnecessary runs starting from the beginning of the day. It also feels stupid to hardcode some specific start date on the dag file i.e. start_date=datetime(2019, 9, 4, 10, 1, 0, 818988). Is my approach wrong or is there some specific reason why the start_date needs to be set?
I think I found an answer to my own question from the official documentation: https://airflow.apache.org/scheduler.html#backfill-and-catchup
By turning off the catchup, DAG run is created only for the most recent interval. So then I can set the start_date to anything in the past and define the dag like this:
dag = DAG('good-dag', catchup=False, default_args=default_args, schedule_interval='*/5 * * * *')

Resources