I have an upload folder that gets irregular uploads. For each uploaded file, I want to spawn a DAG that is specific to that file.
My first thought was to do this with a FileSensor that monitors the upload folder and, conditional on presence of new files, triggers a task that creates the separate DAGs. Conceptually:
Sensor_DAG (FileSensor -> CreateDAGTask)
|-> File1_DAG (Task1 -> Task2 -> ...)
|-> File2_DAG (Task1 -> Task2 -> ...)
In my initial implementation, CreateDAGTask was a PythonOperator that created DAG globals, by placing them in the global namespace (see this SO answer), like so:
from airflow import DAG
from airflow.operators.dummy_operator import DummyOperator
from airflow.operators.python_operator import PythonOperator
from airflow.contrib.sensors.file_sensor import FileSensor
from datetime import datetime, timedelta
from pathlib import Path
UPLOAD_LOCATION = "/opt/files/uploaded"
# Dynamic DAG generation task code, for the Sensor_DAG below
def generate_dags_for_files(location=UPLOAD_LOCATION, **kwargs):
dags = []
for filepath in Path(location).glob('*'):
dag_name = f"process_{filepath.name}"
dag = DAG(dag_name, schedule_interval="#once", default_args={
"depends_on_past": True,
"start_date": datetime(2020, 7, 15),
"retries": 1,
"retry_delay": timedelta(hours=12)
}, catchup=False)
dag_task = DummyOperator(dag=dag, task_id=f"start_{dag_name}")
dags.append(dag)
# Try to place the DAG into globals(), which doesn't work
globals()[dag_name] = dag
return dags
The main DAG then invokes this logic via a PythonOperator:
# File-sensing DAG
default_args = {
"depends_on_past" : False,
"start_date" : datetime(2020, 7, 16),
"retries" : 1,
"retry_delay" : timedelta(hours=5),
}
with DAG("Sensor_DAG", default_args=default_args,
schedule_interval= "50 * * * *", catchup=False, ) as sensor_dag:
start_task = DummyOperator(task_id="start")
stop_task = DummyOperator(task_id="stop")
sensor_task = FileSensor(task_id="my_file_sensor_task",
poke_interval=60,
filepath=UPLOAD_LOCATION)
process_creator_task = PythonOperator(
task_id="process_creator",
python_callable=generate_dags_for_files,
)
start_task >> sensor_task >> process_creator_task >> stop_task
But that doesn't work, because by the time process_creator_task runs, the globals have already been parsed by Airflow. New globals after parse time are irrelevant.
Interim solution
Per Airflow dynamic DAG and task Ids, I can achieve what I'm trying to do by omitting the FileSensor task altogether and just letting Airflow generate the per-file task at each scheduler heartbeat, replacing the Sensor_DAG with just executing generate_dags_for_files: Update: Nevermind -- while this does create a DAG in the dashboard, actual execution runs into the "DAG seems to be missing" issue:
generate_dags_for_files()
This does mean that I can no longer regulate the frequency of folder polling with the poke_interval parameter of FileSensor; instead, Airflow will poll the folder every time it collects DAGs.
Is that the best pattern here?
Other related StackOverflow threads
Run Airflow DAG for each file and Airflow: Proper way to run DAG for each file: identical use case, but the accepted answer uses two static DAGs, presumably with different parameters.
Proper way to create dynamic workflows in Airflow - accepted answer dynamically creates tasks, not DAGs, via a complicated XCom setup.
In short: if the task writes where the DagBag reads from, yes, but it's best to avoid a pattern that requires this. Any DAG you're tempted to custom-create in a task should probably instead be a static, heavily parametrized, conditionally-triggered DAG. y2k-shubham provides an excellent example of such a setup, and I'm grateful for his guidance in the comments on this question.
That said, here are the approaches that would accomplish what the question is asking, no matter how bad of an idea it is, in the increasing degree of ham-handedness:
If you dynamically generate DAGs from a Variable (like so), modify the Variable.
If you dynamically generate DAGs from a list of config files, add a new config file to wherever you're pulling config files from, so that a new DAG gets generated on the next DAG collection.
Use something like Jinja templating to write a new Python file in the dags/ folder.
To retain access to the task after it runs, you'd have to keep the new DAG definition stable and accessible on future dashboard updates / DagBag collection. Otherwise, the Airflow dashboard won't be able to render much about it.
Airflow is suited for building DAGs dynamically; as pointed it out by its creator:
https://youtu.be/Fvu2oFyFCT0?t=411 p.s. thanks to #Yiannis for the video reference
Here is an example of how this could be accomplished:
https://docs.astronomer.io/learn/dynamically-generating-dags
Related
Some of my DAG are waiting to get scheduled, and some are waiting in the queue. I suspect there are reasons for this delay but not sure how I can start to debug this problem. Majority of the pipelines are running Spark jobs.
Can someone help to give me some directions in terms of where to look at to 1) anaylse which DAGs were delayed (did not start at the scheduled time) 2) where are the places I should look at to find out if the resources are enough. I'm quite new to scheduling in Airflow. Many thanks. Please let me know if I can describe the question better.
If you are looking for code that takes advantage of Airflows' wider capabilities.
There are three modules within airflow.models which can be harnessed.
To programmatically retrieve all DAGs which your Airflow is away of, we import DagBag. From the docs "A dagbag is a collection of dags, parsed out of a folder tree and has high"
We utilise DagModel and the method get_current, to initialise each dag_id present in our bag
We check if any DAG is active using the DagModel property is_paused
We retrieve the latest DAG run using the DagRun.find
Sort the individual dag runs by latest to earliest
Here you could just subset [0] to get 1, however, for your debugging purposes I just loop through them all
DagRun returns a lot of information for us to use. In my loop I have output print(i, run.state, run.execution_date, run.start_date). So you can see what is going on under the hood.
id
state
dag_id
queued_at
execution_date
start_date
end_date
run_id
data_interval_start
data_interval_end
last_scheduling_decision
I have commented out an if check for any queued Dags for you to uncomment. Additionally you can do some arithmetic on dates if you desire, to add further conditional functionality.
from datetime import datetime, timedelta
from airflow import DAG
from airflow.models import DagBag, DagModel, DagRun
from airflow.operators.python import PythonOperator
# make a function that returns if a DAG is set to active or paused
def check_dag_active():
bag = DagBag()
for dag_id in bag.dags:
in_bag = DagModel.get_current(dag_id)
if not in_bag.is_paused:
latest = DagRun.find(dag_id=dag_id)
latest.sort(key=lambda x: x.execution_date, reverse=True)
for i, run in enumerate(latest):
print(i, run.state, run.execution_date, run.start_date)
# if run.state == 'queued':
# return [run.dag_id, run.execution_date, run.start_date]
with DAG(
'stack_overflow_ans_3',
tags = ['SO'],
start_date = datetime(2022, 1, 1),
schedule_interval = None,
catchup = False,
is_paused_upon_creation = False
) as dag:
t1 = PythonOperator(
task_id = 'task_that_will_fail',
python_callable = check_dag_active
)
Depending on your version of Airflow and your setup, you should be able to query the Airflow DB directly to get this information.
If you're using Airflow 1.x, there should be an "Ad Hoc Query" executor in the Data Profiling tab in the UI. This was disabled in 2.x though, so if you're running 2.x you'll need to connect directly to your Airflow DB using psql or something similar (this differs from Google to AWS to Docker).
Once you're in, check out this link for some queries on DAG runtime.
Does Airflow automatically detect changed variables that are used by already-deployed DAGs, and apply the change immediately, or is a DAG manual restart or refresh required to apply the new value of a changed variable?
If the airflow variable is being used in the DAG code, something like the below, the change (to the python variable interval) will be set to the value of that airflow variable when the scheduler is executed next time. The scheduler executes periodically with a short interval.
# experimental_dag.py
default_args = {
'owner': 'anonymous',
'start_date': datetime.datetime(2021, 12, 1),
}
interval = Variable.get('interval', '#daily')
dag = DAG(
'experimental_dag',
default_args=default_args,
schedule_interval=interval
)
def write_log(ts):
with open('./output.txt', 'a') as f:
f.write(f'{ts}\n')
py_task = PythonOperator(
task_id='load_yesterday_data',
python_callable=write_log,
op_kwargs={
'ts': '{{ ts }}'
},
dag=dag
)
If the DAG code is changed, for example the above experimental_dag.py, in this situation, definitely, the DAG needs to be copied to the dag folder (configured in airflow.cfg).
Airflow Variables are stored in the database. Airflow does not maintain DAG <-> Variable relationships. Variables are not bound to a specific DAG.
The value of a variable is populated when Variable.get() is called in your code.
I'm trying to understand how to pass values via airflow xcom functionality. The specific usecase I am trying to build is to write a file, then move it, then run another command. The idea is that I pass the file name from one operator to the next.
Here is what I have:
from airflow.models import DAG
from airflow.operators.python_operator import PythonOperator
import datetime as dt
DAG = DAG(
dag_id='xcom_test_dag',
start_date=dt.datetime.now(),
schedule_interval='#once'
)
def push_function(**context):
file_name = 'test_file_{date}'.format(date=dt.datetime.now())
return context['task_instance'].xcom_push(key='filename', value=file_name)
def pull_function(**context):
dir(context['task_instance'].xcom_pull())
push_task = PythonOperator(
task_id='push_task',
python_callable=push_function,
provide_context=True,
dag=DAG)
pull_task = PythonOperator(
task_id='pull_task',
python_callable=pull_function,
provide_context=True,
dag=DAG)
push_task >> pull_task
If I want to reference the file name in the pull_task so I could perform read the file - how should I call that? Trying to access context['task_instance'] does not contain a value. Further - is it best practices to try and reference a file name like this from task to task/operator to operator?
When pulling data from XCOM, you want to provide the task ID of the task where you push the data. In your example, the task_id of your push task is push_task, so you'd want to do something like:
value = context['task_instance'].xcom_pull(task_ids='push_task')
However, from the airflow documentation, note that:
By default, xcom_pull() filters for the keys that are automatically given to XComs when they are pushed by being returned from execute functions (as opposed to XComs that are pushed manually).
If you're pushing data to XCOM manually with specific keys, you may need to include that key when calling xcom_pull. In your example, you push a key called filename in your push task, so you'd likely need to do something like this in your pull task:
value = context['task_instance'].xcom_pull(task_ids='push_task', key='filename')
This information is outlined in further detail in the Airflow documentation: https://airflow.apache.org/docs/stable/concepts.html?highlight=xcom#concepts-xcom
As for your question regarding "best practices" - for communicating between Airflow Tasks/Operators, XCOM is the best way to go. However, if you're wanting to read a file from disk across multiple operators, you would need to ensure that all your workers have access to where the file is stored. If that isn't possible, an alternative could be to have the push task store that file remotely (e.g. in AWS S3) and push the S3 URL to XCOM. The pull task could then read the S3 URL from XCOM, and download the file from S3.
Suppose I have an airflow dag file that creates a graph like so...
def get_current_info(filename)
current_info = {}
<fill in info in current_info relevant for today's date for given file>
return current_info
files = [
get_current_info("file_001"),
get_current_info("file_002"),
....
]
for f in files:
<some BashOperator bo1 using f's current info dict>
<some BashOperator bo2 using f's current info dict>
....
bo1 >> bo2
....
Since these values in the current_info dict that is used to define the dag changes periodically (here, daily), I would like to know by what process / schedule the dag definition gets updated. (I print the current_info values each run and values appear to be updating, but curious as to how and when exactly this happens).
When does a airflow dag definition get evaluated? referenced anywhere in the docs?
The DAGs are evaluated in every run of the scheduler.
This article describes how the scheduler works and at what stage the DAG files are picked up for evaluation.
After some discussion on the [airflow email list][1], it turns out that airflow builds the dag for each task when it is run (so each tasks includes the overhead of building the dag again (which in my case was very significant)).
See more details on this here: https://stackoverflow.com/a/59995882/8236733
To set up the connections and variables in airflow i use a DAG, we do this inorder to setup airflow fast in case we have to setup everything again fast. It does work my connections and variables show up but the task "fails". The error is saying that there is already an sql_path variable
[2018-03-30 19:42:48,784] {{models.py:1595}} ERROR - (psycopg2.IntegrityError) duplicate key value violates unique constraint "variable_key_key"
DETAIL: Key (key)=(sql_path) already exists.
[SQL: 'INSERT INTO variable (key, val, is_encrypted) VALUES (%(key)s, %(val)s, %(is_encrypted)s) RETURNING variable.id'] [parameters: {'key': 'sql_path', 'val': 'gAAAAABavpM46rWjISLZRRKu4hJRD7HFKMuXMpmJ5Z3DyhFbFOQ91cD9NsQsYyFof_pdPn116d6yNoNoOAqx_LRqMahjbYKUqrhNRiYru4juPv4JEGAv2d0=', 'is_encrypted': True}] (Background on this error at: http://sqlalche.me/e/gkpj)
Traceback (most recent call last):
File "/usr/local/lib/python3.6/site-packages/sqlalchemy/engine/base.py", line 1193, in _execute_context
context)
File "/usr/local/lib/python3.6/site-packages/sqlalchemy/engine/default.py", line 507, in do_execute
cursor.execute(statement, parameters)
psycopg2.IntegrityError: duplicate key value violates unique constraint "variable_key_key"
DETAIL: Key (key)=(sql_path) already exists.
However I checked and before I run the DAG the addhoc query SELECT * FROM variable returns nothing and afterwards it returns my two variables.
I checked that I don't create the variable twice but I don't think so.
Here you see the part of the dag creating the path variables
import airflow
from datetime import datetime, timedelta
from airflow.operators.python_operator import PythonOperator
from airflow import models
from airflow.settings import Session
import logging
args = {
'owner': 'airflow',
'start_date': airflow.utils.dates.days_ago(1),
'provide_context': True
}
def init_staging_airflow():
logging.info('Creating connections, pool and sql path')
session = Session()
new_var = models.Variable()
new_var.key = "sql_path"
new_var.set_val("/usr/local/airflow/sql")
session.add(new_var)
session.commit()
new_var = models.Variable()
new_var.key = "conf_path"
new_var.set_val("/usr/local/airflow/conf")
session.add(new_var)
session.commit()
session.add(new_pool)
session.commit()
session.close()
dag = airflow.DAG(
'init_staging_airflow',
schedule_interval="#once",
default_args=args,
max_active_runs=1)
t1 = PythonOperator(task_id='init_staging_airflow',
python_callable=init_staging_airflow,
provide_context=False,
dag=dag)
I ran into the same problem when trying to do Variable.set() inside a DAG. I believe the scheduler will constantly poll the DagBag to refresh any changes dynamically. That's why you see a ton of these when running the webserver:
[2018-04-02 11:28:41,531] [45914] {models.py:168} INFO - Filling up the DagBag from /Users/jasontang/XXX/data-server/dags
Sooner or later you'll hit the key constraint:
What I did was to set all my variables that I need to set at runtime into a global dictionary ("VARIABLE_DICT" in the example below), and just allow all my DAGs and sub-DAGs access it.
def initialize(dag_run_obj):
global VARIABLE_DICT
if dag_run_obj.external_trigger:
VARIABLE_DICT.update(dag_run_obj.conf)
values = (dag_run_obj.conf['client'],
dag_run_obj.conf['vertical'],
dag_run_obj.conf['frequency'],
dag_run_obj.conf.get('snapshot'))
config_file = '{0}-{1}/{0}-{1}-{2}.json'.format(*values)
path = os.path.join(Variable.get('repo_root'), 'conf', config_file)
VARIABLE_DICT.update(read_config(path))
You could ignore the dag_run_obj part, since I specifically look for any additional configuration values provided to the DAG Run when it is created. In your other DAGs and subDAGs just import the dictionary.
justang is correct, the reason this is happening is because the scheduler executes your dag every time the scheduler runs (the scheduler runs frequently to check to see if your DAGs have changed, if they need to be started etc.).
I fixed this one by calling Variable.delete() every time before Variable.set().