I'm trying to understand how to pass values via airflow xcom functionality. The specific usecase I am trying to build is to write a file, then move it, then run another command. The idea is that I pass the file name from one operator to the next.
Here is what I have:
from airflow.models import DAG
from airflow.operators.python_operator import PythonOperator
import datetime as dt
DAG = DAG(
dag_id='xcom_test_dag',
start_date=dt.datetime.now(),
schedule_interval='#once'
)
def push_function(**context):
file_name = 'test_file_{date}'.format(date=dt.datetime.now())
return context['task_instance'].xcom_push(key='filename', value=file_name)
def pull_function(**context):
dir(context['task_instance'].xcom_pull())
push_task = PythonOperator(
task_id='push_task',
python_callable=push_function,
provide_context=True,
dag=DAG)
pull_task = PythonOperator(
task_id='pull_task',
python_callable=pull_function,
provide_context=True,
dag=DAG)
push_task >> pull_task
If I want to reference the file name in the pull_task so I could perform read the file - how should I call that? Trying to access context['task_instance'] does not contain a value. Further - is it best practices to try and reference a file name like this from task to task/operator to operator?
When pulling data from XCOM, you want to provide the task ID of the task where you push the data. In your example, the task_id of your push task is push_task, so you'd want to do something like:
value = context['task_instance'].xcom_pull(task_ids='push_task')
However, from the airflow documentation, note that:
By default, xcom_pull() filters for the keys that are automatically given to XComs when they are pushed by being returned from execute functions (as opposed to XComs that are pushed manually).
If you're pushing data to XCOM manually with specific keys, you may need to include that key when calling xcom_pull. In your example, you push a key called filename in your push task, so you'd likely need to do something like this in your pull task:
value = context['task_instance'].xcom_pull(task_ids='push_task', key='filename')
This information is outlined in further detail in the Airflow documentation: https://airflow.apache.org/docs/stable/concepts.html?highlight=xcom#concepts-xcom
As for your question regarding "best practices" - for communicating between Airflow Tasks/Operators, XCOM is the best way to go. However, if you're wanting to read a file from disk across multiple operators, you would need to ensure that all your workers have access to where the file is stored. If that isn't possible, an alternative could be to have the push task store that file remotely (e.g. in AWS S3) and push the S3 URL to XCOM. The pull task could then read the S3 URL from XCOM, and download the file from S3.
Related
dag1:
start >> clean >> end
I have a dag where i run a few tasks. But I want to modify it such that the clean steps only runs if another dag "dag2" is not running at the moment.
Is there any way I can import information regarding my "dag2", check its status and if it is in success mode, I can proceed to the clean step
Something like this:
start >> wait_for_dag2 >> clean >> end
How can I achieve the wait_for_dag2 part?
There are some different answers depends on what you want to do:
if you have two dags with the same schedule interval, and you want to make the run of second dag waits the same run of first one, you can use ExternalTaskSensor on the last task of the first dag
if you want to run a dag2, after each run of a dag1 even if it's triggered manually, in this case you need to update dag1 and add a TriggerDagRunOperator and set schedule interval of the second to None
I want to modify it such that the clean steps only runs if another dag "dag2" is not running at the moment.
if you have two dags and you don't want to run them in same time to avoid a conflict on an external server/service, you can use one of the first two propositions or just use higher priority for the task of the first dag, and use the same pool (with 1 slot) for the tasks which lead to the conflict, but you will lose the parallelism on these tasks.
Hossein's Approach is the way people usually go. However if you want to get info about any dag run data, you can use the airlfow functionality to get that info. The following appraoch is good when you do not want(or are not allowed) to modify another dag:
from airflow.models.dagrun import DagRun
from airflow.utils.state import DagRunState
dag_runs = DagRun.find(dag_id='the_dag_id_you_want_to_check')
last_run = dag_runs[-1]
if last_run.state == DagRunState.SUCCESS:
print('the dag run was successfull!')
else:
print('the dag state is -->: ', last_run.state)
I have an upload folder that gets irregular uploads. For each uploaded file, I want to spawn a DAG that is specific to that file.
My first thought was to do this with a FileSensor that monitors the upload folder and, conditional on presence of new files, triggers a task that creates the separate DAGs. Conceptually:
Sensor_DAG (FileSensor -> CreateDAGTask)
|-> File1_DAG (Task1 -> Task2 -> ...)
|-> File2_DAG (Task1 -> Task2 -> ...)
In my initial implementation, CreateDAGTask was a PythonOperator that created DAG globals, by placing them in the global namespace (see this SO answer), like so:
from airflow import DAG
from airflow.operators.dummy_operator import DummyOperator
from airflow.operators.python_operator import PythonOperator
from airflow.contrib.sensors.file_sensor import FileSensor
from datetime import datetime, timedelta
from pathlib import Path
UPLOAD_LOCATION = "/opt/files/uploaded"
# Dynamic DAG generation task code, for the Sensor_DAG below
def generate_dags_for_files(location=UPLOAD_LOCATION, **kwargs):
dags = []
for filepath in Path(location).glob('*'):
dag_name = f"process_{filepath.name}"
dag = DAG(dag_name, schedule_interval="#once", default_args={
"depends_on_past": True,
"start_date": datetime(2020, 7, 15),
"retries": 1,
"retry_delay": timedelta(hours=12)
}, catchup=False)
dag_task = DummyOperator(dag=dag, task_id=f"start_{dag_name}")
dags.append(dag)
# Try to place the DAG into globals(), which doesn't work
globals()[dag_name] = dag
return dags
The main DAG then invokes this logic via a PythonOperator:
# File-sensing DAG
default_args = {
"depends_on_past" : False,
"start_date" : datetime(2020, 7, 16),
"retries" : 1,
"retry_delay" : timedelta(hours=5),
}
with DAG("Sensor_DAG", default_args=default_args,
schedule_interval= "50 * * * *", catchup=False, ) as sensor_dag:
start_task = DummyOperator(task_id="start")
stop_task = DummyOperator(task_id="stop")
sensor_task = FileSensor(task_id="my_file_sensor_task",
poke_interval=60,
filepath=UPLOAD_LOCATION)
process_creator_task = PythonOperator(
task_id="process_creator",
python_callable=generate_dags_for_files,
)
start_task >> sensor_task >> process_creator_task >> stop_task
But that doesn't work, because by the time process_creator_task runs, the globals have already been parsed by Airflow. New globals after parse time are irrelevant.
Interim solution
Per Airflow dynamic DAG and task Ids, I can achieve what I'm trying to do by omitting the FileSensor task altogether and just letting Airflow generate the per-file task at each scheduler heartbeat, replacing the Sensor_DAG with just executing generate_dags_for_files: Update: Nevermind -- while this does create a DAG in the dashboard, actual execution runs into the "DAG seems to be missing" issue:
generate_dags_for_files()
This does mean that I can no longer regulate the frequency of folder polling with the poke_interval parameter of FileSensor; instead, Airflow will poll the folder every time it collects DAGs.
Is that the best pattern here?
Other related StackOverflow threads
Run Airflow DAG for each file and Airflow: Proper way to run DAG for each file: identical use case, but the accepted answer uses two static DAGs, presumably with different parameters.
Proper way to create dynamic workflows in Airflow - accepted answer dynamically creates tasks, not DAGs, via a complicated XCom setup.
In short: if the task writes where the DagBag reads from, yes, but it's best to avoid a pattern that requires this. Any DAG you're tempted to custom-create in a task should probably instead be a static, heavily parametrized, conditionally-triggered DAG. y2k-shubham provides an excellent example of such a setup, and I'm grateful for his guidance in the comments on this question.
That said, here are the approaches that would accomplish what the question is asking, no matter how bad of an idea it is, in the increasing degree of ham-handedness:
If you dynamically generate DAGs from a Variable (like so), modify the Variable.
If you dynamically generate DAGs from a list of config files, add a new config file to wherever you're pulling config files from, so that a new DAG gets generated on the next DAG collection.
Use something like Jinja templating to write a new Python file in the dags/ folder.
To retain access to the task after it runs, you'd have to keep the new DAG definition stable and accessible on future dashboard updates / DagBag collection. Otherwise, the Airflow dashboard won't be able to render much about it.
Airflow is suited for building DAGs dynamically; as pointed it out by its creator:
https://youtu.be/Fvu2oFyFCT0?t=411 p.s. thanks to #Yiannis for the video reference
Here is an example of how this could be accomplished:
https://docs.astronomer.io/learn/dynamically-generating-dags
Suppose I have an airflow dag file that creates a graph like so...
def get_current_info(filename)
current_info = {}
<fill in info in current_info relevant for today's date for given file>
return current_info
files = [
get_current_info("file_001"),
get_current_info("file_002"),
....
]
for f in files:
<some BashOperator bo1 using f's current info dict>
<some BashOperator bo2 using f's current info dict>
....
bo1 >> bo2
....
Since these values in the current_info dict that is used to define the dag changes periodically (here, daily), I would like to know by what process / schedule the dag definition gets updated. (I print the current_info values each run and values appear to be updating, but curious as to how and when exactly this happens).
When does a airflow dag definition get evaluated? referenced anywhere in the docs?
The DAGs are evaluated in every run of the scheduler.
This article describes how the scheduler works and at what stage the DAG files are picked up for evaluation.
After some discussion on the [airflow email list][1], it turns out that airflow builds the dag for each task when it is run (so each tasks includes the overhead of building the dag again (which in my case was very significant)).
See more details on this here: https://stackoverflow.com/a/59995882/8236733
I have an airflow DAG that works perfectly when files are present, but error->fails when the source files are not there.
Randomly, I recieve files from a given source, that my DAG picks up and processes. While I need to run the DAG daily, files are not necessarily there daily. Could be monday, wednesday, or even sunday evening.
I'm not worried about days with no new files. i worry about days when new files come and it breaks.
How do I tell the DAG that when no file exist then gracefully exit with success?
My DAG below (please ignore schedule setting. I'm still in development mode):
import airflow
from airflow import models
from airflow.operators.gcs_to_bq import GoogleCloudStorageToBigQueryOperator
from airflow.operators.gcs_to_gcs import GoogleCloudStorageToGoogleCloudStorageOperator
args = {
'owner': 'Airflow',
'start_date': airflow.utils.dates.days_ago(2),
'email': ['email#gmail.com'],
'email_on_failure': True,
'schedule_interval': 'None',
}
dag = models.DAG(
dag_id='Source1_Ingestion',
default_args=args
)
# [START load ATTOM File to STAGING]
load_File_to_Source1_RAW = GoogleCloudStorageToBigQueryOperator(
task_id='Source1_GCS_to_GBQ_Raw',
bucket='Source1_files',
source_objects=['To_Process/*.txt'],
destination_project_dataset_table='Source1.Source1_RAW',
schema_fields=[
{'name': 'datarow', 'type': 'STRING', 'mode': 'NULLABLE'},
],
field_delimiter='ยง',
write_disposition='WRITE_TRUNCATE',
google_cloud_storage_conn_id='GCP_EDW_Staging',
bigquery_conn_id='GCP_EDW_Staging',
dag=dag)
# [END howto_operator_gcs_to_bq]
# [START move files to Archive]
archive_attom_files = GoogleCloudStorageToGoogleCloudStorageOperator(
task_id='Archive_Source1_Files',
source_bucket='Source1_files',
source_object='To_Process/*.txt',
destination_bucket='Source1_files',
destination_object='Archive/',
move_object=True,
google_cloud_storage_conn_id='GCP_EDW_Staging',
dag=dag
)
# [END move files to archive]
load_File_to_Source1_RAW.set_downstream(archive_Source1_files)
One way to approach this would be to add a Sensor Operator to the workflow.
Nehil Jain describes sensors nicely:
Sensors are a special kind of airflow operator that will keep running until a certain criterion is met. For example, you know a file will arrive at your S3 bucket during certain time period, but the exact time when the file arrives is inconsistent.
For your use case, it looks like there's a Google Cloud Sensor, which "checks for the existence of a file in Google Cloud Storage." The reason you'd incorporate a sensor is that you're decoupling the operation "determine if a file exists" from the operation "get the file (and do something with it)".
By default, sensors have two methods (source):
poke: the code to run at poke_interval times, which tests to see if the condition is true
execute: use the poke method to test for a condition on a schedule defined by the poke_interval; fails out when the timeout argument is reached
In a common file-detection sensor, the operator receives instructions to check a source for a file on a schedule (e.g. check every 5 minutes for up to 3 hours to see if the file exists). If the sensor succeeds in meeting its test condition, it succeeds and allows the DAG to continue downstream to the next operator(s). If it fails to find the file, it times out and the sensor operator is marked failed.
With just a sensor operator, you've already succeeded in separating the error cases - the DAG fails at the GoogleCloudStorageObjectSensor instead of the GoogleCloudStorageToBigQueryOperator when the file doesn't exist, and fails at the GoogleCloudStorageToBigQueryOperator when something is wrong with the transfer logic. Importantly for your use case, Airflow supports a soft_fail argument, which "mark[s] the task as SKIPPED on failure"
For this next part, I'll caveat this next part by explicitly stating that I'm not intimately familiar with the GoogleCloudStorage operators. If the operator doesn't allow wildcarding in the sensor, you may need to rewire the sensor's poke method to allow for more complex, pattern based file detection. This is where Airflow's plug-in architecture can really shine, allowing you to modify and extend existing operators to meet your exact needs.
The example I'll give you here is that the SFTPSensor only supports poking for a specific file out of the box. I needed wildcard based poking, so I wrote a plugin that modifies the SFTPSensor to support regular expressions in file identification. In my case, it was just modifying the poke to switch from polling for the existence of a single file to polling a list of files and then passing it through a regular expression to filter the list.
At a cursory glance, it looks like the way that the GoogleCloudStorageSensor pokes for an object is with the hook.exists method. I can't speak to whether a wildcard would work there, but if it doesn't, it looks like there's a hook.list method which would allow you to implement a similar workflow to what I did for the SFTPRegexSensor.
I've included some of the source code for the SFTPRegexSensor Plugin's poke method, modified for how I think it'd work with GCS in case it's helpful:
def poke(self, context):
# create a hook (removed some of the SSH/SFTP intricacies for simplicity)
# get list of file(s) matching regex
files = hook.list(self.bucket, self.prefix) # you need to define operator paramters for the choices that are dynamic in the operator's poke (e.g. which bucket, what the file prefix is); swapped in the GCS args
regex = re.compile(self.remote_filename)
files = list(filter(regex.search, files))
if not files:
return False
return True
I want to run an airflow dag like so ->
I have 2 airflow workers W1 and W2.
In W1 I have scheduled a single task (W1-1) but in W2, I want to create X number of tasks (W2-1, W2-2 ... W2-X).
The number X and the bash command for each task will be derived from a DB call.
All tasks for worker W2 should run in parallel after W1 completes.
This is my code
dag = DAG('deploy_single', catchup=False, default_args=default_args, schedule_interval='16 15 * * *')
t1 = BashOperator(
task_id='dummy_task',
bash_command='echo hi > /tmp/hi',
queue='W1_queue',
dag=dag)
get_all_engines = "select full_command, queue_name from internal_airflow_hosts where logical_group = 'live_engines';"
db_creds = json.loads(open('/opt/airflow/db_creds.json').read())
conn_dict = db_creds["airflowdb_local"]
connection = psycopg2.connect(**conn_dict)
cursor = connection.cursor()
cursor.execute(get_all_engines)
records = cursor.fetchall()
i = 1
for record in records:
t = BashOperator(
task_id='script_test_'+str(i),
bash_command="{full_command} ".format(full_command=str(record[0])),
queue=str(record[1]),
dag=dag)
t.set_upstream(t1)
i += 1
cursor.close()
connection.close()
However, when I run this, the task on W1 completed successfully but all tasks on W2 failed. In the airflow UI, I can see that it can resolve the correct number of tasks (10 in this case) but each of these 10 failed.
Looking at the logs, I saw that on W2 (which is on a different machine), airflow could not find the db_creds.json file.
I do not want to provide the DB creds file to W2.
My question is how can an airflow task be created dynamically in this case?
Basically i want to run a DB query on the airflow server and assign tasks to one or more workers based on the results of that query. The DB will contain updated info about which engines are active etc I want the DAG to reflect this. From logs, it looks like each worker runs the DB query. Providing access to DB to each worker is not an option.
Thank you #viraj-parekh and #cwurtz.
After much trial and error, found the correct way to use airflow variables for this case.
Step 1) We create another script called gen_var.pyand place it in the dag folder. This way, the scheduler will pick up and generate the variables. If the code for generating variables is within the deploy_single dag then we run into the same dependency issue as the worker will try and process the dag too.
"""
Code that goes along with the Airflow tutorial located at:
https://github.com/airbnb/airflow/blob/master/airflow/example_dags/tutorial.py
"""
import json
import psycopg2
from airflow.models import Variable
from psycopg2.extensions import AsIs
get_all_engines = "select full_command, queue_name from internal_airflow_hosts where logical_group = 'live_engines';"
db_creds = json.loads(open('/opt/airflow/db_creds.json').read())
conn_dict = db_creds["airflowdb_local"]
connection = psycopg2.connect(**conn_dict)
cursor = connection.cursor()
cursor.execute(get_all_engines)
records = cursor.fetchall()
hosts = {}
i = 1
for record in records:
comm_dict = {}
comm_dict['full_command'] = str(record[0])
comm_dict['queue_name'] = str(record[1])
hosts[i] = comm_dict
i += 1
cursor.close()
connection.close()
Variable.set("hosts",hosts,serialize_json=True)
Note the call to serialize_json. Airflow will try to store the variable as a string. If you want it to be stored as a dict, then use serialize_json=True. Airflow will still store it as string via json.dumps
Step 2) Simplify the dag and call this "hosts" variable (now deserialize to get back the dict) like so -
hoztz = Variable.get("hosts",deserialize_json=True)
for key in hoztz:
host = hoztz.get(key)
t = BashOperator(
task_id='script_test_'+str(key),
bash_command="{full_command} ".format(full_command=str(host.get('full_command'))),
queue=str(host.get('queue_name')),
dag=dag)
t.set_upstream(t1)
Hope it helps someone else.
One way to do this would be to store the information in an Airflow Variable.
You can fetch the information needed to dynamically generate the DAG (and necessary configs) in a Variable and have W2 access it from there.
Variables are an airflow model that can be used to store static information (information that does not have an associated timestamp) that all tasks can access.