How to add data to Airflow DAG Context? [duplicate] - airflow

This question already has answers here:
Pass other arguments to on_failure_callback
(2 answers)
Closed 11 months ago.
We are working with airflow. We have something like 1000+ DAGS.
To manage DAG errors we use the same on_error_callback function to trigger alerts.
Each DAG is supposed to have context information, that could be expressed as constants, that I would like to share with the alerting stack.
Currently, I am only able to send the dag_id I retrieve from the context, via context['ti'].dag_id, and eventually the conf (parameters).
Is there a way to add other data (constants) to the context when declaring/creating the DAG?
Thanks.

You can pass params into the context.
dag = DAG(
dag_id='example_dag',
default_args=default_args,
params={
"param1": "value1",
"param2": "value2"
}
)
These are available in the
context
# example task that quickly outputs the context
start = PythonOperator(
task_id = 'start',
python_callable = lambda **context: print(context)
)
# outputs the context, in the conf dictionary you will see
# 'params': {'param1': 'value1'},
or via jinja templates
{{params.param1}}

define variables within a DAG and access variables by {{}} ref
Example ( with PROJECT_ID variable)
with models.DAG(
...
user_defined_macros={ "PROJECT_ID" : PROJECT_ID }
) as dag:
projectId = "{{PROJECT_ID}}"

Related

How to access dag run config inside dag code

I am using a DAG (s3_sensor_dag) to trigger another DAG (advanced_dag) and I pass the tag_names configurations to the triggered DAG (advanced_dag) using the conf argument. It looks something like this:
s3_sensor_dag.py:
trigger_advanced_dag = TriggerDagRunOperator(
task_id="trigger_advanced_dag",
trigger_dag_id="advanced_dag",
wait_for_completion="True",
conf={"tag_names": "{{ task_instance.xcom_pull(key='tag_names', task_ids='get_file_paths') }}"}
)
In the advanced_dag, I am trying to access the dag_conf (tag_names) like this:
advanced_dag.py:
with DAG(
dag_id="advanced_dag",
start_date=datetime(2020, 12, 23),
schedule_interval=None,
is_paused_upon_creation=False,
catchup=False,
dagrun_timeout=timedelta(minutes=60),
) as dag:
dag_parser = DagParser(
home=HOME,
env=env,
global_cli_flags=GLOBAL_CLI_FLAGS,
tag=dag_run.conf["tag_names"]
)
But I get the error stating that dag_run does not exist. I realized that this is a run time variable from Accessing configuration parameters passed to Airflow through CLI.
So, I tried a solution which was mentioned in the comment that uses dag.get_dagrun(execution_date=dag.latest_execution_date).conf which goes something like:
dag_parser = DagParser(
home=HOME,
env=env,
global_cli_flags=GLOBAL_CLI_FLAGS,
tag=dag.get_dagrun(execution_date=dag.latest_execution_date).conf['tag_names']
)
But it looks like it didn't fetch the value either.
I was able to solve this issue by using Airflow Variables but I wanted to know if there is a way to use the dag_conf (which obviously gets data only during runtime) inside the dag() code and get the value.

Apache Airflow - use python result in the next steps

I am working on some simple Apache Airflow DAG. My goal is to:
1. calculate the data parameter based on the DAG run date - I try achieve that by the Python operator.
2. pass the parameter calculated above as a bq query parameter.
Any ideas are welcom.
My code below - I have marked the two points with I am struggling with by the 'TODO' label.
...
def set_date_param(dag_run_time):
# a business logic applied here
....
return "2020-05-28" # example result
# --------------------------------------------------------
# DAG definition below
# --------------------------------------------------------
# Python operator
set_data_param = PythonOperator(
task_id='set_data_param',
python_callable=set_data_param,
provide_cotext=True,
op_kwargs={
"dag_run_date": #TODO - how to pass the DAG running date as a function input parameter
},
dag=dag
)
# bq operator
load_data_to_bq_table = BigQueryOperator(
task_id='load_data_to_bq_table',
sql="""SELECT ccustomer_id, sales
FROM `my_project.dataset1.table1`
WHERE date_key = {date_key_param}
""".format(
date_key_param =
), #TODO - how to get the python operator results from the previous step
use_legacy_sql=False,
destination_dataset_table="my_project.dataset2.table2}",
trigger_rule='all_success',
dag=dag
)
set_data_param >> load_data_to_bq_table
For PythonOperator to pass the execution date to the python_callable, you only need to set provide_cotext=True (as it has been already done in your example). This way, Airflow automatically passes a collection of keyword arguments to the python callable, such that the names and values of these arguments are equivalent to the template variables described here. That is, if you define the python callable as set_data_param(ds, **kwargs): ..., the ds parameter will automatically get the execution date as a string value in the format YYYY-MM-DD.
XCOM allows task instances to exchange messages. To use the date returned by set_date_param() inside the sql query string of BigQueryOperator, you can combine XCOM with Jinja templating:
sql="""SELECT ccustomer_id, sales
FROM `my_project.dataset1.table1`
WHERE date_key = {{ task_instance.xcom_pull(task_ids='set_data_param') }}
"""
The following complete example puts all pieces together. In the example, the get_date task creates a date string based on the execution date. After that, the use_date task uses XCOM and Jinja templating to retrieve the date string and writes it to a log.
import logging
from airflow import DAG
from airflow.operators.python_operator import PythonOperator
from airflow.utils.dates import days_ago
default_args = {'start_date': days_ago(1)}
def calculate_date(ds, execution_date, **kwargs):
return f'{ds} ({execution_date.strftime("%m/%d/%Y")})'
def log_date(date_string):
logging.info(date_string)
with DAG(
'a_dag',
schedule_interval='*/5 * * * *',
default_args=default_args,
catchup=False,
) as dag:
get_date = PythonOperator(
task_id='get_date',
python_callable=calculate_date,
provide_context=True,
)
use_date = PythonOperator(
task_id='use_date',
python_callable=log_date,
op_args=['Date: {{ task_instance.xcom_pull(task_ids="get_date") }}'],
)
get_date >> use_date

Apache airflow macro to get last dag run execution time

I thought the macro prev_execution_date listed here would get me the execution date of the last DAG run, but looking at the source code it seems to only get the last date based on the DAG schedule.
prev_execution_date = task.dag.previous_schedule(self.execution_date)
Is there any way via macros to get the execution date of the DAG when it doesn't run on a schedule?
Yes, you can define your own custom macro for this as follows:
# custom macro function
def get_last_dag_run(dag):
last_dag_run = dag.get_last_dagrun()
if last_dag_run is None:
return "no prev run"
else:
return last_dag_run.execution_date.strftime("%Y-%m-%d")
# add macro in user_defined_macros in dag definition
dag = DAG(dag_id="my_test_dag",
schedule_interval='#daily',
user_defined_macros={
'last_dag_run_execution_date': get_last_dag_run
}
)
# example of using it in practice
print_vals = BashOperator(
task_id='print_vals',
bash_command='echo {{ last_dag_run_execution_date(dag) }}',
dag=dag
)
Note that the dag.get_last_run() is just one of the many functions available on the Dag object. Here's where I found it: https://github.com/apache/incubator-airflow/blob/v1-10-stable/airflow/models.py#L3396
You can also tweak the formatting of the string for the date format, and what you want output if there is no previous run.
You can make your own user custom macro function, use airflow model to search meta-database.
def get_last_dag_run(dag_id):
//TODO search DB
return xxx
dag = DAG(
'example',
schedule_interval='0 1 * * *',
user_defined_macros={
'last_dag_run_execution_date': get_last_dag_run,
}
)
Then use the KEY in your template.

How to force Jinja templating on Airflow variable?

The Airflow docs say: You can use Jinja templating with every parameter that is marked as “templated” in the documentation. It makes sense that specific parameters in the Airflow world (such as certain parameters to PythonOperator) get templated by Airflow automatically. I'm wondering what the best/correct way is to get a non-Airflow variable to get templated. My specific use case is something similar to:
from airflow import DAG
from airflow.operators.python_operator import PythonOperator
from somewhere import export_votes_data, export_queries_data
from elsewhere import ApiCaucus, ApiQueries
dag = DAG('export_training_data',
description='Export training data for all active orgs to GCS',
schedule_interval=None,
start_date=datetime(2018, 3, 26), catchup=False)
HOST = "http://api-00a.dev0.solvvy.co"
BUCKET = "gcs://my-bucket-name/{{ ds }}/" # I'd like this to get templated
votes_api = ApiCaucus.get_votes_api(HOST)
queries_api = ApiQueries.get_queries_api(HOST)
export_votes = PythonOperator(task_id="export_votes", python_callable=export_votes_data,
op_args=[BUCKET, votes_api], dag=dag)
export_queries = PythonOperator(task_id="export_queries", python_callable=export_query_data,
op_args=[BUCKET, queries_api, export_solutions.task_id], dag=dag,
provide_context=True)
The provide_context argument for the PythonOperator will pass along the arguments that are used for templating. From the documentation:
provide_context (bool) – if set to true, Airflow will pass a set of
keyword arguments that can be used in your function. This set of
kwargs correspond exactly to what you can use in your jinja templates.
For this to work, you need to define **kwargs in your function header.
With the context provided to your callable, you can then do the interpolation in your function:
def your_callable(bucket, api, **kwargs):
bucket = bucket.format(**kwargs)
[...]
Inside methods(execute/pre_execute/post_execute, and anywhere you can get the Airflow context) of an Operator:
BUCKET = "gcs://my-bucket-name/{{ ds }}/" # I'd like this to get templated
jinja_context = context['ti'].get_template_context()
rendered_content = self.render_template('', BUCKET, jinja_context)

Generating dynamic tasks in airflow based on output of an upstream task

How to generate tasks dynamically based on the list returned from an upstream task.
I have tried the following:
Using an external file to write and read from the list - this option works but I am looking for a more elegant solution.
Xcom pull inside a subdag factory. It does not work.
I am able to pass a list from the upstream task to a subdag but that xcom is only accessible inside of a subdag's task and cannot be used to loop/iterate over the returned list and generate tasks.
for e.g. subdag factory method.
def subdag1(parent_dag_name, child_dag_name, default_args,**kwargs):
dag_subdag = DAG(
dag_id='%s.%s' % (parent_dag_name, child_dag_name),
default_args=default_args,
schedule_interval="#once",
)
list_files='{{ task_instance.xcom_pull( dag_id="qqq",task_ids="push")}}'
for newtask in list_files:
BashOperator(
task_id='%s-task-%s' % (child_dag_name, 'a'),
default_args=default_args,
bash_command= 'echo '+ list_files + newtask,
dag=dag_subdag,
)
return dag_subdag

Resources