Apache airflow macro to get last dag run execution time - airflow

I thought the macro prev_execution_date listed here would get me the execution date of the last DAG run, but looking at the source code it seems to only get the last date based on the DAG schedule.
prev_execution_date = task.dag.previous_schedule(self.execution_date)
Is there any way via macros to get the execution date of the DAG when it doesn't run on a schedule?

Yes, you can define your own custom macro for this as follows:
# custom macro function
def get_last_dag_run(dag):
last_dag_run = dag.get_last_dagrun()
if last_dag_run is None:
return "no prev run"
else:
return last_dag_run.execution_date.strftime("%Y-%m-%d")
# add macro in user_defined_macros in dag definition
dag = DAG(dag_id="my_test_dag",
schedule_interval='#daily',
user_defined_macros={
'last_dag_run_execution_date': get_last_dag_run
}
)
# example of using it in practice
print_vals = BashOperator(
task_id='print_vals',
bash_command='echo {{ last_dag_run_execution_date(dag) }}',
dag=dag
)
Note that the dag.get_last_run() is just one of the many functions available on the Dag object. Here's where I found it: https://github.com/apache/incubator-airflow/blob/v1-10-stable/airflow/models.py#L3396
You can also tweak the formatting of the string for the date format, and what you want output if there is no previous run.

You can make your own user custom macro function, use airflow model to search meta-database.
def get_last_dag_run(dag_id):
//TODO search DB
return xxx
dag = DAG(
'example',
schedule_interval='0 1 * * *',
user_defined_macros={
'last_dag_run_execution_date': get_last_dag_run,
}
)
Then use the KEY in your template.

Related

How to add data to Airflow DAG Context? [duplicate]

This question already has answers here:
Pass other arguments to on_failure_callback
(2 answers)
Closed 11 months ago.
We are working with airflow. We have something like 1000+ DAGS.
To manage DAG errors we use the same on_error_callback function to trigger alerts.
Each DAG is supposed to have context information, that could be expressed as constants, that I would like to share with the alerting stack.
Currently, I am only able to send the dag_id I retrieve from the context, via context['ti'].dag_id, and eventually the conf (parameters).
Is there a way to add other data (constants) to the context when declaring/creating the DAG?
Thanks.
You can pass params into the context.
dag = DAG(
dag_id='example_dag',
default_args=default_args,
params={
"param1": "value1",
"param2": "value2"
}
)
These are available in the
context
# example task that quickly outputs the context
start = PythonOperator(
task_id = 'start',
python_callable = lambda **context: print(context)
)
# outputs the context, in the conf dictionary you will see
# 'params': {'param1': 'value1'},
or via jinja templates
{{params.param1}}
define variables within a DAG and access variables by {{}} ref
Example ( with PROJECT_ID variable)
with models.DAG(
...
user_defined_macros={ "PROJECT_ID" : PROJECT_ID }
) as dag:
projectId = "{{PROJECT_ID}}"

Airflow Get value from table in Bigquery and use the return value to pass into next task

I am trying to Take data from BigQuery Dataset and pass the result value to bash_command so that it will execute commands to remove files in Cloud storage.
When I execute 'SELECT commandshell_variable_1 FROM sys_process.outbound_flat_file_config where file_number=1' the result is ..... gsutil rm -r gs://A/loop_member_*.csv
I want to use the result of below and Pass it to bash_command in next task ...
Thank you.
DAG Code
with DAG(
dag_id='kodz_Automation',
description='kodz_Automation',
schedule_interval=None,
catchup= False,
default_args=DEFAULT_DAG_ARGS) as dag:
def get_data_from_bq(**kwargs):
hook = BigQueryHook(bigquery_conn_id='bigquery_default', delegate_to=None, use_legacy_sql=False)
conn = hook.get_conn()
cursor = conn.cursor()
cursor.execute('SELECT commandshell_variable_1 FROM sys_process.outbound_flat_file_config where file_number=1')
result = cursor.fetchall()
print('result', result)
return result
fetch_data = PythonOperator(
task_id='fetch_data_public_dataset',
provide_context=True,
python_callable=get_data_from_bq,
dag=dag
)
also_run_this = bash_operator.BashOperator(
task_id='also_run_this',
bash_command=python_callable
)
fetch_data >> also_run_this
To send data from one task to another you can use Airflow XCOM feature.
Using PythonOperator, the returned value will be stored in XCOM by default, so all you need to do is add a xcom_pull in the BashOperator, something like this:
also_run_this = bash_operator.BashOperator(
task_id='also_run_this',
bash_command="<you command> {{ ti.xcom_pull(task_ids=[\'fetch_data_public_dataset\']) }}'"
)
Learn more of XCOM here
But if you will return a lot of data, I recommend saving this in some storage (like S3, GCS, etc) and then sending the link address to the bash command.

Apache Airflow - use python result in the next steps

I am working on some simple Apache Airflow DAG. My goal is to:
1. calculate the data parameter based on the DAG run date - I try achieve that by the Python operator.
2. pass the parameter calculated above as a bq query parameter.
Any ideas are welcom.
My code below - I have marked the two points with I am struggling with by the 'TODO' label.
...
def set_date_param(dag_run_time):
# a business logic applied here
....
return "2020-05-28" # example result
# --------------------------------------------------------
# DAG definition below
# --------------------------------------------------------
# Python operator
set_data_param = PythonOperator(
task_id='set_data_param',
python_callable=set_data_param,
provide_cotext=True,
op_kwargs={
"dag_run_date": #TODO - how to pass the DAG running date as a function input parameter
},
dag=dag
)
# bq operator
load_data_to_bq_table = BigQueryOperator(
task_id='load_data_to_bq_table',
sql="""SELECT ccustomer_id, sales
FROM `my_project.dataset1.table1`
WHERE date_key = {date_key_param}
""".format(
date_key_param =
), #TODO - how to get the python operator results from the previous step
use_legacy_sql=False,
destination_dataset_table="my_project.dataset2.table2}",
trigger_rule='all_success',
dag=dag
)
set_data_param >> load_data_to_bq_table
For PythonOperator to pass the execution date to the python_callable, you only need to set provide_cotext=True (as it has been already done in your example). This way, Airflow automatically passes a collection of keyword arguments to the python callable, such that the names and values of these arguments are equivalent to the template variables described here. That is, if you define the python callable as set_data_param(ds, **kwargs): ..., the ds parameter will automatically get the execution date as a string value in the format YYYY-MM-DD.
XCOM allows task instances to exchange messages. To use the date returned by set_date_param() inside the sql query string of BigQueryOperator, you can combine XCOM with Jinja templating:
sql="""SELECT ccustomer_id, sales
FROM `my_project.dataset1.table1`
WHERE date_key = {{ task_instance.xcom_pull(task_ids='set_data_param') }}
"""
The following complete example puts all pieces together. In the example, the get_date task creates a date string based on the execution date. After that, the use_date task uses XCOM and Jinja templating to retrieve the date string and writes it to a log.
import logging
from airflow import DAG
from airflow.operators.python_operator import PythonOperator
from airflow.utils.dates import days_ago
default_args = {'start_date': days_ago(1)}
def calculate_date(ds, execution_date, **kwargs):
return f'{ds} ({execution_date.strftime("%m/%d/%Y")})'
def log_date(date_string):
logging.info(date_string)
with DAG(
'a_dag',
schedule_interval='*/5 * * * *',
default_args=default_args,
catchup=False,
) as dag:
get_date = PythonOperator(
task_id='get_date',
python_callable=calculate_date,
provide_context=True,
)
use_date = PythonOperator(
task_id='use_date',
python_callable=log_date,
op_args=['Date: {{ task_instance.xcom_pull(task_ids="get_date") }}'],
)
get_date >> use_date

Airflow is taking jinja template as string

in Airflow im trying to us jinja template in airflow but the problem is it is not getting parsed and rather treated as a string . Please see my code
``
from datetime import datetime
from airflow.operators.python_operator import PythonOperator
from airflow.models import DAG
def test_method(dag,network_id,schema_name):
print "Schema_name in test_method", schema_name
third_task = PythonOperator(
task_id='first_task_' + network_id,
provide_context=True,
python_callable=print_context2,
dag=dag)
return third_task
dag = DAG('testing_xcoms_pull', description='Testing Xcoms',
schedule_interval='0 12 * * *',
start_date= datetime.today(),
catchup=False)
def print_context(ds, **kwargs):
return 'Returning from print_context'
def print_context2(ds, **kwargs):
return 'Returning from print_context2'
def get_schema(ds, **kwargs):
# Returning schema name based on network_id
schema_name = "my_schema"
return get_schema
first_task = PythonOperator(
task_id='first_task',
provide_context=True,
python_callable=print_context,
dag=dag)
second_task = PythonOperator(
task_id='second_task',
provide_context=True,
python_callable=get_schema,
dag=dag)
network_id = '{{ dag_run.conf["network_id"]}}'
first_task >> second_task >> test_method(
dag=dag,
network_id=network_id,
schema_name='{{ ti.xcom_pull("second_task")}}')
``
The Dag creation is failing because '{{ dag_run.conf["network_id"]}}' is taken as string by airflow. Can anyone help me with the problem in my code ???
Airflow operators have a variable called template_fields. This variable is usually declared at the top of the operator Class, check out any of the operators in the github code base.
If the field you are trying to pass Jinja template syntax into is not in the template_fields list the jinja syntax will appear as a string.
A DAG object, and its definition code, isn't parsed within the context an execution, it's parsed with regards to the environment available to it when loaded by Python.
The network_id variable, which you use to define the task_id in your function, isn't templated prior to execution, it can't be since there is no execution active. Even with templating you still need a valid, static, non-templated task_id value to instantiate a DAG object.

Generating dynamic tasks in airflow based on output of an upstream task

How to generate tasks dynamically based on the list returned from an upstream task.
I have tried the following:
Using an external file to write and read from the list - this option works but I am looking for a more elegant solution.
Xcom pull inside a subdag factory. It does not work.
I am able to pass a list from the upstream task to a subdag but that xcom is only accessible inside of a subdag's task and cannot be used to loop/iterate over the returned list and generate tasks.
for e.g. subdag factory method.
def subdag1(parent_dag_name, child_dag_name, default_args,**kwargs):
dag_subdag = DAG(
dag_id='%s.%s' % (parent_dag_name, child_dag_name),
default_args=default_args,
schedule_interval="#once",
)
list_files='{{ task_instance.xcom_pull( dag_id="qqq",task_ids="push")}}'
for newtask in list_files:
BashOperator(
task_id='%s-task-%s' % (child_dag_name, 'a'),
default_args=default_args,
bash_command= 'echo '+ list_files + newtask,
dag=dag_subdag,
)
return dag_subdag

Resources