Generating dynamic tasks in airflow based on output of an upstream task - airflow

How to generate tasks dynamically based on the list returned from an upstream task.
I have tried the following:
Using an external file to write and read from the list - this option works but I am looking for a more elegant solution.
Xcom pull inside a subdag factory. It does not work.
I am able to pass a list from the upstream task to a subdag but that xcom is only accessible inside of a subdag's task and cannot be used to loop/iterate over the returned list and generate tasks.
for e.g. subdag factory method.
def subdag1(parent_dag_name, child_dag_name, default_args,**kwargs):
dag_subdag = DAG(
dag_id='%s.%s' % (parent_dag_name, child_dag_name),
default_args=default_args,
schedule_interval="#once",
)
list_files='{{ task_instance.xcom_pull( dag_id="qqq",task_ids="push")}}'
for newtask in list_files:
BashOperator(
task_id='%s-task-%s' % (child_dag_name, 'a'),
default_args=default_args,
bash_command= 'echo '+ list_files + newtask,
dag=dag_subdag,
)
return dag_subdag

Related

Airflow Get value from table in Bigquery and use the return value to pass into next task

I am trying to Take data from BigQuery Dataset and pass the result value to bash_command so that it will execute commands to remove files in Cloud storage.
When I execute 'SELECT commandshell_variable_1 FROM sys_process.outbound_flat_file_config where file_number=1' the result is ..... gsutil rm -r gs://A/loop_member_*.csv
I want to use the result of below and Pass it to bash_command in next task ...
Thank you.
DAG Code
with DAG(
dag_id='kodz_Automation',
description='kodz_Automation',
schedule_interval=None,
catchup= False,
default_args=DEFAULT_DAG_ARGS) as dag:
def get_data_from_bq(**kwargs):
hook = BigQueryHook(bigquery_conn_id='bigquery_default', delegate_to=None, use_legacy_sql=False)
conn = hook.get_conn()
cursor = conn.cursor()
cursor.execute('SELECT commandshell_variable_1 FROM sys_process.outbound_flat_file_config where file_number=1')
result = cursor.fetchall()
print('result', result)
return result
fetch_data = PythonOperator(
task_id='fetch_data_public_dataset',
provide_context=True,
python_callable=get_data_from_bq,
dag=dag
)
also_run_this = bash_operator.BashOperator(
task_id='also_run_this',
bash_command=python_callable
)
fetch_data >> also_run_this
To send data from one task to another you can use Airflow XCOM feature.
Using PythonOperator, the returned value will be stored in XCOM by default, so all you need to do is add a xcom_pull in the BashOperator, something like this:
also_run_this = bash_operator.BashOperator(
task_id='also_run_this',
bash_command="<you command> {{ ti.xcom_pull(task_ids=[\'fetch_data_public_dataset\']) }}'"
)
Learn more of XCOM here
But if you will return a lot of data, I recommend saving this in some storage (like S3, GCS, etc) and then sending the link address to the bash command.

Apache Airflow - use python result in the next steps

I am working on some simple Apache Airflow DAG. My goal is to:
1. calculate the data parameter based on the DAG run date - I try achieve that by the Python operator.
2. pass the parameter calculated above as a bq query parameter.
Any ideas are welcom.
My code below - I have marked the two points with I am struggling with by the 'TODO' label.
...
def set_date_param(dag_run_time):
# a business logic applied here
....
return "2020-05-28" # example result
# --------------------------------------------------------
# DAG definition below
# --------------------------------------------------------
# Python operator
set_data_param = PythonOperator(
task_id='set_data_param',
python_callable=set_data_param,
provide_cotext=True,
op_kwargs={
"dag_run_date": #TODO - how to pass the DAG running date as a function input parameter
},
dag=dag
)
# bq operator
load_data_to_bq_table = BigQueryOperator(
task_id='load_data_to_bq_table',
sql="""SELECT ccustomer_id, sales
FROM `my_project.dataset1.table1`
WHERE date_key = {date_key_param}
""".format(
date_key_param =
), #TODO - how to get the python operator results from the previous step
use_legacy_sql=False,
destination_dataset_table="my_project.dataset2.table2}",
trigger_rule='all_success',
dag=dag
)
set_data_param >> load_data_to_bq_table
For PythonOperator to pass the execution date to the python_callable, you only need to set provide_cotext=True (as it has been already done in your example). This way, Airflow automatically passes a collection of keyword arguments to the python callable, such that the names and values of these arguments are equivalent to the template variables described here. That is, if you define the python callable as set_data_param(ds, **kwargs): ..., the ds parameter will automatically get the execution date as a string value in the format YYYY-MM-DD.
XCOM allows task instances to exchange messages. To use the date returned by set_date_param() inside the sql query string of BigQueryOperator, you can combine XCOM with Jinja templating:
sql="""SELECT ccustomer_id, sales
FROM `my_project.dataset1.table1`
WHERE date_key = {{ task_instance.xcom_pull(task_ids='set_data_param') }}
"""
The following complete example puts all pieces together. In the example, the get_date task creates a date string based on the execution date. After that, the use_date task uses XCOM and Jinja templating to retrieve the date string and writes it to a log.
import logging
from airflow import DAG
from airflow.operators.python_operator import PythonOperator
from airflow.utils.dates import days_ago
default_args = {'start_date': days_ago(1)}
def calculate_date(ds, execution_date, **kwargs):
return f'{ds} ({execution_date.strftime("%m/%d/%Y")})'
def log_date(date_string):
logging.info(date_string)
with DAG(
'a_dag',
schedule_interval='*/5 * * * *',
default_args=default_args,
catchup=False,
) as dag:
get_date = PythonOperator(
task_id='get_date',
python_callable=calculate_date,
provide_context=True,
)
use_date = PythonOperator(
task_id='use_date',
python_callable=log_date,
op_args=['Date: {{ task_instance.xcom_pull(task_ids="get_date") }}'],
)
get_date >> use_date

How to runn external DAG as part of my DAG?

I'm new to Airflow and I'm trying to run an external DAG (developed and owned by another team), as part of my DAG flow.
I was looking at SubDagOperator, but it seems that for some reason it enforces the name of the subdag to be . which I cannot do as the child dag is owned by a different team.
here is my code sample:
parent_dag = DAG(
dag_id='parent_dag', default_args=args,
schedule_interval=None)
external_dag = SubDagOperator(
subdag=another_teams_dag,
task_id='external_dag',
dag=parent_dag,
trigger_rule=TriggerRule.ALL_DONE
)
and the other team's dag is defined like this:
another_teams_dag = DAG(
dag_id='another_teams_dag', default_args=args,
schedule_interval=None)
but I'm getting this error:
The subdag's dag_id should have the form
'{parent_dag_id}.{this_task_id}'. Expected 'parent_dag.external_dag';
received 'another_teams_dag'.
Any ideas?
What am I missing?
Use TriggerDagRunOperator
More info: https://airflow.apache.org/code.html#airflow.operators.dagrun_operator.TriggerDagRunOperator
Example:
Dag that triggers: https://github.com/apache/incubator-airflow/blob/master/airflow/example_dags/example_trigger_controller_dag.py
Dag that is triggered: https://github.com/apache/incubator-airflow/blob/master/airflow/example_dags/example_trigger_target_dag.py
For your case, you can use something like:
trigger = TriggerDagRunOperator(task_id='external_dag',
trigger_dag_id="another_teams_dag",
dag=dag)

Apache airflow macro to get last dag run execution time

I thought the macro prev_execution_date listed here would get me the execution date of the last DAG run, but looking at the source code it seems to only get the last date based on the DAG schedule.
prev_execution_date = task.dag.previous_schedule(self.execution_date)
Is there any way via macros to get the execution date of the DAG when it doesn't run on a schedule?
Yes, you can define your own custom macro for this as follows:
# custom macro function
def get_last_dag_run(dag):
last_dag_run = dag.get_last_dagrun()
if last_dag_run is None:
return "no prev run"
else:
return last_dag_run.execution_date.strftime("%Y-%m-%d")
# add macro in user_defined_macros in dag definition
dag = DAG(dag_id="my_test_dag",
schedule_interval='#daily',
user_defined_macros={
'last_dag_run_execution_date': get_last_dag_run
}
)
# example of using it in practice
print_vals = BashOperator(
task_id='print_vals',
bash_command='echo {{ last_dag_run_execution_date(dag) }}',
dag=dag
)
Note that the dag.get_last_run() is just one of the many functions available on the Dag object. Here's where I found it: https://github.com/apache/incubator-airflow/blob/v1-10-stable/airflow/models.py#L3396
You can also tweak the formatting of the string for the date format, and what you want output if there is no previous run.
You can make your own user custom macro function, use airflow model to search meta-database.
def get_last_dag_run(dag_id):
//TODO search DB
return xxx
dag = DAG(
'example',
schedule_interval='0 1 * * *',
user_defined_macros={
'last_dag_run_execution_date': get_last_dag_run,
}
)
Then use the KEY in your template.

Add downstream task to every task without downstream in a DAG in Airflow 1.9

Problem: I've been trying to find a way to get tasks from a DAG that have no downstream tasks following them.
Why I need it: I'm building an "on success" notification for DAGs. Airflow DAGs have an on_success_callback argument, but problem with that is that it gets triggered after every task success instead of just DAG. I've seen other people approach this problem by creating notification task and appending it to the end. Problem I have with this approach is that many DAGs we're using have multiple ends, and some are auto-generated.
Making sure that all ends are caught manually is tedious.
I've spent hours digging for a way to access data I need to build this.
Sample DAG setup:
from airflow import DAG
from airflow.operators.dummy_operator import DummyOperator
from datetime import datetime
default_args = {
'owner': 'airflow',
'start_date': datetime(2018, 7, 29)}
dag = DAG(
'append_to_end',
description='append a tast to all tasks without downstream',
default_args=default_args,
schedule_interval='* * * * *',
catchup=False)
task_1 = DummyOperator(dag=dag, task_id='task_1')
task_2 = DummyOperator(dag=dag, task_id='task_2')
task_3 = DummyOperator(dag=dag, task_id='task_3')
task_1 >> task_2
task_1 >> task_3
This produces following DAG:
What I want to achieve is an automated way to include a new task to a DAG that connects to all ends, like in an image below.
I know it's an old post, but I've had a similar need as the above posted.
You can add to your return function a statement that doesn't return your "final_task" id, and so it won't be added to the get_leaf_task return, something like:
def get_leaf_tasks(dag):
return [task for task_id, task in dag.task_dict.items() if len(task.downstream_list) == 0 and task_ids != 'final_task']
Additionally, you can change this part:
for task in leaf_tasks:
task >> final_task
to:
get_leaf_tasks(dag) >> final_task
Since it already gives you a list of task instances and the bitwise operator ">>" will do the loop for you.
What I've got to so far is code below:
def get_leaf_tasks(dag):
return [task for task_id, task in dag.task_dict.items() if len(task.downstream_list) == 0]
leaf_tasks = get_leaf_tasks(dag)
final_task = DummyOperator(dag=dag, task_id='final_task')
for task in leaf_tasks:
task >> final_task
It produces the result I want, but what I don't like about this solution is that get_leaf_tasks must be executed before final_task is created, or it will be included in leaf_tasks list and I'll have to find ways to exclude it.
I could wrap assignment in another function:
def append_to_end(dag, task):
leaf_tasks = get_leaf_tasks(dag)
dag.add_task(task)
for task in leaf_tasks:
task >> final_task
final_task = DummyOperator(task_id='final_task')
append_to_end(dag, final_task)
This is not ideal either, as caller must ensure they've created a final_task without DAG assigned to it.

Resources