Airflow BashOperator Parameter From XCom Value - airflow

I am having some problem assigning an xcom value to the BashOperator.
All the parameters are properly retrieved except the tmp_dir, which is an xcom value generated during init_dag. I was able to retrieve the value in my custom operator but not being able to do it in the BashOperator. I have added the outputs of the three different ways I have tried that came to my mind.
I think one way could be if I could store that value in a variable but I was also not able to figure it out how.
Any help will be highly appreciated.
Here is my DAG code:
import airflow
from airflow.models import DAG
from airflow.utils.dates import days_ago
from airflow.models import Variable
from utility import util
import os
from airflow.operators.bash_operator import BashOperator
from operators.mmm_operator import MMMOperator #it is a custom operator
from operators.iftp_operator import IFTPOperator #it is another custom operator
AF_DATAMONTH = util.get_date_by_format(deltaMth=2,deltaDay=0,ft='%b_%Y').lower() #it gives a date in required format
AF_FILENM_1 = 'SOME_FILE_' + AF_DATAMONTH + '.zip' #required filename for ftp
default_args = {
'owner': 'airflow',
'depends_on_past': False,
'start_date': days_ago(0),
}
dag = DAG(dag_id='my_dag', default_args=default_args, schedule_interval=None)
init_dag = MMMOperator(
task_id='init_dag',
provide_context=True,
mmm_oracle_conn_id=Variable.get('SOME_VARIABLE'),
mmm_view="{0}.{1}".format(Variable.get('ANOTHER_VARIABLE'), AF_DAG_MMM_VIEW_NM),
mmm_view_filter=None,
mmm_kv_type=True,
mmm_af_env_view="{0}.{1}".format(Variable.get('ANOTHER_VARIABLE_1'),Variable.get('ANOTHER_VARIABLE_2')),
dag=dag
) #local_tmp_folder is generated here and pushed via xcom
download_ftp_files = IFTPOperator(task_id='download_ftp_files',
ftp_conn_id=util.getFromConfig("nt_conn_id"), #value properly retrieved by xcom_pull
operation='GET',
source_path=util.getFromConfig("nt_remote_folder"), #value properly retrieved by xcom_pull
dest_path=util.getFromConfig("local_tmp_folder"), #value properly retrieved by xcom_pull
filenames=AF_FILENM,
dag=dag
)
bash_cmd_template = "cd /vagrant/ && python3 hello_print.py {{params.client}} {{params.task}} {{params.environment}} {{params.tmp_dir}} {{params.af_file_nm}}"
#try 1 output value for params.tmp_dir: {{ ti.xcom_pull(task_ids="init_dag")["local_tmp_folder"] }} - instead of the actual tmp folder location
#try 2 and try 3 output: Broken DAG: [/home/vagrant/airflow/dags/my_dag.py] name 'ti' is not defined - message in UI
execute_main_py_script = BashOperator(
task_id='execute_main_py_script',
bash_command=bash_cmd_template,
params={'client' : 'some_client',
'task' : 'load_some_task',
'environment' : 'environment_name',
#'tmp_dir' : util.getFromConfig("local_tmp_folder"), #try 1
#'tmp_dir' : {{ ti.xcom_pull(task_ids="init_dag")["local_tmp_folder"] }} #try 2
#'tmp_dir' : ti.xcom_pull(task_ids="init_dag")["local_tmp_folder"] #try 3
'af_file_nm' : AF_FILENM_1
},
provide_context=True,
dag=dag
)
init_dag >> download_ftp_files >> execute_main_py_script

The params argument of the BashOperator is not Jinja Templated hence any values you pass in params would be rendered "as-is".
You should directly pass the value of tmp_dir in bash_cmd_template as follows:
bash_cmd_template = """
cd /vagrant/ && python3 hello_print.py {{params.client}} {{params.task}} {{params.environment}} {{ ti.xcom_pull(task_ids="init_dag")["local_tmp_folder"] }} {{params.af_file_nm}}
"""
execute_main_py_script = BashOperator(
task_id='execute_main_py_script',
bash_command=bash_cmd_template,
params={'client' : 'some_client',
'task' : 'load_some_task',
'environment' : 'environment_name',
'af_file_nm' : AF_FILENM_1
},
provide_context=True,
dag=dag
)

Related

How to set airflow `http_conn_id` with a param?

Running Airflow 2.2.2
I would like to parametrize the http_conn_id using the DAG input parameters as such:
with DAG(params={'api': 'my-api-id'}) as dag:
post_op = SimpleHttpOperator(
task_id='post_op',
endpoint='custom-end-point',
http_conn_id='{{ params.api }}', # <- this doesn't get filled correctly
dag=dag)
Where my-api-id is set in the Airflow Connections.
However, when executing, the operator evaluates http_conn_id as '{{ params.api }}'.
I'm suspecting this is not possible - or is an anti-pattern?
Airflow operators do not render all the fields, they render only the fields which are listed in the attribute template_fields. For the operator SimpleHttpOperator, you have only the fiels:
template_fields: Sequence[str] = (
'endpoint',
'data',
'headers',
)
To get around the problem, you can create a new class which extend the official operator, and just add the extra fields you want to render:
from datetime import datetime
from airflow import DAG
from airflow.providers.http.operators.http import SimpleHttpOperator
class MyHttpOperator(SimpleHttpOperator):
template_fields = (
*SimpleHttpOperator.template_fields,
'http_conn_id'
)
with DAG(
dag_id='http_dag',
start_date=datetime.today(),
params={'api': 'my-api-id'}
) as dag:
post_op = MyHttpOperator(
task_id='post_op',
endpoint='custom-end-point',
http_conn_id='{{ params.api }}',
dag=dag
)

Airflow | How DAG got started

Does anyone know how to get the way a DAG got started (whether it was on a scheduler or manually)? I'm using Airflow 2.1.
I have a DAG that runs on an hourly basis, but there are times that I run it manually to test something. I want to capture how the DAG got started and pass that value to a column in a table where I'm saving some data. This will allow me to filter based on scheduled or manual starts and filter test information.
Thanks!
From an execution context, such as a python_callable provided to a PythonOperator you can access to the DagRun object related to the current execution:
def _print_dag_run(**kwargs):
dag_run: DagRun = kwargs["dag_run"]
print(f"Run type: {dag_run.run_type}")
print(f"Externally triggered ?: {dag_run.external_trigger}")
Logs output:
[2021-09-08 18:53:52,188] {taskinstance.py:1300} INFO - Exporting the following env vars:
AIRFLOW_CTX_DAG_OWNER=airflow
AIRFLOW_CTX_DAG_ID=example_dagRun_info
AIRFLOW_CTX_TASK_ID=python_task
AIRFLOW_CTX_EXECUTION_DATE=2021-09-07T00:00:00+00:00
AIRFLOW_CTX_DAG_RUN_ID=backfill__2021-09-07T00:00:00+00:00
Run type: backfill
Externally triggered ?: False
dag_run.run_type would be: "manual", "scheduled" or "backfill". (not sure if there are others)
external_trigger docs:
external_trigger (bool) -- whether this dag run is externally triggered
Also you could use jinja to access default vairables in templated fields, there is a variable representing the dag_run object:
bash_task = BashOperator(
task_id="bash_task",
bash_command="echo dag_run type is: {{ dag_run.run_type }}",
)
Full DAG:
from airflow import DAG
from airflow.models.dagrun import DagRun
from airflow.operators.python import PythonOperator
from airflow.utils.dates import days_ago
default_args = {
"owner": "airflow",
}
def _print_dag_run(**kwargs):
dag_run: DagRun = kwargs["dag_run"]
print(f"Run type: {dag_run.run_type}")
print(f"Externally triggered ?: {dag_run.external_trigger}")
dag = DAG(
dag_id="example_dagRun_info",
default_args=default_args,
start_date=days_ago(1),
schedule_interval="#once",
tags=["example_dags", "params"],
catchup=False,
)
with dag:
python_task = PythonOperator(
task_id="python_task",
python_callable=_print_dag_run,
)
bash_task = BashOperator(
task_id="bash_task",
bash_command="echo dag_run type is: {{ dag_run.run_type }}",
)

How to pass output data between airflow SSHOperator tasks using xcom?

Problem summary:
I need to get stdout from one SSHOperator using xcom
Filter some rows and get output values for passing them to another SSHOperator
Unfortunatelly I've not find anything helpful in the Airflow documentation
Code example:
import airflow
from airflow.operators.dummy_operator import DummyOperator
from airflow.contrib.operators.ssh_operator import SSHOperator
default_args = {
'owner': 'airflow',
'depends_on_past': False,
'start_date': datetime.datetime(2020, 1, 1, 0, 0),
}
dag = airflow.DAG(
'example',
default_args=default_args,
)
task_dummy = DummyOperator(
task_id='task_dummy',
dag=dag
)
cmd_ssh = """
for f in "file1" "file2"
do
if $(hdfs dfs -test -d /data/$f)
then hdfs dfs -rm -r -skipTrash /data/$f
else echo "doesn't exists"
fi
done
"""
task_1 = SSHOperator(
ssh_conn_id='server_connection',
task_id='task_ssh',
command=cmd_ssh,
do_xcom_push=True,
dag=dag
)
My question is - how to access stdout from task_1 when I sed do_xcom_push=True?
You can access the XCom data in templated fields or callables which receive the Airflow context, such as the PythonOperator (and its child classes) -- from the documentation:
# inside a PythonOperator called 'pushing_task'
def push_function():
return value
# inside another PythonOperator where provide_context=True
def pull_function(**context):
value = context['task_instance'].xcom_pull(task_ids='pushing_task')

Jinja Template Variable Email ID not rendering when using ON_FAILURE_CALLBACK

Need help on rendering the jinja template email ID in the On_failure_callback.
I understand that rendering templates work fine in the SQL file or with the operator having template_fields .How do I get below code rendered the jinja template variable
It works fine with Variable.get('email_edw_alert'), but I don't want to use Variable method to avoid hitting DB
Below is the Dag file
import datetime
import os
from functools import partial
from datetime import timedelta
from airflow.models import DAG,Variable
from airflow.contrib.operators.snowflake_operator import SnowflakeOperator
from alerts.email_operator import dag_failure_email
def get_db_dag(
*,
dag_id,
start_date,
schedule_interval,
max_taskrun,
max_dagrun,
proc_nm,
load_sql
):
default_args = {
'owner': 'airflow',
'start_date': start_date,
'provide_context': True,
'execution_timeout': timedelta(minutes=max_taskrun),
'retries': 0,
'retry_delay': timedelta(minutes=3),
'retry_exponential_backoff': True,
'email_on_retry': False,
}
dag = DAG(
dag_id=dag_id,
schedule_interval=schedule_interval,
dagrun_timeout=timedelta(hours=max_dagrun),
template_searchpath=tmpl_search_path,
default_args=default_args,
max_active_runs=1,
catchup='{{var.value.dag_catchup}}',
on_failure_callback=partial(dag_failure_email, config={'email_address': '{{var.value.email_edw_alert}}'}),
)
load_table = SnowflakeOperator(
task_id='load_table',
sql=load_sql,
snowflake_conn_id=CONN_ID,
autocommit=True,
dag=dag,
)
load_table
return dag
# ======== DAG DEFINITIONS #
edw_table_A = get_db_dag(
dag_id='edw_table_A',
start_date=datetime.datetime(2020, 5, 21),
schedule_interval='0 5 * * *',
max_taskrun=3, # Minutes
max_dagrun=1, # Hours
load_sql='recon/extract.sql',
)
Below is the python code alerts.email_operator
import logging
from airflow.utils.email import send_email
from airflow.models import Variable
logger = logging.getLogger(__name__)
TIME_FORMAT = "%Y-%m-%d %H:%M:%S"
def dag_failure_email(context, config=None):
config = {} if config is None else config
task_id = context.get('task_instance').task_id
dag_id = context.get("dag").dag_id
execution_time = context.get("execution_date").strftime(TIME_FORMAT)
reason = context.get("exception")
alerting_email_address = config.get('email_address')
dag_failure_html_body = f"""<html>
<header><title>The following DAG has failed!</title></header>
<body>
<b>DAG Name</b>: {dag_id}<br/>
<b>Task Id</b>: {task_id}<br/>
<b>Execution Time (UTC)</b>: {execution_time}<br/>
<b>Reason for Failure</b>: {reason}<br/>
</body>
</html>
"""
try:
if reason != 'dagrun_timeout':
send_email(
to=alerting_email_address,
subject=f"Airflow alert: <DagInstance: {dag_id} - {execution_time} [failed]",
html_content=dag_failure_html_body,
)
except Exception as e:
logger.error(
f'Error in sending email to address {alerting_email_address}: {e}',
exc_info=True,
)
I have also tried other way too , even below one is not working
try:
if reason != 'dagrun_timeout':
send_email = EmailOperator(
to=alerting_email_address,
task_id='email_task',
subject=f"Airflow alert: <DagInstance: {dag_id} - {execution_time} [failed]",
params={'content1': 'random'},
html_content=dag_failure_html_body,
)
send_email.dag = context['dag']
#send_email.to = send_email.get_template_env().from_string(send_email.to).render(**context)
send_email.to = send_email.render_template(alerting_email_address, send_email.to, context)
send_email.execute(context)
except Exception as e:
logger.error(
f'Error in sending email to address {alerting_email_address}: {e}',
exc_info=True,
)
I don't think templates would work in this way - you'll have to have something specifically parse the template. Usually jinja templates in Airflow are used to pass templated fields through to operators, and rendered using the render_template function (https://airflow.apache.org/docs/stable/_modules/airflow/models/baseoperator.html#BaseOperator.render_template)
Since your callback function isn't an operator, it won't have this method by default.
I think the best thing to do here would be to either explicitly call Variable.get during runtime of the callback function itself, rather than in the DAG definition, or implement some version of that render_template_fields function in your callback. Both of these solutions would result only in hitting the DB during runtime of this task, rather than whenever the DAG is created.
Edit: Just saw your attempt to do the rendering explicitly via the operator. Are the fields that you want to be templated specified as templated_fields within email operator?

Apache Airflow - use python result in the next steps

I am working on some simple Apache Airflow DAG. My goal is to:
1. calculate the data parameter based on the DAG run date - I try achieve that by the Python operator.
2. pass the parameter calculated above as a bq query parameter.
Any ideas are welcom.
My code below - I have marked the two points with I am struggling with by the 'TODO' label.
...
def set_date_param(dag_run_time):
# a business logic applied here
....
return "2020-05-28" # example result
# --------------------------------------------------------
# DAG definition below
# --------------------------------------------------------
# Python operator
set_data_param = PythonOperator(
task_id='set_data_param',
python_callable=set_data_param,
provide_cotext=True,
op_kwargs={
"dag_run_date": #TODO - how to pass the DAG running date as a function input parameter
},
dag=dag
)
# bq operator
load_data_to_bq_table = BigQueryOperator(
task_id='load_data_to_bq_table',
sql="""SELECT ccustomer_id, sales
FROM `my_project.dataset1.table1`
WHERE date_key = {date_key_param}
""".format(
date_key_param =
), #TODO - how to get the python operator results from the previous step
use_legacy_sql=False,
destination_dataset_table="my_project.dataset2.table2}",
trigger_rule='all_success',
dag=dag
)
set_data_param >> load_data_to_bq_table
For PythonOperator to pass the execution date to the python_callable, you only need to set provide_cotext=True (as it has been already done in your example). This way, Airflow automatically passes a collection of keyword arguments to the python callable, such that the names and values of these arguments are equivalent to the template variables described here. That is, if you define the python callable as set_data_param(ds, **kwargs): ..., the ds parameter will automatically get the execution date as a string value in the format YYYY-MM-DD.
XCOM allows task instances to exchange messages. To use the date returned by set_date_param() inside the sql query string of BigQueryOperator, you can combine XCOM with Jinja templating:
sql="""SELECT ccustomer_id, sales
FROM `my_project.dataset1.table1`
WHERE date_key = {{ task_instance.xcom_pull(task_ids='set_data_param') }}
"""
The following complete example puts all pieces together. In the example, the get_date task creates a date string based on the execution date. After that, the use_date task uses XCOM and Jinja templating to retrieve the date string and writes it to a log.
import logging
from airflow import DAG
from airflow.operators.python_operator import PythonOperator
from airflow.utils.dates import days_ago
default_args = {'start_date': days_ago(1)}
def calculate_date(ds, execution_date, **kwargs):
return f'{ds} ({execution_date.strftime("%m/%d/%Y")})'
def log_date(date_string):
logging.info(date_string)
with DAG(
'a_dag',
schedule_interval='*/5 * * * *',
default_args=default_args,
catchup=False,
) as dag:
get_date = PythonOperator(
task_id='get_date',
python_callable=calculate_date,
provide_context=True,
)
use_date = PythonOperator(
task_id='use_date',
python_callable=log_date,
op_args=['Date: {{ task_instance.xcom_pull(task_ids="get_date") }}'],
)
get_date >> use_date

Resources