How to use BigQueryOperator with execution_date? - airflow

This is my code:
EXEC_TIMESTAMP = "{{ execution_date.strftime('%Y-%m-%d %H:%M') }}"
query = """
select ... where date_purchased between TIMESTAMP_TRUNC(cast ( {{ params.run_timestamp }} as TIMESTAMP), HOUR, 'UTC') ...
"""
generate_op = BigQueryOperator(
bql=query,
destination_dataset_table=table_name,
task_id='generate',
bigquery_conn_id=CONNECTION_ID,
use_legacy_sql=False,
write_disposition='WRITE_TRUNCATE',
create_disposition='CREATE_IF_NEEDED',
query_params={'run_timestamp': EXEC_TIMESTAMP},
dag=dag)
This should work but it doesn't.
The render tab shows me:
between TIMESTAMP_TRUNC(cast ( as TIMESTAMP), HOUR, 'UTC')
The date is missing. It's being rendered into nothing.
How can I fix this? There is no provide_context=True for this operator. I don't know what to do.

Luis, the query_params are not the params you can refer to in the templating context. They are not added to it. And since params is empty, your {{ params.run_timestamp }} is either "" or None. If you changed that to params={'run_timestamp':…} it would still have a problem because params values are not templated. So when you use a templated field bql to include {{ params.run_timestamp }} you will get exactly what's in params: {'run_timestamp': …str… } filled in WITHOUT any recursive expansion of that value. You should get {{ execution_date.strftime('%Y-%m-%d %H:%M') }}.
Let me try re-writing this for you (but I may have got the parens around cast incorrectly, not sure):
generate_op = BigQueryOperator(
sql="""
select ...
where date_purchased between
TIMESTAMP_TRUNC(cast('{{execution_date.strftime('%Y-%m-%d %H:%M')}}') as TIMESTAMP), HOUR, 'UTC')
...
""",
destination_dataset_table=table_name,
task_id='generate',
bigquery_conn_id=CONNECTION_ID,
use_legacy_sql=False,
write_disposition='WRITE_TRUNCATE',
create_disposition='CREATE_IF_NEEDED',
dag=dag,
)
You can see the bql and sql fields are templated. However the bql field is deprecated and removed in later code.

The issue is you are using query_params which is not a templated field as #dlamblin mentioned.
Use the following code that directly uses the execution_date date inside bql:
import airflow
from airflow.models import DAG, Variable
from airflow.operators.dummy_operator import DummyOperator
from airflow.operators.python_operator import PythonOperator
from datetime import datetime,timedelta
from airflow.contrib.operators.bigquery_operator import BigQueryOperator
import os
CONNECTION_ID = Variable.get("Your_Connection")
args = {
'owner': 'airflow',
'start_date': datetime(2018, 12, 27, 11, 15),
'retries': 4,
'retry_delay': timedelta(minutes=10)
}
dag = DAG(
dag_id='My_Test_DAG',
default_args=args,
schedule_interval='15 * * * *',
max_active_runs=1,
catchup=False,
)
query = """select customers_email_address as email,
from mytable
where
and date_purchased = TIMESTAMP_SUB(TIMESTAMP_TRUNC(cast ({{ execution_date.strftime('%Y-%m-%d %H:%M') }} as TIMESTAMP), HOUR, 'UTC'), INTERVAL 1 HOUR) """
create_orders_temp_table_op = BigQueryOperator(
bql = query,
destination_dataset_table='some table',
task_id='create_orders_temp_table',
bigquery_conn_id=CONNECTION_ID,
use_legacy_sql=False,
write_disposition='WRITE_TRUNCATE',
create_disposition='CREATE_IF_NEEDED',
dag=dag)
start_task_op = DummyOperator(task_id='start_task', dag=dag)
start_task_op >> create_orders_temp_table_op

Related

DatabricksRunOperator Execution date

opr_run_now = DatabricksRunNowOperator(
task_id = 'run_now',
databricks_conn_id = 'databricks_default',
job_id = 754377,
notebook_params = meta_data,
dag = dag
) here
Is there way to pass execution date using databricks run operator.
What do you want to pass the execution_date to? What are you trying to achieve in the end? The following doc was helpful for me:
https://www.astronomer.io/guides/airflow-databricks
And here is an example where I am passing execution_date to be used in a python file run in Databricks. I'm capturing the execution_date using sys.argv.
from airflow import DAG
from airflow.providers.databricks.operators.databricks import (
DatabricksRunNowOperator,
)
from datetime import datetime, timedelta
spark_python_task = {
"python_file": "dbfs:/FileStore/sandbox/databricks_test_python_task.py"
}
# Define params for Run Now Operator
python_params = [
"{{ execution_date }}",
"{{ execution_date.subtract(hours=1) }}",
]
default_args = {
"owner": "airflow",
"depends_on_past": False,
"email_on_failure": False,
"email_on_retry": False,
"retries": 1,
"retry_delay": timedelta(minutes=2),
}
with DAG(
dag_id="databricks_dag",
start_date=datetime(2022, 3, 11),
schedule_interval="#hourly",
catchup=False,
default_args=default_args,
max_active_runs=1,
) as dag:
opr_run_now = DatabricksRunNowOperator(
task_id="run_now",
databricks_conn_id="Databricks",
job_id=2060,
python_params=python_params,
)
opr_run_now
There are two ways to set DatabricksRunOperator. One with named arguments (as you did) - which doesn't support templating. The second way is to use JSON payload that you typically use to call the api/2.0/jobs/run-now - This way also gives you the ability to pass execution_date as the json parameter is templated.
notebook_task_params = {
'new_cluster': new_cluster,
'notebook_task': {
'notebook_path': '/test-{{ ds }}',
}
DatabricksSubmitRunOperator(task_id='notebook_task', json=notebook_task_params)
For more information see the operator docs.

Jinja Template Variable Email ID not rendering when using ON_FAILURE_CALLBACK

Need help on rendering the jinja template email ID in the On_failure_callback.
I understand that rendering templates work fine in the SQL file or with the operator having template_fields .How do I get below code rendered the jinja template variable
It works fine with Variable.get('email_edw_alert'), but I don't want to use Variable method to avoid hitting DB
Below is the Dag file
import datetime
import os
from functools import partial
from datetime import timedelta
from airflow.models import DAG,Variable
from airflow.contrib.operators.snowflake_operator import SnowflakeOperator
from alerts.email_operator import dag_failure_email
def get_db_dag(
*,
dag_id,
start_date,
schedule_interval,
max_taskrun,
max_dagrun,
proc_nm,
load_sql
):
default_args = {
'owner': 'airflow',
'start_date': start_date,
'provide_context': True,
'execution_timeout': timedelta(minutes=max_taskrun),
'retries': 0,
'retry_delay': timedelta(minutes=3),
'retry_exponential_backoff': True,
'email_on_retry': False,
}
dag = DAG(
dag_id=dag_id,
schedule_interval=schedule_interval,
dagrun_timeout=timedelta(hours=max_dagrun),
template_searchpath=tmpl_search_path,
default_args=default_args,
max_active_runs=1,
catchup='{{var.value.dag_catchup}}',
on_failure_callback=partial(dag_failure_email, config={'email_address': '{{var.value.email_edw_alert}}'}),
)
load_table = SnowflakeOperator(
task_id='load_table',
sql=load_sql,
snowflake_conn_id=CONN_ID,
autocommit=True,
dag=dag,
)
load_table
return dag
# ======== DAG DEFINITIONS #
edw_table_A = get_db_dag(
dag_id='edw_table_A',
start_date=datetime.datetime(2020, 5, 21),
schedule_interval='0 5 * * *',
max_taskrun=3, # Minutes
max_dagrun=1, # Hours
load_sql='recon/extract.sql',
)
Below is the python code alerts.email_operator
import logging
from airflow.utils.email import send_email
from airflow.models import Variable
logger = logging.getLogger(__name__)
TIME_FORMAT = "%Y-%m-%d %H:%M:%S"
def dag_failure_email(context, config=None):
config = {} if config is None else config
task_id = context.get('task_instance').task_id
dag_id = context.get("dag").dag_id
execution_time = context.get("execution_date").strftime(TIME_FORMAT)
reason = context.get("exception")
alerting_email_address = config.get('email_address')
dag_failure_html_body = f"""<html>
<header><title>The following DAG has failed!</title></header>
<body>
<b>DAG Name</b>: {dag_id}<br/>
<b>Task Id</b>: {task_id}<br/>
<b>Execution Time (UTC)</b>: {execution_time}<br/>
<b>Reason for Failure</b>: {reason}<br/>
</body>
</html>
"""
try:
if reason != 'dagrun_timeout':
send_email(
to=alerting_email_address,
subject=f"Airflow alert: <DagInstance: {dag_id} - {execution_time} [failed]",
html_content=dag_failure_html_body,
)
except Exception as e:
logger.error(
f'Error in sending email to address {alerting_email_address}: {e}',
exc_info=True,
)
I have also tried other way too , even below one is not working
try:
if reason != 'dagrun_timeout':
send_email = EmailOperator(
to=alerting_email_address,
task_id='email_task',
subject=f"Airflow alert: <DagInstance: {dag_id} - {execution_time} [failed]",
params={'content1': 'random'},
html_content=dag_failure_html_body,
)
send_email.dag = context['dag']
#send_email.to = send_email.get_template_env().from_string(send_email.to).render(**context)
send_email.to = send_email.render_template(alerting_email_address, send_email.to, context)
send_email.execute(context)
except Exception as e:
logger.error(
f'Error in sending email to address {alerting_email_address}: {e}',
exc_info=True,
)
I don't think templates would work in this way - you'll have to have something specifically parse the template. Usually jinja templates in Airflow are used to pass templated fields through to operators, and rendered using the render_template function (https://airflow.apache.org/docs/stable/_modules/airflow/models/baseoperator.html#BaseOperator.render_template)
Since your callback function isn't an operator, it won't have this method by default.
I think the best thing to do here would be to either explicitly call Variable.get during runtime of the callback function itself, rather than in the DAG definition, or implement some version of that render_template_fields function in your callback. Both of these solutions would result only in hitting the DB during runtime of this task, rather than whenever the DAG is created.
Edit: Just saw your attempt to do the rendering explicitly via the operator. Are the fields that you want to be templated specified as templated_fields within email operator?

Airflow sql_path not able to read the sql files when passed as Jinja Template Variable

I am trying to use Jinja template variable as against using Variable.get('sql_path'), So as to avoid hitting DB for every scan of the dag file
Original code
import datetime
import os
from functools import partial
from datetime import timedelta
from airflow.models import DAG,Variable
from airflow.contrib.operators.snowflake_operator import SnowflakeOperator
from alerts.email_operator import dag_failure_email
SNOWFLAKE_CONN_ID = 'etl_conn'
tmpl_search_path = []
for subdir in ['business/', 'audit/', 'business/transform/']:
tmpl_search_path.append(os.path.join(Variable.get('sql_path'), subdir))
def get_db_dag(
*,
dag_id,
start_date,
schedule_interval,
max_taskrun,
max_dagrun,
proc_nm,
load_sql
):
default_args = {
'owner': 'airflow',
'start_date': start_date,
'provide_context': True,
'execution_timeout': timedelta(minutes=max_taskrun),
'retries': 0,
'retry_delay': timedelta(minutes=3),
'retry_exponential_backoff': True,
'email_on_retry': False,
}
dag = DAG(
dag_id=dag_id,
schedule_interval=schedule_interval,
dagrun_timeout=timedelta(hours=max_dagrun),
template_searchpath=tmpl_search_path,
default_args=default_args,
max_active_runs=1,
catchup='{{var.value.dag_catchup}}',
on_failure_callback=alert_email_callback,
)
load_table = SnowflakeOperator(
task_id='load_table',
sql=load_sql,
snowflake_conn_id=SNOWFLAKE_CONN_ID,
autocommit=True,
dag=dag,
)
load_vcc_svc_recon
return dag
# ======== DAG DEFINITIONS #
edw_table_A = get_db_dag(
dag_id='edw_table_A',
start_date=datetime.datetime(2020, 5, 21),
schedule_interval='0 5 * * *',
max_taskrun=3, # Minutes
max_dagrun=1, # Hours
load_sql='recon/extract.sql',
)
When I have replaced Variable.get('sql_path') with Jinja Template '{{var.value.sql_path}}' as below and ran the Dag, it threw an error as below, it was not able to get the file from the subdirectory of the SQL folder
tmpl_search_path = []
for subdir in ['bus/', 'audit/', 'business/snflk/']:
tmpl_search_path.append(os.path.join('{{var.value.sql_path}}', subdir))
Got below error as
inja2.exceptions.TemplateNotFound: extract.sql
Templates are not rendered everywhere in a DAG script. Usually they are rendered in the templated parameters of Operators. So, unless you pass the elements of tmpl_search_path to some templated parameter {{var.value.sql_path}} will not be rendered.
The template_searchpath of DAG is not templated. That is why you cannot pass Jinja templates to it.
The options of which I can think are
Hardcode the value of Variable.get('sql_path') in the pipeline script.
Save the value of Variable.get('sql_path') in a configuration file and read it from there in the pipeline script.
Move the Variable.get() call out of the for-loop. This will result in three times fewer requests to the database.
More info about templating in Airflow.

Airflow BashOperator Parameter From XCom Value

I am having some problem assigning an xcom value to the BashOperator.
All the parameters are properly retrieved except the tmp_dir, which is an xcom value generated during init_dag. I was able to retrieve the value in my custom operator but not being able to do it in the BashOperator. I have added the outputs of the three different ways I have tried that came to my mind.
I think one way could be if I could store that value in a variable but I was also not able to figure it out how.
Any help will be highly appreciated.
Here is my DAG code:
import airflow
from airflow.models import DAG
from airflow.utils.dates import days_ago
from airflow.models import Variable
from utility import util
import os
from airflow.operators.bash_operator import BashOperator
from operators.mmm_operator import MMMOperator #it is a custom operator
from operators.iftp_operator import IFTPOperator #it is another custom operator
AF_DATAMONTH = util.get_date_by_format(deltaMth=2,deltaDay=0,ft='%b_%Y').lower() #it gives a date in required format
AF_FILENM_1 = 'SOME_FILE_' + AF_DATAMONTH + '.zip' #required filename for ftp
default_args = {
'owner': 'airflow',
'depends_on_past': False,
'start_date': days_ago(0),
}
dag = DAG(dag_id='my_dag', default_args=default_args, schedule_interval=None)
init_dag = MMMOperator(
task_id='init_dag',
provide_context=True,
mmm_oracle_conn_id=Variable.get('SOME_VARIABLE'),
mmm_view="{0}.{1}".format(Variable.get('ANOTHER_VARIABLE'), AF_DAG_MMM_VIEW_NM),
mmm_view_filter=None,
mmm_kv_type=True,
mmm_af_env_view="{0}.{1}".format(Variable.get('ANOTHER_VARIABLE_1'),Variable.get('ANOTHER_VARIABLE_2')),
dag=dag
) #local_tmp_folder is generated here and pushed via xcom
download_ftp_files = IFTPOperator(task_id='download_ftp_files',
ftp_conn_id=util.getFromConfig("nt_conn_id"), #value properly retrieved by xcom_pull
operation='GET',
source_path=util.getFromConfig("nt_remote_folder"), #value properly retrieved by xcom_pull
dest_path=util.getFromConfig("local_tmp_folder"), #value properly retrieved by xcom_pull
filenames=AF_FILENM,
dag=dag
)
bash_cmd_template = "cd /vagrant/ && python3 hello_print.py {{params.client}} {{params.task}} {{params.environment}} {{params.tmp_dir}} {{params.af_file_nm}}"
#try 1 output value for params.tmp_dir: {{ ti.xcom_pull(task_ids="init_dag")["local_tmp_folder"] }} - instead of the actual tmp folder location
#try 2 and try 3 output: Broken DAG: [/home/vagrant/airflow/dags/my_dag.py] name 'ti' is not defined - message in UI
execute_main_py_script = BashOperator(
task_id='execute_main_py_script',
bash_command=bash_cmd_template,
params={'client' : 'some_client',
'task' : 'load_some_task',
'environment' : 'environment_name',
#'tmp_dir' : util.getFromConfig("local_tmp_folder"), #try 1
#'tmp_dir' : {{ ti.xcom_pull(task_ids="init_dag")["local_tmp_folder"] }} #try 2
#'tmp_dir' : ti.xcom_pull(task_ids="init_dag")["local_tmp_folder"] #try 3
'af_file_nm' : AF_FILENM_1
},
provide_context=True,
dag=dag
)
init_dag >> download_ftp_files >> execute_main_py_script
The params argument of the BashOperator is not Jinja Templated hence any values you pass in params would be rendered "as-is".
You should directly pass the value of tmp_dir in bash_cmd_template as follows:
bash_cmd_template = """
cd /vagrant/ && python3 hello_print.py {{params.client}} {{params.task}} {{params.environment}} {{ ti.xcom_pull(task_ids="init_dag")["local_tmp_folder"] }} {{params.af_file_nm}}
"""
execute_main_py_script = BashOperator(
task_id='execute_main_py_script',
bash_command=bash_cmd_template,
params={'client' : 'some_client',
'task' : 'load_some_task',
'environment' : 'environment_name',
'af_file_nm' : AF_FILENM_1
},
provide_context=True,
dag=dag
)

Airflow - how to make EmailOperator html_content dynamic?

I'm looking for a method that will allow the content of the emails sent by a given EmailOperator task to be set dynamically. Ideally I would like to make the email contents dependent on the results of an xcom call, preferably through the html_content argument.
alert = EmailOperator(
task_id=alertTaskID,
to='please#dontreply.com',
subject='Airflow processing report',
html_content='raw content #2',
dag=dag
)
I notice that the Airflow docs say that xcom calls can be embedded in templates. Perhaps there is a way to formulate an xcom pull using a template on a specified task ID then pass the result in as html_content? Thanks
Use PythonOperator + send_email instead:
from airflow.operators import PythonOperator
from airflow.utils.email import send_email
def email_callback(**kwargs):
with open('/path/to.html') as f:
content = f.read()
send_email(
to=[
# emails
],
subject='subject',
html_content=content,
)
email_task = PythonOperator(
task_id='task_id',
python_callable=email_callback,
provide_context=True,
dag=dag,
)
For those looking for an exact example of using jinja template with EmailOperator, here is one
from airflow.operators.email_operator import EmailOperator
from datetime import timedelta, datetime
email_task = EmailOperator(
to='some#email.com',
task_id='email_task',
subject='Templated Subject: start_date {{ ds }}',
params={'content1': 'random'},
html_content="Templated Content: content1 - {{ params.content1 }} task_key - {{ task_instance_key_str }} test_mode - {{ test_mode }} task_owner - {{ task.owner}} hostname - {{ ti.hostname }}",
dag=dag)
You can test run the above code snippet using
airflow test dag_name email_task 2017-05-10
might as well answer this myself. Turns out it's fairly straight forward using the template+xcom route. This code snippet works in the context of an already defined dag. It uses the BashOperator instead of EmailOperator because it's easier to test.
def pushparam(param, ds, **kwargs):
kwargs['ti'].xcom_push(key='specificKey', value=param)
return
loadxcom = PythonOperator(
task_id='loadxcom',
python_callable=pushparam,
provide_context=True,
op_args=['your_message_here'],
dag=dag)
template2 = """
echo "{{ params.my_param }}"
echo "{{ task_instance.xcom_pull(task_ids='loadxcom', key='specificKey') }}"
"""
t5 = BashOperator(
task_id='tt2',
bash_command=template2,
params={'my_param': 'PARAMETER1'},
dag=dag)
can be tested on commandline using something like this:
airflow test dag_name loadxcom 2015-12-31
airflow test dag_name tt2 2015-12-31
I will eventually test with EmailOperator and add something here if it doesn't work...

Resources