Airflow set DAG options conditionally - airflow

I'm implementing a python script to create a bunch of Airflow dag based on json config files. One json config file contains all the fields to be used in DAG(), and the last three fields are optional(will use global default if not set).
{
"owner": "Mike",
"start_date": "2022-04-10",
"schedule_interval": "0 0 * * *",
"on_failure_callback": "slack",
"is_paused_upon_creation": False,
"catchup": True
}
Now, my question is how to create the DAG conditionally? Since the last three option on_failure_callback, is_paused_upon_creation and catchup is optional, wonder what's the best way to use them in DAG()?
One solution_1 I tried is to use default_arg=optional_fields, and add optional fields into it with an if statement. However, it doesn't work. The DAG is not taking these three optional fields' values.
def create_dag(name, config):
# config is a dict that generate from the json config file
optional_fields = {
'owner': config['owner']
}
if 'on_failure_callback' in config:
optional_fields['on_failure_callback'] = partial(xxx(config['on_failure_callback']))
if 'is_paused_upon_creation' in config:
optional_fields['is_paused_upon_creation'] = config['is_paused_upon_creation']
dag = DAG(
dag_id=name,
start_date=datetime.strptime(config['start_date'], '%Y-%m-%d'),
schedule_interval=config['schedule_interval'],
default_args=optional_fields
)
Then, I tried solution_2 with **optional_fields, but got an error TypeError: __init__() got an unexpected keyword argument 'owner'
dag = DAG(
dag_id=name,
start_date=datetime.strptime(config['start_date'], '%Y-%m-%d'),
schedule_interval=config['schedule_interval'],
**optional_fields
)
Then solution_3 works as the following.
def create_dag(name, config):
# config is a dict that generate from the json config file
default_args = {
'owner': config['owner']
}
optional_fields = {}
if 'on_failure_callback' in config:
optional_fields['on_failure_callback'] = partial(xxx(config['on_failure_callback']))
if 'is_paused_upon_creation' in config:
optional_fields['is_paused_upon_creation'] = config['is_paused_upon_creation']
dag = DAG(
dag_id=name,
start_date=datetime.strptime(config['start_date'], '%Y-%m-%d'),
schedule_interval=config['schedule_interval'],
default_args=default_args
**optional_fields
)
However, I'm confused 1) which fields should be set in optional_fields vs default_args? 2) is there any other way to achieve it?

Related

how to pass default values for run time input variable in airflow for scheduled execution

I come across one issue while running DAG in airflow. my code is working in two scenarios where is failing for one.
below are my scenarios,
Manual trigger with input - Running Fine
Manual trigger without input - Running Fine
Scheduled Run - Failing
Below is my code:
def decide_the_flow(**kwargs):
cleanup=kwargs['dag_run'].conf.get('cleanup','N')
print("IP is :",cleanup)
return cleanup
I am getting below error,
cleanup=kwargs['dag_run'].conf.get('cleanup','N')
AttributeError: 'NoneType' object has no attribute 'get'
I tried to define default variables like,
default_dag_args = {
'start_date':days_ago(0),
'params': {
"cleanup": "N"
},
'retries': 0
}
but it wont work.
I am using BranchPythonOperator to call this function.
Scheduling : enter image description here
Can anyone please guide me here. what I am missing ?
For workaround i am using below code,
try:
cleanup=kwargs['dag_run'].conf.get('cleanup','N')
except:
cleanup="N"
You can access the parameters from the context dict params, because airflow defines the default values on this dict after copying the dict dag_run.conf and checking if there is something missing:
from datetime import datetime
from airflow import DAG
from airflow.operators.empty import EmptyOperator
from airflow.operators.python import BranchPythonOperator
def decide_the_flow(**kwargs):
cleanup = kwargs['params']["cleanup"]
print(f"IP is : {cleanup}")
return cleanup
with DAG(
dag_id='airflow_params',
start_date=datetime(2022, 8, 25),
schedule_interval="* * * * *",
params={
"cleanup": "N",
},
catchup=False
) as dag:
branch_task = BranchPythonOperator(
task_id='test_param',
python_callable=decide_the_flow
)
task_n = EmptyOperator(task_id="N")
task_m = EmptyOperator(task_id="M")
branch_task >> [task_n, task_m]
I just tested it in scheduled and manual (with and without conf) runs, it works fine.

Airflow Xcom not getting resolved return task_instance string

I am facing an odd issue with xcom_pull where it is always returning back a xcom_pull string
"{{ task_instance.xcom_pull(dag_id = 'cf_test',task_ids='get_config_val',key='http_con_id') }}"
My requirement is simple I have pushed an xcom using python operator and with xcom_pull I am trying to retrieve the value and pass it as an http_conn_id for SimpleHttpOperator but the variable is returning a string instead of resolving xcom_pull value.
Python Operator is successfully able to push XCom.
Code:
from datetime import datetime
import simplejson as json
from airflow.models import DAG
from airflow.operators.dummy_operator import DummyOperator
from airflow.operators.python_operator import PythonOperator
from airflow.providers.http.operators.http import SimpleHttpOperator
from google.auth.transport.requests import Request
default_airflow_args = {
"owner": "divyaansh",
"depends_on_past": False,
"start_date": datetime(2022, 5, 18),
"retries": 0,
"schedule_interval": "#hourly",
}
project_configs = {
"project_id": "test",
"conn_id": "google_cloud_storage_default",
"bucket_name": "test-transfer",
"folder_name": "processed-test-rdf",
}
def get_config_vals(**kwargs) -> dict:
"""
Get config vals from airlfow variable and store it as xcoms
"""
task_instance = kwargs["task_instance"]
task_instance.xcom_push(key="http_con_id", value="gcp_cloud_function")
def generate_api_token(cf_name: str):
"""
generate token for api request
"""
import google.oauth2.id_token
request = Request()
target_audience = f"https://us-central1-test-a2h.cloudfunctions.net/{cf_name}"
return google.oauth2.id_token.fetch_id_token(
request=request, audience=target_audience
)
with DAG(
dag_id="cf_test",
default_args=default_airflow_args,
catchup=False,
render_template_as_native_obj=True,
) as dag:
start = DummyOperator(task_id="start")
config_vals = PythonOperator(
task_id="get_config_val", python_callable=get_config_vals, provide_context=True
)
ip_data = json.dumps(
{
"bucket_name": project_configs["bucket_name"],
"file_name": "dummy",
"target_location": "/valid",
}
)
conn_id = "{{ task_instance.xcom_pull(dag_id = 'cf_test',task_ids='get_config_val',key='http_con_id') }}"
api_token = generate_api_token("new-cp")
cf_task = SimpleHttpOperator(
task_id="file_decrypt_and_validate_cf",
http_conn_id=conn_id,
method="POST",
endpoint="new-cp",
data=json.dumps(
json.dumps(
{
"bucket_name": "test-transfer",
"file_name": [
"processed-test-rdf/dummy_20220501.txt",
"processed-test-rdf/dummy_20220502.txt",
],
"target_location": "/valid",
}
)
),
headers={
"Authorization": f"bearer {api_token}",
"Content-Type": "application/json",
},
do_xcom_push=True,
log_response=True,
)
print("task new-cp", cf_task)
check_flow = DummyOperator(task_id="check_flow")
end = DummyOperator(task_id="end")
start >> config_vals >> cf_task >> check_flow >> end
Error Message:
raise AirflowNotFoundException(f"The conn_id `{conn_id}` isn't defined") airflow.exceptions.AirflowNotFoundException: The conn_id `"{{ task_instance.xcom_pull(dag_id = 'cf_test',task_ids='get_config_val',key='http_con_id') }}"` isn't defined
I have tried several different days but nothing seems to be working.
Can someone point me to the right direction here.
Airflow-version : 2.2.3
Composer-version : 2.0.11
In SimpleHttpOperator the http_conn_id parameter is not templated field thus you can not use Jinja engine with it. This means that this parameter can not be rendered. So when you pass "{{ task_instance.xcom_pull(dag_id = 'cf_test',task_ids='get_config_val',key='http_con_id') }}" to the operator you expect it to be replaced during runtime with the value stored in Xcom by previous task but in fact Airflow consider it just as a regular string this is also what the exception tells you. It actually try to search a connection with the name of your very long string but couldn't find it so it tells you that the connection is not defined.
To solve it you can create a custom operator:
class MySimpleHttpOperator(SimpleHttpOperator):
template_fields = SimpleHttpOperator.template_fields + ("http_conn_id",)
Then you should replace SimpleHttpOperator with MySimpleHttpOperator in your DAG.
This change makes the string that you set in http_conn_id to be passed via the Jinja engine. So in your case the string will be replaced with the Xcom value as you expect.

DatabricksRunOperator Execution date

opr_run_now = DatabricksRunNowOperator(
task_id = 'run_now',
databricks_conn_id = 'databricks_default',
job_id = 754377,
notebook_params = meta_data,
dag = dag
) here
Is there way to pass execution date using databricks run operator.
What do you want to pass the execution_date to? What are you trying to achieve in the end? The following doc was helpful for me:
https://www.astronomer.io/guides/airflow-databricks
And here is an example where I am passing execution_date to be used in a python file run in Databricks. I'm capturing the execution_date using sys.argv.
from airflow import DAG
from airflow.providers.databricks.operators.databricks import (
DatabricksRunNowOperator,
)
from datetime import datetime, timedelta
spark_python_task = {
"python_file": "dbfs:/FileStore/sandbox/databricks_test_python_task.py"
}
# Define params for Run Now Operator
python_params = [
"{{ execution_date }}",
"{{ execution_date.subtract(hours=1) }}",
]
default_args = {
"owner": "airflow",
"depends_on_past": False,
"email_on_failure": False,
"email_on_retry": False,
"retries": 1,
"retry_delay": timedelta(minutes=2),
}
with DAG(
dag_id="databricks_dag",
start_date=datetime(2022, 3, 11),
schedule_interval="#hourly",
catchup=False,
default_args=default_args,
max_active_runs=1,
) as dag:
opr_run_now = DatabricksRunNowOperator(
task_id="run_now",
databricks_conn_id="Databricks",
job_id=2060,
python_params=python_params,
)
opr_run_now
There are two ways to set DatabricksRunOperator. One with named arguments (as you did) - which doesn't support templating. The second way is to use JSON payload that you typically use to call the api/2.0/jobs/run-now - This way also gives you the ability to pass execution_date as the json parameter is templated.
notebook_task_params = {
'new_cluster': new_cluster,
'notebook_task': {
'notebook_path': '/test-{{ ds }}',
}
DatabricksSubmitRunOperator(task_id='notebook_task', json=notebook_task_params)
For more information see the operator docs.

Jinja Template Variable Email ID not rendering when using ON_FAILURE_CALLBACK

Need help on rendering the jinja template email ID in the On_failure_callback.
I understand that rendering templates work fine in the SQL file or with the operator having template_fields .How do I get below code rendered the jinja template variable
It works fine with Variable.get('email_edw_alert'), but I don't want to use Variable method to avoid hitting DB
Below is the Dag file
import datetime
import os
from functools import partial
from datetime import timedelta
from airflow.models import DAG,Variable
from airflow.contrib.operators.snowflake_operator import SnowflakeOperator
from alerts.email_operator import dag_failure_email
def get_db_dag(
*,
dag_id,
start_date,
schedule_interval,
max_taskrun,
max_dagrun,
proc_nm,
load_sql
):
default_args = {
'owner': 'airflow',
'start_date': start_date,
'provide_context': True,
'execution_timeout': timedelta(minutes=max_taskrun),
'retries': 0,
'retry_delay': timedelta(minutes=3),
'retry_exponential_backoff': True,
'email_on_retry': False,
}
dag = DAG(
dag_id=dag_id,
schedule_interval=schedule_interval,
dagrun_timeout=timedelta(hours=max_dagrun),
template_searchpath=tmpl_search_path,
default_args=default_args,
max_active_runs=1,
catchup='{{var.value.dag_catchup}}',
on_failure_callback=partial(dag_failure_email, config={'email_address': '{{var.value.email_edw_alert}}'}),
)
load_table = SnowflakeOperator(
task_id='load_table',
sql=load_sql,
snowflake_conn_id=CONN_ID,
autocommit=True,
dag=dag,
)
load_table
return dag
# ======== DAG DEFINITIONS #
edw_table_A = get_db_dag(
dag_id='edw_table_A',
start_date=datetime.datetime(2020, 5, 21),
schedule_interval='0 5 * * *',
max_taskrun=3, # Minutes
max_dagrun=1, # Hours
load_sql='recon/extract.sql',
)
Below is the python code alerts.email_operator
import logging
from airflow.utils.email import send_email
from airflow.models import Variable
logger = logging.getLogger(__name__)
TIME_FORMAT = "%Y-%m-%d %H:%M:%S"
def dag_failure_email(context, config=None):
config = {} if config is None else config
task_id = context.get('task_instance').task_id
dag_id = context.get("dag").dag_id
execution_time = context.get("execution_date").strftime(TIME_FORMAT)
reason = context.get("exception")
alerting_email_address = config.get('email_address')
dag_failure_html_body = f"""<html>
<header><title>The following DAG has failed!</title></header>
<body>
<b>DAG Name</b>: {dag_id}<br/>
<b>Task Id</b>: {task_id}<br/>
<b>Execution Time (UTC)</b>: {execution_time}<br/>
<b>Reason for Failure</b>: {reason}<br/>
</body>
</html>
"""
try:
if reason != 'dagrun_timeout':
send_email(
to=alerting_email_address,
subject=f"Airflow alert: <DagInstance: {dag_id} - {execution_time} [failed]",
html_content=dag_failure_html_body,
)
except Exception as e:
logger.error(
f'Error in sending email to address {alerting_email_address}: {e}',
exc_info=True,
)
I have also tried other way too , even below one is not working
try:
if reason != 'dagrun_timeout':
send_email = EmailOperator(
to=alerting_email_address,
task_id='email_task',
subject=f"Airflow alert: <DagInstance: {dag_id} - {execution_time} [failed]",
params={'content1': 'random'},
html_content=dag_failure_html_body,
)
send_email.dag = context['dag']
#send_email.to = send_email.get_template_env().from_string(send_email.to).render(**context)
send_email.to = send_email.render_template(alerting_email_address, send_email.to, context)
send_email.execute(context)
except Exception as e:
logger.error(
f'Error in sending email to address {alerting_email_address}: {e}',
exc_info=True,
)
I don't think templates would work in this way - you'll have to have something specifically parse the template. Usually jinja templates in Airflow are used to pass templated fields through to operators, and rendered using the render_template function (https://airflow.apache.org/docs/stable/_modules/airflow/models/baseoperator.html#BaseOperator.render_template)
Since your callback function isn't an operator, it won't have this method by default.
I think the best thing to do here would be to either explicitly call Variable.get during runtime of the callback function itself, rather than in the DAG definition, or implement some version of that render_template_fields function in your callback. Both of these solutions would result only in hitting the DB during runtime of this task, rather than whenever the DAG is created.
Edit: Just saw your attempt to do the rendering explicitly via the operator. Are the fields that you want to be templated specified as templated_fields within email operator?

Airflow sql_path not able to read the sql files when passed as Jinja Template Variable

I am trying to use Jinja template variable as against using Variable.get('sql_path'), So as to avoid hitting DB for every scan of the dag file
Original code
import datetime
import os
from functools import partial
from datetime import timedelta
from airflow.models import DAG,Variable
from airflow.contrib.operators.snowflake_operator import SnowflakeOperator
from alerts.email_operator import dag_failure_email
SNOWFLAKE_CONN_ID = 'etl_conn'
tmpl_search_path = []
for subdir in ['business/', 'audit/', 'business/transform/']:
tmpl_search_path.append(os.path.join(Variable.get('sql_path'), subdir))
def get_db_dag(
*,
dag_id,
start_date,
schedule_interval,
max_taskrun,
max_dagrun,
proc_nm,
load_sql
):
default_args = {
'owner': 'airflow',
'start_date': start_date,
'provide_context': True,
'execution_timeout': timedelta(minutes=max_taskrun),
'retries': 0,
'retry_delay': timedelta(minutes=3),
'retry_exponential_backoff': True,
'email_on_retry': False,
}
dag = DAG(
dag_id=dag_id,
schedule_interval=schedule_interval,
dagrun_timeout=timedelta(hours=max_dagrun),
template_searchpath=tmpl_search_path,
default_args=default_args,
max_active_runs=1,
catchup='{{var.value.dag_catchup}}',
on_failure_callback=alert_email_callback,
)
load_table = SnowflakeOperator(
task_id='load_table',
sql=load_sql,
snowflake_conn_id=SNOWFLAKE_CONN_ID,
autocommit=True,
dag=dag,
)
load_vcc_svc_recon
return dag
# ======== DAG DEFINITIONS #
edw_table_A = get_db_dag(
dag_id='edw_table_A',
start_date=datetime.datetime(2020, 5, 21),
schedule_interval='0 5 * * *',
max_taskrun=3, # Minutes
max_dagrun=1, # Hours
load_sql='recon/extract.sql',
)
When I have replaced Variable.get('sql_path') with Jinja Template '{{var.value.sql_path}}' as below and ran the Dag, it threw an error as below, it was not able to get the file from the subdirectory of the SQL folder
tmpl_search_path = []
for subdir in ['bus/', 'audit/', 'business/snflk/']:
tmpl_search_path.append(os.path.join('{{var.value.sql_path}}', subdir))
Got below error as
inja2.exceptions.TemplateNotFound: extract.sql
Templates are not rendered everywhere in a DAG script. Usually they are rendered in the templated parameters of Operators. So, unless you pass the elements of tmpl_search_path to some templated parameter {{var.value.sql_path}} will not be rendered.
The template_searchpath of DAG is not templated. That is why you cannot pass Jinja templates to it.
The options of which I can think are
Hardcode the value of Variable.get('sql_path') in the pipeline script.
Save the value of Variable.get('sql_path') in a configuration file and read it from there in the pipeline script.
Move the Variable.get() call out of the for-loop. This will result in three times fewer requests to the database.
More info about templating in Airflow.

Resources