Airflow:.AirflowException: Issues in reading JSON template variable - airflow

Requirement: I am trying to avoid using Variable.get() Instead use Jinja templated {{var.json.variable}}
I have defined the variables in JSON format as an example below and stored them in the secret manager as snflk_json
snflk_json
{
"snwflke_acct_request_memory":"4000Mi",
"snwflke_acct_limit_memory":"4000Mi",
"schedule_interval_snwflke_acct":"0 12 * * *",
"LIST" ::[
"ABC.DEV","CDD.PROD"
]
}
Issue 1: Unable to retrieve schedule interval from the JSON variable
Error : Invalid timetable expression: Exactly 5 or 6 columns has to be specified for iterator expression.
Tried to use in the dag as below
schedule_interval = '{{var.json.snflk_json.schedule_interval_snwflke_acct}}',
Issue 2:
I am trying to loop to get the task for each in LIST, I tried as below but in vain
with DAG(
dag_id = dag_id,
default_args = default_args,
schedule_interval = '{{var.json.usage_snwflk_acct_admin_config.schedule_interval_snwflke_acct}}' ,
dagrun_timeout = timedelta(hours=3),
max_active_runs = 1,
catchup = False,
params = {},
tags=tags
) as dag:
shares = '{{var.json.snflk_json.LIST}}'
for s in shares:
sf_tasks = SnowflakeOperator(
task_id=f"{s}" ,
snowflake_conn_id= snowflake_conn_id,
sql=sqls,
params={"sf_env": s},
)
Error
File "/home/airflow/.local/lib/python3.8/site-packages/airflow/models/baseoperator.py", line 754, in __init__
validate_key(task_id)
File "/home/airflow/.local/lib/python3.8/site-packages/airflow/utils/helpers.py", line 63, in validate_key
raise AirflowException(
airflow.exceptions.AirflowException: The key '{' has to be made of alphanumeric characters, dashes, dots and underscores exclusively

Airflow is parsing the dag every few seconds (30 by default). so actually it runs the for loop on a string with value {{var.json.snflk_json.LIST}} and that why you get that error.
you should use DynamicTask (from ver 2.3) or put the code under Python task that creates tasks and execute the new tasks.

Related

how to pass default values for run time input variable in airflow for scheduled execution

I come across one issue while running DAG in airflow. my code is working in two scenarios where is failing for one.
below are my scenarios,
Manual trigger with input - Running Fine
Manual trigger without input - Running Fine
Scheduled Run - Failing
Below is my code:
def decide_the_flow(**kwargs):
cleanup=kwargs['dag_run'].conf.get('cleanup','N')
print("IP is :",cleanup)
return cleanup
I am getting below error,
cleanup=kwargs['dag_run'].conf.get('cleanup','N')
AttributeError: 'NoneType' object has no attribute 'get'
I tried to define default variables like,
default_dag_args = {
'start_date':days_ago(0),
'params': {
"cleanup": "N"
},
'retries': 0
}
but it wont work.
I am using BranchPythonOperator to call this function.
Scheduling : enter image description here
Can anyone please guide me here. what I am missing ?
For workaround i am using below code,
try:
cleanup=kwargs['dag_run'].conf.get('cleanup','N')
except:
cleanup="N"
You can access the parameters from the context dict params, because airflow defines the default values on this dict after copying the dict dag_run.conf and checking if there is something missing:
from datetime import datetime
from airflow import DAG
from airflow.operators.empty import EmptyOperator
from airflow.operators.python import BranchPythonOperator
def decide_the_flow(**kwargs):
cleanup = kwargs['params']["cleanup"]
print(f"IP is : {cleanup}")
return cleanup
with DAG(
dag_id='airflow_params',
start_date=datetime(2022, 8, 25),
schedule_interval="* * * * *",
params={
"cleanup": "N",
},
catchup=False
) as dag:
branch_task = BranchPythonOperator(
task_id='test_param',
python_callable=decide_the_flow
)
task_n = EmptyOperator(task_id="N")
task_m = EmptyOperator(task_id="M")
branch_task >> [task_n, task_m]
I just tested it in scheduled and manual (with and without conf) runs, it works fine.

Airflow set DAG options conditionally

I'm implementing a python script to create a bunch of Airflow dag based on json config files. One json config file contains all the fields to be used in DAG(), and the last three fields are optional(will use global default if not set).
{
"owner": "Mike",
"start_date": "2022-04-10",
"schedule_interval": "0 0 * * *",
"on_failure_callback": "slack",
"is_paused_upon_creation": False,
"catchup": True
}
Now, my question is how to create the DAG conditionally? Since the last three option on_failure_callback, is_paused_upon_creation and catchup is optional, wonder what's the best way to use them in DAG()?
One solution_1 I tried is to use default_arg=optional_fields, and add optional fields into it with an if statement. However, it doesn't work. The DAG is not taking these three optional fields' values.
def create_dag(name, config):
# config is a dict that generate from the json config file
optional_fields = {
'owner': config['owner']
}
if 'on_failure_callback' in config:
optional_fields['on_failure_callback'] = partial(xxx(config['on_failure_callback']))
if 'is_paused_upon_creation' in config:
optional_fields['is_paused_upon_creation'] = config['is_paused_upon_creation']
dag = DAG(
dag_id=name,
start_date=datetime.strptime(config['start_date'], '%Y-%m-%d'),
schedule_interval=config['schedule_interval'],
default_args=optional_fields
)
Then, I tried solution_2 with **optional_fields, but got an error TypeError: __init__() got an unexpected keyword argument 'owner'
dag = DAG(
dag_id=name,
start_date=datetime.strptime(config['start_date'], '%Y-%m-%d'),
schedule_interval=config['schedule_interval'],
**optional_fields
)
Then solution_3 works as the following.
def create_dag(name, config):
# config is a dict that generate from the json config file
default_args = {
'owner': config['owner']
}
optional_fields = {}
if 'on_failure_callback' in config:
optional_fields['on_failure_callback'] = partial(xxx(config['on_failure_callback']))
if 'is_paused_upon_creation' in config:
optional_fields['is_paused_upon_creation'] = config['is_paused_upon_creation']
dag = DAG(
dag_id=name,
start_date=datetime.strptime(config['start_date'], '%Y-%m-%d'),
schedule_interval=config['schedule_interval'],
default_args=default_args
**optional_fields
)
However, I'm confused 1) which fields should be set in optional_fields vs default_args? 2) is there any other way to achieve it?

How to add data to Airflow DAG Context? [duplicate]

This question already has answers here:
Pass other arguments to on_failure_callback
(2 answers)
Closed 11 months ago.
We are working with airflow. We have something like 1000+ DAGS.
To manage DAG errors we use the same on_error_callback function to trigger alerts.
Each DAG is supposed to have context information, that could be expressed as constants, that I would like to share with the alerting stack.
Currently, I am only able to send the dag_id I retrieve from the context, via context['ti'].dag_id, and eventually the conf (parameters).
Is there a way to add other data (constants) to the context when declaring/creating the DAG?
Thanks.
You can pass params into the context.
dag = DAG(
dag_id='example_dag',
default_args=default_args,
params={
"param1": "value1",
"param2": "value2"
}
)
These are available in the
context
# example task that quickly outputs the context
start = PythonOperator(
task_id = 'start',
python_callable = lambda **context: print(context)
)
# outputs the context, in the conf dictionary you will see
# 'params': {'param1': 'value1'},
or via jinja templates
{{params.param1}}
define variables within a DAG and access variables by {{}} ref
Example ( with PROJECT_ID variable)
with models.DAG(
...
user_defined_macros={ "PROJECT_ID" : PROJECT_ID }
) as dag:
projectId = "{{PROJECT_ID}}"

Airflow task setup with execution date

I want to customize the task to be weekday dependent in the dag file. It seems the airflow macros like {{ next_execution_date }} are not directly available in the python dag file. This is my dag definition:
RG_TASKS = {
'asia': {
'start_date': pendulum.datetime.(2021,1,1,16,0,tzinfo='Asia/Tokyo'),
'tz': 'Asia/Tokyo',
'files': [
'/path/%Y%m%d/asia_file1.%Y%m%d.csv',
'/path/%Y%m%d/asia_file2.%Y%m%d.csv',
...], },
'euro': {
'start_date': pendulum.datetime.(2021,1,1,16,0,tzinfo='Europe/London'),
'tz': 'Europe/London',
'files': [
'/path/%Y%m%d/euro_file1.%Y%m%d.csv',
'/path/%Y%m%d/euro_file2.%Y%m%d.csv',
...], },
}
dag = DAG(..., start_date=pendulum.datetime.(2021,1,1,16,0,tzinfo='Asia/Tokyo'),
schedule='00 16 * * 0-6')
for rg, t in RG_TASKS.items():
tz = t['tz']
h = t['start_date'].hour
m = t['start_date'].minute
target_time = f'{{{{ next_execution_date.replace(tzinfo="{tz}", hour={h}, minute={m}) }}}}'
time_sensor = DateTimeSensor(dag=dag, task_id=f'wait_for_{rg}', tartget_time=target_time)
bash_task = BashOperator(dag=dag, task_id='load_{rg}', trigger_rule='all_success', depends_on_past=True, bash_command=...)
for fname in t['files']:
fpath = f'{{{{ next_execution_date.strftime("{fname}") }}}}'
task_id = os.path.basename(fname).split('.')[0]
file_sensor = FileSensor(dag=dag, task_id=task_id, filepath=fpath, ...)
file_sensor.set_upstream(time_sensor)
file_sensor.set_downstream(bash_task)
The above works, and the bash_task will be triggered if all files are available, and it is set depend_on_past=True. However, the files have slightly different schedule. {rg}_file1 will be available 6 days/week, except Saturday, while the rest are available 7 days a week.
One option is to create 2 dags, one scheduled to run Sun-Fri, while the other is scheduled to run Sat only. But with this option, the depends_on_past=True is broken on Saturday.
Is there any better way to keep depends_on_past=True 7 days/week? Ideally in the files loop, I could do sth like:
for fname in t['files']:
dt = ...
if dt.weekday()==5 and task_id==f'{rg}_file1':
continue
Generally I think it's better to accomplish things in a single task when it is easy enough to do, and in this case it seem to me you can.
I'm not entirely sure why you are using a datetime sensor, but it does not seem necessary. As far as I can tell, you just want your process to run every day (ideally after the file is there) and skip once per week.
I think we can do away with file sensor too.
Option 1: everything in bash
Check for existence in your bash script and fail (with retries) if missing. Just return non-zero exit code when file missing.
Then in your bash script you could silently do nothing on the skip day.
On skip days, your bash task will be green even though it did nothing.
Option 2: subclass bash operator
Subclass BashOperator and add a skip_day parameter. Then your execute is like this:
def execute(self, context):
next_execution_date = context['next_execution_date']
if next_execution_date.day_of_week == self.skip_day:
raise AirflowSkipException(f'we skip on day {self.skip_day}')
super().execute(context)
With this option your bash script still needs to fail if file missing, but doesn't need to deal with the skip logic. And you'll be able to see that the task skipped in the UI.
Either way, no sensors.
Other note
You can simplify your filename templating.
files=[
'/path/{{ next_ds_nodash }}/euro_file2.{{ next_ds_nodash }}.csv',
...
]
Then you don't need to mess with strftime.

Apache airflow macro to get last dag run execution time

I thought the macro prev_execution_date listed here would get me the execution date of the last DAG run, but looking at the source code it seems to only get the last date based on the DAG schedule.
prev_execution_date = task.dag.previous_schedule(self.execution_date)
Is there any way via macros to get the execution date of the DAG when it doesn't run on a schedule?
Yes, you can define your own custom macro for this as follows:
# custom macro function
def get_last_dag_run(dag):
last_dag_run = dag.get_last_dagrun()
if last_dag_run is None:
return "no prev run"
else:
return last_dag_run.execution_date.strftime("%Y-%m-%d")
# add macro in user_defined_macros in dag definition
dag = DAG(dag_id="my_test_dag",
schedule_interval='#daily',
user_defined_macros={
'last_dag_run_execution_date': get_last_dag_run
}
)
# example of using it in practice
print_vals = BashOperator(
task_id='print_vals',
bash_command='echo {{ last_dag_run_execution_date(dag) }}',
dag=dag
)
Note that the dag.get_last_run() is just one of the many functions available on the Dag object. Here's where I found it: https://github.com/apache/incubator-airflow/blob/v1-10-stable/airflow/models.py#L3396
You can also tweak the formatting of the string for the date format, and what you want output if there is no previous run.
You can make your own user custom macro function, use airflow model to search meta-database.
def get_last_dag_run(dag_id):
//TODO search DB
return xxx
dag = DAG(
'example',
schedule_interval='0 1 * * *',
user_defined_macros={
'last_dag_run_execution_date': get_last_dag_run,
}
)
Then use the KEY in your template.

Resources