Airflow how to set default values for dag_run.conf - airflow

I'm trying to setup an Airflow DAG that provides default values available from dag_run.conf. This works great when running the DAG from the webUI, using the "Run w/ Config" option. However when running on the schedule, the dag_run.conf dict is not present, and the task will fail, e.g.
jinja2.exceptions.UndefinedError: 'dict object' has no attribute 'key1'
Below is an example job.
Is it possible to make it so that dag_run.conf always contains the dict defined by params here?
from airflow import DAG
from airflow.utils.dates import hours_ago
from airflow.operators.bash import BashOperator
from datetime import timedelta
def do_something(val1: str, val2: str) -> str:
return f'echo vars are: "{val1}, {val2}"'
params = {
'key1': 'def1',
'key2': 'def2',
}
default_args = {
'retries': 0,
}
with DAG(
'template_test',
default_args=default_args,
schedule_interval=timedelta(minutes=1),
start_date=hours_ago(1),
params = params,
) as dag:
hello_t = BashOperator(
task_id='example-command',
bash_command=do_something('{{dag_run.conf["key1"]}}', '{{dag_run.conf["key2"]}}'),
config=params,
)
The closest I've seen is in For Apache Airflow, How can I pass the parameters when manually trigger DAG via CLI?, however there they leverage Jinja and if/else - which would require defining these default parameters twice. I'd like to define them only once.

You could use DAG params to achieve what you are looking for:
params (dict) – a dictionary of DAG level parameters that are made accessible in templates, namespaced under params. These params can be overridden at the task level.
You can define params at DAG or Task levels and also add or modify them from the UI in the Trigger DAG w/ config section.
Example DAG:
default_args = {
"owner": "airflow",
}
dag = DAG(
dag_id="example_dag_params",
default_args=default_args,
schedule_interval="*/5 * * * *",
start_date=days_ago(1),
params={"param1": "first_param"},
catchup=False,
)
with dag:
bash_task = BashOperator(
task_id="bash_task", bash_command="echo bash_task: {{ params.param1 }}"
)
Output log:
[2021-10-02 20:23:25,808] {logging_mixin.py:104} INFO - Running <TaskInstance: example_dag_params.bash_task 2021-10-02T23:15:00+00:00 [running]> on host worker_01
[2021-10-02 20:23:25,867] {taskinstance.py:1302} INFO - Exporting the following env vars:
AIRFLOW_CTX_DAG_OWNER=***
AIRFLOW_CTX_DAG_ID=example_dag_params
AIRFLOW_CTX_TASK_ID=bash_task
AIRFLOW_CTX_EXECUTION_DATE=2021-10-02T23:15:00+00:00
AIRFLOW_CTX_DAG_RUN_ID=scheduled__2021-10-02T23:15:00+00:00
[2021-10-02 20:23:25,870] {subprocess.py:52} INFO - Tmp dir root location:
/tmp
[2021-10-02 20:23:25,871] {subprocess.py:63} INFO - Running command: ['bash', '-c', 'echo bash_task: first_param']
[2021-10-02 20:23:25,884] {subprocess.py:74} INFO - Output:
[2021-10-02 20:23:25,886] {subprocess.py:78} INFO - bash_task: first_param
[2021-10-02 20:23:25,887] {subprocess.py:82} INFO - Command exited with return code 0
From the logs, notice that the dag_run is scheduled and the params are still there.
You can find a more extensive example on using parameters in this answer.
Hope that works for you!

Related

how to pass default values for run time input variable in airflow for scheduled execution

I come across one issue while running DAG in airflow. my code is working in two scenarios where is failing for one.
below are my scenarios,
Manual trigger with input - Running Fine
Manual trigger without input - Running Fine
Scheduled Run - Failing
Below is my code:
def decide_the_flow(**kwargs):
cleanup=kwargs['dag_run'].conf.get('cleanup','N')
print("IP is :",cleanup)
return cleanup
I am getting below error,
cleanup=kwargs['dag_run'].conf.get('cleanup','N')
AttributeError: 'NoneType' object has no attribute 'get'
I tried to define default variables like,
default_dag_args = {
'start_date':days_ago(0),
'params': {
"cleanup": "N"
},
'retries': 0
}
but it wont work.
I am using BranchPythonOperator to call this function.
Scheduling : enter image description here
Can anyone please guide me here. what I am missing ?
For workaround i am using below code,
try:
cleanup=kwargs['dag_run'].conf.get('cleanup','N')
except:
cleanup="N"
You can access the parameters from the context dict params, because airflow defines the default values on this dict after copying the dict dag_run.conf and checking if there is something missing:
from datetime import datetime
from airflow import DAG
from airflow.operators.empty import EmptyOperator
from airflow.operators.python import BranchPythonOperator
def decide_the_flow(**kwargs):
cleanup = kwargs['params']["cleanup"]
print(f"IP is : {cleanup}")
return cleanup
with DAG(
dag_id='airflow_params',
start_date=datetime(2022, 8, 25),
schedule_interval="* * * * *",
params={
"cleanup": "N",
},
catchup=False
) as dag:
branch_task = BranchPythonOperator(
task_id='test_param',
python_callable=decide_the_flow
)
task_n = EmptyOperator(task_id="N")
task_m = EmptyOperator(task_id="M")
branch_task >> [task_n, task_m]
I just tested it in scheduled and manual (with and without conf) runs, it works fine.

How do we set OS environment variables in Airflow

I have an airflow dag and what im trying to do is read my variables stored in the airflow UI (username and password) and pass those variable values as exported values in the OS. the reason is because I am using dbt yml file which requires me to read the environment variable 'dbt_user'. (the only other way is to set the password in the yaml file which is not secure.
default:
target: dev
outputs:
dev:
type: snowflake
account: xxxx
user: "{{ env_var('dbt_user') }}"
I tried to write a dag which does the bashoperator export but it doesnt seem to set the environment variable.
from airflow import DAG
from airflow.operators.python import PythonOperator, BranchPythonOperator
from airflow.operators.bash import BashOperator
from airflow.operators.dummy_operator import DummyOperator
from datetime import datetime
from airflow.models import Variable
import os
default_args = {
'owner': 'airflow',
'depends_on_past': False,
'start_date': datetime(2020,8,1),
'retries': 0
}
with DAG('sample', default_args=default_args, schedule_interval='#once') as dag:
task_1 = BashOperator(
task_id='get_variables',
bash_command='export dbt_user={{ var.value.dbt_user }} ',
env = os.environ.copy(),
dag=dag
)
task_2 = BashOperator(
task_id='load_seed_data_once',
bash_command='echo $dbt_user',
dag=dag
)
task_1 >> task_2
when I tried to echo we can see nothing is being set. Does anyone know how to set an environment variable using the bashoperator?
[2021-11-04 12:00:34,452] {subprocess.py:63} INFO - Running command: ['bash', '-c', 'echo $dbt_user']
[2021-11-04 12:00:34,463] {subprocess.py:74} INFO - Output:
[2021-11-04 12:00:34,464] {subprocess.py:78} INFO -
[2021-11-04 12:00:34,465] {subprocess.py:82} INFO - Command exited with return code 0
[2021-11-04 12:00:34,494] {taskinstance.py:1211} INFO - Marking task as SUCCESS. dag_id=sample, task_id=load_seed_data_once, execution_date=20211104T120032, start_date=20211104T120034, end_date=20211104T120034
[2021-11-04 12:00:34,517] {taskinstance.py:1265} INFO - 0 downstream tasks scheduled from follow-on schedule check
[2021-11-04 12:00:34,555] {local_task_job.py:149} INFO - Task exited with return code 0
update:
I also tried doing via the python operator but it didnt work as well. It gave me a raise KeyError(key) from None
KeyError: 'variable_1'
from airflow import DAG
from airflow.operators.python import PythonOperator, BranchPythonOperator
from airflow.operators.bash import BashOperator
from airflow.operators.dummy_operator import DummyOperator
from datetime import datetime
from airflow.models import Variable
import os
default_args = {
'owner': 'airflow',
'depends_on_past': False,
'start_date': datetime(2020,8,1),
'retries': 0
}
def set_env():
os.environ["variable_1"] = "value_1"
def print_env_var():
print(os.environ["variable_1"])
with DAG('sample', default_args=default_args, schedule_interval='#once') as dag:
set_env_task = PythonOperator(
task_id='python_task',
python_callable=set_env,
dag=dag
)
print_env_task = PythonOperator(
task_id='load_seed_data_once',
python_callable=print_env_var,
dag=dag
)
set_env_task >> print_env_task
The BashOperator and PythonOperator - any operator I think - start a new child shell and will only inherit the environment set in the container in the image, at runtime (e.g. compose or k8s deploy) or by a script before starting airflow (e.g. entrypoint).
That's why you have the env parameter in BashOperator to pass whatever dict of env vars you want to set for the script.
You can pass the dbt_user and password from the AF variables there too as env is templated.
env={'dbt_user': '{{ var.value.dbt_user }}'}
You can set env in the dag defaults too to make it available to all tasks so you don't need to set individually.
Lastly, if you use the LocalExecutor, you can instead do the following in the first bash :
echo "export dbt_user={{ var.value.dbt_user }} >> ~/.bashrc
It will make the exported vars accessible in any new shell.
Note this would not work with KubernetesExecutor as a new container is started - but there are ways around it.

Airflow triggering the "on_failure_callback" when the "dagrun_timeout" is exceeded

Currently working on setting up alerts for long running tasks in Airflow. To cancel/fail the airflow dag I've put "dagrun_timeout" in the default_args, and it does what I need, fails/errors the dag when its been running for too long (usually stuck). The only problem is that the function in "on_failure_callback" doesn't get called when the dagrun_timeout is exceeded, because the "on_failure_callback" is on the task level (I think) while the dagrun_timeout is on the dag level.
How can I execute the "on_failure_callback" when the dagrun_timeout is exceeded, or how can I specify a function to be called when a dag fails? Or should I re-think my approach?
Try setting on_failure_callback during DAG declaration:
with DAG(
dag_id="failure_callback_example",
on_failure_callback=_on_dag_run_fail,
...
) as dag:
...
The explanation is that on_failure_callback defined in default_args will get passed only to the Tasks being created and not to the DAG object.
Here is an example to try this behaviour:
from datetime import datetime, timedelta
from airflow import DAG
from airflow.models import TaskInstance
from airflow.operators.bash import BashOperator
def _on_dag_run_fail(context):
print("***DAG failed!! do something***")
print(f"The DAG failed because: {context['reason']}")
print(context)
def _alarm(context):
print("** Alarm Alarm!! **")
task_instance: TaskInstance = context.get("task_instance")
print(f"Task Instance: {task_instance} failed!")
default_args = {
"owner": "mi_empresa",
"email_on_failure": False,
"on_failure_callback": _alarm,
}
with DAG(
dag_id="failure_callback_example",
start_date=datetime(2021, 9, 7),
schedule_interval=None,
default_args=default_args,
catchup=False,
on_failure_callback=_on_dag_run_fail,
dagrun_timeout=timedelta(seconds=45),
) as dag:
delayed = BashOperator(
task_id="delayed",
bash_command='echo "waiting..";sleep 60; echo "Done!!"',
)
will_fail = BashOperator(
task_id="will_fail",
bash_command="exit 1",
# on_failure_callback=_alarm,
)
delayed >> will_fail
You can find the logs of the callbacks execution in the Scheduler logs AIRFLOW_HOME/logs/scheduler/date/failure_callback_example :
[2021-09-24 13:12:34,285] {logging_mixin.py:104} INFO - [2021-09-24 13:12:34,285] {dag.py:862} INFO - Executing dag callback function: <function _on_dag_run_fail at 0x7f83102e8670>
[2021-09-24 13:12:34,336] {logging_mixin.py:104} INFO - ***DAG failed!! do something***
[2021-09-24 13:12:34,345] {logging_mixin.py:104} INFO - The DAG failed because: timed_out
Edit:
Within the context dict the key reason is passed in order to specify the cause of the DAG run failure. Some values are: 'reason': 'timed_out' or 'reason': 'task_failure' . This could be use to perfom specific behaviour in the callback based on the reason of the DAG Run failure.

Airflow | How DAG got started

Does anyone know how to get the way a DAG got started (whether it was on a scheduler or manually)? I'm using Airflow 2.1.
I have a DAG that runs on an hourly basis, but there are times that I run it manually to test something. I want to capture how the DAG got started and pass that value to a column in a table where I'm saving some data. This will allow me to filter based on scheduled or manual starts and filter test information.
Thanks!
From an execution context, such as a python_callable provided to a PythonOperator you can access to the DagRun object related to the current execution:
def _print_dag_run(**kwargs):
dag_run: DagRun = kwargs["dag_run"]
print(f"Run type: {dag_run.run_type}")
print(f"Externally triggered ?: {dag_run.external_trigger}")
Logs output:
[2021-09-08 18:53:52,188] {taskinstance.py:1300} INFO - Exporting the following env vars:
AIRFLOW_CTX_DAG_OWNER=airflow
AIRFLOW_CTX_DAG_ID=example_dagRun_info
AIRFLOW_CTX_TASK_ID=python_task
AIRFLOW_CTX_EXECUTION_DATE=2021-09-07T00:00:00+00:00
AIRFLOW_CTX_DAG_RUN_ID=backfill__2021-09-07T00:00:00+00:00
Run type: backfill
Externally triggered ?: False
dag_run.run_type would be: "manual", "scheduled" or "backfill". (not sure if there are others)
external_trigger docs:
external_trigger (bool) -- whether this dag run is externally triggered
Also you could use jinja to access default vairables in templated fields, there is a variable representing the dag_run object:
bash_task = BashOperator(
task_id="bash_task",
bash_command="echo dag_run type is: {{ dag_run.run_type }}",
)
Full DAG:
from airflow import DAG
from airflow.models.dagrun import DagRun
from airflow.operators.python import PythonOperator
from airflow.utils.dates import days_ago
default_args = {
"owner": "airflow",
}
def _print_dag_run(**kwargs):
dag_run: DagRun = kwargs["dag_run"]
print(f"Run type: {dag_run.run_type}")
print(f"Externally triggered ?: {dag_run.external_trigger}")
dag = DAG(
dag_id="example_dagRun_info",
default_args=default_args,
start_date=days_ago(1),
schedule_interval="#once",
tags=["example_dags", "params"],
catchup=False,
)
with dag:
python_task = PythonOperator(
task_id="python_task",
python_callable=_print_dag_run,
)
bash_task = BashOperator(
task_id="bash_task",
bash_command="echo dag_run type is: {{ dag_run.run_type }}",
)

apache airflow - Cannot load the dag bag to handle failure

I have created a on_failure_callback function(refering Airflow default on_failure_callback) to handle task's failure.
It works well when there is only one task in a DAG, however, if there are 2 more tasks, a task is randomly failed since the operator is null, it can resume later by manully . In airflow-scheduler.out the log is:
[2018-05-08 14:24:21,237] {models.py:1595} ERROR - Executor reports
task instance %s finished (%s) although the task says its %s. Was the
task killed externally? NoneType [2018-05-08 14:24:21,238]
{jobs.py:1435} ERROR - Cannot load the dag bag to handle failure for
. Setting task to FAILED without
callbacks or retries. Do you have enough resources?
The DAG code is:
from airflow import DAG
from airflow.operators.bash_operator import BashOperator
from datetime import timedelta
import airflow
from devops.util import WechatUtil
from devops.util import JiraUtil
def on_failure_callback(context):
ti = context['task_instance']
log_url = ti.log_url
owner = ti.task.owner
ti_str = str(context['task_instance'])
wechat_msg = "%s - Owner:%s"%(ti_str,owner)
WeChatUtil.notify_team(wechat_msg)
jira_desc = "Please check log from url %s"%(log_url)
JiraUtil.create_incident("DW",ti_str,jira_desc,owner)
args = {
'queue': 'default',
'start_date': airflow.utils.dates.days_ago(1),
'retry_delay': timedelta(minutes=1),
'on_failure_callback': on_failure_callback,
'owner': 'user1',
}
dag = DAG(dag_id='test_dependence1',default_args=args,schedule_interval='10 16 * * *')
load_crm_goods = BashOperator(
task_id='crm_goods_job',
bash_command='date',
dag=dag)
load_crm_memeber = BashOperator(
task_id='crm_member_job',
bash_command='date',
dag=dag)
load_crm_order = BashOperator(
task_id='crm_order_job',
bash_command='date',
dag=dag)
load_crm_eur_invt = BashOperator(
task_id='crm_eur_invt_job',
bash_command='date',
dag=dag)
crm_member_cohort_analysis = BashOperator(
task_id='crm_member_cohort_analysis_job',
bash_command='date',
dag=dag)
crm_member_cohort_analysis.set_upstream(load_crm_goods)
crm_member_cohort_analysis.set_upstream(load_crm_memeber)
crm_member_cohort_analysis.set_upstream(load_crm_order)
crm_member_cohort_analysis.set_upstream(load_crm_eur_invt)
crm_member_kpi_daily = BashOperator(
task_id='crm_member_kpi_daily_job',
bash_command='date',
dag=dag)
crm_member_kpi_daily.set_upstream(crm_member_cohort_analysis)
I had tried to update the airflow.cfg by adding the default memory from 512 to even 4096, but no luck. Would anyone have any advice ?
Ialso try to updated my JiraUtil and WechatUtil as following, encoutering the same error
WechatUtil:
import requests
class WechatUtil:
#staticmethod
def notify_trendy_user(user_ldap_id, message):
return None
#staticmethod
def notify_bigdata_team(message):
return None
JiraUtil:
import json
import requests
class JiraUtil:
#staticmethod
def execute_jql(jql):
return None
#staticmethod
def create_incident(projectKey, summary, desc, assignee=None):
return None
(I'm shooting tracer bullets a bit here, so bear with me if this answer doesn't get it right on the first try.)
The null operator issue with multiple task instances is weird... it would help approaching troubleshooting this if you could boil the current code down to a MCVE e.g., 1–2 operators and excluding the JiraUtil and WechatUtil parts if they're not related to the callback failure.
Here are 2 ideas:
1. Can you try changing the line that fetches the task instance out of the context to see if this makes a difference?
Before:
def on_failure_callback(context):
ti = context['task_instance']
...
After:
def on_failure_callback(context):
ti = context['ti']
...
I saw this usage in the Airflow repo (https://github.com/apache/incubator-airflow/blob/c1d583f91a0b4185f760a64acbeae86739479cdb/airflow/contrib/hooks/qubole_check_hook.py#L88). It's possible it can be accessed both ways.
2. Can you try adding provide_context=True on the operators either as a kwarg or in default_args?

Resources