Airflow | How DAG got started - airflow

Does anyone know how to get the way a DAG got started (whether it was on a scheduler or manually)? I'm using Airflow 2.1.
I have a DAG that runs on an hourly basis, but there are times that I run it manually to test something. I want to capture how the DAG got started and pass that value to a column in a table where I'm saving some data. This will allow me to filter based on scheduled or manual starts and filter test information.
Thanks!

From an execution context, such as a python_callable provided to a PythonOperator you can access to the DagRun object related to the current execution:
def _print_dag_run(**kwargs):
dag_run: DagRun = kwargs["dag_run"]
print(f"Run type: {dag_run.run_type}")
print(f"Externally triggered ?: {dag_run.external_trigger}")
Logs output:
[2021-09-08 18:53:52,188] {taskinstance.py:1300} INFO - Exporting the following env vars:
AIRFLOW_CTX_DAG_OWNER=airflow
AIRFLOW_CTX_DAG_ID=example_dagRun_info
AIRFLOW_CTX_TASK_ID=python_task
AIRFLOW_CTX_EXECUTION_DATE=2021-09-07T00:00:00+00:00
AIRFLOW_CTX_DAG_RUN_ID=backfill__2021-09-07T00:00:00+00:00
Run type: backfill
Externally triggered ?: False
dag_run.run_type would be: "manual", "scheduled" or "backfill". (not sure if there are others)
external_trigger docs:
external_trigger (bool) -- whether this dag run is externally triggered
Also you could use jinja to access default vairables in templated fields, there is a variable representing the dag_run object:
bash_task = BashOperator(
task_id="bash_task",
bash_command="echo dag_run type is: {{ dag_run.run_type }}",
)
Full DAG:
from airflow import DAG
from airflow.models.dagrun import DagRun
from airflow.operators.python import PythonOperator
from airflow.utils.dates import days_ago
default_args = {
"owner": "airflow",
}
def _print_dag_run(**kwargs):
dag_run: DagRun = kwargs["dag_run"]
print(f"Run type: {dag_run.run_type}")
print(f"Externally triggered ?: {dag_run.external_trigger}")
dag = DAG(
dag_id="example_dagRun_info",
default_args=default_args,
start_date=days_ago(1),
schedule_interval="#once",
tags=["example_dags", "params"],
catchup=False,
)
with dag:
python_task = PythonOperator(
task_id="python_task",
python_callable=_print_dag_run,
)
bash_task = BashOperator(
task_id="bash_task",
bash_command="echo dag_run type is: {{ dag_run.run_type }}",
)

Related

How do we set OS environment variables in Airflow

I have an airflow dag and what im trying to do is read my variables stored in the airflow UI (username and password) and pass those variable values as exported values in the OS. the reason is because I am using dbt yml file which requires me to read the environment variable 'dbt_user'. (the only other way is to set the password in the yaml file which is not secure.
default:
target: dev
outputs:
dev:
type: snowflake
account: xxxx
user: "{{ env_var('dbt_user') }}"
I tried to write a dag which does the bashoperator export but it doesnt seem to set the environment variable.
from airflow import DAG
from airflow.operators.python import PythonOperator, BranchPythonOperator
from airflow.operators.bash import BashOperator
from airflow.operators.dummy_operator import DummyOperator
from datetime import datetime
from airflow.models import Variable
import os
default_args = {
'owner': 'airflow',
'depends_on_past': False,
'start_date': datetime(2020,8,1),
'retries': 0
}
with DAG('sample', default_args=default_args, schedule_interval='#once') as dag:
task_1 = BashOperator(
task_id='get_variables',
bash_command='export dbt_user={{ var.value.dbt_user }} ',
env = os.environ.copy(),
dag=dag
)
task_2 = BashOperator(
task_id='load_seed_data_once',
bash_command='echo $dbt_user',
dag=dag
)
task_1 >> task_2
when I tried to echo we can see nothing is being set. Does anyone know how to set an environment variable using the bashoperator?
[2021-11-04 12:00:34,452] {subprocess.py:63} INFO - Running command: ['bash', '-c', 'echo $dbt_user']
[2021-11-04 12:00:34,463] {subprocess.py:74} INFO - Output:
[2021-11-04 12:00:34,464] {subprocess.py:78} INFO -
[2021-11-04 12:00:34,465] {subprocess.py:82} INFO - Command exited with return code 0
[2021-11-04 12:00:34,494] {taskinstance.py:1211} INFO - Marking task as SUCCESS. dag_id=sample, task_id=load_seed_data_once, execution_date=20211104T120032, start_date=20211104T120034, end_date=20211104T120034
[2021-11-04 12:00:34,517] {taskinstance.py:1265} INFO - 0 downstream tasks scheduled from follow-on schedule check
[2021-11-04 12:00:34,555] {local_task_job.py:149} INFO - Task exited with return code 0
update:
I also tried doing via the python operator but it didnt work as well. It gave me a raise KeyError(key) from None
KeyError: 'variable_1'
from airflow import DAG
from airflow.operators.python import PythonOperator, BranchPythonOperator
from airflow.operators.bash import BashOperator
from airflow.operators.dummy_operator import DummyOperator
from datetime import datetime
from airflow.models import Variable
import os
default_args = {
'owner': 'airflow',
'depends_on_past': False,
'start_date': datetime(2020,8,1),
'retries': 0
}
def set_env():
os.environ["variable_1"] = "value_1"
def print_env_var():
print(os.environ["variable_1"])
with DAG('sample', default_args=default_args, schedule_interval='#once') as dag:
set_env_task = PythonOperator(
task_id='python_task',
python_callable=set_env,
dag=dag
)
print_env_task = PythonOperator(
task_id='load_seed_data_once',
python_callable=print_env_var,
dag=dag
)
set_env_task >> print_env_task
The BashOperator and PythonOperator - any operator I think - start a new child shell and will only inherit the environment set in the container in the image, at runtime (e.g. compose or k8s deploy) or by a script before starting airflow (e.g. entrypoint).
That's why you have the env parameter in BashOperator to pass whatever dict of env vars you want to set for the script.
You can pass the dbt_user and password from the AF variables there too as env is templated.
env={'dbt_user': '{{ var.value.dbt_user }}'}
You can set env in the dag defaults too to make it available to all tasks so you don't need to set individually.
Lastly, if you use the LocalExecutor, you can instead do the following in the first bash :
echo "export dbt_user={{ var.value.dbt_user }} >> ~/.bashrc
It will make the exported vars accessible in any new shell.
Note this would not work with KubernetesExecutor as a new container is started - but there are ways around it.

Airflow triggering the "on_failure_callback" when the "dagrun_timeout" is exceeded

Currently working on setting up alerts for long running tasks in Airflow. To cancel/fail the airflow dag I've put "dagrun_timeout" in the default_args, and it does what I need, fails/errors the dag when its been running for too long (usually stuck). The only problem is that the function in "on_failure_callback" doesn't get called when the dagrun_timeout is exceeded, because the "on_failure_callback" is on the task level (I think) while the dagrun_timeout is on the dag level.
How can I execute the "on_failure_callback" when the dagrun_timeout is exceeded, or how can I specify a function to be called when a dag fails? Or should I re-think my approach?
Try setting on_failure_callback during DAG declaration:
with DAG(
dag_id="failure_callback_example",
on_failure_callback=_on_dag_run_fail,
...
) as dag:
...
The explanation is that on_failure_callback defined in default_args will get passed only to the Tasks being created and not to the DAG object.
Here is an example to try this behaviour:
from datetime import datetime, timedelta
from airflow import DAG
from airflow.models import TaskInstance
from airflow.operators.bash import BashOperator
def _on_dag_run_fail(context):
print("***DAG failed!! do something***")
print(f"The DAG failed because: {context['reason']}")
print(context)
def _alarm(context):
print("** Alarm Alarm!! **")
task_instance: TaskInstance = context.get("task_instance")
print(f"Task Instance: {task_instance} failed!")
default_args = {
"owner": "mi_empresa",
"email_on_failure": False,
"on_failure_callback": _alarm,
}
with DAG(
dag_id="failure_callback_example",
start_date=datetime(2021, 9, 7),
schedule_interval=None,
default_args=default_args,
catchup=False,
on_failure_callback=_on_dag_run_fail,
dagrun_timeout=timedelta(seconds=45),
) as dag:
delayed = BashOperator(
task_id="delayed",
bash_command='echo "waiting..";sleep 60; echo "Done!!"',
)
will_fail = BashOperator(
task_id="will_fail",
bash_command="exit 1",
# on_failure_callback=_alarm,
)
delayed >> will_fail
You can find the logs of the callbacks execution in the Scheduler logs AIRFLOW_HOME/logs/scheduler/date/failure_callback_example :
[2021-09-24 13:12:34,285] {logging_mixin.py:104} INFO - [2021-09-24 13:12:34,285] {dag.py:862} INFO - Executing dag callback function: <function _on_dag_run_fail at 0x7f83102e8670>
[2021-09-24 13:12:34,336] {logging_mixin.py:104} INFO - ***DAG failed!! do something***
[2021-09-24 13:12:34,345] {logging_mixin.py:104} INFO - The DAG failed because: timed_out
Edit:
Within the context dict the key reason is passed in order to specify the cause of the DAG run failure. Some values are: 'reason': 'timed_out' or 'reason': 'task_failure' . This could be use to perfom specific behaviour in the callback based on the reason of the DAG Run failure.

Airflow 2.1.0 passing variable to another DAG using TriggerDagRunOperator

We're using Airflow 2.1.0 and want to trigger a DAG and pass a variable to it (an S3 file name) using TriggerDagRunOperator.
I've found examples of this and can pass a static JSON to the next DAG using conf:
#task()
def trigger_target_dag_task(context):
TriggerDagRunOperator(
task_id="trigger_target_dag",
trigger_dag_id="target_dag",
conf={"file_name": "test.txt"}
).execute(context)
However, I cannot find current examples where the conf is dynamically created without using python_callable - this seems close:
Airflow 2.0.0+ - Pass a Dynamically Generated Dictionary to DAG Triggered by TriggerDagRunOperator
https://github.com/apache/airflow/pull/6317#issuecomment-859556243
Is this possible?
Updated question:
This method did not work when I used:
#task()
def trigger_dag_task(context):
TriggerDagRunOperator(
task_id="trigger_dag_task",
trigger_dag_id="target_dag",
conf={"payload": "{{ ti.xcom_pull(task_ids='extract_rss') }}"},
).execute(context)
The target_dag received the conf as a string:
{logging_mixin.py:104} INFO - Remotely received value of {{ ti.xcom_pull(task_ids='extract_rss') }}
Conf is a templated field, so you could use Jinja to pass in any variable. Consider this example based on the official TriggerDagRunOperator example
If the variable (object_name) is within your scope you could do:
Controller DAG:
dag = DAG(
dag_id="example_trigger_controller_dag",
default_args={"owner": "airflow"},
start_date=days_ago(2),
schedule_interval="#once",
tags=['example'],
)
object_name = "my-object-s3-aws"
trigger = TriggerDagRunOperator(
task_id="test_trigger_dagrun",
trigger_dag_id="example_trigger_target_dag",
conf={"s3_object": object_name},
dag=dag,
)
Target DAG:
dag = DAG(
dag_id="example_trigger_target_dag",
default_args={"owner": "airflow"},
start_date=days_ago(2),
schedule_interval=None,
tags=['example'],
)
def run_this_func(**context):
print("Remotely received value of {} for key=message".format(
context["dag_run"].conf["s3_object"]))
run_this = PythonOperator(
task_id="run_this", python_callable=run_this_func, dag=dag)
bash_task = BashOperator(
task_id="bash_task",
bash_command='echo "Here is the message: $message"',
env={'message': '{{ dag_run.conf["s3_object"] if dag_run else "" }}'},
dag=dag,
)
If the variable is stored as an Airflow Variable you could retrieve it like this:
conf={"s3_object": "{{var.json.s3_object}}"}
If it were an XCom from a previous task, you could do:
conf={"s3_object": "{{ ti.xcom_pull(task_ids='previous_task_id', key='return_value') }}"
Let me know if that worked for you!
docs
Edit:
This is a working example, tested in version 2.0.1, using xcom_pull in conf param:
Controller DAG:
from airflow import DAG
from airflow.operators.trigger_dagrun import TriggerDagRunOperator
from airflow.operators.python import PythonOperator
from airflow.utils.dates import days_ago
def _do_something():
return "my-object-s3-aws"
dag = DAG(
dag_id="example_trigger_controller_dag",
default_args={"owner": "airflow"},
start_date=days_ago(2),
schedule_interval="#once",
tags=['example'],
)
task_1 = PythonOperator(task_id='previous_task_id',
python_callable=_do_something)
trigger = TriggerDagRunOperator(
task_id="test_trigger_dagrun",
trigger_dag_id="example_trigger_target_dag",
conf={
"s3_object":
"{{ ti.xcom_pull(task_ids='previous_task_id', key='return_value') }}"},
dag=dag,
)
task_1 >> trigger
Target DAG:
from airflow import DAG
from airflow.operators.bash import BashOperator
from airflow.operators.python import PythonOperator
from airflow.utils.dates import days_ago
dag = DAG(
dag_id="example_trigger_target_dag",
default_args={"owner": "airflow"},
start_date=days_ago(2),
schedule_interval=None,
tags=['example'],
)
def run_this_func(**context):
print("Remotely received value of {} ".format(
context["dag_run"].conf["s3_object"]))
run_this = PythonOperator(
task_id="run_this", python_callable=run_this_func, dag=dag)
bash_task = BashOperator(
task_id="bash_task",
bash_command='echo "Here is the message: $s3_object"',
env={'s3_object': '{{ dag_run.conf["s3_object"] if dag_run else "" }}'},
dag=dag,
)
Logs from run_this task:
[2021-07-15 19:24:11,410] {logging_mixin.py:104} INFO - Remotely received value of my-object-s3-aws

Using dag_run variables in airflow Dag

I am trying to use airflow variables to determine whether to execute a task or not. I have tried this and it's not working:
if '{{ params.year }}' == '{{ params.message }}':
run_this = DummyOperator (
task_id = 'dummy_dag'
)
I was hoping to get some help making it work. Also is there a better way of doing something like this in airflow?
I think a good way to solve this, is with BranchPythonOperator to branch dynamically based on the provided DAG parameters. Consider this example:
Use params to provide the parameters to the DAG (could be also done from the UI), in this example: {"enabled": True}
from airflow.decorators import dag, task
from airflow.utils.dates import days_ago
from airflow.operators.python import get_current_context, BranchPythonOperator
#dag(
default_args=default_args,
schedule_interval=None,
start_date=days_ago(1),
catchup=False,
tags=["example"],
params={"enabled": True},
)
def branch_from_dag_params():
def _print_enabled():
context = get_current_context()
enabled = context["params"].get("enabled", False)
print(f"Task id: {context['ti'].task_id}")
print(f"Enabled is: {enabled}")
#task
def task_a():
_print_enabled()
#task
def task_b():
_print_enabled()
Define a callable to the BranchPythonOperator in which you will perform your conditionals and return the next task to be executed. You can access the execution context variables from **kwargs. Also keep in mind that this operator should return a single task_id or a list of task_ids to follow downstream. Those resultant tasks should always be directly downstream from it.
def _get_task_run(ti, **kwargs):
custom_param = kwargs["params"].get("enabled", False)
if custom_param:
return "task_a"
else:
return "task_b"
branch_task = BranchPythonOperator(
task_id="branch_task",
python_callable=_get_task_run,
)
task_a_exec = task_a()
task_b_exec = task_b()
branch_task >> [task_a_exec, task_b_exec]
The result is that task_a gets executed and task_b is skipped :
AIRFLOW_CTX_DAG_OWNER=airflow
AIRFLOW_CTX_DAG_ID=branch_from_dag_params
AIRFLOW_CTX_TASK_ID=task_a
Task id: task_a
Enabled is: True
Let me know if that worked for you.
Docs

DAG marked as "success" if one task fails, because of trigger rule ALL_DONE

I have the following DAG with 3 tasks:
start --> special_task --> end
The task in the middle can succeed or fail, but end must always be executed (imagine this is a task for cleanly closing resources). For that, I used the trigger rule ALL_DONE:
end.trigger_rule = trigger_rule.TriggerRule.ALL_DONE
Using that, end is properly executed if special_task fails. However, since end is the last task and succeeds, the DAG is always marked as SUCCESS.
How can I configure my DAG so that if one of the tasks failed, the whole DAG is marked as FAILED?
Example to reproduce
import datetime
from airflow import DAG
from airflow.operators.bash_operator import BashOperator
from airflow.utils import trigger_rule
dag = DAG(
dag_id='my_dag',
start_date=datetime.datetime.today(),
schedule_interval=None
)
start = BashOperator(
task_id='start',
bash_command='echo start',
dag=dag
)
special_task = BashOperator(
task_id='special_task',
bash_command='exit 1', # force failure
dag=dag
)
end = BashOperator(
task_id='end',
bash_command='echo end',
dag=dag
)
end.trigger_rule = trigger_rule.TriggerRule.ALL_DONE
start.set_downstream(special_task)
special_task.set_downstream(end)
This post seems to be related, but the answer does not suit my needs, since the downstream task end must be executed (hence the mandatory trigger_rule).
I thought it was an interesting question and spent some time figuring out how to achieve it without an extra dummy task. It became a bit of a superfluous task, but here's the end result:
This is the full DAG:
import airflow
from airflow import AirflowException
from airflow.models import DAG, TaskInstance, BaseOperator
from airflow.operators.bash_operator import BashOperator
from airflow.operators.dummy_operator import DummyOperator
from airflow.operators.python_operator import PythonOperator
from airflow.utils.db import provide_session
from airflow.utils.state import State
from airflow.utils.trigger_rule import TriggerRule
default_args = {"owner": "airflow", "start_date": airflow.utils.dates.days_ago(3)}
dag = DAG(
dag_id="finally_task_set_end_state",
default_args=default_args,
schedule_interval="0 0 * * *",
description="Answer for question https://stackoverflow.com/questions/51728441",
)
start = BashOperator(task_id="start", bash_command="echo start", dag=dag)
failing_task = BashOperator(task_id="failing_task", bash_command="exit 1", dag=dag)
#provide_session
def _finally(task, execution_date, dag, session=None, **_):
upstream_task_instances = (
session.query(TaskInstance)
.filter(
TaskInstance.dag_id == dag.dag_id,
TaskInstance.execution_date == execution_date,
TaskInstance.task_id.in_(task.upstream_task_ids),
)
.all()
)
upstream_states = [ti.state for ti in upstream_task_instances]
fail_this_task = State.FAILED in upstream_states
print("Do logic here...")
if fail_this_task:
raise AirflowException("Failing task because one or more upstream tasks failed.")
finally_ = PythonOperator(
task_id="finally",
python_callable=_finally,
trigger_rule=TriggerRule.ALL_DONE,
provide_context=True,
dag=dag,
)
succesful_task = DummyOperator(task_id="succesful_task", dag=dag)
start >> [failing_task, succesful_task] >> finally_
Look at the _finally function, which is called by the PythonOperator. There are a few key points here:
Annotate with #provide_session and add argument session=None, so you can query the Airflow DB with session.
Query all upstream task instances for the current task:
upstream_task_instances = (
session.query(TaskInstance)
.filter(
TaskInstance.dag_id == dag.dag_id,
TaskInstance.execution_date == execution_date,
TaskInstance.task_id.in_(task.upstream_task_ids),
)
.all()
)
From the returned task instances, get the states and check if State.FAILED is in there:
upstream_states = [ti.state for ti in upstream_task_instances]
fail_this_task = State.FAILED in upstream_states
Perform your own logic:
print("Do logic here...")
And finally, fail the task if fail_this_task=True:
if fail_this_task:
raise AirflowException("Failing task because one or more upstream tasks failed.")
The end result:
As #JustinasMarozas explained in a comment, a solution is to create a dummy task like :
dummy = DummyOperator(
task_id='test',
dag=dag
)
and bind it downstream to special_task :
failing_task.set_downstream(dummy)
Thus, the DAG is marked as failed, and the dummy task is marked as upstream_failed.
Hope there is an out-of-the-box solution, but waiting for that, this solution does the job.
To expand on Bas Harenslak answer, a simpler _finally function which will check the state of all tasks (not only the upstream ones) can be:
def _finally(**kwargs):
for task_instance in kwargs['dag_run'].get_task_instances():
if task_instance.current_state() != State.SUCCESS and \
task_instance.task_id != kwargs['task_instance'].task_id:
raise Exception("Task {} failed. Failing this DAG run".format(task_instance.task_id))

Resources