I have created a task with PythonOperator as the operator. It calls for a function in another folder with an argument. But the operator does not accept the argument dag=dag when in fact it is a must since it is used to point to the dag context.
dags/
- my_dag.py
sub_folder/
- __init__.py
- my_functions.py
My DAG contains task1 and task2. They will call the function from a sub folder, and pass an argument to print.
my_dag.py
import datetime as dt
from airflow import DAG
from airflow.operators.bash_operator import BashOperator
from airflow.operators.python_operator import PythonOperator
from sub_folder.my_functions import task1, task2
args = {
'owner': 'hello',
'start_date': dt.datetime(2019, 1, 1),
'retries': 1,
'retry_delay': dt.timedelta(minutes=2)
}
dag = DAG(
'try',
default_args = args,
schedule_interval = dt.timedelta(minutes=2))
task1 = PythonOperator(
task_id='task1',
python_callable=task1,
provide_context=True,
op_kwargs={'idx': "Hello "},
dag=dag,
)
task2 = PythonOperator(
task_id='task2',
python_callable=task2,
provide_context=True,
op_kwargs={'idx': "World!"},
dag=dag,
)
task1 >> task2
The callable functions are just simple functions that prints the argument passed in them.
my_functions.py
def task1(idx):
print(f"Task 1! {idx}")
def task2(idx):
print(f"Task 2! {idx}")
My task1 is always retrying to run and in some time it will fail. I looked into the logs to find out what's going on. I found that that it gets a
TypeError: task1() got an unexpected keyword argument 'dag'
I don't know what is happening here. Obviously I have to call dag=dag and it's really an argument for making an operator to point which dag container it must have context with.
There is conflict between my_functions.task1 and the PythonOperator named task1
try:
import sub_folder.my_functions as mf # changed
task1 = PythonOperator(
task_id='task1',
python_callable=mf.task1, # changed
provide_context=True,
op_kwargs={'idx': "Hello "},
dag=dag,
)
Related
Let's take an example DAG.
Here is the code for it.
import logging
from airflow import DAG
from datetime import datetime, timedelta
from airflow.models import TaskInstance
from airflow.operators.python import PythonOperator
from airflow.operators.dummy import DummyOperator
def task_failure_notification_alert(context):
logging.info("Task context details: %s", str(context))
def dag_failure_notification_alert(context):
logging.info("DAG context details: %s", str(context))
def red_exception_task(ti: TaskInstance, **kwargs):
raise Exception('red')
default_args = {
"owner": "analytics",
"start_date": datetime(2021, 12, 12),
'retries': 0,
'retry_delay': timedelta(),
"schedule_interval": "#daily"
}
dag = DAG('logger_dag',
default_args=default_args,
catchup=False,
on_failure_callback=dag_failure_notification_alert
)
start_task = DummyOperator(task_id="start_task", dag=dag, on_failure_callback=task_failure_notification_alert)
red_task = PythonOperator(
dag=dag,
task_id='red_task',
python_callable=red_exception_task,
provide_context=True,
on_failure_callback=task_failure_notification_alert
)
end_task = DummyOperator(task_id="end_task", dag=dag, on_failure_callback=task_failure_notification_alert)
start_task >> red_task >> end_task
We can see two functions i.e. task_failure_notification_alert and dag_failure_notification_alert are being called in case of failures.
We can see logs in case of Task failure by the below steps.
We can see logs for the task as below.
but I am unable to find logs for the on_failure_callback of DAG anywhere in UI. Where can we see it?
Under airflow/logs find the "scheduler" folder, under it look for the specific date you ran the Dag for example 2022-12-03 and there you will see name of the dag_file.log.
I have a DAG and have 3 tasks in it. I would like to not display the 2nd task (middle_name) in the DAG run based on a condition. for e.g. if middle_name_var == 'false', i don't want to display the middle_name task in the DAG. Is there a way to elegantly achieve this?
from airflow.operators import PythonOperator
from airflow.operators.python_operator import BranchPythonOperator
from airflow.operators.dummy_operator import DummyOperator
from datetime import datetime, timedelta
from airflow.models import Variable
middle_name_var = Variable.get('middle_name')
default_args = {
'owner': 'test',
'depends_on_past': False,
'start_date': datetime(2018, 6, 18),
'email': ['tes#abc.com'],
'email_on_failure': False,
'email_on_retry': False,
'retries': 1,
'retry_delay': timedelta(minutes=1)
}
dag = DAG(
'name',
default_args=default_args,
schedule_interval="#once")
def first_name():
print('John')
def middle_name():
print('Smith')
def last_name():
print('Doe')
first_name_task = PythonOperator(
task_id='first_name',
provide_context=False,
python_callable=first_name,
dag=dag
)
middle_name_task = PythonOperator(
task_id='middle_name',
provide_context=False,
python_callable=middle_name,
dag=dag
)
last_name_task = PythonOperator(
task_id='last_name',
provide_context=False,
python_callable=last_name,
dag=dag
)
if middle_name_var == 'true':
first_name_task >> middle_name_task >>last_name_task
else:
first_name_task >> last_name_task
MY DAG looks like this with middle_name task...But i would like to not have middle_name task based on middle_name_var, which is set to false in this case.
With reference to that last set of chaining statements
# by the way i believe the comparison expression should be
# middle_name_var == True (boolean rather than string), but lets ignore it for now
if middle_name_var == 'true':
first_name_task >> middle_name_task >>last_name_task
else:
first_name_task >> last_name_task
Let me ask you: what would happen if you remove these chaining statements? Would the tasks disappear from DAG?
Not quite.
Chaining merely establishes a dependency relationship between tasks. Even without chaining, your task would remain part of your DAG (like in the screenshot that you've posted).
Here's the secret bit: a task becomes part of your dag as soon as you declare it
middle_name_task = PythonOperator(
task_id='middle_name',
provide_context=False,
python_callable=middle_name,
dag=dag
)
And whether or not you set that task upstream or downstream of some other tasks, it will continue to 'appear' in your DAG. Quoting the docs in this regard
Operators do not have to be assigned to DAGs immediately (previously
dag was a required argument). However, once an operator is assigned to
a DAG, it can not be transferred or unassigned. DAG assignment can be
done explicitly when the operator is created, through deferred
assignment, or even inferred from other operators.
Q So what should you do to 'not display' the task?
A Simply not declare (instantiate) it.
Q And how would you go about doing that?
A Just move task declaration inside your if-else clause
if middle_name_var == 'true':
middle_name_task = PythonOperator(
task_id='middle_name',
provide_context=False,
python_callable=middle_name,
dag=dag
)
first_name_task >> middle_name_task >>last_name_task
else:
first_name_task >> last_name_task
I have a dag as below:
ingest_excel.py:
from __future__ import print_function
import time
from builtins import range
from datetime import timedelta
from pprint import pprint
import airflow
from airflow.models import DAG
#from airflow.operators.bash_operator import BashOperator
from airflow.operators.python_operator import PythonOperator
args = {
'owner': 'rxie',
'start_date': airflow.utils.dates.days_ago(2),
}
dag = DAG(
dag_id='ingest_excel',
default_args=args,
schedule_interval='0 0 * * *',
dagrun_timeout=timedelta(minutes=60),
)
def print_context(**kwargs):
pprint("DAG info below:")
pprint(kwargs)
return 'Whatever you return gets printed in the logs'
t11_extract_excel_to_csv = PythonOperator(
task_id='t1_extract_excel_to_csv',
provide_context=True,
python_callable=print_context(),
op_kwargs=None,
dag=dag,
)
t12_upload_csv_to_hdfs_parquet = PythonOperator(
task_id='t12_upload_csv_to_hdfs_parquet',
provide_context=True,
python_callable=print_context(),
op_kwargs=None,
dag=dag,
)
t13_register_parquet_to_impala = PythonOperator(
task_id='t13_register_parquet_to_impala',
provide_context=True,
python_callable=print_context(),
op_kwargs=None,
dag=dag,
)
t21_text_to_parquet = PythonOperator(
task_id='t21_text_to_parquet',
provide_context=True,
python_callable=print_context(),
op_kwargs=None,
dag=dag,
)
t22_register_parquet_to_impala = PythonOperator(
task_id='t22_register_parquet_to_impala',
provide_context=True,
python_callable=print_context(),
op_kwargs=None,
dag=dag,
)
t31_verify_completion = PythonOperator(
task_id='t31_verify_completion',
provide_context=True,
python_callable=print_context(),
op_kwargs=None,
dag=dag,
)
t32_send_notification = PythonOperator(
task_id='t32_send_notification',
provide_context=True,
python_callable=print_context(),
op_kwargs=None,
dag=dag,
)
t11_extract_excel_to_csv >> t12_upload_csv_to_hdfs_parquet
t12_upload_csv_to_hdfs_parquet >> t13_register_parquet_to_impala
t21_text_to_parquet >> t22_register_parquet_to_impala
t13_register_parquet_to_impala >> t31_verify_completion
t22_register_parquet_to_impala >> t31_verify_completion
t31_verify_completion >> t32_send_notification
#if __name__ == "__main__":
# dag.cli()
In DAG GUI it prompts:
Broken DAG: [/root/airflow/dags/ingest_excel.py] python_callable
param must be callable
This is my first dag in Airflow, and I am pretty new to Airflow, it would be greatly appreciated if anyone can shed me some light and sort it out for me.
Thank you in advance.
To elaborate on your issue: your process is broken because you're not passing the function print_context to the PythonOperator, you're passing the result of calling print_context:
[...]
t32_send_notification = PythonOperator(
task_id='t32_send_notification',
provide_context=True,
python_callable=print_context(), # <-- This is the issue.
op_kwargs=None,
dag=dag,
)
[...]
Your function is returning the string 'Whatever you return gets printed in the logs' which is, in turn, being provided to the PythonOperator in the python_callable keyword argument. Airflow is essentially attempting to do the following:
your_return = 'Whatever you return gets printed in the logs'
your_return()
...and you're receiving the error you see. The other contributor is correct in stating that you should change your PythonOperator.python_callable keyword argument to simply print_context
The following option needs to be passed to PythonOperator in the newer versions of airflow:
provide_context=True
Otherwise the ds parameter is not passed to your function. This was a recent change to Airflow that I ran into.
Complete Example:
def print_context(ds, **kwargs):
pprint(kwargs)
print(ds)
return 'Whatever you return gets printed in the logs'
run_this = PythonOperator(
task_id='print_the_context',
provide_context=True,
python_callable=print_context,
dag=dag,
)
I'm not entirely sure why you're code doesn't work. It should work, but a work around is given below.
def print_context(**kwargs):
ds = kwargs['ds']
also the python_callable should be passed like this
python_callable=print_context,
I have the following DAG with 3 tasks:
start --> special_task --> end
The task in the middle can succeed or fail, but end must always be executed (imagine this is a task for cleanly closing resources). For that, I used the trigger rule ALL_DONE:
end.trigger_rule = trigger_rule.TriggerRule.ALL_DONE
Using that, end is properly executed if special_task fails. However, since end is the last task and succeeds, the DAG is always marked as SUCCESS.
How can I configure my DAG so that if one of the tasks failed, the whole DAG is marked as FAILED?
Example to reproduce
import datetime
from airflow import DAG
from airflow.operators.bash_operator import BashOperator
from airflow.utils import trigger_rule
dag = DAG(
dag_id='my_dag',
start_date=datetime.datetime.today(),
schedule_interval=None
)
start = BashOperator(
task_id='start',
bash_command='echo start',
dag=dag
)
special_task = BashOperator(
task_id='special_task',
bash_command='exit 1', # force failure
dag=dag
)
end = BashOperator(
task_id='end',
bash_command='echo end',
dag=dag
)
end.trigger_rule = trigger_rule.TriggerRule.ALL_DONE
start.set_downstream(special_task)
special_task.set_downstream(end)
This post seems to be related, but the answer does not suit my needs, since the downstream task end must be executed (hence the mandatory trigger_rule).
I thought it was an interesting question and spent some time figuring out how to achieve it without an extra dummy task. It became a bit of a superfluous task, but here's the end result:
This is the full DAG:
import airflow
from airflow import AirflowException
from airflow.models import DAG, TaskInstance, BaseOperator
from airflow.operators.bash_operator import BashOperator
from airflow.operators.dummy_operator import DummyOperator
from airflow.operators.python_operator import PythonOperator
from airflow.utils.db import provide_session
from airflow.utils.state import State
from airflow.utils.trigger_rule import TriggerRule
default_args = {"owner": "airflow", "start_date": airflow.utils.dates.days_ago(3)}
dag = DAG(
dag_id="finally_task_set_end_state",
default_args=default_args,
schedule_interval="0 0 * * *",
description="Answer for question https://stackoverflow.com/questions/51728441",
)
start = BashOperator(task_id="start", bash_command="echo start", dag=dag)
failing_task = BashOperator(task_id="failing_task", bash_command="exit 1", dag=dag)
#provide_session
def _finally(task, execution_date, dag, session=None, **_):
upstream_task_instances = (
session.query(TaskInstance)
.filter(
TaskInstance.dag_id == dag.dag_id,
TaskInstance.execution_date == execution_date,
TaskInstance.task_id.in_(task.upstream_task_ids),
)
.all()
)
upstream_states = [ti.state for ti in upstream_task_instances]
fail_this_task = State.FAILED in upstream_states
print("Do logic here...")
if fail_this_task:
raise AirflowException("Failing task because one or more upstream tasks failed.")
finally_ = PythonOperator(
task_id="finally",
python_callable=_finally,
trigger_rule=TriggerRule.ALL_DONE,
provide_context=True,
dag=dag,
)
succesful_task = DummyOperator(task_id="succesful_task", dag=dag)
start >> [failing_task, succesful_task] >> finally_
Look at the _finally function, which is called by the PythonOperator. There are a few key points here:
Annotate with #provide_session and add argument session=None, so you can query the Airflow DB with session.
Query all upstream task instances for the current task:
upstream_task_instances = (
session.query(TaskInstance)
.filter(
TaskInstance.dag_id == dag.dag_id,
TaskInstance.execution_date == execution_date,
TaskInstance.task_id.in_(task.upstream_task_ids),
)
.all()
)
From the returned task instances, get the states and check if State.FAILED is in there:
upstream_states = [ti.state for ti in upstream_task_instances]
fail_this_task = State.FAILED in upstream_states
Perform your own logic:
print("Do logic here...")
And finally, fail the task if fail_this_task=True:
if fail_this_task:
raise AirflowException("Failing task because one or more upstream tasks failed.")
The end result:
As #JustinasMarozas explained in a comment, a solution is to create a dummy task like :
dummy = DummyOperator(
task_id='test',
dag=dag
)
and bind it downstream to special_task :
failing_task.set_downstream(dummy)
Thus, the DAG is marked as failed, and the dummy task is marked as upstream_failed.
Hope there is an out-of-the-box solution, but waiting for that, this solution does the job.
To expand on Bas Harenslak answer, a simpler _finally function which will check the state of all tasks (not only the upstream ones) can be:
def _finally(**kwargs):
for task_instance in kwargs['dag_run'].get_task_instances():
if task_instance.current_state() != State.SUCCESS and \
task_instance.task_id != kwargs['task_instance'].task_id:
raise Exception("Task {} failed. Failing this DAG run".format(task_instance.task_id))
I want to execute task 2 if task 1 is success if task 1 fails i want to run task 3 and want to assign another flow if required.
Basically i want to run conditional tasks in airflow without ssh operators.
from airflow import DAG
from airflow.operators import PythonOperator,BranchPythonOperator
from airflow.operators import BashOperator
from datetime import datetime, timedelta
from airflow.models import Variable
def t2_error_task(context):
instance = context['task_instance']
if instance.task_id == "performExtract":
print ("Please implement something over this")
task_3 = PythonOperator(
task_id='performJoin1',
python_callable=performJoin1, # maybe main?
dag = dag
)
dag.add_task(task_3)
with DAG(
'manageWorkFlow',
catchup=False,
default_args={
'owner': 'Mannu',
'start_date': datetime(2018, 4, 13),
'schedule_interval':None,
'depends_on_past': False,
},
) as dag:
task_1 = PythonOperator(
task_id='performExtract',
python_callable=performExtract,
on_failure_callback=t2_error_task,
depends_on_past=True
)
task_2 = PythonOperator(
task_id='printSchemas',
depends_on_past=True,
python_callable=printSchemaAll, # maybe main?
)
task_2.set_upstream(task_1)
Adding tasks dynamically based on execution-time statuses is not something Airflow supports. In order to get the desired behaviour, you should add task_3 to your dag but change its trigger_rule to all_failed. In this case, the task will get marked as skipped when task_1 succeeds, but it will get executed when it fails.