Airflow Exception not being thrown when encountering None/Falsy values - airflow

I am trying to pass data between a PythonOperator, _etl_lasic to another PythonOperator, _download_s3_data, which works fine but I want to throw an exception when the value passed is None which should mark the task as a failure.
import airflow
from airflow import DAG
from airflow.operators.python import PythonOperator
from airflow.exceptions import AirflowFailException
def _etl_lasic(**context):
path_s3 = None
context["task_instance"].xcom_push(
key="path_s3",
value=path_s3,
)
def _download_s3_data(templates_dict, **context):
path_s3 = templates_dict["path_s3"]
if not path_s3:
raise AirflowFailException("Path to S3 was not passed!")
else:
print(f"Path to S3: {path_s3}")
with DAG(
dag_id="02_lasic_retraining_without_etl",
start_date=airflow.utils.dates.days_ago(3),
schedule_interval="#once",
) as dag:
etl_lasic = PythonOperator(
task_id="etl_lasic",
python_callable=_etl_lasic,
)
download_s3_data = PythonOperator(
task_id="download_s3_data",
python_callable=_download_s3_data,
templates_dict={
"path_s3": "{{task_instance.xcom_pull(task_ids='etl_lasic',key='path_s3')}}"
},
)
etl_lasic >> download_s3_data
Logs:
[2021-08-17 04:04:41,128] {logging_mixin.py:103} INFO - Path to S3: None
[2021-08-17 04:04:41,128] {python.py:118} INFO - Done. Returned value was: None
[2021-08-17 04:04:41,143] {taskinstance.py:1135} INFO - Marking task as SUCCESS. dag_id=02_lasic_retraining_without_etl, task_id=download_s3_data, execution_date=20210817T040439, start_date=20210817T040440, end_date=20210817T040441
[2021-08-17 04:04:41,189] {taskinstance.py:1195} INFO - 0 downstream tasks scheduled from follow-on schedule check
[2021-08-17 04:04:41,212] {local_task_job.py:118} INFO - Task exited with return code 0

Jinja-templated values are rendered as strings by default. In your case, even though you push an XCom value of None, when the value is pulled via "{{task_instance.xcom_pull(task_ids='etl_lasic',key='path_s3')}}" the value is actually rendered as "None" which doesn't throw an exception based on the current logic.
There are two options that will solve this:
Instead of setting path_s3 to None in the "_etl_lasic" function, set it to an empty string.
If you are using Airflow 2.1+, there is a parameter, render_template_as_native_obj, that can be set at the DAG level which will render Jinja-templated values as native Python types (list, dict, etc.). Setting that parameter to True will do the trick without changing how path_s3 is set in the function. A conceptual example is documented here.

Related

how to pass default values for run time input variable in airflow for scheduled execution

I come across one issue while running DAG in airflow. my code is working in two scenarios where is failing for one.
below are my scenarios,
Manual trigger with input - Running Fine
Manual trigger without input - Running Fine
Scheduled Run - Failing
Below is my code:
def decide_the_flow(**kwargs):
cleanup=kwargs['dag_run'].conf.get('cleanup','N')
print("IP is :",cleanup)
return cleanup
I am getting below error,
cleanup=kwargs['dag_run'].conf.get('cleanup','N')
AttributeError: 'NoneType' object has no attribute 'get'
I tried to define default variables like,
default_dag_args = {
'start_date':days_ago(0),
'params': {
"cleanup": "N"
},
'retries': 0
}
but it wont work.
I am using BranchPythonOperator to call this function.
Scheduling : enter image description here
Can anyone please guide me here. what I am missing ?
For workaround i am using below code,
try:
cleanup=kwargs['dag_run'].conf.get('cleanup','N')
except:
cleanup="N"
You can access the parameters from the context dict params, because airflow defines the default values on this dict after copying the dict dag_run.conf and checking if there is something missing:
from datetime import datetime
from airflow import DAG
from airflow.operators.empty import EmptyOperator
from airflow.operators.python import BranchPythonOperator
def decide_the_flow(**kwargs):
cleanup = kwargs['params']["cleanup"]
print(f"IP is : {cleanup}")
return cleanup
with DAG(
dag_id='airflow_params',
start_date=datetime(2022, 8, 25),
schedule_interval="* * * * *",
params={
"cleanup": "N",
},
catchup=False
) as dag:
branch_task = BranchPythonOperator(
task_id='test_param',
python_callable=decide_the_flow
)
task_n = EmptyOperator(task_id="N")
task_m = EmptyOperator(task_id="M")
branch_task >> [task_n, task_m]
I just tested it in scheduled and manual (with and without conf) runs, it works fine.

Airflow DAG - Failed Task Doesn't Show Fail Status as It Should

I just started with Airflow DAG and encountered a strange issue with the tool. I am using airflow version 2.3.3 with SequentialExecutor.
The script I Used:
import datetime
from airflow import DAG
from airflow.operators.python import PythonOperator
dag_args = {
'owner': 'hao',
'retries': 2,
'retry_delay': datetime.timedelta(minutes=1)
}
with DAG(
dag_id='dependency_experiment',
default_args=dag_args,
description='experiment the dag task denpendency expression',
start_date=datetime.datetime.now(),
schedule_interval='#daily',
dagrun_timeout=datetime.timedelta(seconds=10),
) as dag:
pyOp = PythonOperator(
task_id='pyOp',
python_callable=lambda x: haha * x,
op_kwargs={'x': 10}
)
pyOp
The Log Snippit of This Task:
NameError: name 'haha' is not defined
[2022-07-27, 18:19:34 EDT] {taskinstance.py:1415} INFO - Marking task as UP_FOR_RETRY. dag_id=dependency_experiment, task_id=pyOp, execution_date=20220728T021932, start_date=20220728T021934, end_date=20220728T021934
[2022-07-27, 18:19:34 EDT] {standard_task_runner.py:92} ERROR - Failed to execute job 44 for task pyOp (name 'haha' is not defined; 19405)
[2022-07-27, 18:19:34 EDT] {local_task_job.py:156} INFO - Task exited with return code 1
[2022-07-27, 18:19:34 EDT] {local_task_job.py:273} INFO - 0 downstream tasks scheduled from follow-on schedule check
Problem:
I purposefully defined a PythonOperator, which would fail. When I put the script on DAG, the task raised an exception as expected; however, the status for this task is always skipped. I cannot figure out why the task didn't show a failed status as expected. Any suggestions will be much appreciated.
It's because you are defining 'retries' and 'retry_delay' in your dag_args dictionary.
From the Docs:
default_args (Optional[Dict]) – A dictionary of default parameters to be used as constructor keyword parameters when initialising operators. Note that operators have the same hook, and precede those defined here, meaning that if your dict contains ‘depends_on_past’: True here and ‘depends_on_past’: False in the operator’s call default_args, the actual value will be False.
When you set the 'retries' to a value, Airflow thinks that the Task would be retried in an other time. So it shows it in UI as skipped.
If you delete 'retries' and 'retry_delay' from the dag_args, you'll see that task set to failed when you try to initiate the DAG.
When I ran your code in the logs I see:
INFO - Marking task as UP_FOR_RETRY. dag_id=dependency_experiment, task_id=pyOp, execution_date=20220729T060953, start_date=20220729T060953, end_date=20220729T060953
After I delete the 'retries' and 'retry_delay' the same log becomes:
INFO - Marking task as FAILED. dag_id=dependency_experiment, task_id=pyOp, execution_date=20220729T061031, start_date=20220729T061031, end_date=20220729T061031

How to pull XCOM value from on_failure_callback

When a task fails, is it possible to pull a XCOM value that has been previously set in an other task during the on_failure_callback execution?
To be more specific, exemple:
dag: task1 >> task2
task1 runs successfully and set key="test" value=123 in Xcom
task2 fails
on_failure_callback is called
Is this possible to retrieve the value of key test in the on_failure_callback ?
I tried like this, but it seems it didn't find any value:
# Daf configuration
...
"on_failure_callback": deploy_failure,
...
# In task1
kwargs["ti"].xcom_push(key="test", value=123)
# on_failure_callback method
def deploy_failure(context):
print("/!\ Deploy failure callback triggered...")
test_value = context.get("ti").xcom_pull(key="test")
print(test_value)
test_value is None
I am sure the Xcom value is set because I can see it on the Airflow backend.
Any idea?
I guess there is some issue with provide_context in the failure_callback. You can work around this by accessing directly XCom class:
from airflow.models import XCom
def deploy_failure(context):
print("/!\ Deploy failure callback triggered...")
test_value = XCom.get_one(execution_date = context.get('execution_date'), key='test')
print("ALERT: {0}".format(test_value))

Assign airflow task to several DAGs

I am trying to reuse an existing airflow task by assigning it to different dags.
def create_new_task_for_dag(task: BaseOperator,
dag: models.DAG) -> BaseOperator:
"""Create a deep copy of given task and associate it with given dag
"""
new_task = copy.deepcopy(task)
new_task.dag = dag
return new_task
print_datetime_task = python_operator.PythonOperator(
task_id='print_datetime', python_callable=_print_datetime)
# define a new dag ...
# add to the new dag
create_new_task_for_dag(print_datetime_task, new_dag)
Then it gives the error Task is missing the start_date parameter.
If I define the dag when creating the operator, print_datetime_task = PythonOperator(task_id='print_datetime', python_callable=_print_datetime, dag=new_dag), then it is OK.
I have searched around, and this seems to be the root cause: https://github.com/apache/airflow/pull/5598, but PR has been marked as stale.
I wonder if there is any other approach to reuse an existing airflow task assign to a different dag.
I am using apache-airflow[docker,kubernetes]==1.10.10
While I don't know the solution to your problem with current design (code-layout), it can be made to work by tweaking the design slightly (note that the following code-snippets have NOT been tested)
Instead of copying a task from a DAG,
def create_new_task_for_dag(task: BaseOperator,
dag: models.DAG) -> BaseOperator:
"""Create a deep copy of given task and associate it with given dag
"""
new_task = copy.deepcopy(task)
new_task.dag = dag
return new_task
you can move the instantiation of task (as well as it's assignment to the DAG) to a separate utility function.
from datetime import datetime
from typing import Dict, Any
from airflow.models.dag import DAG
from airflow.operators.python_operator import PythonOperator
def add_new_print_datetime_task(my_dag: DAG,
kwargs: Dict[str, Any]) -> PythonOperator:
"""
Creates and adds a new 'print_datetime' (PythonOperator) task in 'my_dag'
and returns it's reference
:param my_dag: reference to DAG object in which to add the task
:type my_dag: DAG
:param kwargs: dictionary of args for PythonOperator / BaseOperator
'task_id' is mandatory
:type kwargs: Dict[str, Any]
:return: PythonOperator
"""
def my_callable() -> None:
print(datetime.now())
return PythonOperator(dag=my_dag, python_callable=my_callable, **kwargs)
Thereafter you can call that function everytime you want to instantiate that same task (and assign to any DAG)
with DAG(dag_id="my_dag_id", start_date=datetime(year=2020, month=8, day=22, hour=16, minute=30)) as my_dag:
print_datetime_task_kwargs: Dict[str, Any] = {
"task_id": "my_task_id",
"depends_on_past": True
}
print_datetime_task: PythonOperator = add_new_print_datetime_task(my_dag=my_dag, kwargs=print_datetime_task_kwargs)
# ... other tasks and their wiring
References / good reads
Astronomer.io: Dynamically Generating DAGs in Airflow
Apache Airflow | With Statement and DAG

How to access Xcom value in a non airflow operator python function

I have a stored XCom value that I wanted to pass to another python function which is not called using PythonOperator.
def sql_file_template():
<some code which uses xcom variable>
def call_stored_proc(**kwargs):
#project = kwargs['row_id']
print("INSIDE CALL STORE PROC ------------")
query = """CALL `{0}.dataset_name.store_proc`(
'{1}' # source table
, ['{2}'] # row_ids
, '{3}' # pivot_col_name
, '{4}' # pivot_col_value
, 100 # max_columns
, 'MAX' # aggregation
);"""
query = query.format(kwargs['project'],kwargs['source_tbl'] ,kwargs['row_id'],kwargs['pivot_col'],kwargs['pivot_val'])
job = client.query(query, location="US")
for result in job.result():
task_instance = kwargs['task_instance']
task_instance.xcom_push(key='query_string', value=result)
print result
return result
bq_cmd = PythonOperator (
task_id= 'task1'
provide_context= True,
python_callable= call_stored_proc,
op_kwargs= {'project' : project,
'source_tbl' : source_tbl,
'row_id' : row_id,
'pivot_col' : pivot_col,
'pivot_val' : pivot_val
},
dag= dag
)
dummy_operator >> bq_cmd
sql_file_template()
The output of stored proc is a string which is captured using xcom.
Now I would like to pass this value to some python function sql_file_template without using PythonOperator.
As per Airflow documentation xcom can be accessed only between tasks.
Can anyone help on this?
If you have access to the Airflow installation you'd like to query (configuration, database access, and code) you can use Airflow's airflow.models.XCom:get_one class method:
from datetime import datetime
from airflow.models import XCom
execution_date = datetime(2020, 8, 28)
xcom_value = XCom.get_one(execution_date=execution_date,
task_id="the_task_id",
dag_id="the_dag_id")
So you want to access XCOM outside Airflow (probably a different project / module, without creating any Airflow DAGs / tasks)?
Airflow uses SQLAlchemy for mapping all it's models (including XCOM) to corresponding SQLAlchemy backend (meta-db) tables
Therefore this can be done in two ways
Leverage Airflow's SQLAlchemy model
(without having to create a task or DAG). Here's an untested code snippet for reference
from typing import List
from airflow.models import XCom
from airflow.settings import Session
from airflow.utils.db import provide_session
from pendulum import Pendulum
#provide_session
def read_xcom_values(dag_id: str,
task_id: str,
execution_date: Pendulum,
session: Optional[Session]) -> List[str]:
"""
Function that reads and returns 'values' of XCOMs with given filters
:param dag_id:
:param task_id:
:param execution_date: datetime object
:param session: Airflow's SQLAlchemy Session (this param must not be passed, it will be automatically supplied by
'#provide_session' decorator)
:return:
"""
# read XCOMs
xcoms: List[XCom] = session.query(XCom).filter(
XCom.dag_id == dag_id, XCom.task_id == task_id,
XCom.execution_date == execution_date).all()
# retrive 'value' fields from XCOMs
xcom_values: List[str] = list(map(lambda xcom: xcom.value, xcoms))
return xcom_values
Do note that since it is importing airflow packages, it still requires working airflow installation on python classpath (as well as connection to backend-db), but here we are not creating any tasks or dags (this snippet can be run in a standalone python file)
For this snippet, I have referred to views.py which is my favorite place to peek into Airflow's SQLAlchemy magic
Directly query Airflow's SQLAlchemy backend meta-db
Connect to meta db and run this query
SELECT value FROM xcom WHERE dag_id='' AND task_id='' AND ..

Resources