Airflow: Issues with Calling TaskGroup - airflow

I am having issues with calling TaskGroups, the error log thinks my Job id is avg_speed_20220502_22c11bdf instead of just avg_speed, and I can't figure out why.
Here's my code:
with DAG(
'debug_bigquery_data_analytics',
catchup=False,
default_args=default_arguments) as dag:
# Note to self: the bucket region and the dataproc cluster should be in the same region
create_cluster = DataprocCreateClusterOperator(
task_id='create_cluster',
...
)
with TaskGroup(group_id='weekday_analytics') as weekday_analytics:
avg_temperature = DummyOperator(task_id='avg_temperature')
avg_tire_pressure = DummyOperator(task_id='avg_tire_pressure')
avg_speed = DataprocSubmitPySparkJobOperator(
task_id='avg_speed',
project_id='...',
main=f'gs://.../.../avg_speed.py',
cluster_name=f'spark-cluster-{{ ds_nodash }}',
region='...',
dataproc_jars=['gs://spark-lib/bigquery/spark-bigquery-latest_2.12.jar'],
)
avg_temperature >> avg_tire_pressure >> avg_speed
delete_cluster = DataprocDeleteClusterOperator(
task_id='delete_cluster',
project_id='...',
cluster_name='spark-cluster-{{ ds_nodash }}',
region='...',
trigger_rule='all_done',
)
create_cluster >> weekday_analytics >> delete_cluster
Here's the error message I get:
google.api_core.exceptions.InvalidArgument: 400 Job id 'weekday_analytics.avg_speed_20220502_22c11bdf' must conform to '[a-zA-Z0-9]([a-zA-Z0-9\-\_]{0,98}[a-zA-Z0-9])?' pattern
[2022-05-02, 11:46:11 UTC] {taskinstance.py:1278} INFO - Marking task as FAILED. dag_id=debug_bigquery_data_analytics, task_id=weekday_analytics.avg_speed, execution_date=20220502T184410, start_date=20220502T184610, end_date=20220502T184611
[2022-05-02, 11:46:11 UTC] {standard_task_runner.py:93} ERROR - Failed to execute job 549 for task weekday_analytics.avg_speed (400 Job id 'weekday_analytics.avg_speed_20220502_22c11bdf' must conform to '[a-zA-Z0-9]([a-zA-Z0-9\-\_]{0,98}[a-zA-Z0-9])?' pattern; 18116)
[2022-05-02, 11:46:11 UTC] {local_task_job.py:154} INFO - Task exited with return code 1
[2022-05-02, 11:46:11 UTC] {local_task_job.py:264} INFO - 1 downstream tasks scheduled from follow-on schedule check

In Airflow task identifier is task_id. However when using TaskGroups you can have same task_id in different groups thus tasks defined in task group have identifier of group_id.task_id.
For apache-airflow-providers-google>7.0.0:
The bug has been fixed. It should work now.
For apache-airflow-providers-google<=7.0.0:
You are having issues because DataprocJobBaseOperator has:
:param job_name: The job name used in the DataProc cluster. This name by default
is the task_id appended with the execution data, but can be templated. The
name will always be appended with a random number to avoid name clashes.
The problem is that Airflow adds the . char and Google doesn't accept it thus to fix your issue you must override the default of job_name parameter to a string of your choice. You can set it to be the task_id if you wish.
I opened https://github.com/apache/airflow/issues/23439 to report this bug in the meantime you can follow the suggestion above.

Related

Airflow DAG - Failed Task Doesn't Show Fail Status as It Should

I just started with Airflow DAG and encountered a strange issue with the tool. I am using airflow version 2.3.3 with SequentialExecutor.
The script I Used:
import datetime
from airflow import DAG
from airflow.operators.python import PythonOperator
dag_args = {
'owner': 'hao',
'retries': 2,
'retry_delay': datetime.timedelta(minutes=1)
}
with DAG(
dag_id='dependency_experiment',
default_args=dag_args,
description='experiment the dag task denpendency expression',
start_date=datetime.datetime.now(),
schedule_interval='#daily',
dagrun_timeout=datetime.timedelta(seconds=10),
) as dag:
pyOp = PythonOperator(
task_id='pyOp',
python_callable=lambda x: haha * x,
op_kwargs={'x': 10}
)
pyOp
The Log Snippit of This Task:
NameError: name 'haha' is not defined
[2022-07-27, 18:19:34 EDT] {taskinstance.py:1415} INFO - Marking task as UP_FOR_RETRY. dag_id=dependency_experiment, task_id=pyOp, execution_date=20220728T021932, start_date=20220728T021934, end_date=20220728T021934
[2022-07-27, 18:19:34 EDT] {standard_task_runner.py:92} ERROR - Failed to execute job 44 for task pyOp (name 'haha' is not defined; 19405)
[2022-07-27, 18:19:34 EDT] {local_task_job.py:156} INFO - Task exited with return code 1
[2022-07-27, 18:19:34 EDT] {local_task_job.py:273} INFO - 0 downstream tasks scheduled from follow-on schedule check
Problem:
I purposefully defined a PythonOperator, which would fail. When I put the script on DAG, the task raised an exception as expected; however, the status for this task is always skipped. I cannot figure out why the task didn't show a failed status as expected. Any suggestions will be much appreciated.
It's because you are defining 'retries' and 'retry_delay' in your dag_args dictionary.
From the Docs:
default_args (Optional[Dict]) – A dictionary of default parameters to be used as constructor keyword parameters when initialising operators. Note that operators have the same hook, and precede those defined here, meaning that if your dict contains ‘depends_on_past’: True here and ‘depends_on_past’: False in the operator’s call default_args, the actual value will be False.
When you set the 'retries' to a value, Airflow thinks that the Task would be retried in an other time. So it shows it in UI as skipped.
If you delete 'retries' and 'retry_delay' from the dag_args, you'll see that task set to failed when you try to initiate the DAG.
When I ran your code in the logs I see:
INFO - Marking task as UP_FOR_RETRY. dag_id=dependency_experiment, task_id=pyOp, execution_date=20220729T060953, start_date=20220729T060953, end_date=20220729T060953
After I delete the 'retries' and 'retry_delay' the same log becomes:
INFO - Marking task as FAILED. dag_id=dependency_experiment, task_id=pyOp, execution_date=20220729T061031, start_date=20220729T061031, end_date=20220729T061031

Airflow: Get status of a task from an external DAG

I want to get the status of a task from an external DAG. I have the same tasks running in 2 different DAGs based on some conditions. So, I want to check the status of this task in DAG2 from DAG1. If the task status is 'running' in DAG2, then I will skip this task in DAG1.
I tried using:
dag_runs = DagRun.find(dag_id=dag_id,execution_date=exec_dt)
for dag_run in dag_runs:
dag_run.state
I couldn't figure out if we can get task status using DagRun.
If I use TaskDependencySensor, the DAG will have to wait until it finds the allowed_states of the task.
Is there a way to get the current status of a task in another DAG?
I used below code to get the status of a task from another DAG:
from airflow.api.common.experimental.get_task_instance import get_task_instance
def get_dag_state(execution_date, **kwargs):
ti = get_task_instance('dag_id', 'task_id', execution_date)
task_status = ti.current_state()
return task_status
dag_status = BranchPythonOperator(
task_id='dag_status',
python_callable=get_dag_state,
dag=dag
)
More details can be found here

Airflow Exception not being thrown when encountering None/Falsy values

I am trying to pass data between a PythonOperator, _etl_lasic to another PythonOperator, _download_s3_data, which works fine but I want to throw an exception when the value passed is None which should mark the task as a failure.
import airflow
from airflow import DAG
from airflow.operators.python import PythonOperator
from airflow.exceptions import AirflowFailException
def _etl_lasic(**context):
path_s3 = None
context["task_instance"].xcom_push(
key="path_s3",
value=path_s3,
)
def _download_s3_data(templates_dict, **context):
path_s3 = templates_dict["path_s3"]
if not path_s3:
raise AirflowFailException("Path to S3 was not passed!")
else:
print(f"Path to S3: {path_s3}")
with DAG(
dag_id="02_lasic_retraining_without_etl",
start_date=airflow.utils.dates.days_ago(3),
schedule_interval="#once",
) as dag:
etl_lasic = PythonOperator(
task_id="etl_lasic",
python_callable=_etl_lasic,
)
download_s3_data = PythonOperator(
task_id="download_s3_data",
python_callable=_download_s3_data,
templates_dict={
"path_s3": "{{task_instance.xcom_pull(task_ids='etl_lasic',key='path_s3')}}"
},
)
etl_lasic >> download_s3_data
Logs:
[2021-08-17 04:04:41,128] {logging_mixin.py:103} INFO - Path to S3: None
[2021-08-17 04:04:41,128] {python.py:118} INFO - Done. Returned value was: None
[2021-08-17 04:04:41,143] {taskinstance.py:1135} INFO - Marking task as SUCCESS. dag_id=02_lasic_retraining_without_etl, task_id=download_s3_data, execution_date=20210817T040439, start_date=20210817T040440, end_date=20210817T040441
[2021-08-17 04:04:41,189] {taskinstance.py:1195} INFO - 0 downstream tasks scheduled from follow-on schedule check
[2021-08-17 04:04:41,212] {local_task_job.py:118} INFO - Task exited with return code 0
Jinja-templated values are rendered as strings by default. In your case, even though you push an XCom value of None, when the value is pulled via "{{task_instance.xcom_pull(task_ids='etl_lasic',key='path_s3')}}" the value is actually rendered as "None" which doesn't throw an exception based on the current logic.
There are two options that will solve this:
Instead of setting path_s3 to None in the "_etl_lasic" function, set it to an empty string.
If you are using Airflow 2.1+, there is a parameter, render_template_as_native_obj, that can be set at the DAG level which will render Jinja-templated values as native Python types (list, dict, etc.). Setting that parameter to True will do the trick without changing how path_s3 is set in the function. A conceptual example is documented here.

airflow reschedule error: dependency 'Task Instance State' PASSED: False

I have a customized sensor that looked like below. The idea is one dag can have different tasks that can start from different time, and take advantage of the built-in airflow reschedule system.
class MySensor(BaseSensorOperator):
def __init__(self, *, start_time, tz, ...)
super().__init__(**kwargs)
self._start_time = start_time
self._tz = tz
#provide_session
def execute(self, context, session: Session=None):
dt_start = datetime.combine(context['next_execution_date'].date(), self._start_time)
dt_start = dt_start.replace(tzinfo=self._tz)
if datetime.now().timestamp() < dt_start.timestamp():
dt_reschedule = datetime.utcnow().replace(tzinfo=UTC)
dt_reschedule += timedelta(seconds=dt_start.timestamp()-datetime.now().timestamp())
raise AirflowRescheduleException(dt_reschedule)
return super().execute(context)
In the dag, I have something as below. However, I notice when the mode is 'poke', which is default, the sensor will not work properly.
with DAG( schedule='0 10 * * 1-5', ... ) as dag:
task1 = MySensor(start_time=time(14,0), mode='poke')
task2 = MySensor(start_time=time(16,0), mode='reschedule')
... ...
From the log, i can see the following:
{taskinstance.py:1141} INFO - Rescheduling task, mark task as UP_FOR_RESCHEDULE
[5s later]
{local_task_job.py:102} INFO - Task exited with return code 0
[14s later]
{taskinstance.py:687} DEBUG - <TaskInstance: mydag.mytask execution_date [failed]> dependency 'Task Instance State' PASSED: False, Task in in the 'failed' state which is not a valid state for execution. The task must be cleared in order to be run.
{taskinstance.py:664} INFO - Dependencies not met for <TaskInstance ... [failed]> ...
Why rescheduling not working with mode='poke'? And when did the scheduler(?) flip the state of the taskinstance from "up_for_reschedule" to "failed"? Any better way to start the each task/sensor at different time? The sensor is an improved version of FileSensor, and checks a bunch of files or files with patterns. My current option is to force every task with mode='reschedule'
Airflow version 1.10.12

How to setup Nagios Alerts for Apache Airflow Dags

Is it possible to setup Nagios alerts for airflow dags?
In case the dag is failed, I need to alert the respective groups.
You can add an "on_failure_callback" to any task which will call an arbitrary failure handling function. In that function you can then send an error call to Nagios.
For example:
dag = DAG(dag_id="failure_handling",
schedule_interval='#daily')
def handle_failure(context):
# first get useful fields to send to nagios/elsewhere
dag_id = context['dag'].dag_id
ds = context['ds']
task_id = context['ti'].task_id
# instead of printing these out - you can send these to somewhere else
logging.info("dag_id={}, ds={}, task_id={}".format(dag_id, ds, task_id))
def task_that_fails(**kwargs):
raise Exception("failing test")
task_to_fail = PythonOperator(
task_id='python_task_to_fail',
python_callable=task_that_fails,
provide_context=True,
on_failure_callback=handle_failure,
dag=dag)
If you run a test on this:
airflow test failure_handling task_to_fail 2018-08-10
You get the following in your log output:
INFO - dag_id=failure_handling, ds=2018-08-10, task_id=task_to_fail

Resources