Airflow - skip DAG if sensor failed - airflow

I have process where I'm waiting for a file every week, but this file is timestamped in its name for the mesurment's date. So I know I'm going to have something this week and the name can be 2020-05-25*.csv up to 2020-05-31*.csv.
The only way I find out to start my processes with airflow is to run a sensor at start #daily and using the executing date to find is there is a file.
The thing is, since I don't know which day the file will be uploaded I will have 6 fails sensors, so 6 failed DAGs, and 1 succeeded.
SFTP Sensors part exemple :
with DAG(
"geometrie-sftp-to-safe",
default_args=default_args,
schedule_interval="#daily",
catchup=True,
) as dag:
starting_sensor = DummyOperator(
task_id="starting_sensor"
)
sensor_sftp_A = SFTPSensor(
task_id="sensor_sftp_A",
path="/input/geometrie/prod/Track_Geometry-{{ ds_nodash }}_A.csv",
sftp_conn_id="ssh_ftp_landing",
poke_interval=60,
soft_fail=True,
mode="reschedule"
)
Second With GCSSensor
with DAG(
"geometrie-preprocessing",
default_args=default_args,
schedule_interval="#daily",
catchup=True
) as dag:
# File A
sensor_gcs_A = GoogleCloudStorageObjectSensor(
task_id="gcs-sensor_A",
bucket="lisea-mesea-sea-cloud-safe",
object="geometrie/original/track_geometry_{{ ds_nodash }}_A.csv",
google_cloud_conn_id="gcp_conn",
poke_interval=50
)
That's why I would like the DAGs to be set as skipped, if and only if the sensor have fail. If it's something else I would like a real fail.

Airflow has multiple sensors which senses the directory to check for the defined file. The schedule_interval as None will work to your use case as you want the DAG to trigger only when the file is received(considering that the file can be received anytime within the week).
The below example for GCSSensor will sense the bucket for the particular type of file and will print the filename.I am pretty sure that SFTP sensor should work the same way.
dag = DAG(
dag_id='sensing-bucket',
schedule_interval=None,
default_args=args)
def new_file_detection(**context):
value = context['ti'].xcom_pull(task_ids='list_Files')
print('value is : '+str(value))
File_sensor = GoogleCloudStoragePrefixSensor(
task_id='gcs_polling',
bucket='lisea-mesea-sea-cloud-safe',
prefix='geometrie/original/track_geometry_',
dag=dag
)
GCS_File_list = GoogleCloudStorageListOperator(
task_id='list_Files',
bucket='lisea-mesea-sea-cloud-safe',
prefix='geometrie/original/track_geometry_',
delimiter='.csv',
google_cloud_storage_conn_id='google_cloud_default',
dag=dag
)
File_detection = PythonOperator(
task_id='print_detected_filename',
provide_context=True,
python_callable=new_file_detection,
dag=dag
)
File_sensor >> GCS_File_list >> File_detection

Related

How to run a specific dag first in airflow?

I'm using apache airflow (2.3.1) to load data into a database. I have more than 150 dags, I need to run some of them first, how can I do this?
The initialization of the work of the dags occurs at 3 am and the dags start to run randomly, standing in a queue.
I read about priority_weight and weight_rule, but this is only used for tasks, not for dag in general.
As I said, the dag queue is built randomly, and I would like to control it and hard-code which dag should be executed first.
You can use an ExternalTaskSensor to define cross-DAG dependencies.
In particular it allows you to wait for an external (= on a different DAG) task or DAG to complete before proceeding. You can configure the dag_id and task_id to wait for and a time-delta for the execution_date (by default, it expects that the external DAG run has the same execution date as the current).
Full details and possible configurations are available in the official documentation: Cross-DAG Dependencies.
Example usage
Task to be executed first
with DAG(
dag_id = 'first_dag',
start_date = datetime(2022, 1, 1),
schedule_interval = '0 0 * * *'
) as first_dag:
first_task = DummyOperator(task_id = 'first_task')
Task to be executed later
with DAG(
dag_id = 'second_dag',
start_date = datetime(2022, 1, 1),
schedule = '0 0 * * *'
) as second_dag:
first_task_sensor = ExternalTaskSensor(
task_id = 'first_task_sensor',
external_dag_id = 'first_dag',
external_task_id = 'first_task',
timeout = 600,
allowed_states = ['success'],
failed_states = ['failed', 'skipped'],
mode = 'reschedule'
)
second_task = DummyOperator(task_id = 'second_task')
first_task_sensor >> second_task

DAG failure during the initial run

I have two DAGs:
DAG_A , DAG_B.
DAG_A triggers DAG_B thru TriggerDagRunOperator.
My tasks in DAG_B:
with DAG(
dag_id='DAG_B',
default_args=default_args,
schedule_interval='#once',
description='ETL pipeline for processing users'
) as dag:
start = DummyOperator(
task_id='start')
delete_xcom_task = PostgresOperator(
task_id='clean_up_xcom',
postgres_conn_id='postgres_default',
sql="delete from xcom where dag_id='DAG_A' and task_id='TASK_A' ")
end = DummyOperator(
task_id='end')
#trigger_rule='none_failed')
#num_table is set by DAG_A. Will have an empty list initially.
iterable_string = Variable.get("num_table",default_var="[]")
iterable_list = ast.literal_eval(iterable_string)
for index,table in enumerate(iterable_list):
table = table.strip()
read_src1 = PythonOperator(
task_id=f'Read_Source_data_{table}',
python_callable=read_src,
op_kwargs={'index': index}
)
upload_file_to_directory_bulk1 = PythonOperator(
task_id=f'ADLS_Loading_{table}',
python_callable=upload_file_to_directory_bulk,
op_kwargs={'index': index}
)
write_Snowflake1 = PythonOperator(
task_id=f'Snowflake_Staging_{table}',
python_callable=write_Snowflake,
op_kwargs={'index': index}
)
task_sf_storedproc1 = DummyOperator(
task_id=f'Snowflake_Processing_{table}'
)
start >> read_src1 >> upload_file_to_directory_bulk1 >> write_Snowflake1 >>task_sf_storedproc1 >> delete_xcom_task >> end
After executing airflow db init and making the webserver and scheduler up, DAG_B fails with failure in task delete_xcom_task.
2021-06-22 08:04:43,647] {taskinstance.py:871} INFO - Dependencies not met for <TaskInstance: Target_DIF.clean_up_xcom 2021-06-22T08:04:27.861718+00:00 [queued]>, dependency 'Trigger Rule' FAILED: Task's trigger rule 'all_success' requires all upstream tasks to have succeeded, but found 2 non-success(es). upstream_tasks_state={'total': 2, 'successes': 0, 'skipped': 0, 'failed': 0, 'upstream_failed': 0, 'done': 0}, upstream_task_ids={'Snowflake_Processing_products', 'Snowflake_Processing_inventories'}
[2021-06-22 08:04:43,651] {local_task_job.py:93} INFO - Task is not able to be run
But both DAGs become successful from the second runs.
Can anyone explain me what is happening internally?
How can I avoid the failure during the first run?
Thanks.
I suspect that the problem is in schedule_interval='#once' for DAG_B: When you add the DAG for the first time, the schedule_interval tells the scheduler to run the DAG once. So, DAG_B is triggered once by the scheduler and not by DAG_A. Any preparations that needs to be done by DAG_A for DAG_B to run successfully have not been done yet, therefore DAG_B fails.
Later on, DAG_A runs as scheduled and triggers DAG_B as expected. Both succeed.
To avoid DAG_B being triggered by the scheduler set schedule_interval=None.

How to define a timeout for Apache Airflow DAGs?

I'm using Airflow 1.10.2 but Airflow seems to ignore the timeout I've set for the DAG.
I'm setting a timeout period for the DAG using the dagrun_timeout parameter (e.g. 20 seconds) and I've got a task which takes 2 mins to run, but Airflow marks the DAG as successful!
args = {
'owner': 'me',
'start_date': airflow.utils.dates.days_ago(2),
'provide_context': True,
}
dag = DAG(
'test_timeout',
schedule_interval=None,
default_args=args,
dagrun_timeout=timedelta(seconds=20),
)
def this_passes(**kwargs):
return
def this_passes_with_delay(**kwargs):
time.sleep(120)
return
would_succeed = PythonOperator(
task_id='would_succeed',
dag=dag,
python_callable=this_passes,
email=to,
)
would_succeed_with_delay = PythonOperator(
task_id='would_succeed_with_delay',
dag=dag,
python_callable=this_passes_with_delay,
email=to,
)
would_succeed >> would_succeed_with_delay
No error messages are thrown. Am I using an incorrect parameter?
As stated in the source code:
:param dagrun_timeout: specify how long a DagRun should be up before
timing out / failing, so that new DagRuns can be created. The timeout
is only enforced for scheduled DagRuns, and only once the
# of active DagRuns == max_active_runs.
so this might be expected behavior as you set schedule_interval=None. Here, the idea is rather to make sure a scheduled DAG won't last forever and block subsequent run instances.
Now, you may be interested in the execution_timeout available in all operators.
For example, you could set a 60s timeout on your PythonOperator like this:
would_succeed_with_delay = PythonOperator(task_id='would_succeed_with_delay',
dag=dag,
execution_timeout=timedelta(seconds=60),
python_callable=this_passes_with_delay,
email=to)

Is there a way to configure different 'retries' for tasks in the same DAG

I have a DAG with many sub-tasks in it. In the middle of the DAG, there is a validation task and based on the result/return code from the task, i want to take two different paths. If success, one route(a sequence of tasks) will be followed and in case of failure, we would like to execute a different set of tasks. There are two problems with the current approach, one is that, validation tasks execute many times(as per the retries configured) if the exit code is 1. Second there is no way possible to take different branches of execution
To solve problem number 1, we can use the retry number is available from the task instance, which is available via the macro {{ task_instance }} . Appreciate if someone could point us to a cleaner approach, and the problem number 2 of taking different paths remains unsolved.
You can have retries at the task level.
run_this = BashOperator(
task_id='run_after_loop',
bash_command='echo 1',
retries=3,
dag=dag,
)
run_this_last = DummyOperator(
task_id='run_this_last',
retries=1,
dag=dag,
)
Regarding your 2nd problem, there is a concept of Branching.
The BranchPythonOperator is much like the PythonOperator except that it expects a python_callable that returns a task_id (or list of task_ids). The task_id returned is followed, and all of the other paths are skipped. The task_id returned by the Python function has to be referencing a task directly downstream from the BranchPythonOperator task.
Example DAG:
import random
import airflow
from airflow.models import DAG
from airflow.operators.dummy_operator import DummyOperator
from airflow.operators.python_operator import BranchPythonOperator
args = {
'owner': 'airflow',
'start_date': airflow.utils.dates.days_ago(2),
}
dag = DAG(
dag_id='example_branch_operator',
default_args=args,
schedule_interval="#daily",
)
run_this_first = DummyOperator(
task_id='run_this_first',
dag=dag,
)
options = ['branch_a', 'branch_b', 'branch_c', 'branch_d']
branching = BranchPythonOperator(
task_id='branching',
python_callable=lambda: random.choice(options),
dag=dag,
)
run_this_first >> branching
join = DummyOperator(
task_id='join',
trigger_rule='one_success',
dag=dag,
)
for option in options:
t = DummyOperator(
task_id=option,
dag=dag,
)
dummy_follow = DummyOperator(
task_id='follow_' + option,
dag=dag,
)
branching >> t >> dummy_follow >> join
Regarding your first problem, you set task/Operator specific retry options quite easily. Reference: baseoperator.py#L77.
Problem two, you can branch within a DAG easily with BranchPythonOperator (Example Usage: example_branch_operator.py). You will want to nest your validation task/logic within the BranchPythonOperator (You can define and execute operators within operators).

How to runn external DAG as part of my DAG?

I'm new to Airflow and I'm trying to run an external DAG (developed and owned by another team), as part of my DAG flow.
I was looking at SubDagOperator, but it seems that for some reason it enforces the name of the subdag to be . which I cannot do as the child dag is owned by a different team.
here is my code sample:
parent_dag = DAG(
dag_id='parent_dag', default_args=args,
schedule_interval=None)
external_dag = SubDagOperator(
subdag=another_teams_dag,
task_id='external_dag',
dag=parent_dag,
trigger_rule=TriggerRule.ALL_DONE
)
and the other team's dag is defined like this:
another_teams_dag = DAG(
dag_id='another_teams_dag', default_args=args,
schedule_interval=None)
but I'm getting this error:
The subdag's dag_id should have the form
'{parent_dag_id}.{this_task_id}'. Expected 'parent_dag.external_dag';
received 'another_teams_dag'.
Any ideas?
What am I missing?
Use TriggerDagRunOperator
More info: https://airflow.apache.org/code.html#airflow.operators.dagrun_operator.TriggerDagRunOperator
Example:
Dag that triggers: https://github.com/apache/incubator-airflow/blob/master/airflow/example_dags/example_trigger_controller_dag.py
Dag that is triggered: https://github.com/apache/incubator-airflow/blob/master/airflow/example_dags/example_trigger_target_dag.py
For your case, you can use something like:
trigger = TriggerDagRunOperator(task_id='external_dag',
trigger_dag_id="another_teams_dag",
dag=dag)

Resources