Airflow : Is it possible to configure task-level timeout in a DAG .
I wished to prevent a task from running indefinitely . According to my understanding sla parameters will come in place only once the task gets completed and overshoots the SLA .
For timeouts on Operators in Airflow you can add the execution_timeout parameter. From the docs:
exuction_timeout (datetime.timedelta) – max time allowed for the execution of this task instance, if it goes beyond it will raise and fail
It expects a datetime.timedelta, e.g. timedelta(hours=1) for a max of 1 hour for the task.
Note that for sensors execution_timeout does not work. Sensors expect a timeout parameter instead.
Related
I would like to trigger an Airflow DAD based on SQS messages. I am quite new to Airflow but this is how I think it should be done:
Option 1
Use the Airflow SQS Sensor. From my understanding, this waits on SQS messages to proceed with the execution of an already trigger DAG. Does this mean a DAG would always need to be running and waiting for SQS messages to catch any eventual new messages and process them? Does this also mean I should schedule my DAG on a very short interval so that when an SQS message gets handled by a DAG, another DAG is created to handle the next SQS messages?
Option 2
Add a lambda or something watching for SQS messages and using the Airflow API to trigger DAGs when needed.
Eventually, I would like to minimise the number of interactions needed to trigger a DAG so I would like to use an Airflow built-in way of watching SQS.
Thank you
Both options are valid however Option 2 is basically an alternative implementation to sensor. I think the better solution is Option 1 with some modification:
Use SQSSensor but with mode='reschedule' that way every once in a while the sensor is "awaking" checking if the criteria is met. Note that this is not like sleep(x). When the criteria isn't met Airflow will release the worker for other tasks that needs to run and return the SQSSensor to the scheduling queue.
You can read more about the sensor modes in the docs.
from airflow.providers.amazon.aws.sensors.sqs import SQSSensor
SQSSensor(
task_id='test_task',
dag=dag,
sqs_queue='your_queue',
aws_conn_id='aws_default',
mode='reschedule')
Note that the sensor will run indefinitely until the criteria is met. You can set timeout on the sensor task (there are other possible reasons for timeout like cluster policy and other defaults but that is another topic).
Facing a scenrio in Apache - Airflow , where the ask is to stop the DAG execution on the event of an sla miss and proceed with subsequent DAG interval's execution
Is such a functionality configurable in Airflow ?
You can create a lightweight DAG that runs every X (the smallest interval within your Airflow cluster) which checks for SLA misses and pauses the DAG if there's any.
My DAG is scheduled to run daily at 7 AM. Can I specify time of the day to execution timeout parameter instead of duration.
For example, I want to add specific time 12 PM so that job will fail if it is still running at 12 PM.
Such a param is not present in BaseOperator or DAG
You'll have a build it. Here's some hint how you can go about it (not certain if this would work)
Write a custom TimeSensor (not to be confused with TimeDeltaSensor) by subclassing it, that kills the DAG upon failure.
You'll have to override the execute() method
For killing you can look into _mark_dagrun_state_as_failed() method
With the specified datetime timeout, add that custom sensor task as one of the starting tasks (tasks that don't have an upstream task) of you DAG
In case you have to timeout some specific task(s) instead of entire DAG
you can change write another custom timesensor that marks a specific task as failed upon timing out.
You can use _mark_task_instance_state() method for it
you can wire up this custom timesensor with that task in parallel (so that both the task and it's sensor launch together)
I have a DAG which takes a very long time to do a bigquery operation. And always i get the error 'Broken DAG: [/home/airflow/gcs/dags/xyz.py] Timeout'
I found some answers saying that we have to increase the timeout in airflow.cfg. But that idea is not suitable in my project. Is it possible to somehow increase the timeout for a particular DAG? Anybody please help. Thank you.
Yes you can set dagrun_timeout parameter on the Dag.
Specify how long a DagRun should be up before
timing out / failing, so that new DagRuns can be created. The timeout
is only enforced for scheduled DagRuns, and only once the
# of active DagRuns == max_active_runs.
We also have a parameter execution_timeout on each Task that you can set.
execution_timeout: max time allowed for the execution of
this task instance, if it goes beyond it will raise and fail.
:type execution_timeout: datetime.timedelta
So if one of the task is running a query on BigQuery you can use something like
BigQueryOperator(sql=sql,
destination_dataset_table={{ params.t_name }}),
task_id='bq_query',
bigquery_conn_id='my_bq_connection',
use_legacy_sql=False,
write_disposition='WRITE_TRUNCATE',
create_disposition='CREATE_IF_NEEDED',
query_params={'t_name': table_name},
execution_timeout=datetime.timedelta(minutes=10)
dag=dag)
Is there a way specify that a task can only run once concurrently? So in the tree above where DAG concurrency is 4, Airflow will start task 4 instead of a second instance of task 2?
This DAG is a little special because there is no order between the tasks. These tasks are independent but related in purpose and therefore kept in one DAG so as to new create an excessive number of single task DAGs.
max_active_runs is 2 and dag_concurrency is 4. I would like it start all 4 tasks and only start a task in next if same task in previous run is done.
I may have mis-understood your question, but I believe you are wanting to have all the tasks in a single dagrun finish before the tasks begin in the next dagrun. So a DAG will only execute once the previous execution is complete.
If that is the case, you can make use of the max_active_runs parameter of the dag to limit how many running concurrent instances of a DAG there are allowed to be.
More information here (refer to the last dotpoint): https://airflow.apache.org/faq.html#why-isn-t-my-task-getting-scheduled
max_active_runs defines how many running concurrent instances of a DAG there are allowed to be.
Airflow operator documentation describes argument task_concurrency. Just set it to one.
From the official docs for trigger rules:
depends_on_past (boolean) when set to True, keeps a task from getting triggered if the previous schedule for the task hasn’t succeeded.
So the future DAGs will wait for the previous ones to finish successfully before executing.
On airflow.cfg under [core]. You will find
dag_concurrency = 16
//The number of task instances allowed to run concurrently by the scheduler
you're free to change this to what you desire.