I am running 5 PythonOperator tasks in my airflow DAG and one of them is performing an ETL job which is taking a long time, due to which all my resources are blocked. Is there a way I can set a max execution time per task, after which the task either fails or is marked successful (so that the DAG doesnt fail) with a message?
In every operator we have an execution_timeout variable where you have to pass a datetime.timedelta object.
As per the base operator code comments:
:param execution_timeout: max time allowed for the execution of
this task instance, if it goes beyond it will raise and fail.
:type execution_timeout: datetime.timedelta
Also bear in mind that this will fail a single run of the DAG and will trigger re-runs and will only be declared to be a failed DAG when all re-runs have failed.
So depending on what number of auto retries you have assigned, you could have a potential maximum time of ( number of retries ) x ( timeout ) in case the code keeps taking too long.
Check out this previous answer.
In short, using airflow's built in pools or even specifying a start_date for a task (instead of an entire DAG) seem to be potential solutions.
From this documentation, you'd want to set the execution_timeout task parameter, which would look something like this
from datetime import timedelta
sensor = SFTPSensor(
task_id="sensor",
path="/root/test",
execution_timeout=timedelta(hours=2),
timeout=3600,
retries=2,
mode="reschedule",
)
Related
I have a DAG that looks like this:
dag1:
start >> clean >> end
Then I have a global Airflow variable "STATUS". Before running the clean step, I want to check if the "STATUS" variable is true or not. If it is true, then I want to proceed to the "clean" task. Or else, I want to stay in a waiting state until the global variable "STATUS" turns to true.
Something like this:
start >> wait_for_dag2 >> clean >> end
How can I achieve this?
Alternatively, if waiting is not possible, is there any way to trigger the dag1 whenever the global variable is set to true? Instead of giving a set schedule criteria
You can use a PythonSensor that call a python function that check the variable and return true/false.
There are 3 methods you can use:
use TriggerDagRunOperator as #azs suggested. Though, the problem with this approach is that is kind of contradicts with the "O"(Open to extend close to modify) in the "SOLID" concept.
put the variable inside a file and use data-aware escheduling which was introduced in Airflow 2.4. However, its a new functionality at the time of this answer and it may be changed in future. data_aware_scheduling
check the last status of the dag2 ( the previous dag). This is also has a flaw which may accur rarely but can not be excluded completely; and it is what if right after chacking the status the dag starts to run!?:
from airflow.models.dagrun import DagRun
from airflow.utils.state import DagRunState
dag_runs = DagRun.find(dag_id='the_dag_id_of_dag2')
last_run = dag_runs[-1]
if last_run.state == DagRunState.SUCCESS:
print('the dag run was successfull!')
else:
print('the dag state is -->: ', last_run.state)
After all it depends on you and your business constraint to choose among these methods.
When I trigger a DAG manually, prev_execution_date and execution_date are the same.
echo_exec_date = BashOperator(
task_id='bash_script',
bash_command='echo "prev_exec_date={{ prev_execution_date }} execution_date={{ execution_date }}"',
dag=dag)
results in:
prev_exec_date=2022-06-29T08:50:37.506898+00:00 execution_date=2022-06-29T08:50:37.506898+00:00
They are different if the DAG is triggered automatically by the scheduler.
I would like to have prev_execution_date regardless of triggering it manually or automatically.
When manually triggering DAG, the schedule will be ignored, and prev_execution_date == next_execution_date == execution_date
This is explained in the Airflow docs
This is because previous / next of manual run is not something that is well defined. Consider you have a daily schedule (say at 00:00) and you invoke a manual run on 13:00. What is the expected next schedule? should it be daily from 00:00 or daily from 13:00? a DagRun can have only 1 prev and only 1 next. In your senario it seems like you are interested in a case where there can be more than 1 or that the manual run "comes between" the two scheduled runs. This is not something that Airflow supports - It really over complicate things.
If you want to workaround it you can create custom macro that checks the run_type, searches the specific DagRun that you consider as previous and return it's execution_date. Be noted that it might create some side effects (overlapping data interval process etc..) you need to really verify that the logic you implement make sense for your specific use case.
I have a following scenario/DAG;
|----->Task1----| |---->Task3---|
start task-->| |-->Merge Task --->| | ----->End Task
|----->Task2----| |---->Task4---|
Currently the Task, Task2, Task3 and Task4 are ShortCircuitOperators, When one of the Task1 and Task2 are ShortCircuted all the downstream tasks are skipped.
But my requirement is to break the skipped state being propagated to Task3 and Task4 at Merge Task.
Cause I want the Task 3 and Task 4 to be run no matter what happens upstream.
Is there a way I can achieve this.? I want to have the dependencies in place as depicted/showed in the DAG.
Yes it can be achieved
Instead of using ShortCircuitOperator, use AirflowSkipException (inside a PythonOperator) to skip a task (that is conditionally executing tasks / branches)
You might be able to achieve the same thing using a BranchPythonOperator
but ShortCircuitOperator definitely doesn't behave as per most people's expectations. Citing this line closely resembling your problem from this link
... When one of the upstreams gets skipped by ShortCircuitOperator
this task gets skipped as well. I don't want final task to get skipped
as it has to report on DAG success.
To avoid it getting skipped I used trigger_rule='all_done', but it
still gets skipped.
If I use BranchPythonOperator instead of ShortCircuitOperator final
task doesn't get skipped. ...
Furthermore the docs do warn us about it (this is really the expected behaviour of ShortCircuitOperator)
It evaluates a condition and short-circuits the workflow if the condition is False. Any downstream tasks are marked with a state
of “skipped”.
And for tasks downstream of your (possibly) skipped tasks, use different trigger_rules
So instead of default all_success, use something like none_failed or all_done (depending on your requirements)
Requirement:
Get the date value in the format of YYYYMMDDHHMMSS
Code:
TS_HOURS_NODASH = "{{ execution_date.strftime('%Y%m%d%H%M%S') }}"
Output
20200721000000
Expected: Actual hour/minute/seconds
It depends on that you need:
execution_date - it is a time when your dag expected to run. In case your dag run on a #daily basis your time will be exactly 00:00:00
ti.start_date - it is a time when your task instance actually started.
I have achieved using pendulum
pendulum.now().format('%Y%m%d%H%M%S')
execution_date is calculated according to schedule interval, execution_date of all task instances related to the dag run is the same, and it is not the actual datetime that a task is run.
if you just want to get the actual start time of the task, why not get the system time at the beginning of your task, although it is slightly later than airflow task's start_time, it is much easier.
if you insist on the start_time of airflow, it needs to do some change on the operator, and it is another story.
usually, it is better to use execution_date as suffix of a file, it is stable and will not change after the task instance is generated, the actual start time of a task depends on the upstream tasks, retry also change start time, and it will also change if your clear some task instances and re-run them.
I have a pipeline I want to run everyday, but I would like the execution date to lag. That is, on day X I want the execution date to be X-3. Is something like that possible?
It looks like you are using execution_date as a variable in your pipeline logic. For example, to process the data that is 3 days older than the execution_date. So, instead of making execution_date to lag by 3 days you can subtract the lag from execution_date and use the result in you pipeline logic. Airflow provides a number of ways to do it:
Templates: {{ execution_date - macros.timedelta(days=3) }}. So, for example, the bash_command parameter of BashOperator can be bash_command='echo Processing date: {{ execution_date - macros.timedelta(days=3) }} '
The PythonOperator's python callable: Define the callable something like def func(execution_date, **kwargs): ... and set the PythonOperator's parameter provide_context=True. The execution_date parameter of func() will be set to the current execution date (datetime object) on call. So, inside func() you can do processing_date = execution_date - timedelta(days=3).
The Sensors' context parameter: The poke() and execute() methods of any sensor have the context paramter that is a dict with all macros including execution_date. So, in these methods you can do processing_date = context['execution_date'] - timedelta(days=3).
Forcing execution date to have a lag simply does not feel right. Because, according to the Airflow's logic, the execution date of the currently running DAG normally can have lag only if it is catching up (bakcfilling).
You can use a TimeSensor to delay the execution of tasks in a DAG. I don't think you can change the actual execution_date unless you can describe the behavior as a cron.
If you want this to only apply this delay for a subset of scheduled DAG runs, you could use a BranchPythonOperator to first check if execution_date is one of those days you want the lag. If it is, then take the branch with the sensor. Otherwise, move along without it.
Alternatively, especially if you plan to have this behavior in more than one DAG, you can write a modified version of the sensor. It might look something like this:
def poke(self, context):
if should_delay(context['execution_date']):
self.log.info('Checking if the time (%s) has come', self.target_time)
return timezone.utcnow().time() > self.target_time
else:
self.log.info('Not one of those days, just run')
return True
You can reference the code for the existing time sensor in https://github.com/apache/incubator-airflow/blob/1.10.1/airflow/sensors/time_sensor.py#L38-L40.