I was learning apache airflow and found that there is an operator called DummyOperator. I googled about its use case, but couldn't find anything that I can understand. Can anyone here please discuss its use case?
Operator that does literally nothing. It can be used to group tasks in
a DAG.
https://airflow.apache.org/_api/airflow/operators/dummy_operator/index.html
as far as I know, at least to two case:
test purpose. in dags, the dummy operation just between upstream and
downstream, later, you can replace the true operator.
Workflow purpose: BranchPythonOperator work with DummyOperator. If you want to skip some tasks, keep in mind that you can’t have an
empty path, if so make a dummy task.
https://airflow.apache.org/concepts.html#workflows
dummy_operator is used in BranchPythonOperator where we decide next task based on some condition.
For example:
-> task C->task D
task A -> task B -> task F
-> task E(Dummy)
So let's suppose we have some condition in task B which decides whether to follow [task C->task D] or task E(Dummy) to reach task F.
Since we cannot leave else condition empty we have to put dummy operator which does nothing just skip or bypass.
Another use case: I've implemented a framework that returns an Operator. In most cases this is a PostgresOperator but under some user-specified configuration there's no SQL to run but the caller still expects an Operator so I return a DummyOperator rather than a PostgresOperator with trivial SQL like "select 1;".
Related
I have a DAG that looks like this:
dag1:
start >> clean >> end
Then I have a global Airflow variable "STATUS". Before running the clean step, I want to check if the "STATUS" variable is true or not. If it is true, then I want to proceed to the "clean" task. Or else, I want to stay in a waiting state until the global variable "STATUS" turns to true.
Something like this:
start >> wait_for_dag2 >> clean >> end
How can I achieve this?
Alternatively, if waiting is not possible, is there any way to trigger the dag1 whenever the global variable is set to true? Instead of giving a set schedule criteria
You can use a PythonSensor that call a python function that check the variable and return true/false.
There are 3 methods you can use:
use TriggerDagRunOperator as #azs suggested. Though, the problem with this approach is that is kind of contradicts with the "O"(Open to extend close to modify) in the "SOLID" concept.
put the variable inside a file and use data-aware escheduling which was introduced in Airflow 2.4. However, its a new functionality at the time of this answer and it may be changed in future. data_aware_scheduling
check the last status of the dag2 ( the previous dag). This is also has a flaw which may accur rarely but can not be excluded completely; and it is what if right after chacking the status the dag starts to run!?:
from airflow.models.dagrun import DagRun
from airflow.utils.state import DagRunState
dag_runs = DagRun.find(dag_id='the_dag_id_of_dag2')
last_run = dag_runs[-1]
if last_run.state == DagRunState.SUCCESS:
print('the dag run was successfull!')
else:
print('the dag state is -->: ', last_run.state)
After all it depends on you and your business constraint to choose among these methods.
Let's say I have some Airflow operator, and one of the arguments to the operator needs to take the value from the xcom. I've managed to do it in the following way -
f"model_id={{{{ ti.xcom_pull(task_ids='train', key='{task_id}')}}}}"
Where model_id is the argument name to the docker operator the airflow runs and task_id is the name of the key for that value in the xcom.
Now I want to do something more complex and save under task_id a dictionary instead of one value, and be able to take it from it somehow.
Is there a similar way to do it to the one I mentioned above? something like -
f"model_id={{{{ ti.xcom_pull(task_ids='train', key='{task_id}')}}}}[value]"
By default, all the template_fields are rendered as strings.
However Airflow offers the option to render fields as native Python objects.
You will need to set you DAG as:
dag = DAG(
...
render_template_as_native_obj=True,
)
You can see example of how to render as dictionary in the docs.
My answer for a similar issue was this.
f"model_id={{{{ ti.xcom_pull(task_ids='train', key='{task_id}')[value]}}}}"
I have a following scenario/DAG;
|----->Task1----| |---->Task3---|
start task-->| |-->Merge Task --->| | ----->End Task
|----->Task2----| |---->Task4---|
Currently the Task, Task2, Task3 and Task4 are ShortCircuitOperators, When one of the Task1 and Task2 are ShortCircuted all the downstream tasks are skipped.
But my requirement is to break the skipped state being propagated to Task3 and Task4 at Merge Task.
Cause I want the Task 3 and Task 4 to be run no matter what happens upstream.
Is there a way I can achieve this.? I want to have the dependencies in place as depicted/showed in the DAG.
Yes it can be achieved
Instead of using ShortCircuitOperator, use AirflowSkipException (inside a PythonOperator) to skip a task (that is conditionally executing tasks / branches)
You might be able to achieve the same thing using a BranchPythonOperator
but ShortCircuitOperator definitely doesn't behave as per most people's expectations. Citing this line closely resembling your problem from this link
... When one of the upstreams gets skipped by ShortCircuitOperator
this task gets skipped as well. I don't want final task to get skipped
as it has to report on DAG success.
To avoid it getting skipped I used trigger_rule='all_done', but it
still gets skipped.
If I use BranchPythonOperator instead of ShortCircuitOperator final
task doesn't get skipped. ...
Furthermore the docs do warn us about it (this is really the expected behaviour of ShortCircuitOperator)
It evaluates a condition and short-circuits the workflow if the condition is False. Any downstream tasks are marked with a state
of “skipped”.
And for tasks downstream of your (possibly) skipped tasks, use different trigger_rules
So instead of default all_success, use something like none_failed or all_done (depending on your requirements)
I have a python script that is called from BashOperator.
The scripts return can return statuses 0 or 1.
I want to trigger email only when the status 1.
Note these statuses are not to be confused with Failure/Success. This is simply an indication that something was changed with the data and requires attention from the developer.
This is my operator:
t = BashOperator(task_id='import',
bash_command="python /home/ubuntu/airflow/scripts/import.py",
dag=dag)
I looked over the docs but all email related addressed the issue of On Failure which is irrelevant in my case.
If you don't want to override an operator or anything fancy, you might be able to use Xcoms and the BranchPythonOperator
If your condition is based on a 0 or a 1, you can just push that value to XCom (set xcom_push to True).
Then, you can use the PythonBranchOperator to check that value, and use that value to execute the appropriate task. You can find an example of the BranchPythonOperator and pulling from XCom in the Airflow example_dags.
What's the proper way to interrupt a long chain of compose or pipe functions ?
Let's say the chain doesn't need to run after the second function because it found an invalid value and it doesn't need to continue the next 5 functions as long as user submitted value is invalid.
Do you return an undefined / empty parameter, so the rest of the functions just check if there is no value returned, and in this case just keep passing the empty param ?
I don't think there is a generic way of dealing with that.
Often when working with algebraic data types, things are defined so that you can continue to run functions even when the data you would prefer is not present. This is one of the extremely useful features of types such as Maybe and Either for instance.
But most versions of compose or related functions don't give you an early escape mechanism. Ramda's certainly doesn't.
While you can't technically exit a pipeline, you can use pipe() within a pipeline, so there's no reason why you couldn't do something like below to 'exit' (return from a pipeline or kick into another):
pipe(
// ... your pipeline
propOr(undefined, 'foo'), // - where your value is missing
ifElse(isNil, always(undefined), pipe(
// ...the rest of your pipeline
))
)