I have a python script that is called from BashOperator.
The scripts return can return statuses 0 or 1.
I want to trigger email only when the status 1.
Note these statuses are not to be confused with Failure/Success. This is simply an indication that something was changed with the data and requires attention from the developer.
This is my operator:
t = BashOperator(task_id='import',
bash_command="python /home/ubuntu/airflow/scripts/import.py",
dag=dag)
I looked over the docs but all email related addressed the issue of On Failure which is irrelevant in my case.
If you don't want to override an operator or anything fancy, you might be able to use Xcoms and the BranchPythonOperator
If your condition is based on a 0 or a 1, you can just push that value to XCom (set xcom_push to True).
Then, you can use the PythonBranchOperator to check that value, and use that value to execute the appropriate task. You can find an example of the BranchPythonOperator and pulling from XCom in the Airflow example_dags.
Related
I have a DAG that looks like this:
dag1:
start >> clean >> end
Then I have a global Airflow variable "STATUS". Before running the clean step, I want to check if the "STATUS" variable is true or not. If it is true, then I want to proceed to the "clean" task. Or else, I want to stay in a waiting state until the global variable "STATUS" turns to true.
Something like this:
start >> wait_for_dag2 >> clean >> end
How can I achieve this?
Alternatively, if waiting is not possible, is there any way to trigger the dag1 whenever the global variable is set to true? Instead of giving a set schedule criteria
You can use a PythonSensor that call a python function that check the variable and return true/false.
There are 3 methods you can use:
use TriggerDagRunOperator as #azs suggested. Though, the problem with this approach is that is kind of contradicts with the "O"(Open to extend close to modify) in the "SOLID" concept.
put the variable inside a file and use data-aware escheduling which was introduced in Airflow 2.4. However, its a new functionality at the time of this answer and it may be changed in future. data_aware_scheduling
check the last status of the dag2 ( the previous dag). This is also has a flaw which may accur rarely but can not be excluded completely; and it is what if right after chacking the status the dag starts to run!?:
from airflow.models.dagrun import DagRun
from airflow.utils.state import DagRunState
dag_runs = DagRun.find(dag_id='the_dag_id_of_dag2')
last_run = dag_runs[-1]
if last_run.state == DagRunState.SUCCESS:
print('the dag run was successfull!')
else:
print('the dag state is -->: ', last_run.state)
After all it depends on you and your business constraint to choose among these methods.
Let's say I have some Airflow operator, and one of the arguments to the operator needs to take the value from the xcom. I've managed to do it in the following way -
f"model_id={{{{ ti.xcom_pull(task_ids='train', key='{task_id}')}}}}"
Where model_id is the argument name to the docker operator the airflow runs and task_id is the name of the key for that value in the xcom.
Now I want to do something more complex and save under task_id a dictionary instead of one value, and be able to take it from it somehow.
Is there a similar way to do it to the one I mentioned above? something like -
f"model_id={{{{ ti.xcom_pull(task_ids='train', key='{task_id}')}}}}[value]"
By default, all the template_fields are rendered as strings.
However Airflow offers the option to render fields as native Python objects.
You will need to set you DAG as:
dag = DAG(
...
render_template_as_native_obj=True,
)
You can see example of how to render as dictionary in the docs.
My answer for a similar issue was this.
f"model_id={{{{ ti.xcom_pull(task_ids='train', key='{task_id}')[value]}}}}"
Would someone let me know if there is a way to override default failure notification method.
I am planning to send failure notification to SNS, however this means I will have to change all the existing DAG and add on_failure_callback method to it.
I was thinking if there is a way I can override existing notification method such that I don't need to change all the DAG.
or configure global hook for all the dags, such that I don't need to add on_failure_callback to all the dags.
You can use Cluster policy to mutate the task right after the DAG is parsed.
For example, this function could apply a specific queue property when using a specific operator, or enforce a task timeout policy, making sure that no tasks run for more than 48 hours. Here’s an example of what this may look like inside your airflow_local_settings.py:
def policy(task):
if task.__class__.__name__ == 'HivePartitionSensor':
task.queue = "sensor_queue"
if task.timeout > timedelta(hours=48):
task.timeout = timedelta(hours=48)
For Airflow 2.0, this policy should looks:
def task_policy(task):
if task.__class__.__name__ == 'HivePartitionSensor':
task.queue = "sensor_queue"
if task.timeout > timedelta(hours=48):
task.timeout = timedelta(hours=48)
The policy function has been renamed to task_policy.
In a similar way, you can modify other attributes, e.g. on_execute_callback, on_failure_callback, on_success_callback, on_retry_callback.
The airflow_local_settings.py file must be in one of the directories that are in sys.path. The easiest way to take advantage of this is that Airflow adds the directory ~/airflow/config to sys.path at startup, so you you need to create an ~/airfow/config/airflow_local_settings.py file.
I have a list of http endpoints each performing a task on its own. We are trying to write an application which will orchestrate by invoking these endpoints in a certain order. In this solution we also have to process the output of one http endpoint and generate the input for the next http enpoint. Also, the same workflow can get invoked simultaneously depending on the trigger.
What I have done until now,
1. Have defined a new operator deriving from the HttpOperator and introduced capabilities to write the output of the http endpoint to a file.
2. Have written a python operator which can transfer the output depending on the necessary logic.
Since I can have multiple instances of the same workflow in execution, I could not hardcode the output file names. Is there a way to make the http operator which I wrote to write to some unique file names and the same file name should be available for the next task so that it can read and process the output.
Airflow does have a feature for operator cross-communication called XCom
XComs can be “pushed” (sent) or “pulled” (received). When a task pushes an XCom, it makes it generally available to other tasks. Tasks can push XComs at any time by calling the xcom_push() method.
Tasks call xcom_pull() to retrieve XComs, optionally applying filters based on criteria like key, source task_ids, and source dag_id.
To push to XCOM use
ti.xcom_push(key=<variable name>, value=<variable value>)
To pull a XCOM object use
myxcom_val = ti.xcom_pull(key=<variable name>, task_ids='<task to pull from>')
With bash operator , you just set xcom_push = True and the last line in stdout is set as xcom object.
You can view the xcom object , hwile your task is running by simply opening the tast execution from airflow UI and clicking on the xcom tab.
I was learning apache airflow and found that there is an operator called DummyOperator. I googled about its use case, but couldn't find anything that I can understand. Can anyone here please discuss its use case?
Operator that does literally nothing. It can be used to group tasks in
a DAG.
https://airflow.apache.org/_api/airflow/operators/dummy_operator/index.html
as far as I know, at least to two case:
test purpose. in dags, the dummy operation just between upstream and
downstream, later, you can replace the true operator.
Workflow purpose: BranchPythonOperator work with DummyOperator. If you want to skip some tasks, keep in mind that you can’t have an
empty path, if so make a dummy task.
https://airflow.apache.org/concepts.html#workflows
dummy_operator is used in BranchPythonOperator where we decide next task based on some condition.
For example:
-> task C->task D
task A -> task B -> task F
-> task E(Dummy)
So let's suppose we have some condition in task B which decides whether to follow [task C->task D] or task E(Dummy) to reach task F.
Since we cannot leave else condition empty we have to put dummy operator which does nothing just skip or bypass.
Another use case: I've implemented a framework that returns an Operator. In most cases this is a PostgresOperator but under some user-specified configuration there's no SQL to run but the caller still expects an Operator so I return a DummyOperator rather than a PostgresOperator with trivial SQL like "select 1;".