Would someone let me know if there is a way to override default failure notification method.
I am planning to send failure notification to SNS, however this means I will have to change all the existing DAG and add on_failure_callback method to it.
I was thinking if there is a way I can override existing notification method such that I don't need to change all the DAG.
or configure global hook for all the dags, such that I don't need to add on_failure_callback to all the dags.
You can use Cluster policy to mutate the task right after the DAG is parsed.
For example, this function could apply a specific queue property when using a specific operator, or enforce a task timeout policy, making sure that no tasks run for more than 48 hours. Here’s an example of what this may look like inside your airflow_local_settings.py:
def policy(task):
if task.__class__.__name__ == 'HivePartitionSensor':
task.queue = "sensor_queue"
if task.timeout > timedelta(hours=48):
task.timeout = timedelta(hours=48)
For Airflow 2.0, this policy should looks:
def task_policy(task):
if task.__class__.__name__ == 'HivePartitionSensor':
task.queue = "sensor_queue"
if task.timeout > timedelta(hours=48):
task.timeout = timedelta(hours=48)
The policy function has been renamed to task_policy.
In a similar way, you can modify other attributes, e.g. on_execute_callback, on_failure_callback, on_success_callback, on_retry_callback.
The airflow_local_settings.py file must be in one of the directories that are in sys.path. The easiest way to take advantage of this is that Airflow adds the directory ~/airflow/config to sys.path at startup, so you you need to create an ~/airfow/config/airflow_local_settings.py file.
Related
I have a DAG that looks like this:
dag1:
start >> clean >> end
Then I have a global Airflow variable "STATUS". Before running the clean step, I want to check if the "STATUS" variable is true or not. If it is true, then I want to proceed to the "clean" task. Or else, I want to stay in a waiting state until the global variable "STATUS" turns to true.
Something like this:
start >> wait_for_dag2 >> clean >> end
How can I achieve this?
Alternatively, if waiting is not possible, is there any way to trigger the dag1 whenever the global variable is set to true? Instead of giving a set schedule criteria
You can use a PythonSensor that call a python function that check the variable and return true/false.
There are 3 methods you can use:
use TriggerDagRunOperator as #azs suggested. Though, the problem with this approach is that is kind of contradicts with the "O"(Open to extend close to modify) in the "SOLID" concept.
put the variable inside a file and use data-aware escheduling which was introduced in Airflow 2.4. However, its a new functionality at the time of this answer and it may be changed in future. data_aware_scheduling
check the last status of the dag2 ( the previous dag). This is also has a flaw which may accur rarely but can not be excluded completely; and it is what if right after chacking the status the dag starts to run!?:
from airflow.models.dagrun import DagRun
from airflow.utils.state import DagRunState
dag_runs = DagRun.find(dag_id='the_dag_id_of_dag2')
last_run = dag_runs[-1]
if last_run.state == DagRunState.SUCCESS:
print('the dag run was successfull!')
else:
print('the dag state is -->: ', last_run.state)
After all it depends on you and your business constraint to choose among these methods.
I run airflow on Kubernetes (so don't want a solution involving CLI commands, everything should be doable via the GUI ideally.)
I have some task and want to inject a variable to the command manually only. I can achieve this with airflow variables, but the user has to create then reset the variable.
With variables it might look like:
flag = Variable.get(
"NAME_OF_VARIABLE", False
)
append_args = "--injected-argument" if flag == "True" else ""
Or you could use jinja templating.
Is there a way to inject variables one off to the task without the CLI?
There's no way to pass a value to one single task in Airflow, but you can trigger a DAG and provide a JSON object for that one single DAG run.
The JSON object is accessible when templating as {{ dag_run.conf }}.
I have a list of http endpoints each performing a task on its own. We are trying to write an application which will orchestrate by invoking these endpoints in a certain order. In this solution we also have to process the output of one http endpoint and generate the input for the next http enpoint. Also, the same workflow can get invoked simultaneously depending on the trigger.
What I have done until now,
1. Have defined a new operator deriving from the HttpOperator and introduced capabilities to write the output of the http endpoint to a file.
2. Have written a python operator which can transfer the output depending on the necessary logic.
Since I can have multiple instances of the same workflow in execution, I could not hardcode the output file names. Is there a way to make the http operator which I wrote to write to some unique file names and the same file name should be available for the next task so that it can read and process the output.
Airflow does have a feature for operator cross-communication called XCom
XComs can be “pushed” (sent) or “pulled” (received). When a task pushes an XCom, it makes it generally available to other tasks. Tasks can push XComs at any time by calling the xcom_push() method.
Tasks call xcom_pull() to retrieve XComs, optionally applying filters based on criteria like key, source task_ids, and source dag_id.
To push to XCOM use
ti.xcom_push(key=<variable name>, value=<variable value>)
To pull a XCOM object use
myxcom_val = ti.xcom_pull(key=<variable name>, task_ids='<task to pull from>')
With bash operator , you just set xcom_push = True and the last line in stdout is set as xcom object.
You can view the xcom object , hwile your task is running by simply opening the tast execution from airflow UI and clicking on the xcom tab.
I have a python script that is called from BashOperator.
The scripts return can return statuses 0 or 1.
I want to trigger email only when the status 1.
Note these statuses are not to be confused with Failure/Success. This is simply an indication that something was changed with the data and requires attention from the developer.
This is my operator:
t = BashOperator(task_id='import',
bash_command="python /home/ubuntu/airflow/scripts/import.py",
dag=dag)
I looked over the docs but all email related addressed the issue of On Failure which is irrelevant in my case.
If you don't want to override an operator or anything fancy, you might be able to use Xcoms and the BranchPythonOperator
If your condition is based on a 0 or a 1, you can just push that value to XCom (set xcom_push to True).
Then, you can use the PythonBranchOperator to check that value, and use that value to execute the appropriate task. You can find an example of the BranchPythonOperator and pulling from XCom in the Airflow example_dags.
Context: I've defined a airflow DAG which performs an operation, compute_metrics, on some data for an entity based on a parameter called org. Underneath something like myapi.compute_metrics(org) is called. This flow will mostly be run on an ad-hoc basis.
Problem: I'd like to be able to select the org to run the flow against when I manually trigger the DAG from the airflow UI.
The most straightforward solution I can think of is to generate n different DAGs, one for each org. The DAGs would have ids like: compute_metrics_1, compute_metrics_2, etc... and then when I need to trigger compute metrics for a single org, I can pick the DAG for that org. This doesn't scale as I add orgs and as I add more types of computation.
I've done some research and it seems that I can create a flask blueprint for airflow, which to my understanding, extends the UI. In this extended UI I can add input components, like a text box, for picking an org and then pass that as a conf to a DagRun which is manually created by the blueprint. Is that correct? I'm imaging I could write something like:
session = settings.Session()
execution_date = datetime.now()
run_id = 'external_trigger_' + execution_date.isoformat()
trigger = DagRun(
dag_id='general_compute_metrics_needs_org_id',
run_id=run_id,
state=State.RUNNING,
execution_date=execution_date,
external_trigger=True,
conf=org_ui_component.text) # pass the org id from a component in the blueprint
session.add(trigger)
session.commit() # I don't know if this would actually be scheduled by the scheduler
Is my idea sound? Is there a better way to achieve what I want?
I've done some research and it seems that I can create a flask blueprint for airflow, which to my understanding, extends the UI.
The blueprint extends the API. If you want some UI for it, you'll need to serve a template view. The most feature-complete way of achieve this is developing your own Airflow Plugin.
If you want to manually create DagRuns, you can use this trigger as reference. For simplicity, I'd trigger a Dag with the API.
And specifically about your problem, I would have a single DAG compute_metrics that reads the org from an Airflow Variable. They are global and can be set dynamically. You can prefix the variable name with something like the DagRun id to make it unique and thus dag-concurrent safe.