When I create a new dag, I have to go into the UI and click on the 'schedule' toggle to turn scheduling off. How can I do this without needing to use the UI? Is there an option in the DAG constructor itself?
In other words: how do I turn those buttons above to 'Off' in my DAG file?
There is no way to set a DAG as disabled within a DAG file. You can mimic the behavior by temporarily setting the DAG's schedule_interval to None. You can also set the airflow configuration value dags_are_paused_at_creation to True if you want to make sure all new DAGs to be off by default. You'll need to then turn new DAGs on manually in the UI when they are ready to be scheduled.
you can set the is_paused_upon_creation=True
DAG(dag_id=dag_id,
schedule_interval='#once',
...
is_paused_upon_creation=True)
There's no way to set this within the DAG file, but if you're trying to enable or disable a large amount of DAGs you can run an UPDATE statement in your Airflow database: UPDATE dag SET is_paused = TRUE;
Related
I have a DAG that looks like this:
dag1:
start >> clean >> end
Then I have a global Airflow variable "STATUS". Before running the clean step, I want to check if the "STATUS" variable is true or not. If it is true, then I want to proceed to the "clean" task. Or else, I want to stay in a waiting state until the global variable "STATUS" turns to true.
Something like this:
start >> wait_for_dag2 >> clean >> end
How can I achieve this?
Alternatively, if waiting is not possible, is there any way to trigger the dag1 whenever the global variable is set to true? Instead of giving a set schedule criteria
You can use a PythonSensor that call a python function that check the variable and return true/false.
There are 3 methods you can use:
use TriggerDagRunOperator as #azs suggested. Though, the problem with this approach is that is kind of contradicts with the "O"(Open to extend close to modify) in the "SOLID" concept.
put the variable inside a file and use data-aware escheduling which was introduced in Airflow 2.4. However, its a new functionality at the time of this answer and it may be changed in future. data_aware_scheduling
check the last status of the dag2 ( the previous dag). This is also has a flaw which may accur rarely but can not be excluded completely; and it is what if right after chacking the status the dag starts to run!?:
from airflow.models.dagrun import DagRun
from airflow.utils.state import DagRunState
dag_runs = DagRun.find(dag_id='the_dag_id_of_dag2')
last_run = dag_runs[-1]
if last_run.state == DagRunState.SUCCESS:
print('the dag run was successfull!')
else:
print('the dag state is -->: ', last_run.state)
After all it depends on you and your business constraint to choose among these methods.
How to use the note present in DAG runs panel from the ui?
I would want to programmatically fill it. For example changing the content depending the on the params passed to the DAG run
Is there a way to programmatically determine what triggered the current task run of the PythonOperator from inside of the operator?
I want to differentiate between the task runs triggered on schedule, those catching up, and those triggered by the backfill CLI command.
The template context contains two variables: dag_run and run_id that you can use to determine whether the run was scheduled, a backfill, or externally triggered.
from airflow import jobs
def python_target(**context):
is_backfill = context["dag_run"].is_backfill
is_external = context["dag_run"].external_trigger
is_latest = context["execution_date"] == context["dag"].latest_execution_date
# More code...
Context: I've defined a airflow DAG which performs an operation, compute_metrics, on some data for an entity based on a parameter called org. Underneath something like myapi.compute_metrics(org) is called. This flow will mostly be run on an ad-hoc basis.
Problem: I'd like to be able to select the org to run the flow against when I manually trigger the DAG from the airflow UI.
The most straightforward solution I can think of is to generate n different DAGs, one for each org. The DAGs would have ids like: compute_metrics_1, compute_metrics_2, etc... and then when I need to trigger compute metrics for a single org, I can pick the DAG for that org. This doesn't scale as I add orgs and as I add more types of computation.
I've done some research and it seems that I can create a flask blueprint for airflow, which to my understanding, extends the UI. In this extended UI I can add input components, like a text box, for picking an org and then pass that as a conf to a DagRun which is manually created by the blueprint. Is that correct? I'm imaging I could write something like:
session = settings.Session()
execution_date = datetime.now()
run_id = 'external_trigger_' + execution_date.isoformat()
trigger = DagRun(
dag_id='general_compute_metrics_needs_org_id',
run_id=run_id,
state=State.RUNNING,
execution_date=execution_date,
external_trigger=True,
conf=org_ui_component.text) # pass the org id from a component in the blueprint
session.add(trigger)
session.commit() # I don't know if this would actually be scheduled by the scheduler
Is my idea sound? Is there a better way to achieve what I want?
I've done some research and it seems that I can create a flask blueprint for airflow, which to my understanding, extends the UI.
The blueprint extends the API. If you want some UI for it, you'll need to serve a template view. The most feature-complete way of achieve this is developing your own Airflow Plugin.
If you want to manually create DagRuns, you can use this trigger as reference. For simplicity, I'd trigger a Dag with the API.
And specifically about your problem, I would have a single DAG compute_metrics that reads the org from an Airflow Variable. They are global and can be set dynamically. You can prefix the variable name with something like the DagRun id to make it unique and thus dag-concurrent safe.
I have a use case where I have a list of clients. The client can be added or removed from the list, and they can have different start dates, and different initial parameters.
I want to use airflow to backfill all data for each client based on their initial start date + rerun if something fails. I am thinking about creating a SubDag for each client. Will this address my problem?
How can I dynamically create SubDags based on the client_id?
You can definitely create DAG objects dynamically:
def make_client_dag(parent_dag, client):
return DAG(
'%s.client_%s' % (parent_dag.dag_id, client.name),
start_date = client.start_date
)
You could then use that method in a SubDagOperator from your main dag:
for client in clients:
SubDagOperator(
task_id='client_%s' % client.name,
dag=main_dag,
subdag = make_client_dag(main_dag, client)
)
This will create a subdag specific to each member of the collection clients, and each will run for the next invocation of the main dag. I'm not sure if you'll get the backfill behavior you want.