I have two independent DAGs let's say DAG_A and DAG_B, each has multiple tasks
The two DAGs are in different GCP projects let's say projct-1 and project-2 respectively.
What I want to do is to create a 3rd DAG let's call it DAG_C
DAG_C will be part of project-1 , and will be used to orchestrate DAG_A and DAG_B.
DAG_C should start by triggering DAG_A and on task_2 success it should trigger DAG_B
Please take a look at this picture that simplifies the problem:
Overview of the architecture
The question is: would this be possible using the TriggerDagRunOperator , as I can't see any option to change the GCP project id on that operator ?
Also what would be the best approach to go towards that "assuming that TriggerDagRunOperator will not work in that case" ?
There is no option to do that with TriggerDagRunOperator as the operator see only the scope of the Airflow instance that it's in.
Your only option is to use the Airflow Rest API.
In DAG_C the trigger_B task will need to be a PythonOperator that authenticate with the Rest API of project_2 and then use the Trigger new DagRun endpoint to trigger DAG_B. Note that Airflow provides official Python client for the API so you can use it for this task.
Related
I have two DAGs in my airflow scheduler, which were working in the past. After needing to rebuild the docker containers running airflow, they are now stuck in queued. DAGs in my case are triggered via the REST API, so no actual scheduling is involved.
Since there are quite a few similar posts, I ran through the checklist of this answer from a similar question:
Do you have the airflow scheduler running?
Yes!
Do you have the airflow webserver running?
Yes!
Have you checked that all DAGs you want to run are set to On in the web ui?
Yes, both DAGS are shown in the WebUI and no errors are displayed.
Do all the DAGs you want to run have a start date which is in the past?
Yes, the constructor of both DAGs looks as follows:
dag = DAG(
dag_id='image_object_detection_dag',
default_args=args,
schedule_interval=None,
start_date=days_ago(2),
tags=['helloworld'],
)
Do all the DAGs you want to run have a proper schedule which is shown in the web ui?
No, I trigger my DAGs manually via the REST API.
If nothing else works, you can use the web ui to click on the dag, then on Graph View. Now select the first task and click on Task Instance. In the paragraph Task Instance Details you will see why a DAG is waiting or not running.
Here is the output of what this paragraph is showing me:
What is the best way to find the reason, why the tasks won't exit the queued state and run?
EDIT:
Out of curiousity I tried to trigger the DAG from within the WebUI and now both Runs executed (the one triggered from the WebUI failed, but that was expected, since there was no config set)
I have a question about the TriggerDagRunOperator , specifically the wait_for_completion parameter.
Before moving to Airflow 2.2, we used this operator to trigger another DAG and a ExternalTaskSensor to wait for its completion.
In Airflow 2.2, there is a new parameter that is called wait_for_completion that if sets to True, will make the task complete only when the triggered DAG completed.
This is great, but I was wondering about wether the worker will be released between pokes or not. I know that the ExternalTaskSensor used to have a parameter reschedule that you can use for pokes larger than 1m which will release the worker slot between pokes - but I don’t see it in the documentation anymore.
My question is if the wait_for_completion parameter causes the operator to release the worker between pokes or not? From looking at the code I don’t think that is the case, so I just want to verify.
If it isn’t releasing the worker and the triggered DAG is bound to take more than 1m to finish, what should be the best approach here?
We are using MWAA Airflow 2.2 so I guess deferred operators are not an option (if it is a solution in this case)
When using wait_for_completion=True in TriggerDagRunOperator the worker will not be released as long as the operator is running. You can see that in the operator implementation. The operator use time.sleep(self.poke_interval)
As you pointed there are two ways to achieve the goal of verifying the triggered dag completed:
DAG A Using TriggerDagRunOperator followed by ExternalTaskSensor
Using TriggerDagRunOperator with wait_for_completion=True
However other than resources issue which you mentioned the two options are not really equivalent.
In option 1 if the triggered DAG fails then the ExternalTaskSensor will fail.
In option 2 consider:
from airflow.operators.trigger_dagrun import TriggerDagRunOperator
my_op = TriggerDagRunOperator (
task_id='task',
trigger_dag_id="dag_b",
...,
wait_for_completion=True,
retries=2
)
if the dag_b fails then TriggerDagRunOperator will retry which will invoke another DagRun of dag_b.
Both options are valid. You need to decide which behavior suitable for your use case.
I was looking through the different API endpoints that Airflow offers, but I could not find one that would suite my needs. Essentially I want to monitor the state of each task within the DAG, without having to specify each task I am trying to monitor. Ideally, I would be able to ping the DAG and the response would tell me the state of the task/tasks and what task/tasks are running/retrying...etc
You can use the airflow rest api which comes along with it - https://airflow.apache.org/docs/apache-airflow/stable/stable-rest-api-ref.html
I read the API reference and couldnt find anything on it, is that possible?
Currently, there is no such feature that does it out-of-the-box but you can write some custom code in your DAG to get around this. For example, use PythonOperator (you can use MySQL operator if your metadata db is mysql) to get status of the last X runs for the dag.
use BranchPythonOperator to see if the number is more than X, if it is then use a BashOperator to run airflow pause dag cli.
You can also just make this a 2-step task by adding logic of PythonOperator in BranchPythonOperator. This is just an idea, you can use a different logic.
I started to use Airflow to schedule jobs in our company, and I am wondering about its best practices.
Is it recommended to put all my tasks in one DAG ? If not, what is the right middle between one Dag and multiple Dags?
Our scheduled DAG's execute collects, transforms, exports and some other computing programs. So we will continuously have new tasks to add.
Generally, one python file consists of a single DAG with multiple task. This is because it is the logical grouping of the tasks.
If you have multiple DAG that have dependencies you can use TriggerDagRunOperator at the end of DAG1. This would trigger DAG2 (separate DAG file) if all tasks in DAG1 succeeds.
An example of this is:
DAG1: https://github.com/apache/incubator-airflow/blob/master/airflow/example_dags/example_trigger_controller_dag.py
DAG2: https://github.com/apache/incubator-airflow/blob/master/airflow/example_dags/example_trigger_target_dag.py