When you trigger an Airflow DAG either through the UI (see screenshot) or the API (https://airflow.apache.org/docs/stable/rest-api-ref.html), you have the option of submitting a JSON configuration. However the usefulness of this isn't clearly documented as far as I can tell. I have two basic questions:
Is this intended for free-form configuration settings at the application level, or is this only for Airflow configuration variables?
If this is for free-form configuration settings, how (in my code) can I access whatever configuration was passed when the DAG was triggered?
Here is the screenshot where you can provide configuration when triggering a DAG:
Yes it is intended for Application level configuration.
Example -
{"appConfig":"Test"}
To read it in your DAG
def read_app_configuration(**kwargs):
print("Read App Config - Task : Start")
dag_run = kwargs['dag_run']
region = kwargs['dag_run'].conf.get('appConfig')
Related
I have two DAGs in my airflow scheduler, which were working in the past. After needing to rebuild the docker containers running airflow, they are now stuck in queued. DAGs in my case are triggered via the REST API, so no actual scheduling is involved.
Since there are quite a few similar posts, I ran through the checklist of this answer from a similar question:
Do you have the airflow scheduler running?
Yes!
Do you have the airflow webserver running?
Yes!
Have you checked that all DAGs you want to run are set to On in the web ui?
Yes, both DAGS are shown in the WebUI and no errors are displayed.
Do all the DAGs you want to run have a start date which is in the past?
Yes, the constructor of both DAGs looks as follows:
dag = DAG(
dag_id='image_object_detection_dag',
default_args=args,
schedule_interval=None,
start_date=days_ago(2),
tags=['helloworld'],
)
Do all the DAGs you want to run have a proper schedule which is shown in the web ui?
No, I trigger my DAGs manually via the REST API.
If nothing else works, you can use the web ui to click on the dag, then on Graph View. Now select the first task and click on Task Instance. In the paragraph Task Instance Details you will see why a DAG is waiting or not running.
Here is the output of what this paragraph is showing me:
What is the best way to find the reason, why the tasks won't exit the queued state and run?
EDIT:
Out of curiousity I tried to trigger the DAG from within the WebUI and now both Runs executed (the one triggered from the WebUI failed, but that was expected, since there was no config set)
I have two independent DAGs let's say DAG_A and DAG_B, each has multiple tasks
The two DAGs are in different GCP projects let's say projct-1 and project-2 respectively.
What I want to do is to create a 3rd DAG let's call it DAG_C
DAG_C will be part of project-1 , and will be used to orchestrate DAG_A and DAG_B.
DAG_C should start by triggering DAG_A and on task_2 success it should trigger DAG_B
Please take a look at this picture that simplifies the problem:
Overview of the architecture
The question is: would this be possible using the TriggerDagRunOperator , as I can't see any option to change the GCP project id on that operator ?
Also what would be the best approach to go towards that "assuming that TriggerDagRunOperator will not work in that case" ?
There is no option to do that with TriggerDagRunOperator as the operator see only the scope of the Airflow instance that it's in.
Your only option is to use the Airflow Rest API.
In DAG_C the trigger_B task will need to be a PythonOperator that authenticate with the Rest API of project_2 and then use the Trigger new DagRun endpoint to trigger DAG_B. Note that Airflow provides official Python client for the API so you can use it for this task.
We have encountered a scenario recently where someone mistakenly turned off a production dag, and we want to get alert whenever a dag is paused using datadog.
I have checked https://docs.datadoghq.com/integrations/airflow/?tab=host
But have not got any metric for dag to check if it is paused or not.
I can run a custom script in datadog as well.
One of the method is that I exec into postgres pod and get the list of active dags:
select * from dag where is_paused=true;
Or is there any other way I can get the unpaused dag list and also when new dag is added what is the best way to handle it.
I want the alert whenever a unpaused dag is paused.
If you are on Airflow 2 you can use the REST API to query for state of the DAG.
https://airflow.apache.org/docs/apache-airflow/stable/stable-rest-api-ref.html#operation/get_dag
There is "is_paused" field.
And of you are not Airflow 2, you should be. Airflow 1.10 is end-of-life and will not receive any fixes (including critical security fixes) so you should upgrade as soon as you can.
I have setup airflow in my local machine. I am trying to access the below airflow link:
http://localhost:8080/api/experimental/test/
I am getting Airflow 404 = lots of circles
I have tried to set auth_backend to default, but no luck.
What changes do i need to make in airflow.cfg to be able to make REST API calls to airflow for triggering DAGs?
Experimental API is disabled by default in Airlfow 2. It was used in 1.10 but it has been deprecated and disabled by default in Airflow 2. Instead you should use the fully-fledged REST API which uses completely different URL scheme:
https://airflow.apache.org/docs/apache-airflow/stable/stable-rest-api-ref.html
In Airflow UI you can even browse and try the API (just look at the menus of Airflow).
I am getting started with Apache Airflow and trying to setup a event driven DAG in Airflow. My event is a file being landed in Linux directory. This file can be landed multiple number of time throughout the day. I am using File Sensor operator for file monitoring.
My requirement is every time the file lands(with same name) in directory the Dag should kick off.
I was reading the official scheduling documentation and based on my understanding I see with option None I can make my Dag to be triggered externally based on event and it can be triggered multiple times throughout the day based on that external event.
Is my understanding correct? The official documentation doesn't have detailed information on it.
https://airflow.apache.org/scheduler.html?highlight=scheduling
That is correct. Having the schedule_interval as None means Airflow will never automatically schedule a run of the Dag.
You can schedule dag_runs externally a few different ways:
through the Airflow CLI
using a Local client from within a python script
through the Airflow REST API
manually via the trigger button in the Web UI