report airflow dag run statistics after completion - report

whilst the gantt chart is great, it's not exactly easy to summarize. I would like to report all its data over to something like influxdb (and present in grafana).
so the question is:
1) is there a general hook which is called after each and every completed dagrun?
2) how can grab the (eg gantt) data for each dagrun after it has completed (of even just how long the dag took to run)?
cheers!

If your airflow is connected to Postgres, then you can add the airflow Postgres DB connection in grafana (https://grafana.com/docs/features/datasources/postgres/)

Related

Correct Approach For Airflow DAG Project

I am trying to see if Airflow is the right tool for some functionality I need for my project. We are trying to use it as a scheduler for running a sequence of jobs
that start at a particular time (or possibly on demand).
The first "task" is to query the database for the list of job id's to sequence through.
For each job in the sequence send a REST request to start the job
Wait until job completes or fails (via REST call or DB query)
Go to next job in sequence.
I am looking for recommendations on how to break down the functionality discussed above into an airflow DAG. So far my approach would :
create a Hook for the database and another for the REST server.
create a custom operator that handles the start and monitoring of the "job" (steps 2 and 3)
use a sensor to poll handle waiting for job to complete
Thanks

Send an alert when a dag did not run google cloud

I have a DAG in Airflow where the run is not scheduled, but triggered by an event. I would like to send an alert when the DAG did not run in the last 24 hours. My problem is I am not really sure which tool is the best for the task.
I tried to solve it with the Logs Explorer, I was able to write a quite good query filtering by the textPayload, but it seems that tool is designed to send the alert when a specific log is there, not when it is missing. (Maybe I missed something?)
I also checked Monitoring where I could set up an Alert when logs are missing, however in this case I was not able to write any query where I can filter logs by textPayload.
Thank you in advance if you can help me!
You could set up a separate alert DAG that notifies you if other DAGs haven't run in a specified amount of time? To get the last runtime of a DAG, use something like this:
from airflow.models import DagRun
dag_runs = DagRun.find(dag_id=dag_id)
dag_runs.sort(key=lambda x: x.execution_date, reverse=True)
Then you can use dag_runs[0] and compare with the current server time. If the date difference is greater than 24h, raise an alert.
I was able to do it in the monitoring. I did not need the filtering query which I used in the Logs Explorer. I needed to create an Alerting Policy, filtered by workflow_name, task_name and location. In the configure trigger section I was able to choose "Metric absence" with a 1 day absence time, so I resolved my old query with this.
Of course, it could be solved with setting up a new DAG, but setting up an Alerting Policy seems more easier.

Sharing information between DAGs in airflow

I have one dag that tells another dag what tasks to create in a specific order.
Dag 1 -> a file that has a task order
This runs every 5 minutes or so to keep this file fresh.
Dag 2 -> runs the task
this runs daily.
How can I pass this data between the two DAGs using Airflow.
Solutions and problems
The problem with using Airflow Variables is that I cannot set them at runtime.
The problem with using Xcoms is that they can only be run during the task stage and once the tasks are created in Dag 2, they're set and cannot be changed correct?
The problem with pushing the file to s3 is that the airflow instance doesn't have permission to pull from s3 due to security reasons decided by a team that I have no control over.
So what can I do? What are some choices I have?
What is the file format of the output from the 1st DAG? I would recommend the following workflow
Dag 1 -> Update the tasks order and store it in a yaml or json file inside the airflow environment.
Dag 2 -> Read the file to create the required tasks and run them daily.
You need to understand that airflow is constantly reading your dag files to have the latest configuration, so no extra step would be required.
I have had a similar issue in the past and it largely depends on your setup.
If you are running Airflow on Kubernetes this might work.
You create a PV(Persistent Volume) and PVC
You start your application with a KubernetesOperator and mount the PVC to it.
You store the result on the PVC.
You mount the PVC to the other pod.

How to export lastest airflow dag and task status in tabular format for specific owners

Airflow provides rest API functionality to extracts dag/task status.
https://airflow.apache.org/docs/apache-airflow/stable/stable-rest-api-ref.html#section/Trying-the-API
But wondering if there a way to get latest dag/task status of all dags w.r.t dag owner only without specifying it manually for each dag id.
This will help us for creating a workflow dashboard for business users.
You can make use of dag, dag_run and task_instsnce tables in airflow metadata database. It's fairly straightforward..

Airflow: Mark execution date range task istances as success

I have a DAG that has run tasks for over a decade of execution dates. Now I needed to add another year to the beginning. I googled a little bit and the recommendation was to do this under a new dag_id. Because the old DAG has run already for that named execution date range, I want to mark those in the new DAG as a success. How can I archive this in a convenient way?
Thanks in advance. Have a nice start to this week.
Airflow's backfill feature is designed to do exactly what you're trying to do.
That said, not everyone likes using the feature. For example, if your dag is ordinarily an hourly job, backfilling several years of data in hourly batches might be really inefficient.
So for various reasons, creating a temporary "backfill" dag is not a bad way to go.
And to be clear, with "backfill" dag I refer to a dag you are using for purpose of backfill, while not using airflow backfill feature.
For your "backfill" dag, use the DAG parameters start_date and end_date to control the range of execution_date a dag will create dag runs for.
Then after your "backfill" dag is done with all its runs, you can delete it. Airflow won't know the old task instances are now backfilled, but you may not care about that. If you do, you can update the dag_id manually in the metastore database. And otherwise, your "old" dag has correct metadata for more recent periods.

Resources