We are trying to analyze our DAG tasks over time, and want to be able to query the data in Airflow's (v2) metadata database.
A simple example would be a DAG with a > b > c.
By querying the TaskInstance metadata, and we can get all the individual task run details for a, b, and c. However in the TaskInstance database, there isn't information there that c depends on b, b depends on a.
I've looked at the other objects in the metadata environment such as DagRun, DagModel, etc, but the information on the upstream/downstream for each task isn't there. Is there a way to get the task dependencies from the metadata so that we can join it to the TaskInstance data, so we can track the lineage of task execution by querying the metadatabase instead of from the UI?
how're doing? Try table named as serialized_dag. There's a information about dependencies dag.
Related
I want to extract statistics on our Airflow processes using its database. One of these statistics is how many DAG runs are finished smoothly, without any failures and reruns. Doing that using the try_number column of the dag_run table doesn't help, since it also counts automatic retries. I want to count only the cases in which an engineer had to rerun or resume the DAG run.
Thank you.
If I understand correctly you want to get all Dagruns that never had a failed task in them? You can do this by excluding all DAG run_id s that have an entry in the task_failed table:
SELECT *
FROM dag_run
WHERE run_id NOT IN (
SELECT run_id
FROM dag_run
JOIN task_fail USING(run_id)
)
;
This of course would not catch other task states that an engineer might intervene with like marking a task as successful that is stuck in running or a deferred state etc.
One note as of Airflow 2.5.0 you can add notes to DAG and task runs in the UI when manually intervening. Those notes are stored in the tables dag_run_note and task_instance_note.
Let's assume my dag converts a large data set from CSV format to parquet. While running the dag, for some reason my dag fails, is it possible to restore the progress when I re run the dag?
It shouldn't start from scratch after I re run the dag.
Airflow dag is a collection of tasks, organized in a way that reflects their relationships and dependencies. So if you have a dag with 3 tasks: A -> B -> C, when the task C fails, you can just re run it without re running A and B, But if you re run the dag, that means you re run the task A with all the downstream tasks (B and C).
If you want to restore the progress within a task, you can do that based on your job logic but this is not related to airflow, it depends only on the techno you use and the logic you want to implement. For example, for your data, if you have multiple files in the dataset, you can create a state store on cloud storage or a database, to know the processing state for each file, and if the file is already processed, you can skip the processing and pass to the next one.
I have multiple Ingestion DAGs -> 'DAG IngD1', 'DAG IngD2', 'DAG IngD3' , and so on which ingest data for individual tables.
After the ingestion DAGs are completed successfully, I want to run a single transformation DAG -> 'DAG Tran'. Which means that the DAG 'DAG Tran' should be triggered only when all the ingestion Dags 'DAG IngD1', 'DAG IngD2' and 'DAG IngD3' have successfully finished.
To achieve this if I use the ExternalTaskSensor operator, the external_dag_id parameter is a string and not a list. Which means that I need to have three ExternalTaskSensor operator in my 'DAG Tran' for each ingestion DAG? Is my understanding correct or is there an easy way?
Currently, meet dag dependency management problem too.
My solution is to set a mediator(dag) to use task flow to show dag dependency.
# create mediator_dag to show dag dependency
mediator_dag():
trigger_dag_a = TriggerDagRunOperator(dagid="a")
trigger_dag_b = TriggerDagRunOperator(dagid="b")
trigger_dag_c = TriggerDagRunOperator(dagid="c")
# taskflow
trigger_dag_a >> [trigger_dag_b, trigger_dag_c]
Cross-DAG dependencies in Apache Airflow This article might help you !
we have created a task for sensor operation, but the task name will be dynamically updated. i.e., f"{table_name}_s3_exists". We have a scenario where we have to check a table's location twice, but if the task is still present, we don't have to create the sensor. Is there a way to find whether the task exists or not within the DAG during building ?
The CLI command
airflow tasks list [-h] [-S SUBDIR] [-t] [-v] dag_id
will give you list of all the dags.
https://airflow.apache.org/docs/apache-airflow/stable/cli-and-env-variables-ref.html#list_repeat6
You can also use the REST API to get the same info:
https://airflow.apache.org/docs/apache-airflow/stable/stable-rest-api-ref.html#operation/get_tasks
You could try the get_tasks endpoint in the Airflow REST API. The endpoint returns a lot of information for tasks in a given DAG.
on Airflow, we currently are using the {{ prev_execution_date_success }} at the dag level to execute queries.
I was wondering if it was possible to have it by task (i.e. retrieving the last successful execution date of a task in particular and not of the whole DAG)
Thanks for your help
from the current DAG run you can access to the task instance and look up for the previous task in success state.
from airflow.models.taskinstance import TaskInstance
from airflow.utils.state import State
ti = TaskInstance(task_id=your_task_id,
dag_id=your_task_id,
execution_date=execution_date)
prev_task_success_state = ti.get_previous_ti(state=State.SUCCESS)
Note that get_previous_ti returns TaskInstance object thus you can access anything related to the task.