We have relatively complex dynamic DAG as part of our ETL. DAG contains hundreds of transformations and it is created programmatically based on set of yaml files. It is changed through time: new tasks are added, queries executed by tasks are changed and even relationships between tasks are changed.
I know that new DAG should be created each time it is changed in this way and that DAG versioning is not supported by Airflow, but this is real use case and I would like to hear if there are some suggestions how to do this.
One of the most important request and why we want to try to tackle this, is that we must aware of DAG versions when we are doing clear of backfill for some moment in the past. This effectively means that when DAG is executed for some past moment, that must be version of DAG from that moment, not the newest one.
Any suggestions are more than welcome.
Related
Looks like SubDagOperator is being deprecated in Airflow in favor of TaskGroups. Since TaskGroups are only a UI grouping concept and all tasks still belong to the same DAG, I think some functionalities such as being able to BackFill an individual task group (which you can do with SubDags) will not be available.
Does anyone know if this is possible, and if not, what could be a workaround to this limitation?
Thanks!
I have a DAG that runs in a multi tenant scenario. The tenant ID get's set in the dag_run.conf when the DAG is triggered. I want to ensure that there is at most one active run per tenant at a time, but potentially many active runs simultaneously across all tenants.
So far I have found the max_active_runs setting, but that would require me to actually setup one DAG per tenant, which I am trying to avoid.
Is there a way to achieve this in airflow or am I approaching the problem in the wrong way?
You are using dag_run.conf which means that you are triggering your dags manually. Currently there is a bug (Airflow 2.0.1) max_active_runs isn't respected for manual runs (see GitHub issue).
I have a DAG that uses ShortCircuitOperator to skip downstream tasks that are (usually) unnecessary. In a few exceptional circumstances, I would like to still run these downstream tasks, and would prefer not to have to modify my DAG to deal with every possible case. Is there a simple way in the Airflow UI or via console to force the run of downstream tasks after they've been skipped?
Answering my own question here, you can manually clear the state of the task instances via the UI, but you have to clear the state of downstream tasks. I was running into issues because I wanted to skip part of the DAG and was trying to clear the state of tasks further downstream, which was of course causing them to immediately be set to skipped again. If you really want to skip a part of a DAG this way, you can do it, you just need manually to set the dependencies immediately upstream of the tasks you want to run to succeeded.
One of my Airflow DAGs runs without any issues most of the time. However, every now and then (every >3 hours), it "freezes".
In this state, its tasks are not "queued" (see attached image), and the timeouts which exist on specific tasks also do not activate. The only way of getting out of such a scenario is my manually marking that run as a fail.
This failure is always followed up by another immediate failure (see blank cells in the image).
What should I look for in the logs and/or what are other ways of debugging this?
Found the issue, it was just some tasks running longer than the schedule and hence double running in parallel.
I was hoping that in such cases airflow would provide some kind of feedback in the logs or UI, but that isn't the case.
Resolved.
We have some applications running and we want to start using airflow. From the documentation it seems that the only way to start a DAG is over command line. Is this true?
For example we have a flask server running and we want to start some workflow controlled by airflow. How can we achieve this? Is there an API to trigger e.g.: "Run DAG now with parameters x,y,h"?
There are couple of ways to achieve this with airflow. It depends on your situation which one or if any at all is suitable for you. Two suggestions that come to my mind:
Use Triggered DAGs. Python Jobs running in the Background may trigger a DAG to be executed in case an event happens. Have a look at the example_trigger_controller_dag.py and example_trigger_target_dag.py in the repository: GitHub Airflow
Use SensorTasks: There are some predefined sensors available which you can use to listen for specific events in a datasource f.e. If the existing once do not satisfy your need, airflow should be adaptable enough to let you implement your own sensor Airflow Sensor
After reading your question i understand your usecase as:
That you wish to run/trigger a DAG from HTTP server
--> you can just use the provided Airflow webserver(localhost:8080/) from which you can trigger/run the dag manually
Also You Can go HERE ,which is still in experimentation mode and use the api as provided
Please elaborate more so as to understand question better.