I read the API reference and couldnt find anything on it, is that possible?
Currently, there is no such feature that does it out-of-the-box but you can write some custom code in your DAG to get around this. For example, use PythonOperator (you can use MySQL operator if your metadata db is mysql) to get status of the last X runs for the dag.
use BranchPythonOperator to see if the number is more than X, if it is then use a BashOperator to run airflow pause dag cli.
You can also just make this a 2-step task by adding logic of PythonOperator in BranchPythonOperator. This is just an idea, you can use a different logic.
Related
I have an airflow DAG that is triggered externally via cli.
I have a requirement to change order of the execution of tasks based on a Boolean parameter which I would be getting from the CLI.
How do I achieve this?
I understand dag_run.conf can only be used in a template field of an operator.
Thanks in advance.
You can not change tasks dependency with runtime parameter.
However you can pass runtime parameter (with dag_run.conf) that according to it's value tasks will be executed or be skipped for that you need to place operators in your workflow that can handle this logic for example: ShortCircuitOperator, BranchPythonOperator
I have two independent DAGs let's say DAG_A and DAG_B, each has multiple tasks
The two DAGs are in different GCP projects let's say projct-1 and project-2 respectively.
What I want to do is to create a 3rd DAG let's call it DAG_C
DAG_C will be part of project-1 , and will be used to orchestrate DAG_A and DAG_B.
DAG_C should start by triggering DAG_A and on task_2 success it should trigger DAG_B
Please take a look at this picture that simplifies the problem:
Overview of the architecture
The question is: would this be possible using the TriggerDagRunOperator , as I can't see any option to change the GCP project id on that operator ?
Also what would be the best approach to go towards that "assuming that TriggerDagRunOperator will not work in that case" ?
There is no option to do that with TriggerDagRunOperator as the operator see only the scope of the Airflow instance that it's in.
Your only option is to use the Airflow Rest API.
In DAG_C the trigger_B task will need to be a PythonOperator that authenticate with the Rest API of project_2 and then use the Trigger new DagRun endpoint to trigger DAG_B. Note that Airflow provides official Python client for the API so you can use it for this task.
I am trying to write a test case where I:
instantiate a collection of (Python)Operators (patching some of their dependencies with unittest.mock.patch)
arrange those Operators in a DAG
run that DAG
make assertions about the calls to various mocked downstreams
I see from here that running a DAG is not so simple as calling dag.run - I should instantiate a local_client and call trigger_dag on that. However, the resultant code constructs its own DagBag, and does not accept any parameter that allows me to pass in my manually-constructed DAG - so I cannot see how to run this DAG with local_client.
I see a couple of options here:
I could declare my testing DAG in a separate folder, as specified by DagModel.get_current(dag_id).fileloc, so that my DAG will get picked up by trigger_dag and so run - but this seems pretty indirect, and also I doubt that I'd be able to cleanly reference the injected mocks from my test code.
I could directly call api.common.experimental.trigger_dag._trigger_dag, which has a dag_bag argument. Both the experimental in the name, and the underscored-prefixed method name, suggest that this would be A Bad Idea.
I have a DAG that I want to run multiple times after each successful completion. For an example I want to run it 10 times and stop. Is there a way to accomplish this? I tried looking into scheduling with CRON but it doesn't seem clean nor triggering the DAG via UI multiple times doesn't work (runs in parallel).
I found a solution to my use case. It incorporated using depends_on_past=True (mentioned by #Hitesh Gupta) and setting your airflow.cfg file below:
# The maximum number of active DAG runs per DAG
max_active_runs_per_dag = 1
This allowed us to only have one active DAG run at a time and also to not continue the next DAG run if there were failure in the previous run. This is for Airflow version 1.10.1 that I tested on.
You can, in addition to supplying a start_date, provide your DAG an end_date
Quoting the docstring
:param start_date: The timestamp from which the scheduler will attempt
to backfill
:type start_date: datetime.datetime
:param end_date: A date beyond which your DAG won't run, leave to None for open ended scheduling
:type end_date: datetime.datetime
While unrelated, also have a look at following scheduler settings in airflow.cfg as mentioned in this article
run_duration
num_runs
UPDATE-1
In his article Use apache airflow to run task exactly once, #Andreas P has described a clever technique, which I believe can be adapted to your use-case. While even that won't be a very-tidy solution, it would at-least allow you to specify beforehand the number of runs (integer) for DAG instead of end_date.
Alternatively (assuming you implement the above approach) rather than rather than baking this skipping-dag-after max-runs functionality within each DAG, you can create a separate orchestrator DAG that disables a given DAG after its max runs have passed.
You have to set property depends_on_past. This is set under DAG's default arguments section and it refers to previous instance dag instance. This is fix your problem.
So here's a stupid idea...
I created (many) DAG(s) in airflow... and it works... however, i would like to package it up somehow so that i could run a single DAG Run without having airflow installed; ie have it self contained so i don't need all the web servers, databases etc.
I mostly instantiate new DAG Run's with trigger dag anyway, and i noticed that the overhead of running airflow appears quite high (workers have high loads doing essentially nothing, it can sometimes take 10's of seconds before dependent tasks are queued etc).
i'm not too bothered about all the logging etc.
You can create a script which executes airflow operators, although this loses all the meta data that Airflow provides. You still need to have Airflow installed as a Python package, but you don't need to run any webservers, etc. A simple example could look like this:
from dags.my_dag import operator1, operator2, operator3
def main():
# execute pipeline
# operator1 -> operator2 -> operator3
operator1.execute(context={})
operator2.execute(context={})
operator3.execute(context={})
if __name__ == "__main__":
main()
It sounds like your main concern is the waste of resources by the idling workers more so than the waste of Airflow itself.
I would suggest running Airflow with the LocalExecutor on a single box. This will give you the benefits of concurrent execution without the hassle of managing workers.
As for the database - there is no way to remove the database component without modifying airflow source itself. One alternative would be to leverage the SequentialExecutor with SQLite, but this removes the ability to run concurrent tasks and is not recommended for production.
First I'd say you need to tweak you airflow setup.
But if that's not an option, then another way is to write your main logic in code outside the DAG. (This is also best practice). For me this makes the code easier to test locally as well.
Writing a shell script is pretty easy to tie a few processes together.
You won't get the benefit of operators or dependencies, but you probably can script your way around it. And if you can't, just use Airflow.
You can overload the imported airflow modules if they fail to import. So for example, if you are using from airflow.decorators import dag, task you can overload the #dag and #task decorators:
from datetime import datetime
try:
from airflow.decorators import dag, task
except ImportError:
mock_decorator = lambda f=None,**d: f if f else lambda x:x
dag = mock_decorator
task = mock_decorator
#dag(schedule=None, start_date=datetime(2022, 1, 1), catchup=False)
def mydag():
#task
def task_1():
print("task 1")
#task
def task_2(input):
print("task 2")
task_2(task_1())
_dag = mydag()