We have some applications running and we want to start using airflow. From the documentation it seems that the only way to start a DAG is over command line. Is this true?
For example we have a flask server running and we want to start some workflow controlled by airflow. How can we achieve this? Is there an API to trigger e.g.: "Run DAG now with parameters x,y,h"?
There are couple of ways to achieve this with airflow. It depends on your situation which one or if any at all is suitable for you. Two suggestions that come to my mind:
Use Triggered DAGs. Python Jobs running in the Background may trigger a DAG to be executed in case an event happens. Have a look at the example_trigger_controller_dag.py and example_trigger_target_dag.py in the repository: GitHub Airflow
Use SensorTasks: There are some predefined sensors available which you can use to listen for specific events in a datasource f.e. If the existing once do not satisfy your need, airflow should be adaptable enough to let you implement your own sensor Airflow Sensor
After reading your question i understand your usecase as:
That you wish to run/trigger a DAG from HTTP server
--> you can just use the provided Airflow webserver(localhost:8080/) from which you can trigger/run the dag manually
Also You Can go HERE ,which is still in experimentation mode and use the api as provided
Please elaborate more so as to understand question better.
Related
I tried to explore "How can we add long running monitor feature to Airflow in simplest way?" via a plugin or similar like that?
One of the logic can be considering last 15 days runtime (sorting the data and excluding the edge values)
Below is the only question I found, which has not been answered adequately. Need your guidance on stable and simple solutions?
Monitoring long lasting tasks in Airflow
Airflow dags - reporting of the runtime for tracking purposes
Any way of monitoring Airflow DAG's execution time?
I am trying to understand how to manage large data with Airflow. The documentation is clear that we need to use external storage, rather than XCom, but I can not find any clean examples of staging data to and from a worker node.
My expectation would be that there should be an operator that can run a staging in operation, run the main operation, and staging out again.
Is there such a Operator or pattern? The closes I've found is an S3 File Transform but it runs an executable to do the transform, rather than a generic Operator, such as a DockerOperator which we'd want to use.
Other "solutions" I've seen rely on everything running on a single host, and using known paths, which isn't a production ready solution.
Is there such an operator that would support data staging, or is there a concrete example of handling large data with Airflow that doesn't rely on each operator being equipped with cloud coping capabilities?
Yes and no. Traditionally, Airflow is mostly orchestrator - so it does not usually "do" the stuff, it usually tells others what to do. You very rarely need to bring actual data to Airflow worker, Worker is mostly there to tell others where the data is coming from, what to do with it and where to send it.
There are exceptions (some transfer operators actually download data from one service and upload it to another) - so the data passes through Airflow node, but this is an exception rather than a rule (the more efficient and better pattern is to invoke an external service to do the transfer and have a sensor to wait until it completes).
This is more of "historical" and somewhat "current" way how Airflow operates, however with Airflow 2 and beyond we are expandingh this and it becomes more and more possible to do a pattern similar to what you describe, and this is where XCom play a big role there.
You can - as of recently - develop Custom XCom Backends that allow for more than meta-data sharing - they are also good for sharing the data. You can see docs here https://airflow.apache.org/docs/apache-airflow/stable/concepts/xcoms.html#custom-backends but also you have this nice article from Astronomer about it https://www.astronomer.io/guides/custom-xcom-backends and a very nice talk from Airflow Summit 2021 (from last week) presenting that: https://airflowsummit.org/sessions/2021/customizing-xcom-to-enhance-data-sharing-between-tasks/ . I Highly Recommend to watch the talk!
Looking at your pattern - XCom Pull is staging-in, Operator's execute() is operation and XCom Push is staging-out.
This pattern will be reinforced, I think by upcoming Airflow releases and some cool integrations that are coming. And there will be likely more cool data sharing options in the future (but I think they will all be based on - maybe slightly enhanced - XCom implementation).
I have a DAG that runs in a multi tenant scenario. The tenant ID get's set in the dag_run.conf when the DAG is triggered. I want to ensure that there is at most one active run per tenant at a time, but potentially many active runs simultaneously across all tenants.
So far I have found the max_active_runs setting, but that would require me to actually setup one DAG per tenant, which I am trying to avoid.
Is there a way to achieve this in airflow or am I approaching the problem in the wrong way?
You are using dag_run.conf which means that you are triggering your dags manually. Currently there is a bug (Airflow 2.0.1) max_active_runs isn't respected for manual runs (see GitHub issue).
i have configured apache-airflow with postgreSSQL database and in my airflow i have running 1 dag, now its running successfully but if scheduler have any issue means how i get that and what is the way to check that, kindly give some idea and solution.
Airflow exposes /health endpoint for this purpose
Also do check REST API reference, it has many useful endpoints for several common day-to-day stuff like triggering a DAG or returning the latest runs of DAGs
UPDATE-1
Apparently just because scheduler is running, doesn't necessarily mean that it will actually trigger a DAG; for e.g. this
you can think of it like there could be internal bugs / interesting corrupt internal states of Airflow that may cause it to not trigger DAGs
Thus people have gone a step ahead and they schedule a canary DAG (a dummy DAG which does nothing but runs every few minutes). Then by monitoring metrics (think prometheus) of canary DAG, they can reliably affirm if Airflow scheduler is working as expected or not.
I'm tasked with 2 things that I'm not seeing support for.
Writing a script to enable a receive location with a schedule (No
schedule currently set)
Writing a script to remove a schedule from a receive location.
I see how to enable/disable a receive location, but I don't see how to manipulate the schedule via script.
Check the below article. This may help to create script.
http://msdn.microsoft.com/en-us/library/aa559496.aspx
Create two bindings files, one representing the receive port with schedule and one without. Then use BTSTask importbindings command to import whichever bindings file you want.
There is no support for scripting the scheduling of a receive location that I'm aware of.
You can achieve this in many other ways though. The simplest being to script the delivery
of the files to the receive location using shell script/etc.
Before going down your own scripting route, I'll consider 2 options
Check the schedule/service window capability in Receive Location and see if it solves your problems, else
There is a community Scheduled task adapter, which is suitable for such activities.
http://biztalkscheduledtask.codeplex.com/