We are currently looking at setting up multiple Airflow stacks on the same GKE cluster but in different namespaces (to save on costs) to run DBT jobs.
Each Airflow stack within its namespace would use RBAC authentication to auth end users to run or observe jobs.
I understand this isn't a typical use case but the other alternative would be to have a separate Cloud Composer for each service line which would be quite costly.
Any help is much appreciated :)
You can run multiple Airflow DAGs on the same GKE cluster by using KubernetesPodOperator in the way given here 1. Affinity is used to constrain which nodes your pod is eligible to be scheduled on, based on labels. The code is inside affinity{...} in the same link. Also, it is mentioned that you can use a custom namespace.
Related
I am trying to load data in Avro format from GCS to Big Query, using a single pipeline. There are 10 tables for instance that I am trying to load, which means 10 parallel jobs in a single pipeline.
Now if the 3rd job fails, all the subsequent jobs fail. How can I make the other jobs run independent of the failure/success of one?
You cannot isolate different steps within a single Dataflow pipeline without implementing custom logic (for example, custom DoFn/ParDo implementations). Some I/O connectors such as BigQuery offer a way to send failed requests to a dead-letter queue in some write modes but this might not give what you want. If you want full isolation you should run separate jobs and combine them into a workflow using a orchestration framework such as Apache Airflow.
I am trying to understand how to manage large data with Airflow. The documentation is clear that we need to use external storage, rather than XCom, but I can not find any clean examples of staging data to and from a worker node.
My expectation would be that there should be an operator that can run a staging in operation, run the main operation, and staging out again.
Is there such a Operator or pattern? The closes I've found is an S3 File Transform but it runs an executable to do the transform, rather than a generic Operator, such as a DockerOperator which we'd want to use.
Other "solutions" I've seen rely on everything running on a single host, and using known paths, which isn't a production ready solution.
Is there such an operator that would support data staging, or is there a concrete example of handling large data with Airflow that doesn't rely on each operator being equipped with cloud coping capabilities?
Yes and no. Traditionally, Airflow is mostly orchestrator - so it does not usually "do" the stuff, it usually tells others what to do. You very rarely need to bring actual data to Airflow worker, Worker is mostly there to tell others where the data is coming from, what to do with it and where to send it.
There are exceptions (some transfer operators actually download data from one service and upload it to another) - so the data passes through Airflow node, but this is an exception rather than a rule (the more efficient and better pattern is to invoke an external service to do the transfer and have a sensor to wait until it completes).
This is more of "historical" and somewhat "current" way how Airflow operates, however with Airflow 2 and beyond we are expandingh this and it becomes more and more possible to do a pattern similar to what you describe, and this is where XCom play a big role there.
You can - as of recently - develop Custom XCom Backends that allow for more than meta-data sharing - they are also good for sharing the data. You can see docs here https://airflow.apache.org/docs/apache-airflow/stable/concepts/xcoms.html#custom-backends but also you have this nice article from Astronomer about it https://www.astronomer.io/guides/custom-xcom-backends and a very nice talk from Airflow Summit 2021 (from last week) presenting that: https://airflowsummit.org/sessions/2021/customizing-xcom-to-enhance-data-sharing-between-tasks/ . I Highly Recommend to watch the talk!
Looking at your pattern - XCom Pull is staging-in, Operator's execute() is operation and XCom Push is staging-out.
This pattern will be reinforced, I think by upcoming Airflow releases and some cool integrations that are coming. And there will be likely more cool data sharing options in the future (but I think they will all be based on - maybe slightly enhanced - XCom implementation).
I have a DAG that runs in a multi tenant scenario. The tenant ID get's set in the dag_run.conf when the DAG is triggered. I want to ensure that there is at most one active run per tenant at a time, but potentially many active runs simultaneously across all tenants.
So far I have found the max_active_runs setting, but that would require me to actually setup one DAG per tenant, which I am trying to avoid.
Is there a way to achieve this in airflow or am I approaching the problem in the wrong way?
You are using dag_run.conf which means that you are triggering your dags manually. Currently there is a bug (Airflow 2.0.1) max_active_runs isn't respected for manual runs (see GitHub issue).
We are considering to use Airflow for a project that needs to do thousands of calls a day to external APIs in order to download external data, where each call might take many minutes.
One option we are considering is to create a task for each distinct API call, however this will lead to thousands of tasks. Rendering all those tasks in UI is going to be challenging. We are also worried about the scheduler, which may struggle with so many tasks.
Other option is to have just a few parallel long-running tasks and then implement our own scheduler within those tasks. We can add a custom code into PythonOperator, which will query the database and will decide which API to call next.
Perhaps Airflow is not well suited for such a use case and it would be easier and better to implement such a system outside of Airflow? Does anyone have experience with running thousands of tasks in Airflow and can shed some light on pros and cons on the above use case?
One task per call would kill Airflow as it still needs to check on the status of each task at every heartbeat - even if the processing of the task (worker) is separate e.g. on K8s.
Not sure where you plan on running Airflow but if on GCP and a download is not longer than 9 min, you could use the following:
task (PythonOperator) -> pubsub -> cloud function (to retrieve) -> pubsub -> function (to save result to backend).
The latter function may not be required but we (re)use a generic and simple "bigquery streamer".
Finally, you query in a downstream AF task (PythonSensor) the number of results in the backend and compare with the number of requests published.
We do this quite efficiently for 100K API calls to a third-party system we host on GCP as we maximize parallelism. The nice thing of GCF is that you can tweak the architecture to use and concurrency, instead of provisioning a VM or container to run the tasks.
We have some applications running and we want to start using airflow. From the documentation it seems that the only way to start a DAG is over command line. Is this true?
For example we have a flask server running and we want to start some workflow controlled by airflow. How can we achieve this? Is there an API to trigger e.g.: "Run DAG now with parameters x,y,h"?
There are couple of ways to achieve this with airflow. It depends on your situation which one or if any at all is suitable for you. Two suggestions that come to my mind:
Use Triggered DAGs. Python Jobs running in the Background may trigger a DAG to be executed in case an event happens. Have a look at the example_trigger_controller_dag.py and example_trigger_target_dag.py in the repository: GitHub Airflow
Use SensorTasks: There are some predefined sensors available which you can use to listen for specific events in a datasource f.e. If the existing once do not satisfy your need, airflow should be adaptable enough to let you implement your own sensor Airflow Sensor
After reading your question i understand your usecase as:
That you wish to run/trigger a DAG from HTTP server
--> you can just use the provided Airflow webserver(localhost:8080/) from which you can trigger/run the dag manually
Also You Can go HERE ,which is still in experimentation mode and use the api as provided
Please elaborate more so as to understand question better.