Airflow resource pool usage on DAG-level? - airflow

I'm looking at using airflow for scheduling test-cases execution against shared hw in a lab and have some best practice questions on how to use the resource pool concept for a whole DAG-instance instead of just on task level.
Basically a test-case needs (executed as a instance of a test-case DAG (deploy/execute/collect/un-deploy)) certain physical resources and should therefore request them from the different resource pools(modelling the the physical resources) in order to not run into conflicting concurrent usage with other triggered DAG-instances.
My question is if it's possible to define resource usage on DAG-instance level or if it's only possible on task level. If the latter, then would one parallel task claiming the resource during the whole DAG-instance execution be the best way to handle not having to pass the resource claim between all tasks in the DAG? Other alternatives?
Update after questions from Viraj and dlamblin:
Running 1.10.1
Running LocalExecutor
Have verified that I can run parallel DAGS with concurrent tasks
The resources I want to have custom pools for are not worker resources, rather different peripheral hw units such as relays, routers etc that the tasks running in parallel on a the localexecutor should block on if they are occupied(0 custom resource pool instances left) by an/-other task(s)

The Kubernetes Executor allows for certain node type affinity to be configured on the task or dag level. The Celery Executor has a queue concept to select from a worker group with certain resources available to the worker. You're probably not using a Local Executor as your question doesn't quite make sense for that case.

Related

Airflow - How to configure that all DAG's tasks run in 1 worker

I have a DAG with 2 tasks:
download_file_from_ftp >> transform_file
My concern is that tasks can be performed on different workers.The file will be downloaded on the first worker and will be transformed on another worker. An error will occur because the file is missing on the second worker. Is it possible to configure the dag that all tasks are performed on one worker?
It's a bad practice. Even if you will find a work around it will be very unreliable.
In general, if your executor allows this - you can configure tasks to execute on a specific worker type. For example in CeleryExecutor you can set tasks to a specific Queue. Assuming there is only 1 worker consuming from that queue then your tasks will be executed on the same worker BUT the fact that it's 1 worker doesn't mean it will be the same machine. It highly depended on the infrastructure that you use. For example: when you restart your machines do you get the exact same machine or new one is spawned?
I highly advise you - don't go down this road.
To solve your issue either download the file to shared disk space like S3, Google cloud storage, etc... then all workers can read the file as it's stored in cloud or combine the download and transform into a single operator thus both actions are executed together.

Firebase-Queue Graceful Shutdown on GCE

This is a design question about the handling of tasks during the shutdown of a firebase-queue based app running on Google Compute Engine.
The use case I am working with is automatically scaling queue-workers depending on the load at any given time. Specific to our project is the fact that our tasks are long-running.
In an ideal world, the queue worker would have an opportunity to finish its current tasks before the virtual machine running the worker is terminated. We are working with Google Compute Engine / instance groups to handle the scaling of our queue worker app. Firebase-queue does provide a promise based method to shutdown a queue worker (i.e. queue.shutdown()). This will stop the worker from accepting new tasks and will allow running tasks to finish prior to resolving the promise.
The problem I am facing is how to allow the queue worker to shutdown gracefully prior to instance termination (this problem would also occur during a rolling update). One way is to trigger the worker shutdown and have the worker trigger instance shutdown, but this does not seem like the best design because control is taken away from whatever service is triggering the scale down in the first place.
GCE does provide a service which will run a shutdown script prior to instance termination, however, it will forcefully shutdown an instance after about 90 seconds, which does not work for us.
I am interested in design ideas / patterns to follow here. Any help is much appreciated.

Sharing large intermediate state between Airflow tasks

We have an Airflow deployment with Celery executors.
Many of our DAGs require a local processing step of some file in a BashOperator or PythonOperator.
However, in our understanding the tasks of a given DAG may not always be scheduled on the same machine.
The options for state sharing between tasks I've gathered so far:
Use Local Executors - this may suffice for one team, depending on the load, but may not scale to the wider company
Use XCom - does this have a size limit? Probably unsuitable for large files
Write custom Operators for every combination of tasks that need local processing in between. This approach reduces modularity of tasks and requires replicating existing operators' code.
Use Celery queues to route DAGs to the same worker (docs) - This option seems attractive at first, but what would be an appropriate way to set it up in order to avoid routing everything to one executor, or crafting a million queues?
Use a shared network storage in all machines that run executors - Seems like an additional infrastructure burden, but is a possibility.
What is the recommended way to do sharing of large intermediate state, such as files, between tasks in Airflow?
To clarify something: No matter how you setup airflow, there will only be one executor running.
The executor runs on the same machine as the scheduler.
Currently (current is airflow 1.9.0 at time of writing) there is no safe way to run multiple schedulers, so there will only ever be one executor running.
Local executor executes the task on the same machine as the scheduler.
Celery Executor just puts tasks in a queue to be worked on the celery workers.
However, the question you are asking does apply to Celery workers. If you use Celery Executor you will probably have multiple celery workers.
Using network shared storage solves multiple problems:
Each worker machine sees the same dags because they have the same dags folder
Results of operators can be stored on a shared file system
The scheduler and webserver can also share the dags folder and run on different machines
I would use network storage, and write the output file name to xcom. Then when you need to input the output from a previous task, you would read the file name from that task's Xcom and process that file.
Change datatype of column key in xcom table of airflow metastore.
Default datatype of key is: blob.
Change it to LONGBLOB. It will help you to store upto 4GB in between intermediate tasks.

How do I setup an Airflow of 2 servers?

Trying to split out Airflow processes onto 2 servers. Server A, which has been already running in standalone mode with everything on it, has the DAGs and I'd like to set it as the worker in the new setup with an additional server.
Server B is the new server which would host the metadata database on MySQL.
Can I have Server A run LocalExecutor, or would I have to use CeleryExecutor? Would airflow scheduler has to run on the server that has the DAGs right? Or does it have to run on every server in a cluster? Confused as to what dependencies there are between the processes
This article does an excellent job demonstrating how to cluster Airflow onto multiple servers.
Multi-Node (Cluster) Airflow Setup
A more formal setup for Apache Airflow is to distribute the daemons across multiple machines as a cluster.
Benefits
Higher Availability
If one of the worker nodes were to go down or be purposely taken offline, the cluster would still be operational and tasks would still be executed.
Distributed Processing
If you have a workflow with several memory intensive tasks, then the tasks will be better distributed to allow for higher utilizaiton of data across the cluster and provide faster execution of the tasks.
Scaling Workers
Horizontally
You can scale the cluster horizontally and distribute the processing by adding more executor nodes to the cluster and allowing those new nodes to take load off the existing nodes. Since workers don’t need to register with any central authority to start processing tasks, the machine can be turned on and off without any downtime to the cluster.
Vertically
You can scale the cluster vertically by increasing the number of celeryd daemons running on each node. This can be done by increasing the value in the ‘celeryd_concurrency’ config in the {AIRFLOW_HOME}/airflow.cfg file.
Example:
celeryd_concurrency = 30
You may need to increase the size of the instances in order to support a larger number of celeryd processes. This will depend on the memory and cpu intensity of the tasks you’re running on the cluster.
Scaling Master Nodes
You can also add more Master Nodes to your cluster to scale out the services that are running on the Master Nodes. This will mainly allow you to scale out the Web Server Daemon incase there are too many HTTP requests coming for one machine to handle or if you want to provide Higher Availability for that service.
One thing to note is that there can only be one Scheduler instance running at a time. If you have multiple Schedulers running, there is a possibility that multiple instances of a single task will be scheduled. This could cause some major problems with your Workflow and cause duplicate data to show up in the final table if you were running some sort of ETL process.
If you would like, the Scheduler daemon may also be setup to run on its own dedicated Master Node.
Apache Airflow Cluster Setup Steps
Pre-Requisites
The following nodes are available with the given host names:
master1 - Will have the role(s): Web Server, Scheduler
master2 - Will have the role(s): Web Server
worker1 - Will have the role(s): Worker
worker2 - Will have the role(s): Worker
A Queuing Service is Running. (RabbitMQ, AWS SQS, etc)
You can install RabbitMQ by following these instructions: Installing RabbitMQ
If you’re using RabbitMQ, it is recommended that it is also setup to be a cluster for High Availability. Setup a Load Balancer to proxy requests to the RabbitMQ instances.
Additional Documentation
Documentation: https://airflow.incubator.apache.org/
Install Documentation: https://airflow.incubator.apache.org/installation.html
GitHub Repo: https://github.com/apache/incubator-airflow
All airflow processes need to have the same contents in their airflow_home folder. This includes configuration and dags. If you only want server B to run your MySQL database, you do not need to worry about any airflow specifics. Simply install the database on server B and change your airflow.cfg's sql_alchemy_conn parameter to point to your database on Server B and run airflow initdb from Server A.
If you also want to run airflow processes on server B, you would have to look into scaling using the CeleryExecutor.

Python: Prioritizing tasks and Running asynchronous tasks without a lock

Right now I'm using Gevent, and I wanted to ask two questions:
Is there a way to execute specific tasks that will never execute asynchronously (instead of using a Lock in each of these tasks)
Is there's a way to prioritize spawned tasks in Gevent? Like a group of tasks that will be generated with low priority that will be executed when all of the other tasks are done. For example, two tasks that listen to different socket when each of these tasks handles the socket requests in various priority
If it's not possible in Gevent, is there any other library that it can be done?
Edit
Maybe Celery can help me here?
If you want to manage computing resources, Python async libraries can't help here, because, AFAIK, neither has priority scheduler. All greenthreads are equal.
Task queues generally have a notion of priority, so Celery or Beanstalk is one way to do it.
If your problem does not require task (re)execution guarantees, persistence, multi-machine work distribution, then I would just start few worker processes, assign them CPU, IO, disk priorities using OS and send work/results via UNIX socket DGRAM. Kind of ad-hoc simpler version of task queue. If you go this way, please share your work as open source project, I believe there's demand for this kind of solution.

Resources