Could anyone please let me know whether I can spin up many (say, around 10 concurrent DAGs) concurrent DAGs simultaneously in parallel at the same time? The parameters for these DAGs will be the same but with different values. Any ideas or suggestions would be greatly helpful as stuck on this. Thanks!
Related
Have some basic questions about setting up Airflow on EKS using Fargate. What I have understood so far is that master plane will still be managed by AWS, while the worker plane will be Fargate instances.
Question: What I am unclear is while setting up webserver/scheduler etc on Fargate, do I need to specify anywhere the amount of Vcpu and Memory?
More importantly, do any changes need to be done on how dags are written so that they can execute on the individual pods ? Also do the tasks in the dags specify how much Vcpu and memory will the task use?
Sorry just entering into the Fargate/EKS/Airflow world.
Thanks
I have a DAG that persistently monitors a filesystem and executes when new files are present. I want one copy of this dag to be running perpetually and I'm wondering the best way to accomplish this?
E.g. as soon as one dag run finishes another begins.
I could accomplish this by scheduling DAGs every few minutes, and then limiting concurrent dags to 1, but I'm guessing there is a more systematic way to accomplish this.
I am maintaining API server for my company which runs a python flask app in uwsgi on top of nginx.
...
#app.route('/getquick', methods=["GET"])
def GET_GET_IP_DATA():
sp_final = "CALL sp_quick()"
cursor.execute(sp_final)
#app.route('/get_massive_log', methods=["POST"])
def get_massive_log():
sp_final = "CALL sp_slow()"
cursor.execute(sp_final)
...
While the first request /getquick gets processed very quickly, /get_massive_log can take up to five seconds due to a rather long and complex mySQL query. The server can handle few of these queries but starts creating broken pipe errors when called to much.
The problem is, the other /getquick requests get blocked by these long I/O requests.
My manager suggested that I use gevent to somehow free up the server to process the other requests while waiting for the mySQL queries, but I am not sure if I am looking in the correct direction.
I am using pymysql to run queries, which google seems to suggest to work with gevent on top of uwsgi, but I have not been able to produce better results with it.
I have googled for days now, and while I am trying to understand threads, concurrency, asynchronous requests, I don't know where to start digging to find a solution. Is it even possible? Any suggestions or even pointers to where to research would be greatly appreciated.
EDIT : Perhaps my questions wasn't too clear, so I'll try to restate it:
What's the best way to free up workers for processing other requests while waiting for long database queries with uwsgi?
You need to learn about Uwsgi offloading
Offloading is a way to optimize tiny tasks, delegating them to one or
more threads.
These threads run such tasks in a non-blocking/evented way allowing
for a huge amount of concurrency.
You can read about offloading subsystem in the docs
What is the practical limit on the number of DAGS in an airflow system?
We are seeing severe delays after a couple of hundred DAGS are created.
Is anyone running 1000's of DAGS?
It depends on how you run Airflow and how much resources you provide it. You can run Airflow in a distributed mode with Celery and Master and Worker having high memory and vCores, if you have huge number of tasks.
Right now I'm using Gevent, and I wanted to ask two questions:
Is there a way to execute specific tasks that will never execute asynchronously (instead of using a Lock in each of these tasks)
Is there's a way to prioritize spawned tasks in Gevent? Like a group of tasks that will be generated with low priority that will be executed when all of the other tasks are done. For example, two tasks that listen to different socket when each of these tasks handles the socket requests in various priority
If it's not possible in Gevent, is there any other library that it can be done?
Edit
Maybe Celery can help me here?
If you want to manage computing resources, Python async libraries can't help here, because, AFAIK, neither has priority scheduler. All greenthreads are equal.
Task queues generally have a notion of priority, so Celery or Beanstalk is one way to do it.
If your problem does not require task (re)execution guarantees, persistence, multi-machine work distribution, then I would just start few worker processes, assign them CPU, IO, disk priorities using OS and send work/results via UNIX socket DGRAM. Kind of ad-hoc simpler version of task queue. If you go this way, please share your work as open source project, I believe there's demand for this kind of solution.