How many Tornado Instances and How many Nginx Worker Processes - nginx

Suppose I am running a web application using Tornado and running them behind Nginx as a Load Balancer. Please tell me the best practices for certain things.
1. If I am running the service in an AWS EC2 instance, then How many NGINX worker processes should I run for a given x number of VCPUs for any particular instance. Lets say I am running on an EC2 instance with 2 VCPUs, then how many worker processes should I run? It would be better if I know the general rule for it. Also, in what conditions should I increase the number of workers as against the general rule?
2. Now after I set my Nginx as load balancer, it boils down to my Tornado Application. So, how many Tornado instances should I run given x number of VCPUs in an EC2 instance? As mentioned in the doc, its good to have 1 instance per processor, but is that the best condition? If yes, then in what scenario, should I look for increasing the number of instances per processor? If not, than what is the best rule?
NOTE : I am running the instances via Supervisord as my process management program.
3. Now if my application does a lot of async calls to MySQL Database and MongooseIM server, all running on the same host, then will the number of Tornado Instances per processor should be changed? If yes, then what is the rule? If not, then what is the best practice?

If you are running nginx on a machine by itself, then you should give it as many worker processes as you have CPUs. If you're running it on the same machine as Tornado then you probably want to give it fewer (maybe just one). But it's better to be too high than too low here, so if you're unsure it's fine to use the number of CPUs. You'll want more nginx workers if you're using TLS (especially with stronger security settings) or serving a lot of static files, and fewer if it's just a proxy to Tornado.
One Tornado instance per CPU is the best starting point. You might decrease this number if your application does a lot with threads or if there are other things running on the same machine, and you might increase it if you do any synchronous database/network calls without threads.
As long as your database calls are asynchronous, they do not affect how many Tornado processes you should run.

Related

How Uvicorn workers works, and how many do I need for a slim machine?

The application I deploy is FastAPI with Uvicorn under K8s.
While trying to understand how I want to Dockerize the application I understood I want to implement Uvicorn without Gunicorn and to add a system of scale up/down by the load of the requests the application is getting.
I did a lot of load testing and discovered that with the default of 1 Uvicorn worker I'm getting 3.5 RPS, while changing the workers to 8 I can get easly 22 RPS (didn't check for more since its great results for me).
Now what I was expecting regarding the resources is that the CPU that I will have to provide will be with a limit of 8 (I assume every worker works on one process and thread), but I saw only increase in the memory usage, but barley in the CPU. maybe its because the app don't use that much CPU but indeed its possible for it to use more than 1 CPU? so far it didn't used more than one CPU.
How are the Uvicorn workers works? how should I calculate the workers number I need for the app? I didn't find useful information.
Again, my goal is to keep a slim machine of 1 cpu, with Autoscaling system.

Best practice of deploying a flask-api on google production kubernetes cluster

A flask-api (using gunicorn) is used as an inference api of a deep learning model.
This specific inference process is very cpu intensive (not using gpu yet).
What is the best practice of deploying it to a kubernetes cluster, based on these aspects:
should I create multiple pods handling requests using single gunicorn worker or less pods enabling gunicorn multiple workers? (node memory footprint)
since google provides to expose your deployment as a service using an external loadbalancer,
do I need a nginx web server on my flask-gunicorn stack?
creating multiple identical pods on the same node, is it more memory intensive than handling all these requests using multithreading on a single pod?
More smaller pods is generally better, provided you're staying under "thousands". It is easier for the cluster to place a pod that requires 1 CPU and 1 GB of RAM 16 times than it is to place a single pod that requires 16 CPU and 16 GB RAM once. You usually want multiple replicas for redundancy, to tolerate node failure, and for zero-downtime upgrades in any case.
If the Istio Ingress system works for you, you may not need separate a URL-routing layer (Nginx) inside your cluster. If you're okay with having direct access to your Gunicorn servers with no routing or filtering in front of that, directly pointing a LoadBalancer Service at them is a valid choice.
Running 16 copies of 1 application will generally need more memory than 1 copy with 16 threads; how much more depends on the application.
In particular, if you load your model into memory and the model itself is large, but your multi-threaded setup can share a single copy of it, 1 large pod could use significantly less memory than 16 small pods. If the model is COPYed directly into the Docker image and the application code mmap()s it then you'd probably get to share memory at the kernel layer.
If the model itself is small and most of the memory is used in the processing, it will still use "more" memory to have multiple pods, but it would just be the cost of your runtime system and HTTP service; it shouldn't substantially change the memory required per thread/task/pod if that isn't otherwise shared.

How do I setup an Airflow of 2 servers?

Trying to split out Airflow processes onto 2 servers. Server A, which has been already running in standalone mode with everything on it, has the DAGs and I'd like to set it as the worker in the new setup with an additional server.
Server B is the new server which would host the metadata database on MySQL.
Can I have Server A run LocalExecutor, or would I have to use CeleryExecutor? Would airflow scheduler has to run on the server that has the DAGs right? Or does it have to run on every server in a cluster? Confused as to what dependencies there are between the processes
This article does an excellent job demonstrating how to cluster Airflow onto multiple servers.
Multi-Node (Cluster) Airflow Setup
A more formal setup for Apache Airflow is to distribute the daemons across multiple machines as a cluster.
Benefits
Higher Availability
If one of the worker nodes were to go down or be purposely taken offline, the cluster would still be operational and tasks would still be executed.
Distributed Processing
If you have a workflow with several memory intensive tasks, then the tasks will be better distributed to allow for higher utilizaiton of data across the cluster and provide faster execution of the tasks.
Scaling Workers
Horizontally
You can scale the cluster horizontally and distribute the processing by adding more executor nodes to the cluster and allowing those new nodes to take load off the existing nodes. Since workers don’t need to register with any central authority to start processing tasks, the machine can be turned on and off without any downtime to the cluster.
Vertically
You can scale the cluster vertically by increasing the number of celeryd daemons running on each node. This can be done by increasing the value in the ‘celeryd_concurrency’ config in the {AIRFLOW_HOME}/airflow.cfg file.
Example:
celeryd_concurrency = 30
You may need to increase the size of the instances in order to support a larger number of celeryd processes. This will depend on the memory and cpu intensity of the tasks you’re running on the cluster.
Scaling Master Nodes
You can also add more Master Nodes to your cluster to scale out the services that are running on the Master Nodes. This will mainly allow you to scale out the Web Server Daemon incase there are too many HTTP requests coming for one machine to handle or if you want to provide Higher Availability for that service.
One thing to note is that there can only be one Scheduler instance running at a time. If you have multiple Schedulers running, there is a possibility that multiple instances of a single task will be scheduled. This could cause some major problems with your Workflow and cause duplicate data to show up in the final table if you were running some sort of ETL process.
If you would like, the Scheduler daemon may also be setup to run on its own dedicated Master Node.
Apache Airflow Cluster Setup Steps
Pre-Requisites
The following nodes are available with the given host names:
master1 - Will have the role(s): Web Server, Scheduler
master2 - Will have the role(s): Web Server
worker1 - Will have the role(s): Worker
worker2 - Will have the role(s): Worker
A Queuing Service is Running. (RabbitMQ, AWS SQS, etc)
You can install RabbitMQ by following these instructions: Installing RabbitMQ
If you’re using RabbitMQ, it is recommended that it is also setup to be a cluster for High Availability. Setup a Load Balancer to proxy requests to the RabbitMQ instances.
Additional Documentation
Documentation: https://airflow.incubator.apache.org/
Install Documentation: https://airflow.incubator.apache.org/installation.html
GitHub Repo: https://github.com/apache/incubator-airflow
All airflow processes need to have the same contents in their airflow_home folder. This includes configuration and dags. If you only want server B to run your MySQL database, you do not need to worry about any airflow specifics. Simply install the database on server B and change your airflow.cfg's sql_alchemy_conn parameter to point to your database on Server B and run airflow initdb from Server A.
If you also want to run airflow processes on server B, you would have to look into scaling using the CeleryExecutor.

Puma Stops Running for Rails App on EC2 Instance with Nginx (using Capistrano/Capistrano Puma)

My top-level question is, how can I get Puma to stop failing. But that is really made up of lots of smaller questions. I will number and bold each of them, to try to make this question answerable.
I am hosting a Rails application on an EC2 instance that is a t2.nano. This is admittedly, a very small box--but I don't expect my website to receive any traffic. I configured everything successfully with Nginx and Puma using Capistrano and Capistrano Puma. Everything was great, until one day I went to my website and saw the Nginx 504 message.
I opened the Nginx error log and saw that it could not connect to Puma:
connect() to unix:/home/deploy/myapp/shared/tmp/sockets/puma.sock failed (111: Connection refused) while connecting to upstream, client: xxx.xxx.xxx.xxx, server: localhost, request: "GET / HTTP/1.0", upstream: "http://unix:/home/deploy/myapp/shared/tmp/sockets/puma.sock:/500.html", host: "myapp.com"
Debugging this, I learned that Puma had stopped running. That is why Nginx could not connect to it. I think there are two problems here: the first, is that Puma should not stop running. The server is tiny, but there is no traffic. the second, is that when Puma does fail, it should restart gracefully. However, I am just focusing on the first issue for now. Because if Puma is constantly restarting, it seems reasonable that sometimes it kills the process in a harsh way.
To debug this, I opened htop. Sure enough, the machine was running without any memory to spare. This makes sense--I am running a database, rails app, webserver, and memcache on one tiny machine. It keeps running out of memory and killing Puma.
I looked into the Puma configuration I had set up with Capistrano. In config/deploy.rb I had these lines--
set :puma_threads, [0, 8]
set :puma_workers, 0
I read all about puma_workers and puma_threads. I also learned that Nginx has its own workers. Puma processes are very expensive. What makes Puma cool is that it is properly muli-threaded--so the independent processes are awesome. It sounds like each worker has its own set of threads--so if there are 4 workers with 8 threads, there will be 32 processes. But in my case, I want to use very little memory. 2 processes sound good to me. 1. Is my understanding of workers and threads correct?
I updated my config/deploy.rb file and deployed, with 0 puma_workers and min=0, max=2 threads.
It appears the configuration for Nginx lives here: /etc/nginx/nginx.conf. And the configuration for Puma lives here: /home/deploy/myapp/shared/puma.rb. I would have expected my updates in config/deploy.rb to have had Capistano edit the config files. No luck--my min, max threads was still set to 0,8. 2. Is it correct to try and update these values through config/deploy.rb when using Capistano?
Also--I opened the nginx.conf and saw worker_processes 4;. 3. Was this set to four when I installed Nginx or did Capistano set this default?
I opened htop and sure enough I had lots of Puma processes. Therefore, I edited my config files manually and restarted Puma and Nginx.
I changed the number of Nginx workers from 4 to 1. Looking in htop, this worked. I now only had 1 Nginx worker. However, the Nginx workers were never very expensive (compared to the Puma threads). So I don't think this matters much.
However, there were still more than 2 Puma threads--there were 6. On a lark, I changed the minimum number of threads from 0 to 1--thinking 0 isn't a possible number so maybe it's setting a default. This increased the number of Puma processes to 9. I also tried changing the number of puma_workers to 1, for the same reason, and the number of processes increased. 4. What does it mean to have 0 threads and/or workers?
I then tried to kill one of the puma processes manually (sudo kill xxxxx), and then all of the Puma processes died.
5. What do I have to do to have just 2 puma processes?
As you can see, my understanding of Puma is not great and the lines between what Puma vs Nginx vs Capistano touches is not clear. Any help is greatly appreciated. I haven't been able to find great resources regarding this issue.
This is what I've learned--
if Puma stops working, make sure you have enough memory to handle to number of workers and threads that you specified. each Puma process is pretty expensive.
if you set 0 workers, Puma will not run in cluster mode. it is recommended to run MRI using cluster mode.
threads are set per cluster. if you have 2 works and 0,8 threads that means you will have two works and each will have between 1 and 8 threads.
Puma uses processes in addition to the threads. Puma has a PID for the parent process. if you are using cluster mode, it has a PID to manager the clusters. if you are using cluster mode, it also has a PID for each cluster. then, there are a fixed number of PIDs to run other tasks (per cluster). without cluster mode, there are 5 fixed PIDs. with cluster mode, there are 7 fixed PIDs.
this is all to say--if you see more processes than you expect, this is why. also--when you add a new worker you add a significant amount of expensive processes. make sure you have the space.
i have a small app, and things seem to be working nicely with 1 worker and min=1, max=4 threads. having a max of 8 threads looks to be what kept killing puma for me.
To answer my original questions--
Yes, the explanation above of workers and threads is correct.
capistrano-puma appears to only set puma config with the first deploy.
I think the nginx config is created when nginx is installed.
0 workers means you are running puma without cluster mode. It is impossible to have 0 threads. I believe 0,8 is the same as 1,8.
Puma needs to run processes in addition to the threads you request. It is impossible to have puma running with only 2 or 3 PIDs. These processes run addition tasks.
A suspect for Puma hangs
The thing with Puma is that it's the only mainstream project that encourages the use of threading in MRI Ruby (well, anyway, Heroku encourages).
This is why we sometimes see statements from people working on Puma about how people think that Puma has various kinds of issues, while the problem is elsewhere, and it is, and it affects only Puma :P
"We" have discovered and fixed in the past some very freaky and nasty Ruby GC issues on heavy duty use of threads in Ruby MRI with some freaky corner cases (remember http://blog.skylight.io/hunting-for-leaks-in-ruby/) and who is to say this is not the last of such freaky issues that people attribute to Puma?
Try disabling threading for a while, see how it goes, and let us know, maybe the rabbit lies there, again
Docs explaining threads vs clustered mode vs workers
Thread pool docs: https://github.com/puma/puma#thread-pool
Clustered mode docs: https://github.com/puma/puma#clustered-mode
puma.rb options: https://github.com/puma/puma/blob/master/examples/config.rb
Under Thread pool the docs explain how to set up the number of worker threads. Remember, Puma is/was primarily a JRuby thing and MRI support & forking was added only later as an afterthought, the ordering of configuration entries in the docs (how to set up threading before how to set up forking) is a consequence of this.
The docs state:
Puma utilizes a dynamic thread pool which you can modify. You can set the minimum and maximum number of threads that are available in the pool with the -t (or --threads) flag:
Puma 2 offers clustered mode, allowing you to use forked processes to handle multiple incoming requests concurrently, in addition to threads already provided.
Meaning, Puma will always thread, it's what it does, if you tell it to do 0/1 thread, it will do 1 thread so it can serve requests.
Additionally, if you set the number of workers (processes) to > 1, Puma will run in "Clustered mode" which means it will fork and each fork will thread,
i.e. -w 3 -t4:4 will result with 3 processes running 4 threads each, allowing you to concurrently server 12 requests.
Puma docs don't specify which and how many processes Puma will use for it's internals, but just an educated guess is that at the very minimum it needs to run all of the workers + 1 master process to manage them, deliver data to them, start them, stop them, channel their logs etc.

Is there an advantage to running more tornado processes than cores?

A common configuration I've seen for nginx + tornado is to have nginx serve static files and then act as a reverse proxy to some upstream tornado app servers. I know this configuration is often used to serve an application through wsgi (such as Django) which blocks tornado. In that case the usual approach is to run as many tornado processes as will fit in memory and then have the nginx front round robin across processes.
If I were to use a CDN instead of nginx to serve static files and run tornado in a non-blocking fashion, is there any advantage to running more total processes (i.e. 1 nginx and 1 tornado per core) than there are cores on the machine?
If the Tornado instances have no blocking code, there is not much to gain from running more Tornado instances than the number of CPU cores. With blocking code (like using blocking libraries or db drivers inside IOLoop), it's advised to run more instances than cores to utilize CPU resources better (2-3 per core).

Resources