I can run the benchmark with 50% util of an 8 core machine.
I must configure the http thread pool to 9000 to achieve this.
It's not reasonable to use thousands of threads in 8 core machine.
Most of the threads are waiting for reading http inputs.
There are around 6000 client threads driving the benchmark.
But I don't think it need to keep one thread for one connection.
Here is my jboss configuration:
Related
The application I deploy is FastAPI with Uvicorn under K8s.
While trying to understand how I want to Dockerize the application I understood I want to implement Uvicorn without Gunicorn and to add a system of scale up/down by the load of the requests the application is getting.
I did a lot of load testing and discovered that with the default of 1 Uvicorn worker I'm getting 3.5 RPS, while changing the workers to 8 I can get easly 22 RPS (didn't check for more since its great results for me).
Now what I was expecting regarding the resources is that the CPU that I will have to provide will be with a limit of 8 (I assume every worker works on one process and thread), but I saw only increase in the memory usage, but barley in the CPU. maybe its because the app don't use that much CPU but indeed its possible for it to use more than 1 CPU? so far it didn't used more than one CPU.
How are the Uvicorn workers works? how should I calculate the workers number I need for the app? I didn't find useful information.
Again, my goal is to keep a slim machine of 1 cpu, with Autoscaling system.
ijust implement simple tcp listener/server application in erlang and i test this code using 40000 tcp connections and it creates around 600 os thread in windows machine and i closed those 40K tcp connections but erlang VM doesnt release 600 os/windows threads. how do i optimise this, im looking forward for your reply
The windows implementation of erl_poll (which is the mechanism responsible for checking for new data on a socket) uses WaitForMultipleObjects. This API does not allow for more than 64 objects per call, so erl_poll creates threads that each listen to 64 Objects. These are the threads that you see.
There is no mechanism that removes these threads once that they are used, so they will remain until the system terminates. They are of course re-used if you in the future again use that many connections.
The code that does this is in: https://github.com/erlang/otp/blob/master/erts/emulator/sys/win32/erl_poll.c
One of our production stacks using NancyFX (self-hosted) on Mono 4.2.3 on Ubuntu 14.04 (on AWS EC2 behind an ELB) serves RESTful HTTP calls from external clients. The processing of those requests, in many occasions, also includes making HTTP calls to other services, database lookups and such.
With a certain amount of incoming requests on each machine (that seems to us the server should be able to handle gracefully), the servers start to stall, new incoming connections run into timeouts, etc.
Using netstat, we saw that there are thousands of sockets in TIME_WAIT state. We presume those are either cause or at least symptom of the problems we're having.
Does somebody with a similar setup have an idea how we could identify and fix the root cause of those problems?
We've experimented with varying methods of making HTTP requests in C#, mono startup parameters (--server and setting the number of threads per cpu to sensible amounts) already but could not stop the servers from stalling and degrading catastrophically.
Thanks for any input!
I am running a large scale ERP system on the following server configuration. The application is developed using AngularJS and ASP.NET 4.5
Dell PowerEdge R730 (Quad Core 2.7 Ghz, 32 GB RAM, 5 x 500 GB Hard disk, RAID5 configured) Software: Host OS is VMWare ESXi 6.0 Two VMs run on VMWare ESXi .. one is Windows Server 2012 R2 with 16 GB memory allocated ... this contains IIS 8 server with my application code Another VM is also Windows Server 2012 R2 with SQL Server 2012 and 16 GB memory allocated .... this just contains my application database.
You see, I separated the application server and database server for load balancing purposes.
My application contains a registration module where the load is expected to be very very high (around 10,000 visitors over 10 minutes)
To support this volume of requests, I have done the following in my IIS server -> increase request queue in application pool length to 5000 -> enable output caching for aspx files -> enable static and dynamic compression in IIS server -> set virtual memory limit and private memory limit of each application pool to 0 -> Increase maximum worker process of each application pool to 6
I then used gatling to run load testing on my application. I injected 500 users at once into my registration module.
However, I see that only 40% / 45% of my RAM is being used. Each worker process is using only a maximum amount of 130 MB or so.
And gatling is reporting that around 20% of my requests are getting 403 error, and more than 60% of all HTTP requests have a response time greater than 20 seconds.
A single user makes 380 HTTP requests over a span of around 3 minutes. The total data transfer of a single user is 1.5 MB. I have simulated 500 users like this.
Is there anything missing in my server tuning? I have already tuned my application code to minimize memory leaks, increase timeouts, and so on.
There is a known issue with the newest generation of PowerEdge servers that use the Broadcom Network Chip set. Apparently, the "VM" feature for the network is broken which results in horrible network latency on VMs.
Head to Dell and get the most recent firmware and Windows drivers for the Broadcom.
Head to VMWare Downloads and get the latest Broadcom Driver
As for the worker process settings, for maximum performance, you should consider running the same number of worker processes as there are NUMA nodes, so that there is 1:1 affinity between the worker processes and NUMA nodes. This can be done by setting "Maximum Worker Processes" AppPool setting to 0. In this setting, IIS determines how many NUMA nodes are available on the hardware and starts the same number of worker processes.
I guess the 1 caveat to the answer you received would be if your server isn't NUMA aware/uses symmetric processing, you won't see those IIS options under CPU, but the above poster seems to know a good bit more than I do about the machine. Sorry I don't have enough street cred to add this as a comment. As far as IIS you may also want to make sure your app pool doesn't use default recycle conditions and pick a time like midnight for recycle. If you have root level settings applied the default app pool recycling at 29 hours may also trigger garbage collection against your child pool/causing delays even in concurrent gc where it sounds like you may benefit a bit from Gcserver=true. Pretty tough to assess that though.
Has your sql server been optimized for that type of workload? If your data isn't paramount you could squeeze faster execution times with delayed durability, then assess queries that are returning too much info for async io wait types. In general there's not enough here to really assess for sql optimizations, but if not configured right (size/growth options) you could be hitting a lot of timeouts due to growth, vlf fragmentation, etc.
Suppose I am running a web application using Tornado and running them behind Nginx as a Load Balancer. Please tell me the best practices for certain things.
1. If I am running the service in an AWS EC2 instance, then How many NGINX worker processes should I run for a given x number of VCPUs for any particular instance. Lets say I am running on an EC2 instance with 2 VCPUs, then how many worker processes should I run? It would be better if I know the general rule for it. Also, in what conditions should I increase the number of workers as against the general rule?
2. Now after I set my Nginx as load balancer, it boils down to my Tornado Application. So, how many Tornado instances should I run given x number of VCPUs in an EC2 instance? As mentioned in the doc, its good to have 1 instance per processor, but is that the best condition? If yes, then in what scenario, should I look for increasing the number of instances per processor? If not, than what is the best rule?
NOTE : I am running the instances via Supervisord as my process management program.
3. Now if my application does a lot of async calls to MySQL Database and MongooseIM server, all running on the same host, then will the number of Tornado Instances per processor should be changed? If yes, then what is the rule? If not, then what is the best practice?
If you are running nginx on a machine by itself, then you should give it as many worker processes as you have CPUs. If you're running it on the same machine as Tornado then you probably want to give it fewer (maybe just one). But it's better to be too high than too low here, so if you're unsure it's fine to use the number of CPUs. You'll want more nginx workers if you're using TLS (especially with stronger security settings) or serving a lot of static files, and fewer if it's just a proxy to Tornado.
One Tornado instance per CPU is the best starting point. You might decrease this number if your application does a lot with threads or if there are other things running on the same machine, and you might increase it if you do any synchronous database/network calls without threads.
As long as your database calls are asynchronous, they do not affect how many Tornado processes you should run.