When should I add (auto-scale) new Nginx server instances? - nginx

Should I take in consideration CPU utilization, network traffic or http response time checks? I've run some tests with Apache AB (from the same server - eq: ab -k -n 500000 -c 100 http://192.XXX.XXX.XXX/) - and I monitored the load average. Even if the load was between 1.0 - 1.50(one core server), "time per request"(mean) was pretty solid, 140ms for a simple dynamic page with one set/get Redis operation. Anyway, I'm confused as the general advice is to launch a new instance when you pass the 70% CPU utilization threshold.

70% CPU utilization is a good rule of thumb for CPU-bound applications like nginx. CPU time is kind of like body temperature: it actually hides a lot of different things, but is a good general indicator of health. Load average is a separate measure of how many processes are waiting to be scheduled. The reason the rule is 70% (or 80%) utilization is that, past this point, CPU-bound appliations tend to suffer contention-induced latency and non-linear performance.
You can test this yourself by plotting throughput and latency (median and 90th percentile) against CPU utilization on your setup. Finding the inflection point for your particular system is important for capacity planning.
A very good writeup of this phenomenon is given in Facebook's original paper on Dyno, their system for measuring throughput of PHP under load.
https://www.facebook.com/note.php?note_id=203367363919

Related

Jelastic - vertical scaling

I have an application on tomcat (no horizontal scaling), I send 20 requests. Each request executing complex calculations for 6-7 minutes.
I run the same on a standalone server with many cores (50+) and there each request is executed in a thread on the separate logical processor - execution time is 3-4 minutes.
On the Jelastic platform, it scales up to 40-42 cloudlets (even if max is 92). Is there a way "influence" scaling, to tell Jelastic that it should use more cloudlets. (I suppose there is less than 20 processors on 42 cloudlets and time is shared between threads).
Also if I send twice more (40 requests) it uses around 65 cloudlets (max, in that case, is set to 160).
Could I find anywhere what are rules how Jelastic decide how to scale and what to do to use more cloudlets?
The overall compute power of your Jelastic node is determined by the cloudlet scaling limit. If you have a compute intensive workload, you should set this value as high as possible, keeping in mind your budget constraints. (you can also configure load alerts to notify you about persistent excessive usage)
The cloudlet consumption (in terms of billing) is calculated as the average CPU usage per hour, so if you have a spike of a few minutes - as in your example - you will want to allow it to use as much CPU as it can.
The usage that you see displayed is a function of how your workload is spreading across cores, with potentially artificial limitations on the actual compute power of each core due to the cloudlet limits that you've set. To get the best performance, you have to allow a very high cloudlet limit, because this will give you full access to each core instead of an (effectively) limited clock speed per core.
What you set as reserved cloudlets doesn't matter (it's purely a billing construct: no technical implications), and no cloudlets are added or removed from your server - again, this is purely a billing construct, not a technical one. This means there is no way to instruct Jelastic to use more cloudlets: that's happening purely due to how your code is executed (most likely the way it's spread across cores due to the overall cloudlet limits you've set vs. the compute power of the underlying hardware).
There is also a possibility that your performance will be different across different Jelastic providers depending on their hardware choices. For example some will use a high number of cores with lower clock speed, others may prioritise clock speed over the number of cores. Your application performance in that case will depend on how effectively it can be parallelised across cores (multi-threading); if your most intensive chunk of work doesn't parallelise, it will work best with a higher clock speed. Given that you get increased cloudlet consumption with more requests, it sounds like this is at least partially responsible in your case.
You might also consider that if you have better performance from dedicated hardware, you can get Jelastic as a private or hybrid cloud option. They may also offer different hardware options in different regions to cater for different types of workloads. Talk to your Jelastic provider and see what they can offer you.

An optimal number of worker threads in an async IO TCP server

We have migrated our thread per connection communication model to an async IO based TCP server using boost::asio. The reason for this change is that the old model didn't scale well enough. We have permanently about 2k persistent connections on average with the tendency to keep growing on a monthly basis.
My question is what would be the ideal number of worker threads that will poll the io_service queue for completion handlers - the number of virtual cpu cores?
Choosing a small number can lead to the situations where the server does not consume quickly enough and can't cope with the rate the clients send messages.
Does it make sense to add worker threads dynamically in such situations?
Update:
Probably it is my implementation but i find this statement part of the boost asio docu confusing:
Implementation strategies such as thread-per-connection (which a synchronous-only approach would require) can degrade system
performance, due to increased context switching, synchronisation and
data movement among CPUs. With asynchronous operations it is possible
to avoid the cost of context switching by minimising the number of
operating system threads — typically a limited resource — and only
activating the logical threads of control that have events to
process.
As if you have X threads pumping completion events on a machine that has X cores - 1) you don't have any guarantees that each thread gets a dedicated cpu and 2) if my connections are persistent i don't have any guarantees that the thread that say does an async_read will be the same as the one to execute the completion handler.
void Connection::read {
boost::asio::async_read(socket, boost::asio::buffer(buffer, 18),
boost::bind(&Connection::handleRead, shared_from_this(),
boost::asio::placeholders::error,
boost::asio::placeholders::bytes_transferred));
}
void Connection::handleRead(const boost::system::error_code &error,
std::size_t bytes_transferred) {
// processing of the bytes
...
// keep processing on this socket
read();
}
In an ideal situation with perfectly non-blocking I/O, a working set that fits entirely in L1 cache, and no other processes in the physical system, each thread will use the entire resources of a processor core. In such a situation, the ideal number of threads is one per logical core.
If any part of your I/O is blocking then it makes sense to add more threads than the number of cores so that no cores sit idle. If half the thread time is spent blocked then you should have almost 2 threads per core. If 75% of the thread time is spent blocked then you should have 3 or 4 per core, and so on. Context switching overhead counts as blocking for this purpose.
I've noticed that when Microsoft has to make a blind guess about this, they tend to go for two or four threads per core. Depending on your budget for making this determination I would either go with 2 or 4, or start with one thread per core and work my way up, measuring throughput (requests serviced / second) and latency (min, max, and average response time) until I hit the sweet spot.
Dynamically adjusting this value only makes sense if you are dealing with radically different programs. For a predictable workload there is a sweet spot for your hardware that should not change much even with increasing amounts of work to be done. If you are making a general purpose web server then dynamic adjustment is probably necessary.

Why Process CPU is over 100%?

I've begun to track my asp.net application metrics, but for Servers -> Process CPU (avg) I'm getting values above 100% (194% or more). What does that mean?
Probably it means it's a multi-threaded process that's keeping 1.94 CPUs busy, on average.

Does high CPU usage indicate a software module is designed wrong

We have a process that is computationally intensive. When it runs it typically it uses 99% of the available CPU. It is configured to take advantage of all available processors so I believe this is OK. However, one of our customers is complaining because alarms go off on the server on which this process is running because of the high CPU utilization. I think that there is nothing wrong with high CPU utilization per se. The CPU drops back to normal when the process stops running and the process does run to completion (no infinite loops, etc.). I'm just wondering if I am on solid ground when I say that high CPU usage is not a problem per se.
Thank you,
Elliott
if I am on solid ground when I say that high CPU usage is not a problem per se
You are on solid ground.
We have a process that is computationally intensive
Then I'd expect high CPU usage.
The CPU drops back to normal when the process stops running
Sounds good so far.
Chances are that the systems you client are using are configured to notify when the CPU usage goes over a certain limit, as sometimes this is indicative of a problem (and sustained high usage can cause over heating and associated problems).
If this is expected behavior, your client needs to adjust their monitoring - but you need to ensure that the behavior is as expected on their systems and that it is not likely to cause problems (ensure that high CPU usage is not sustained).
Alarm is not a viable reason for poor design. The real reason may be that it chokes other tasks on the system. Modern OSes usually take care of this by lowering dynamic priority of the CPU hungry process in such a way that others that are less demanding of CPU time will get higher priority. You may tell the customer to "nice" the process to start with, since you probably don't care if it runs 10 mins of 12 mins. Just my 2 cents :)

What is the highest number of threads that is reasonable to simultaneously run in Jmeter?

I want to use the highest possible number of threads (to use less computers) but without making the bottleneck to be in the client.
JMeter can simulate a very High Load provided you use it right.
Don't listen to Urban Legends that say JMeter cannot handle high load.
Now as for answer, it depends on:
your machine power
your jvm 32 bits or 64 bits
your jvm allocated memory -Xmx
your test plan ( lot of beanshell, post processor, xpath ... Means lots of cpu)
your os configuration (tunable)
Gui / non gui mode
So there is no theorical answer but following Best Practices will ensure JMeter performs well.
Note that with jmeter you can distribute load through remote testing, read:
Remote Testing > 15.4 Using a different sample sender
And finally use cloud based testing if it's not enough.
Read this for tuning tips:
http://www.ubik-ingenierie.com/blog/jmeter_performance_tuning_tips/
Read this book for doing load testing and using JMeter correctly.
I have used JMeter a fair bit and found it is not great at generating really high load. On a 2Ghz Core2 Duo with 2Gb memory you can reasonably expect about 100 threads.
That being said, it is best to run it on your hardware so that the CPU of the PC does not peak at 100% - a stable 80%-90% is best otherwise the results are affected.
I have also tried WAPT 5 - it successfully ran 1000+ threads from the same PC. It is not free but it is more useable than JMeter but doesn't have all of the features.
Outdated answer since at least version 2.6 see https://stackoverflow.com/a/11922239/460802 for a more up to date one.
The JMeter Wiki reports cases where JMeter was used with as much as 1000 threads. I have used it with at most 100 threads, but the Links in the Wiki suggest resource reductions I never tried.
One of the issues we had with running JMeter on Windows XP was the Windows XP TCP Connection Limit. Limit should be removed in order to run use the JMeter to workstation’s full potential
More info here. AFAIK, does not apply to other OS.
I used JMeter since 2004 and i launched lot of load tests.
With PC Windows 7 64 bits 4Go RAM iCore5.
I think JMeter can support 300 to 400 concurrent threads for Http (Sampler) protocol with only one "Aggregate Report Listener" who writes in the log file results and timers between call pages.
For a big load test you could configure JMeter with slaves (load generators) like this
http://jmeter-plugins.org/wiki/HttpSimpleTableServer/
I have already done tests with 11 PC slaves to simulate 5000 threads.
I have not used JMeter, but the answer probably depends on your hardware. Best bet might be to establish metrics of performance, guess at the number of threads and then run a binary search as follows.
Source was Wikipedia.
Number guessing game...
This rather simple game begins something like "I'm thinking of an integer between forty and sixty inclusive, and to your guesses I'll respond 'High', 'Low', or 'Yes!' as might be the case." Supposing that N is the number of possible values (here, twenty-one as "inclusive" was stated), then at most questions are required to determine the number, since each question halves the search space. Note that one less question (iteration) is required than for the general algorithm, since the number is already constrained to be within a particular range.
Even if the number we're guessing can be arbitrarily large, in which case there is no upper bound N, we can still find the number in at most steps (where k is the (unknown) selected number) by first finding an upper bound by repeated doubling. For example, if the number were 11, we could use the following sequence of guesses to find it: 1, 2, 4, 8, 16, 12, 10, 11
One could also extend the technique to include negative numbers; for example the following guesses could be used to find −13: 0, −1, −2, −4, −8, −16, −12, −14, −13
It is more dependent on the kind of performance testing you do(load, spike, endurance etc) on a specific server (a little on hardware dependency)
Keep in mind around these parameters
- the client machine on which you are targeting the run of jmeter, there will be a certain amount of heap memory allocated, ensure to have a healthy allocation so that the script does not error out. The highest i had run on jmeter was 1500 on a local environment ( client - server arch), On a Web arch, the highest i had a run was based upon Non- functional requirement were limited to 250 threads,
so it ideally depends on the kinds of performance testing and deployment style and so on..
There is not standard number for this. The maximum number of threads that you can generate from one computer depends completely on the computer's hardware and the OS. The OS by default occupies certain amount of CPU and the RAM.
To find out the maximum threads your computer can handle you can prepare a sample test and run it with only a few threads. Then with each cycle of test run increase the number of threads gradually. During this you also need to monitor the CPU, RAM, Disk I/O and Network I/O of your computer. The moment any of these reach near or beyond 80% (Again for you to decide if near is okay for you or beyond), that is the maximum number of threads your computer can handle. To be on the safer side I would stop at the number when the resource utilization reaches 70%.
It'll depend on the hardware you run on as well as the underlying script. I've always felt that this fuzziness is the biggest problem with traditional load testing tools. If you've got a small budget ($200 or so gets you a LOT of testing), check out my company's load testing service, BrowserMob.
Besides our Real Browser Users (RBUs) which control thousands on actual browsers for the purpose of performance and load testing, we also have traditional virtual users (VUs). Scripts are written in JavaScript and can make various HTTP calls.
The reason I bring it up is that I always felt that the game of trying to figure out how many VUs you can fit on your load gen hardware is dangerous. It's so easy to get bad results without realizing it.
To solve that for BrowserMob, we took an extremely conservative approach on the number of VUs and RBUs per CPU core: no more than 1 browser or 50 threads per CPU core, and sometimes much less. In the world of cloud computing, CPU cycles are so cheap that it just doesn't make sense to try to overload machines.

Resources