MemSQL: High CPU usage - cpu-usage

My cluster has one MASTER AGGREGATOR and one LEAF. After running two months, the CPU usage in LEAF is very high, almost at 100%. So, is this normal?
By the way, its size is 545 MB for table data.

This is not normal for MemSQL operation. Note that the Ops console is showing you all CPU use on that host, not just what MemSQL is using. I recommend running 'top' or similar to determine what process(es) are consuming resources.
You can also run 'SHOW PROCESSLIST' on any node to see if there is a long-running MemSQL process.

Related

Max number of workers/slaves for parallel job snow

I'm running a foreach loop with the snow back-end on a windows machine. I have 8 cores to work with. The rscript is exectuted via a system call embedded in a python script, so there would be an active python instance too.
Is there any benefit to not have #workers=#cores and instead #workers<#cores so there is always an opening for system processes or the python instance?
It successfully runs having #workers=#cores but do I take a performance hit by saturating the cores (max possible threads) with the r worker instances?
It will depend on
Your processor (specifically hyperthreading)
How much info has to be copied to/from the different images
If you're implementing this over multiple boxes (LAN)
For 1) hyperthreading helps. I know my machine does it so I typically have twice as many workers are cores and my code completes in about 85% of the time compared to if I matched the number of workers with cores. It won't improve more than that.
2) If you're not forking, using sockets for instance, you're working as if you're in a distributed memory paradigm, which means creating one copy in memory for every worker. This can be a non-trivial amount of time. Also, multiple images in the same machine may take up a lot of space, depending on what you're working on. I often match the number of workers with number because doubling workers will make me run out of memory.
This is compounded by 3) network speeds over multiple workstations. Locally between machines our switch will transfer things at about 20 mbytes/second which is 10x faster than my internet download speeds at home, but is a snail's pace compared to making copies in the same box.
You might consider increasing R's nice value so that the python has priority when it needs to do something.

Limit CPU & Memory for *nix Process

Is it possible to limit CPU & Memory for the *nix Process?
The CPU limit may look like "use no more than 10% of one core".
The memory limit may look like "use no more than 100Mb", the OS may limit it or kill the process if it try to exceed the limit, both ways are fine.
Any *nix that could do that would be fine.
It seems it is possible to implement it with virtual machines, but it is not acceptable because the overhead is too huge.
If you happen to use Solaris, the ability to limit resource usage is a native feature.
Memory (RAM) usage can be capped per process using the rcap.max-rss setting while CPU usage can be limited per project using the project.cpu-caps.
Note that Solaris also allows OS level virtualization (a.k.a. zones) which have no significant overhead, if any, compared to a bare metal OS instance.
Resource capping is part of Solaris zones configuration.
Try CPULimit
cpulimit is a simple program which attempts to limit the cpu usage of a process (expressed in percentage, not in cpu time). This is useful to control batch jobs, when you don't want them to eat too much cpu. It does not act on the nice value or other scheduling priority stuff, but on the real cpu usage. Also, it is able to adapt itself to the overall system load, dynamically and quickly.

Should CPU usage be restricted?

I have a computation-intensive application, which needs to run on a Windows server with no other applications on it. The application is designed for horizontal scalability so that it can run on multiple servers if the input load is more. Should I be worried about the CPU usage and ensure that every time it goes over a certain threshold, I should bring in a new server and get the application running on it to spread the load? Or is it ok if the app runs continuously at 100% CPU load?
Basically are there any disadvantages of letting an app run at 100%?
I understand overheating of the CPU could be an issue.
Also context switching between threads of the application could cause the overall thorughput to be reduced.
Any other? Is there any guideline regarding a threshold to be set for CPU utilization?
Thanks,
Yash
Intel (i) series i5 and i7 allow hyper-threading and turbo-boosting which makes the CPU run at 105-110% from my knowledge.
Overheating will be a problem but if you can set affinities for different cores on different times if you have a quad core 2 can run on turbo-boost while the other two cool down. Assuming you can do that.
I hope I answered your question (in a way).

How to use GNU make --max-load on a multicore Linux machine?

From the documentation for GNU make: http://www.gnu.org/software/make/manual/make.html#Parallel
When the system is heavily loaded, you will probably want to run fewer
jobs than when it is lightly loaded. You can use the ‘-l’ option to
tell make to limit the number of jobs to run at once, based on the
load average. The ‘-l’ or ‘--max-load’ option is followed by a
floating-point number. For example,
-l 2.5
will not let make start more than one job if the load average is above 2.5.
The ‘-l’ option with no following number removes the load limit, if one was
given with a previous ‘-l’ option.
More precisely, when make goes to start up a job, and it already has
at least one job running, it checks the current load average; if it is
not lower than the limit given with ‘-l’, make waits until the load
average goes below that limit, or until all the other jobs finish.
From the Linux man page for uptime: http://www.unix.com/man-page/Linux/1/uptime/
System load averages is the average number of processes that are
either in a runnable or uninterruptable state. A process in a runnable
state is either using the CPU or waiting to use the CPU. A process
in uninterruptable state is waiting for some I/O access, eg waiting
for disk. The averages are taken over the three time intervals.
Load averages are not normalized for the number of CPUs in a system,
so a load average of 1 means a single CPU system is loaded all the
time while on a 4 CPU system it means it was idle 75% of the time.
I have a parallel makefile and I want to do the obvious thing: have make to keep adding processes until I am getting full CPU usage but I'm not inducing thrashing.
Many (all?) machines today are multicore, so that means that the load average is not the number make should be checking, as that number needs to be adjusted for the number of cores.
Does this mean that the --max-load (aka -l) flag to GNU make is now useless? What are people doing who are running parallel makefiles on multicore machines?
My short answer: --max-load is useful if you're willing to invest the time it takes to make good use of it. With its current implementation there's no simple formula to pick good values, or a pre-fab tool for discovering them.
The build I maintain is fairly large. Before I started maintaining it the build was 6 hours. With -j64 on a ramdisk, now it finishes in 5 minutes (30 on an NFS mount with -j12). My goal here was to find reasonable caps for -j and -l that allows our developers to build quickly but doesn't make the server (build server or NFS server) unusable for everyone else.
To begin with:
If you choose a reasonable -jN value (on your machine) and find a reasonable upper bound for load average (on your machine), they work nicely together to keep things balanced.
If you use a very large -jN value (or unspecified; eg, -j without a number) and limit the load average, gmake will:
continue spawning processes (gmake 3.81 added a throttling mechanism, but that only helps mitigate the problem a little) until the max # of jobs is reached or until the load average goes above your threshold
while the load average is over your threshold:
do nothing until all sub-processes are finished
spawn one job at a time
do it all over again
On Linux at least (and probably other *nix variants), load average is an exponential moving average (UNIX Load Average Reweighed, Neil J. Gunther) that represents the avg number of processes waiting for CPU time (can be caused by too many processes, waiting for IO, page faults, etc). Since it's an exponential moving average, it's weighted such that newer samples have a stronger influence on the current value than older samples.
If you can identify a good "sweet spot" for the right max load and number of parallel jobs (through a combination of educated guesses and empirical testing), assuming you have a long running build: your 1 min avg will hit an equilibrium point (won't fluctuate much). However, if your -jN number is too high for a given max load average, it'll fluctuate quite a bit.
Finding that sweet spot is essentially equivalent to finding optimal parameters to a differential equation. Since it will be subject to initial conditions, the focus is on finding parameters that get the system to stay at equilibrium as opposed to coming up with a "target" load average. By "at equilibrium" I mean: 1m load avg doesn't fluctuate much.
Assuming you're not bottlenecked by limitations in gmake: When you've found a -jN -lM combination that gives a minimum build time: that combination will be pushing your machine to its limits. If the machine needs to be used for other purposes ...
... you may want to scale it back a bit when you're finished optimizing.
Without regard to load avg, the improvements I saw in build time with increasing -jN appeared to be [roughly] logarithmic. That is to say, I saw a larger difference between -j8 and -j12 than between -j12 and -j16.
Things peaked for me somewhere between -j48 and -j64 (on the Solaris machine it was about -j56) because the initial gmake process is single-threaded; at some point that thread cannot start new jobs faster than they finish.
My tests were performed on:
A non-recursive build
recursive builds may see different results; they won't run into the bottleneck I did around -j64
I've done my best to minimize the amount of make-isms (variable expansions, macros, etc) in recipes because recipe parsing occurs in the same thread that spawns parallel jobs. The more complicated recipes are, the more time it spends in the parser instead of spawning/reaping jobs. For example:
No $(shell ...) macros are used in recipes; those are ran during the 1st parsing pass and cached
Most variables are assigned with := to avoid recursive expansion
Solaris 10/sparc
256 cores
no virtualization/logical domains
the build ran on a ramdisk
x86_64 linux
32-core (4x hyper threaded)
no virtualization
the build ran on a fast local drive
Even for a build where the CPU is the bottleneck, -l is not ideal. I use -jN, where N is the number of cores that exist or that I want to spend on the build. Choosing a bigger number doesn't speed up the build in my situation. It doesn't slow it down either, as long as you don't go overboard (such as unlimited launching through -j).
Using -lN is broadly equivalent to -jN, and can work better if the machine has other independent work to do, but there are two quirks (apart from the one you mentioned, the number of cores not accounted for):
Initial spike: when the build starts, make launches a lot of jobs, many more than N. The system load number doesn't immediately increase when a process is forked. That's not a problem in my situation.
Starvation: when some build jobs take a long time and others are equally quick, at the moment the first M quick jobs end jointly, the system load is still ≥N. Soon the system load drops to N - M, but as long as those few slow jobs are dragging on, no new jobs are launched, and cores are left hungry. Make only thinks about launching a new job when an old job ends, and at the start. It doesn't notice the system load dropping in between.
Many (all?) machines today are multicore, so that means that the load
average is not the number make should be checking, as that number
needs to be adjusted for the number of cores.
Does this mean that the --max-load (aka -l) flag to GNU make is now
useless?
No. Imagine jobs with demanding disk i/o. If you started as many jobs as you had CPUs, you still wouldn't utilize the CPU very well.
Personally, I simply use -j because so far it worked well enough for me.
Does this mean that the --max-load (aka -l) flag to GNU make is now useless? What are people doing who are running parallel makefiles on multicore machines?
One of examples is running jobs in test-suite where each test has to compile and link a program. Linking sometimes load system too much, as result - fatal error: ld terminated with signal 9 [Killed]. In my case, it was not memory overhead but CPU usage, so usually suggested swap file didn't help.
With option -l 1 execution is still parallel but linking is almost sequential:
This is really about finding the right balance between RAM usage and CPU usage. The RAM has to feed the CPU with data and the CPU needs to do the work, they need to be working in sync and that depends on your exact settings with regard to its specs.
For my system (CPU: i5-1035G4, 4-core, 8 thread, RAM: 8GB & 10GB swap with swapiness at 99%) the best settings were: -l 1.9 -j7.
With that setting my system was compiling quick using at about 50% performance, so I could still use my system to do everything else in the foreground.

What is the highest number of threads that is reasonable to simultaneously run in Jmeter?

I want to use the highest possible number of threads (to use less computers) but without making the bottleneck to be in the client.
JMeter can simulate a very High Load provided you use it right.
Don't listen to Urban Legends that say JMeter cannot handle high load.
Now as for answer, it depends on:
your machine power
your jvm 32 bits or 64 bits
your jvm allocated memory -Xmx
your test plan ( lot of beanshell, post processor, xpath ... Means lots of cpu)
your os configuration (tunable)
Gui / non gui mode
So there is no theorical answer but following Best Practices will ensure JMeter performs well.
Note that with jmeter you can distribute load through remote testing, read:
Remote Testing > 15.4 Using a different sample sender
And finally use cloud based testing if it's not enough.
Read this for tuning tips:
http://www.ubik-ingenierie.com/blog/jmeter_performance_tuning_tips/
Read this book for doing load testing and using JMeter correctly.
I have used JMeter a fair bit and found it is not great at generating really high load. On a 2Ghz Core2 Duo with 2Gb memory you can reasonably expect about 100 threads.
That being said, it is best to run it on your hardware so that the CPU of the PC does not peak at 100% - a stable 80%-90% is best otherwise the results are affected.
I have also tried WAPT 5 - it successfully ran 1000+ threads from the same PC. It is not free but it is more useable than JMeter but doesn't have all of the features.
Outdated answer since at least version 2.6 see https://stackoverflow.com/a/11922239/460802 for a more up to date one.
The JMeter Wiki reports cases where JMeter was used with as much as 1000 threads. I have used it with at most 100 threads, but the Links in the Wiki suggest resource reductions I never tried.
One of the issues we had with running JMeter on Windows XP was the Windows XP TCP Connection Limit. Limit should be removed in order to run use the JMeter to workstation’s full potential
More info here. AFAIK, does not apply to other OS.
I used JMeter since 2004 and i launched lot of load tests.
With PC Windows 7 64 bits 4Go RAM iCore5.
I think JMeter can support 300 to 400 concurrent threads for Http (Sampler) protocol with only one "Aggregate Report Listener" who writes in the log file results and timers between call pages.
For a big load test you could configure JMeter with slaves (load generators) like this
http://jmeter-plugins.org/wiki/HttpSimpleTableServer/
I have already done tests with 11 PC slaves to simulate 5000 threads.
I have not used JMeter, but the answer probably depends on your hardware. Best bet might be to establish metrics of performance, guess at the number of threads and then run a binary search as follows.
Source was Wikipedia.
Number guessing game...
This rather simple game begins something like "I'm thinking of an integer between forty and sixty inclusive, and to your guesses I'll respond 'High', 'Low', or 'Yes!' as might be the case." Supposing that N is the number of possible values (here, twenty-one as "inclusive" was stated), then at most questions are required to determine the number, since each question halves the search space. Note that one less question (iteration) is required than for the general algorithm, since the number is already constrained to be within a particular range.
Even if the number we're guessing can be arbitrarily large, in which case there is no upper bound N, we can still find the number in at most steps (where k is the (unknown) selected number) by first finding an upper bound by repeated doubling. For example, if the number were 11, we could use the following sequence of guesses to find it: 1, 2, 4, 8, 16, 12, 10, 11
One could also extend the technique to include negative numbers; for example the following guesses could be used to find −13: 0, −1, −2, −4, −8, −16, −12, −14, −13
It is more dependent on the kind of performance testing you do(load, spike, endurance etc) on a specific server (a little on hardware dependency)
Keep in mind around these parameters
- the client machine on which you are targeting the run of jmeter, there will be a certain amount of heap memory allocated, ensure to have a healthy allocation so that the script does not error out. The highest i had run on jmeter was 1500 on a local environment ( client - server arch), On a Web arch, the highest i had a run was based upon Non- functional requirement were limited to 250 threads,
so it ideally depends on the kinds of performance testing and deployment style and so on..
There is not standard number for this. The maximum number of threads that you can generate from one computer depends completely on the computer's hardware and the OS. The OS by default occupies certain amount of CPU and the RAM.
To find out the maximum threads your computer can handle you can prepare a sample test and run it with only a few threads. Then with each cycle of test run increase the number of threads gradually. During this you also need to monitor the CPU, RAM, Disk I/O and Network I/O of your computer. The moment any of these reach near or beyond 80% (Again for you to decide if near is okay for you or beyond), that is the maximum number of threads your computer can handle. To be on the safer side I would stop at the number when the resource utilization reaches 70%.
It'll depend on the hardware you run on as well as the underlying script. I've always felt that this fuzziness is the biggest problem with traditional load testing tools. If you've got a small budget ($200 or so gets you a LOT of testing), check out my company's load testing service, BrowserMob.
Besides our Real Browser Users (RBUs) which control thousands on actual browsers for the purpose of performance and load testing, we also have traditional virtual users (VUs). Scripts are written in JavaScript and can make various HTTP calls.
The reason I bring it up is that I always felt that the game of trying to figure out how many VUs you can fit on your load gen hardware is dangerous. It's so easy to get bad results without realizing it.
To solve that for BrowserMob, we took an extremely conservative approach on the number of VUs and RBUs per CPU core: no more than 1 browser or 50 threads per CPU core, and sometimes much less. In the world of cloud computing, CPU cycles are so cheap that it just doesn't make sense to try to overload machines.

Resources