Optimal paralleism for a given project with gnu make

Optimal paralleism for a given project with gnu make - gnu-make

I'd like to know optimal number of cores needed to build a project with GNU make.
I can use --max-load to tune for an existing system, but I want to know if doubling or tripling the core count and memory would improve build wall clock times.
If I could collect statistics on how many recipes make holds waiting for a free core to execute and how long they occupy the core, this could be turned into a standard job scheduling problem.

I don't think there's any way to answer your question, really. Maybe you can be more specific about what you'd like to know.
Obviously the more cores you have, assuming sufficient memory to support them, then the more recipes make can invoke in parallel without crushing your system.
If you have 2 cores and you run make -j300 then make will dutifully invoke 300 jobs at once and your operating system will dutifully attempt to run all of them at the same time. Most likely, your system will be swapping and context switching so much that it will make very little progress and it would take less wall clock time to run make -j2 instead.
On the other hand, if you have 256 cores then make -j300 is probably quite reasonable... assuming you have enough memory to ensure that all those jobs don't wait swapping memory out.
And of course, at some point (but probably far away from any reasonable number of cores you have unless you have a lot of money to spend) you will run into disk IO issues with so many compiler processes running at the same time trying to read source from the disk to compile.

My goto number is num cpus + 1. This is based on a lot of informal benchmarks, and is usually very close to the optimal number. -j9 on a hyper-threaded four core laptop, and -j49 on my usual production build server.
The + 1 means that make keeps all the CPUs occupied, even as jobs are being retired, and is usually a teensy-weensy bit faster than without the increment.
It also means that other users can use the same multiplier without melting the machine.
Be aware though, that although -j49 ensures there are only 49 processes actually running, the parent make will potentially have many more child processes than that. For instance, a single compile may mean the shell is called, which calls a shell script, which calls the compiler driver, which calls the correct compiler stage. On some toolchains my -j49 builds have a peak of 245 child processes. A bit annoying when my ulimit max user processes is only 512.

Related

Why does process creation using `clone` result in an out-of-memory failure?

I have a process that allocates about 20GB of RAM on a 32GB machine. After some events, I'm streaming the data from the parent process to stdin of the child process. It's mandatory to keep the 20GB of data in the parent process at the point when the child is spawned.
The app is written in Rust and I'm calling Command::new('path/to/command') to create the child process.
When I spawn the child process the operating system is trapping an out-of-memory error.
strace output:
[pid 747] 16:04:41.128377 clone(child_stack=0, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, child_tidptr=0x7ff4c7f87b10) = -1 ENOMEM (Cannot allocate memory)
Why does the trap occur? The child process should not consume more than 1GB and exec() is called immediately after clone().

The Problem
When a child process is created by the Rust call, several things happen at a C/C++ level. This is a simplification, but it will help explain the dilemma.
The streams are duplicated (with dup2 or a similar call)
The parent process is forked (with the fork or clone system call)
The forked process executes the child (with call from the execvp family)
The parent and child are now concurrent processes. The Rust call you are currently using appears to be a clone call that is behaving much like a pure fork, so you're 20G x 2 - 32G = 8G short, without considering the space needed by the operating system and anything else that might be running. The clone call is returning with a negative return value and errno is set by the call to ENOMEM errno.
If the architectural solutions of adding physical memory, compressing the data, or streaming it through a process that does not require the entirety of it to be in memory at any one time are not options, then the classic solution is reasonably simple.
Recommendation
Design the parent process to be lean. Then spawn two worker children, one that handles your 20GB need and the other that handles your 1 GB need1. These children can be connected to one another via pipe, file, shared memory, socket, semaphore, signalling, and/or other communication mechanism(s), just as a parent and child can be.
Many mature software packages from Apache httpd to embedded cell tower routing daemons use this design pattern. It is reliable, maintainable, extensible, and portable.
The 32G would then likely suffice for the 20G and 1G processing needs, along with OS and lean parent process.
Although this solution will surely solve your problem, if the code is to be reused or extended later, there may be value in looking into potential process design changes involving data frames or multidimensional slices to support streaming of data and memory requirement reductions.
Memory Overcommit Always
Setting overcommit_memory to 1 eliminates the clone error condition referenced in the question because the Rust call calls the LINUX clone call that reads that setting. But there are several caveats with this solution that point back to the above recommendation as superior, primarily that the value of 1 is dangerous, especially for production environments.
Background
Kernel discussions about OpenBSD rfork and the clone call ensued in the late 1990s and early 2000s. The features stemming from those discussions permit less extreme forking than processes, which is symmetrically like the provision of more extensive independence between pthreads. Some of these discussions have produced extensions to the traditional process spawning that have entered POSIX standardization.
In the early 2000s, Linux Torvalds suggested a flag structure to determine what components of the execution model are shared and what are copied when execution forks, blurring the distinction between processes and threads. From this, the clone call emerged.
Over-committing memory is not discussed much if any in those threads. The design goal was MORE control of the results of a fork rather than the delegation of memory usage optimization to an operating system heuristic, which is what the default setting of overcommit_memory = 0 does.
Caveats
Memory overcommit goes beyond these extensions, adding the complexity of trade-offs of its modes2, design trend caveats3, practical run time limitations4, and performance impacts5.
Portability and Longevity
Additionally, without standardization, the code using memory overcommit may not be portable, and the question of longevity is pertinent, especially when a setting controls the behavior of a function. There is no guarantee of backward compatibility or even some warning of deprication if the setting system changes.
Danger
The linuxdevcenter documentation2 says, "1 always overcommits. Perhaps you now realize the danger of this mode.", and there are other indications of danger with ALWAYS overcommitting 6, 7.
The implementers of overcommit on LINUX, Windows, and VMWare may guarantee reliability, but it is a statistical game that, combined with the many other complexities of process control, may lead to certain unstable characteristics under certain conditions. Even the name overcommit tells us something about its true character as a practice.
A non-default overcommit_memory mode, for which several warnings are issues, but works for the immediate trial of the immediate case may later lead to intermittent reliability.
Predictability and Its Impact on System Reliability and Response Time Consistency
The idea of a process in a UNIX like operating system, from its Bell Labs beginnings, is that a process makes a concrete requests to its container, the operating system. The result both predictable and binary. Either the request is denied or granted. Once granted, the process is given complete control and direct access over the resources until the use of it is relinquished by the process.
The swap space aspect of virtual memory is a breach of this principle that appears as gross deceleration of activity on workstations, when RAM is heavily consumed. For instance, there are times during development when one presses a key and has to wait ten seconds to see the character on the display.
Conclusion
There are many ways to get the most out of physical memory, but doing so by hoping that use of memory allocated will be sparse will likely introduce negative impacts. Performance hits from swapping when overcommit is overused is the well documented example. If you are keeping 20G of data in RAM, this may particularly be the case.
Only allocating what is needed, forking in intelligent ways, using threads, and freeing memory that is surely no longer needed lead to memory thrift without impacting reliability, creating spikes in swap disk usage, and can operate without caveat up to the limits of system resources.
The position of the designer of the Command::new call may be based on this perspective. In this case, how soon after the fork the exec is called is not a determining factor in how much memory is requested during the spawn.
Notes and References
[1] Spawning worker children may require some code refactoring and appear to be too much trouble on a superficial level, but the refactoring may be surprisingly straightforward and significantly beneficial.
[2] http://www.linuxdevcenter.com/pub/a/linux/2006/11/30/linux-out-of-memory.html?page=2
[3] https://www.etalabs.net/overcommit.html
[4] http://www.gabesvirtualworld.com/memory-overcommit-in-production-yes-yes-yes/
[5] https://labs.vmware.com/vmtj/memory-overcommitment-in-the-esx-server
[6] https://github.com/kubernetes/kubernetes/issues/14452
[7] http://linuxtoolkit.blogspot.com/2011_08_01_archive.html

Using Linux's time utility to measure performance of MPI program

I'm benchmarking an MPI program with different compiler setups.
Right now I use Linux's time to do so:
$> $(which time) mpirun -v [executable]
The values I get look ok in terms of what I expected.
Are there any reasons why I should not be using time for this?
Measuring the needed CPU time is of main interest here.
I'm aware that benchmarking on a single machine is not necessarily going to be consistent with what's happening on a cluster, but this is out of scope.

You should not use time for the purpose of getting the CPU time of a MPI program.
Firstly, that is not going to work in a distributed setup. Now your question is not clear whether you target a single node or a cluster, but that doesn't even matter. An MPI implementation may use whatever mechanism for launching even on a single node. So time may or may not include the CPU time of the actual application processes.
But there is more conceptional issues: What does CPU time for an MPI program mean? That would be the sum of CPU time of all processes. That is a bad metric for benchmarking: It does not quantify improvement, and it does not correlate to overall runtime. For instance a very imbalanced version of your code may use less CPU time but more wall time than a balanced one. Or enabling busy waiting instead of blocking may improve overall runtime, but also increase used CPU time. To really understand what is happenening, and which process uses what kind of resources, you should resort to a proper parallel performance analysis tool.
In HPC, you are not going to be budgeted by CPU time but rather by reserved CPUs * walltime. So if you must use a one dimensional metric, then walltime is the way to go. Now you can use time mpirun ... to get that, although accuracy won't be great for short running applications.

How to use GNU make --max-load on a multicore Linux machine?

From the documentation for GNU make: http://www.gnu.org/software/make/manual/make.html#Parallel
When the system is heavily loaded, you will probably want to run fewer
jobs than when it is lightly loaded. You can use the ‘-l’ option to
tell make to limit the number of jobs to run at once, based on the
load average. The ‘-l’ or ‘--max-load’ option is followed by a
floating-point number. For example,
-l 2.5
will not let make start more than one job if the load average is above 2.5.
The ‘-l’ option with no following number removes the load limit, if one was
given with a previous ‘-l’ option.
More precisely, when make goes to start up a job, and it already has
at least one job running, it checks the current load average; if it is
not lower than the limit given with ‘-l’, make waits until the load
average goes below that limit, or until all the other jobs finish.
From the Linux man page for uptime: http://www.unix.com/man-page/Linux/1/uptime/
System load averages is the average number of processes that are
either in a runnable or uninterruptable state. A process in a runnable
state is either using the CPU or waiting to use the CPU. A process
in uninterruptable state is waiting for some I/O access, eg waiting
for disk. The averages are taken over the three time intervals.
Load averages are not normalized for the number of CPUs in a system,
so a load average of 1 means a single CPU system is loaded all the
time while on a 4 CPU system it means it was idle 75% of the time.
I have a parallel makefile and I want to do the obvious thing: have make to keep adding processes until I am getting full CPU usage but I'm not inducing thrashing.
Many (all?) machines today are multicore, so that means that the load average is not the number make should be checking, as that number needs to be adjusted for the number of cores.
Does this mean that the --max-load (aka -l) flag to GNU make is now useless? What are people doing who are running parallel makefiles on multicore machines?

My short answer: --max-load is useful if you're willing to invest the time it takes to make good use of it. With its current implementation there's no simple formula to pick good values, or a pre-fab tool for discovering them.
The build I maintain is fairly large. Before I started maintaining it the build was 6 hours. With -j64 on a ramdisk, now it finishes in 5 minutes (30 on an NFS mount with -j12). My goal here was to find reasonable caps for -j and -l that allows our developers to build quickly but doesn't make the server (build server or NFS server) unusable for everyone else.
To begin with:
If you choose a reasonable -jN value (on your machine) and find a reasonable upper bound for load average (on your machine), they work nicely together to keep things balanced.
If you use a very large -jN value (or unspecified; eg, -j without a number) and limit the load average, gmake will:
continue spawning processes (gmake 3.81 added a throttling mechanism, but that only helps mitigate the problem a little) until the max # of jobs is reached or until the load average goes above your threshold
while the load average is over your threshold:
do nothing until all sub-processes are finished
spawn one job at a time
do it all over again
On Linux at least (and probably other *nix variants), load average is an exponential moving average (UNIX Load Average Reweighed, Neil J. Gunther) that represents the avg number of processes waiting for CPU time (can be caused by too many processes, waiting for IO, page faults, etc). Since it's an exponential moving average, it's weighted such that newer samples have a stronger influence on the current value than older samples.
If you can identify a good "sweet spot" for the right max load and number of parallel jobs (through a combination of educated guesses and empirical testing), assuming you have a long running build: your 1 min avg will hit an equilibrium point (won't fluctuate much). However, if your -jN number is too high for a given max load average, it'll fluctuate quite a bit.
Finding that sweet spot is essentially equivalent to finding optimal parameters to a differential equation. Since it will be subject to initial conditions, the focus is on finding parameters that get the system to stay at equilibrium as opposed to coming up with a "target" load average. By "at equilibrium" I mean: 1m load avg doesn't fluctuate much.
Assuming you're not bottlenecked by limitations in gmake: When you've found a -jN -lM combination that gives a minimum build time: that combination will be pushing your machine to its limits. If the machine needs to be used for other purposes ...
... you may want to scale it back a bit when you're finished optimizing.
Without regard to load avg, the improvements I saw in build time with increasing -jN appeared to be [roughly] logarithmic. That is to say, I saw a larger difference between -j8 and -j12 than between -j12 and -j16.
Things peaked for me somewhere between -j48 and -j64 (on the Solaris machine it was about -j56) because the initial gmake process is single-threaded; at some point that thread cannot start new jobs faster than they finish.
My tests were performed on:
A non-recursive build
recursive builds may see different results; they won't run into the bottleneck I did around -j64
I've done my best to minimize the amount of make-isms (variable expansions, macros, etc) in recipes because recipe parsing occurs in the same thread that spawns parallel jobs. The more complicated recipes are, the more time it spends in the parser instead of spawning/reaping jobs. For example:
No $(shell ...) macros are used in recipes; those are ran during the 1st parsing pass and cached
Most variables are assigned with := to avoid recursive expansion
Solaris 10/sparc
256 cores
no virtualization/logical domains
the build ran on a ramdisk
x86_64 linux
32-core (4x hyper threaded)
no virtualization
the build ran on a fast local drive

Even for a build where the CPU is the bottleneck, -l is not ideal. I use -jN, where N is the number of cores that exist or that I want to spend on the build. Choosing a bigger number doesn't speed up the build in my situation. It doesn't slow it down either, as long as you don't go overboard (such as unlimited launching through -j).
Using -lN is broadly equivalent to -jN, and can work better if the machine has other independent work to do, but there are two quirks (apart from the one you mentioned, the number of cores not accounted for):
Initial spike: when the build starts, make launches a lot of jobs, many more than N. The system load number doesn't immediately increase when a process is forked. That's not a problem in my situation.
Starvation: when some build jobs take a long time and others are equally quick, at the moment the first M quick jobs end jointly, the system load is still ≥N. Soon the system load drops to N - M, but as long as those few slow jobs are dragging on, no new jobs are launched, and cores are left hungry. Make only thinks about launching a new job when an old job ends, and at the start. It doesn't notice the system load dropping in between.

Many (all?) machines today are multicore, so that means that the load
average is not the number make should be checking, as that number
needs to be adjusted for the number of cores.
Does this mean that the --max-load (aka -l) flag to GNU make is now
useless?
No. Imagine jobs with demanding disk i/o. If you started as many jobs as you had CPUs, you still wouldn't utilize the CPU very well.
Personally, I simply use -j because so far it worked well enough for me.

Does this mean that the --max-load (aka -l) flag to GNU make is now useless? What are people doing who are running parallel makefiles on multicore machines?
One of examples is running jobs in test-suite where each test has to compile and link a program. Linking sometimes load system too much, as result - fatal error: ld terminated with signal 9 [Killed]. In my case, it was not memory overhead but CPU usage, so usually suggested swap file didn't help.
With option -l 1 execution is still parallel but linking is almost sequential:

This is really about finding the right balance between RAM usage and CPU usage. The RAM has to feed the CPU with data and the CPU needs to do the work, they need to be working in sync and that depends on your exact settings with regard to its specs.
For my system (CPU: i5-1035G4, 4-core, 8 thread, RAM: 8GB & 10GB swap with swapiness at 99%) the best settings were: -l 1.9 -j7.
With that setting my system was compiling quick using at about 50% performance, so I could still use my system to do everything else in the foreground.

Single-CPU programs running on Hyper-Threading-enabled quadcore CPU

I'm a researcher in statistical pattern recognition, and I often run simulations that run for many days. I'm running Ubuntu 12.04 with Linux 3.2.0-24-generic, which, as I understand, supports multicore and hyper-threading. With my Intel Core i7 Sandy Bridge Quadcore with HTT, I often run 4 simulations (programs that take a long time) at the same time. Before I ask my question, here are the things that I already (think I) know.
My OS (Ubuntu 12.04) detects 8 CPUs due to hyper-threading.
The scheduler in my OS is clever enough never to schedule two programs to run on two logical (virtual) cores belonging to the same physical core, because the OS supports SMP (Simultaneous Multi-Threading).
I have read the Wikipedia page on Hyper-Threading.
I have read the HowStuffWorks page on Sandy Bridge.
OK, my question is as follows. When I run 4 simulations (programs) on my computer at the same time, they each run on a separate physical core. However, due to hyper-threading, each physical core is split into two logical cores. Therefore, is it true that each of the physical cores is only using half of its full capacity to run each of my simulations?
Thank you very much in advance. If any part of my question is not clear, please let me know.

This answer is probably late, but I see that nobody offered an accurate description of what's going on under the hood.
To answer your question, no, one thread will not use half a core.
One thread can work inside the core at a time, but that one thread can saturate the whole core processing power.
Assume thread 1 and thread 2 belong to core #0. Thread 1 can saturate the whole core's processing power, while thread 2 waits for the other thread to end its execution. It's a serialized execution, not parallel.
At a glance, it looks like that extra thread is useless. I mean the core can process 1 thread at once right?
Correct, but there are situations in which the cores are actually idling because of 2 important factors:
cache miss
branch misprediction
Cache miss
When it receives a task, the CPU searches inside its own cache for the memory addresses it needs to work with. In many scenarios the memory data is so scattered that it is physically impossible to keep all the required address ranges inside the cache (since the cache does have a limited capacity).
When the CPU doesn't find what it needs inside the cache, it has to access the RAM. The RAM itself is fast, but it pales compared to the CPU's on-die cache. The RAM's latency is the main issue here.
While the RAM is being accessed, the core is stalled. It's not doing anything. This is not noticeable because all these components work at a ridiculous speed anyway and you wouldn't notice it through some CPU load software, but it stacks additively. One cache miss after another and another hampers the overall performance quite noticeably.
This is where the second thread comes into play. While the core is stalled waiting for data, the second thread moves in to keep the core busy. Thus, you mostly negate the performance impact of core stalls.
I say mostly because the second thread can also stall the core if another cache miss happens, but the likelihood of 2 threads missing the cache in a row instead of 1 thread is much lower.
Branch misprediction
Branch prediction is when you have a code path with more than one possible result. The most basic branching code would be an if statement.
Modern CPUs have branch prediction algorithms embedded into their microcode which try to predict the execution path of a piece of code. These predictors are actually quite sophisticated and although I don't have solid data on prediction rate, I do recall reading some articles a while back stating that Intel's Sandy Bridge architecture has an average successful branch prediction rate of over 90%.
When the CPU hits a piece of branching code, it practically chooses one path (path which the predictor thinks is the right one) and executes it. Meanwhile, another part of the core evaluates the branching expression to see if the branch predictor was indeed right or not. This is called speculative execution.
This works similarly to 2 different threads: one evaluates the expression, and the other executes one of the possible paths in advance.
From here we have 2 possible scenarios:
The predictor was correct. Execution continues normally from the speculative branch which was already being executed while the code path was being decided upon.
The predictor was wrong. The entire pipeline which was processing the wrong branch has to be flushed and start over from the correct branch.
OR, the readily available thread can come in and simply execute while the mess caused by the misprediction is resolved. This is the second use of hyperthreading.
Branch prediction on average speeds up execution considerably since it has a very high rate of success. But performance does incur quite a penalty when the prediction is wrong.
Branch prediction is not a major factor of performance degradation since, like I said, the correct prediction rate is quite high.
But cache misses are a problem and will continue to be a problem in certain scenarios.
From my experience hyperthreading does help out quite a bit with 3D rendering (which I do as a hobby). I've noticed improvements of 20-30% depending on the size of the scenes and materials/textures required. Huge scenes use huge amounts of RAM making cache misses far more likely. Hyperthreading helps a lot in overcoming these misses.

Since you are running on a Linux kernel you are in luck because the scheduler is smart enough to make sure your tasks is divided on between your physical cores.
Linux became hyperthredding aware in kernel 2.4.17 ( ref: http://kerneltrap.org/node/391 )
Note that the reference is from the old O(1) scheduler. Linux now uses the CFS scheduling algorithm which was introduced in kernel 2.6.23 and should be even better.
But as already suggested you can experiment by disabling hyper threading in bios and see if your particular workload runs faster or slower with or without hyperthreading enabled. If you start 8 tasks instead of 4 you will probably find that the total executing time for 8 tasks on hyperthreading is faster than two separate runs with 4 tasks but again the best thing to do is to experiment. Good luck!

If you are really want just 4 dedicated cores, you should be able to disable hyperthreading in your BIOS page. Also, and this part I'm less clear on, I believe that the processor is smart enough to do more work on a single thread if its second logical core is idle.

No, it's not exactly true. A hyperthreaded core is not two cores. Some things can run in parallel, but not as much as on two separate cores.

Multitasking on Linux with multiple CPUs

I feel my question is quite basic, but I couldn't find any related SO question.
I need to run a program a few thousands of times (different input each time), and currently it is done by a shell script. The machine runs Ubuntu and has 8 CPUs (as revealed by cat /proc/cpuinfo). Using top I see that only 1 CPU is utilized. In order to speed thing up, I want to utilize all 8 CPUs. I know I can start the program in the background, and then call it again (and indeed top reveals that 2 CPUs are utilized in that case), so I can change my shell script to call the program in groups of 8. My question is, is that a recommended way to utilize all CPUs, or is there another, somewhat 'cleaner' way?

You can use cpu affinity to be explicit about the processor for the processes.
http://www.cyberciti.biz/tips/setting-processor-affinity-certain-task-or-process.html
However, if each process runs on a cpu (as it should, the kernel will make sure that things are running as efficiently as possible), then just fire n processes off (8 in your case, or make your shell script figure out what n is so your script is a bit more robust, or make it a command line option) and let the kernel do it for you. Each time a process ends, fire off another process until you are done.

Question is overly vague.
That you want to use all the CPUs implies you want the end result as quickly as possible - but a major concern for the performance f multiple instances would be contention for resources (reducing performance) and caching (improving performance).
Usually splitting the job amongst multiple processes will usually yield results faster. And there are many, many ways of sharding the workload. But without knowing a lot more about what it is doing it is difficult to recommend a particular approach.
Given that you have 8 CPUs, and assuming that the only constrained resource is the CPU, then you don't want to have more than 8 threads running concurrently on the job. So the problem then becomes how you schedule work to ensure that you are using the 8 cores optimally. Splitting the work into 8 scripts and running them concurrently you will initially see all 8 scripts running concurrently - but its very likely, depending on the nature of the work, that the scripts will finish at different times.
So if you really want to use the hardware optimally, that means running 8 processes as daemons, preferably with each process having a cpu affinity set, fed by a message queue. But is it really worthwhile coding all this if you're not going to be running this regularly? Also it may be faster to run just 7 and keep a CPU for handling the quueue and other demands placed on the box.

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex