Using make to execute independent tasks in parallel

Using make to execute independent tasks in parallel - unix

I have a bunch of commands I would like to execute in parallel. The commands are nearly identical. They can be expected to take about the same time, and can run completely independently. They may look like:
command -n 1 > log.1
command -n 2 > log.2
command -n 3 > log.3
...
command -n 4096 > log.4096
I could launch all of them in parallel in a shell script, but the system would try to load more than strictly necessary to keep the CPU(s) busy (each task takes 100% of one core until it has finished). This would cause the disk to thrash and make the whole thing slower than a less greedy approach to execution.
The best approach is probably to keep about n tasks executing, where n is the number of available cores.
I am keen not to reinvent the wheel. This problem has already been solved in the Unix make program (when used with the -j n option). I was wondering if perhaps it was possible to write generic Makefile rules for the above, so as to avoid the linear-size Makefile that would look like:
all: log.1 log.2 ...
log.1:
command -n 1 > log.1
log.2:
command -n 2 > log.2
...
If the best solution is not to use make but another program/utility, I am open to that as long as the dependencies are reasonable (make was very good in this regard).

Here is more portable shell code that does not depend on brace expansion:
LOGS := $(shell seq 1 1024)
Note the use of := to define a more efficient variable: the simply expanded "flavor".

See pattern rules
Another way, if this is the single reason why you need make, is to use -n and -P options of xargs.

First the easy part. As Roman Cheplyaka points out, pattern rules are very useful:
LOGS = log.1 log.2 ... log.4096
all: $(LOGS)
log.%:
command -n $* > log.$*
The tricky part is creating that list, LOGS. Make isn't very good at handling numbers. The best way is probably to call on the shell. (You may have to adjust this script for your shell-- shell scripting isn't my strongest subject.)
NUM_LOGS = 4096
LOGS = $(shell for ((i=1 ; i<=$(NUM_LOGS) ; ++i)) ; do echo log.$$i ; done)

xargs -P is the "standard" way to do this.
Note depending on disk I/O you may want to limit to spindles rather than cores.
If you do want to limit to cores note the new nproc command in recent coreutils.

With GNU Parallel you would write:
parallel command -n {} ">" log.{} ::: {1..4096}
10 second installation:
(wget -O - pi.dk/3 || curl pi.dk/3/ || fetch -o - http://pi.dk/3) | bash
Learn more: http://www.gnu.org/software/parallel/parallel_tutorial.html https://www.youtube.com/playlist?list=PL284C9FF2488BC6D1

Related

When to use spawn.with_shell and when spawn is only needed?

I'm confused when i should use awful.spawn and when to use awful.spawn.with_shell. To me these look and work the same.
The only difference I see is that in awful.spawn you can set client rules and make a callback.
I would appreciate any examples or rules on when to use each one.

awful.spawn.with_shell really does not do more than spawning the given command with a shell: https://github.com/awesomeWM/awesome/blob/c539e0e4350a42f813952fc28dd8490f42d934b3/lib/awful/spawn.lua#L370-L371
function spawn.with_shell(cmd)
if cmd and cmd ~= "" then
cmd = { util.shell, "-c", cmd }
return capi.awesome.spawn(cmd, false)
end
end
So, why would one want that? Some things are done by shells. For example, output redirections (echo hi > some_file), command sequences (echo 1; echo 2) or pipes (echo hello | grep ell) are all done by a shell. None of these work when starting a process correctly.
Why would one not want a shell? For example, argument escaping is way more complicated when a shell is involved. When you e.g. want to start print a pipe symbol (no idea why one would need that), then awful.spawn({"echo", "|"}) just works, while with a shell you need to escape the pipe symbol the appropriate number of times. I guess that awful.spawn.with_shell("echo \\\|") would work, but I am not sure and this is the point.
Also, a shell that does nothing is an extra process and is a tiny bit slower than without a shell, but this difference is really unimportant.

How to parallelize the nested for loops in bash calling the R script

Is it possible to parallelize the following code?
for word in $(cat FileNames.txt)
do
for i in {1..22}
do
Rscript assoc_test.R...........
done >> log.txt
done
I have been trying to parallelize it but have not been lucky so far. I have tried putting () around the Rscript assoc_test.R........... followed by & but it is not giving the results, and the log file turns out to be empty. Any suggestions/help would be appreciated. TIA.

You can change your script to output the commands to run, and feed the results into GNU parallel:
for word in $(cat FileNames.txt)
do
for i in {1..22}
do
echo Rscript assoc_test.R........... \> log.$word.$i
done
done | parallel -j 4
Some details:
parallel -j 4 will keep 4 jobs running at a time - replace 4 by the number of CPUs you want to use.
Notice I redirect the output to log.$word.$i and escape the redirection operator > by using \>. I need to test and make sure it works, but the point is that since you're going parallel, you don't want to jumble all your outputs together.
Make sure you escape anything else the echo might interpret. The output should be valid command lines that parallel can run.
As an alternative to parallel, you can also use xargs -i. See this question for more information.

GNU Parallel is made for replacing loops, so the double loop can be replaced by:
parallel Rscript assoc_test.R... \> log.{1}.{2} :::: FileNames.txt ::: {1..22} > log.txt

How do I get the target directory in a script?

I am trying to get the target directory for modules in a multimodule project. The challenge I have is that SBT's logging makes it hard to consume in a script.
Here is what I have at the moment:
function sbt-target {
sbt -Dsbt.log.noformat=true "project $1" 'show target' |
tail -n1 |
cut -c8-
}
I think this is very hackish as it knows about the [INFO] prefix (the cut -c8-) of each output line from SBT and about the fact that SBT's last line is the output I need (the tail -n1).
More problematic is that each invocation of sbt-target takes almost 11 seconds so invoking it for each module for a large number of modules in this project dominates the time.
How do I get the target directory in a script?

I can't speak to SBT. In terms of bash best-practices, you might consider something more akin to the following:
sbt_target() {
# declare locals as such
local line version
# iterate through all lines; later lines overwrite variable set by prior ones
while read -r line; do
version=${line#"[INFO] "} # strip undesired prefix if present
done < <(sbt -Dsbt.log.noformat=true "project $1" 'show target')
# emit result to stdout
printf '%s\n' "$version"
}
Unlike the version relying on tail and cut, this does everything in-process within bash, and is thus more efficient (presuming that sbt's show target emits a relatively small amount of output).

Complex command execution in Makefile

I have a query regarding the execution of a complex command in the makefile of the current system.
I am currently using shell command in the makefile to execute the command. However my command fails as it is a combination of a many commands and execution collects a huge amount of data. The makefile content is something like this:
variable=$(shell ls -lart | grep name | cut -d/ -f2- )
However the make execution fails with execvp failure, since the file listing is huge and I need to parse all of them.
Please suggest me any ways to overcome this issue. Basically I would like to execute a complex command and assign that output to a makefile variable which I want to use later in the program.

(This may take a few iterations.)
This looks like a limitation of the architecture, not a Make limitation. There are several ways to address it, but you must show us how you use variable, otherwise even if you succeed in constructing it, you might not be able to use it as you intend. Please show us the exact operations you intend to perform on variable.
For now I suggest you do a couple of experiments and tell us the results. First, try the assignment with a short list of files (e.g. three) to verify that the assignment does what you intend. Second, in the directory with many files, try:
variable=$(shell ls -lart | grep name)
to see whether the problem is in grep or cut.

Rather than store the list of files in a variable you can easily use shell functionality to get the same result. It's a bit odd that you're flattening a recursive ls to only get the leaves, and then running mkdir -p which is really only useful if the parent directory doesn't exist, but if you know which depths you want to (for example the current directory and all subdirectories one level down) you can do something like this:
directories:
for path in ./*name* ./*/*name*; do \
mkdir "/some/path/$(basename "$path")" || exit 1; \
done
or even
find . -name '*name*' -exec mkdir "/some/path/$(basename {})" \;

How can I tell if a makefile is being run from an interactive shell?

I have a makefile which runs commands that can take a while. I'd like those commands to be chatty if the build is initiated from an interactive shell but quieter if not (specifically, by cron). Something along the lines of (pseudocode):
foo_opts = -a -b -c
if (make was invoked from an interactive shell):
foo_opts += --verbose
all: bar baz
foo $(foo_opts)
This is GNU make. If the specifics of what I'm doing matter, I can edit the question.

It isn't strictly determining whether it is invoked from an interactive shell or not, but for a cron job in which the output is redirected to a file, the answer to this question would be the same as for How to detect if my shell script is running through a pipe?:
if [ -t 0 ]
then
# input is from a terminal
fi
Edit: To use this to set a variable in a Makefile (in GNU make, that is):
INTERACTIVE:=$(shell [ -t 0 ] && echo 1)
ifdef INTERACTIVE
# is a terminal
else
# cron job
endif

http://www.faqs.org/faqs/unix-faq/faq/part5/section-5.html
5.5) How can I tell if I am running an interactive shell?
In the C shell category, look for the variable $prompt.
In the Bourne shell category, you can look for the variable $PS1,
however, it is better to check the variable $-. If $- contains
an 'i', the shell is interactive. Test like so:
case $- in
*i*) # do things for interactive shell
;;
*) # do things for non-interactive shell
;;
esac

I do not think you can easily find out. I suggest adopting an alternative strategy, probably by quelling the verbose output from the cron job. I would look to do that using a makefile like this:
VERBOSE = --verbose
foo_opts = -a -b -c ${VERBOSE}
all: bar baz
foo $(foo_opts)
Then, in the cron job, specify:
make VERBOSE=
This command-line specification of VERBOSE overrides the one in the makefile (and cannot be changed by the makefile). That way, the specialized task (cron job) that you set up once and use many times will be done without the verbose output; the general task of building will be done verbosely (unless you elect to override the verbose-ness on the command line).
One minor advantage of this technique is that it will work with any variant of make; it does not depend on any GNU Make facility.

I’m not really sure what "am interactive" means. Do you mean if you have a valid /dev/tty? If so, then you could check that. Most of us check isatty on stdin, though, because it answers the questions we want to know: is there someone there to type something.

Just a note: you can also see the related discussion that I had about detecting redirection of STDOUT from inside a Makefile.
I believe it will be helpful to readers of this question - executive summary:
-include piped.mk
all: piped.mk
ifeq ($(PIPED),1)
#echo Output of make is piped because PIPED is ${PIPED}
else
#echo Output of make is NOT piped because PIPED is ${PIPED}
endif
#rm -f piped.mk
piped.mk:
#[ -t 1 ] && PIPED=0 || PIPED=1 ; echo "PIPED=$${PIPED}" > piped.mk
$ make
Output of make is NOT piped because PIPED is 0
$ make | more
Output of make is piped because PIPED is 1
In my answer there I explain why the [-t 1] has to be done in an action and not in a variable assignment (as in the recommended answer here), as well as the various pitfalls regarding re-evaluation of a generated Makefile (i.e. the piped.mk above).
The term interactive in this question seems to imply redirection of STDIN... in which case replacing [ -t 1 ] with [ -t 0 ] in my code above should work as-is.
Hope this helps.

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Using make to execute independent tasks in parallel - unix

Here is more portable shell code that does not depend on brace expansion: LOGS := $(shell seq 1 1024) Note the use of := to define a more efficient variable: the simply expanded "flavor".

See pattern rules Another way, if this is the single reason why you need make, is to use -n and -P options of xargs.

xargs -P is the "standard" way to do this. Note depending on disk I/O you may want to limit to spindles rather than cores. If you do want to limit to cores note the new nproc command in recent coreutils.

Related

When to use spawn.with_shell and when spawn is only needed?

How to parallelize the nested for loops in bash calling the R script

How do I get the target directory in a script?

Complex command execution in Makefile

How can I tell if a makefile is being run from an interactive shell?

Categories

Resources