How to use parallel execution in a shell script? - unix

I have a C shell script that does something like this:
#!/bin/csh
gcc example.c -o ex
gcc combine.c -o combine
ex file1 r1 <-- 1
ex file2 r2 <-- 2
ex file3 r3 <-- 3
#... many more like the above
combine r1 r2 r3 final
\rm r1 r2 r3
Is there some way I can make lines 1, 2 and 3 run in parallel instead of one after the another?

Convert this into a Makefile with proper dependencies. Then you can use make -j to have Make run everything possible in parallel.
Note that all the indents in a Makefile must be TABs. TAB shows Make where the commands to run are.
Also note that this Makefile is now using GNU Make extensions (the wildcard and subst functions).
It might look like this:
export PATH := .:${PATH}
FILES=$(wildcard file*)
RFILES=$(subst file,r,${FILES})
final: combine ${RFILES}
combine ${RFILES} final
rm ${RFILES}
ex: example.c
combine: combine.c
r%: file% ex
ex $< $#

In bash I would do;
ex file1 r1 &
ex file2 r2 &
ex file3 r3 &
wait
... continue with script...
and spawn them out to run in parallel. You can check out this SO thread for another example.

#!/bin/bash
gcc example.c -o ex
gcc combine.c -o combine
# Call 'ex' 3 times in "parallel"
for i in {1..3}; do
ex file${i} r${i} &
done
#Wait for all background processes to finish
wait
# Combine & remove
combine r1 r2 r3 final
rm r1 r2 r3
I slightly altered the code to use brace expansion {1..3} rather than hard code the numbers since I just realized you said there are many more files than just 3. Brace expansion makes scaling to larger numbers trivial by replacing the '3' inside the braces to whatever number you need.

you can use
cmd &
and wait after
#!/bin/csh
echo start
sleep 1 &
sleep 1 &
sleep 1 &
wait
echo ok
test:
$ time ./csh.sh
start
[1] 11535
[2] 11536
[3] 11537
[3] Done sleep 1
[2] - Done sleep 1
[1] + Done sleep 1
ok
real 0m1.008s
user 0m0.004s
sys 0m0.008s

GNU Parallel would make it pretty like:
seq 1 3 | parallel ex file{} r{}
Depending on how 'ex' and 'combine' work you can even do:
seq 1 3 | parallel ex file{} | combine
Learn more about GNU Parallel by watching http://www.youtube.com/watch?v=LlXDtd_pRaY

You could use nohup ex :
nohup ex file1 r1 &
nohup ex file2 r2 &
nohup ex file3 r3 &

xargs can do it:
seq 1 3 | xargs -n 1 -P 0 -I % ex file% r%
-n 1 is for "one line per input", -P is for "run each line in parallel"

Related

Loop and process over blocks of lines between two patterns in awk?

This is actually a continued version of thisquestion:
I have a file
1
2
PAT1
3 - first block
4
PAT2
5
6
PAT1
7 - second block
PAT2
8
9
PAT1
10 - third block
and I use awk '/PAT1/{flag=1; next} /PAT2/{flag=0} flag'
to extract the blocks of lines.
Extracting them works ok, but I'm trying to iterate over these blooks in a block-by-block fashion and do some processing with each block (e.g. save to file, process with other scripts etc.).
How can I construct such a loop?
Problem is not very clear but you may do something like this:
awk '/PAT1/ {
flag = 1
++n
s = ""
next
}
/PAT2/ {
flag = 0
printf "Processing record # %d =>\n%s", n, s
}
flag {
s = s $0 ORS
}' file
Processing record # 1 =>
3 - first block
4
Processing record # 2 =>
7 - second block
This might work for you (GNU sed):
sed -ne '/PAT1/!b;:a;N;/PAT2/!ba;e echo process:' -e 's/.*/echo "&"|wc/pe;p' file
Gather up the lines between PAT1 and PAT2 and process the collection.
In the example above, the literal process: is printed.
The command to print the result of the wc command for the collection is built and printed.
The result of the evaluation of the above command is printed.
N.B. The position of the p flag in the substitution command is critical. If the p is before the e flag the pattern space is printed before the evaluation, if the p flag is after the e flag the pattern space is post evaluation.

Get specific line from unix command output

lets say I run a command in the shell cmd doSomething and it shows separate lines as output, for example
> cmd doSomething
outputLine1
outputLine2
outputLine3
Is there a way to assign the 2 nd line(outputLine2) in to a variable (e.g testdir) ?
Ideally I would like to be able to use $testdir.
You can combine head and tail, as follows:
doSomething | head -n 2 | tail -n 1
The head -n 2 shows the first two output lines, the tail -n 1 the last of those two.
For putting this into a variable:
variable=$(doSomething | head -n 2 | tail -n 1)

Selecting specific rows of a tab-delimited file using bash (linux)

I have a directory lot of txt tab-delimited files with several rows and columns, e.g.
File1
Id Sample Time ... Variant[Column16] ...
1 s1 t0 c.B481A:p.G861S
2 s2 t2 c.C221C:p.D461W
3 s5 t1 c.G31T:p.G61R
File2
Id Sample Time ... Variant[Column16] ...
1 s1 t0 c.B481A:p.G861S
2 s2 t2 c.C21C:p.D61W
3 s5 t1 c.G1T:p.G1R
and what I am looking for is to create a new file with:
all the different variants uniq
the number of variants repeteated
and the file location
i.e.:
NewFile
Variant Nº of repeated Location
c.B481A:p.G861S 2 File1,File2
c.C221C:p.D461W 1 File1
c.G31T:p.G61R 1 File1
c.C21C:p.D61W 1 File2
c.G1T:p.G1R 1 File2
I think using a basic script in bash with awk sort and uniq it will work, but I do not know where to start. Or if using Rstudio or python(3) is easier, I could try.
Thanks!!
Pure bash. Requires version 4.0+
# two associative arrays
declare -A files
declare -A count
# use a glob pattern that matches your files
for f in File{1,2}; do
{
read header
while read -ra fields; do
variant=${fields[3]} # use index "15" for 16th column
(( count[$variant] += 1 ))
files[$variant]+=",$f"
done
} < "$f"
done
for variant in "${!count[#]}"; do
printf "%s\t%d\t%s\n" "$variant" "${count[$variant]}" "${files[$variant]#,}"
done
outputs
c.B481A:p.G861S 2 File1,File2
c.G1T:p.G1R 1 File2
c.C221C:p.D461W 1 File1
c.G31T:p.G61R 1 File1
c.C21C:p.D61W 1 File2
The order of the output lines is indeterminate: associative arrays have no particular ordering.
Pure bash would be hard I think but everyone has some awk lying around :D
awk 'FNR==1{next}
{
++n[$16];
if ($16 in a) {
a[$16]=a[$16]","ARGV[ARGIND]
}else{
a[$16]=ARGV[ARGIND]
}
}
END{
printf("%-24s %6s %s\n","Variant","Nº","Location");
for (v in n) printf("%-24s %6d %s\n",v,n[v],a[v])}' *

Read only part of file/ cut to specific symbol

I have 100 files which all have a similar structure
line1
line2
stuff
RR
important stuff
The problem is that I want to cut when RR appears (which it does in each file). However, this is not always in the same line (it can be line 20, it can be line 35) but it is always there. Hence, is there any way in bash or R (when reading in the file) to that( just cuttign of the header)? I would prefer R.
You can read all rows and remove the unnecessary ones:
dat <- readLines(textConnection(
"line1
line2
stuff
RR
important stuff"))
# dat <- readLines("file.name")
dat[seq(which.max(dat == "RR") + 1, length(dat))]
# [1] "important stuff"
If you have awk available through bash you could do:
awk '(/RR/){p=1; next} (p){print}' < file.txt
$ cat file.txt
line1
line2
stuff
RR
important stuff
$ awk '(/RR/){p=1; next} (p){print}' < file.txt
important stuff
This sets a flag p when the 'RR' string is found, next causing the next line to be read without first evaluating (p){ print }. Subsequent lines will be printed.
Here's a few ways:
Using basic tools:
$ tail -n+$((1 + $(grep -n '^RR$' file.txt | cut -d: -f1))) file.txt
important stuff
$
Using pure bash:
$ { while read ln; do [ "$ln" == RR ] && break; done; cat; } < file.txt
important stuff
$
And another way, assuming you can guarantee no more than 9999 lines in a file:
$ grep -A9999 '^RR$' file.txt | tail -n+2
important stuff
$

Move top 1000 lines from text file to a new file using Unix shell commands

I wish to copy the top 1000 lines in a text file containing more than 50 million entries, to another new file, and also delete these lines from the original file.
Is there some way to do the same with a single shell command in Unix?
head -1000 input > output && sed -i '1,+999d' input
For example:
$ cat input
1
2
3
4
5
6
$ head -3 input > output && sed -i '1,+2d' input
$ cat input
4
5
6
$ cat output
1
2
3
head -1000 file.txt > first100lines.txt
tail --lines=+1001 file.txt > restoffile.txt
Out of curiosity, I found a box with a GNU version of sed (v4.1.5) and tested the (uncached) performance of two approaches suggested so far, using an 11M line text file:
$ wc -l input
11771722 input
$ time head -1000 input > output; time tail -n +1000 input > input.tmp; time cp input.tmp input; time rm input.tmp
real 0m1.165s
user 0m0.030s
sys 0m1.130s
real 0m1.256s
user 0m0.062s
sys 0m1.162s
real 0m4.433s
user 0m0.033s
sys 0m1.282s
real 0m6.897s
user 0m0.000s
sys 0m0.159s
$ time head -1000 input > output && time sed -i '1,+999d' input
real 0m0.121s
user 0m0.000s
sys 0m0.121s
real 0m26.944s
user 0m0.227s
sys 0m26.624s
This is the Linux I was working with:
$ uname -a
Linux hostname 2.6.18-128.1.1.el5 #1 SMP Mon Jan 26 13:58:24 EST 2009 x86_64 x86_64 x86_64 GNU/Linux
For this test, at least, it looks like sed is slower than the tail approach (27 sec vs ~14 sec).
This is a one-liner but uses four atomic commands:
head -1000 file.txt > newfile.txt; tail +1000 file.txt > file.txt.tmp; cp file.txt.tmp file.txt; rm file.txt.tmp
Perl approach:
perl -ne 'if($i<1000) { print; } else { print STDERR;}; $i++;' in 1> in.new 2> out && mv in.new in
Using pipe:
cat en-tl.100.en | head -10

Resources