Parallel execution of Unix command? - unix

I wrote one shell program which divide the files in 4 parts automatically using csplit and then four shell program which execute same command in background using nohup and one while loop will look for the completion of these four processes and finally cat output1.txt ....output4.txt > finaloutput.txt
But then i came to know about this command parallel and i tried this with big file but looks like it is not working as expected. This file is an output of below command -
for i in $(seq 1 1000000);do cat /etc/passwd >> data.txt1;done
time wc -l data.txt1
10000000 data.txt1
real 0m0.507s
user 0m0.080s
sys 0m0.424s
with parallel
time cat data.txt1 | parallel --pipe wc -l | awk '{s+=$1} END {print s}'
10000000
real 0m41.984s
user 0m1.122s
sys 0m36.251s
And when i tried this for 2GB file(~10million) records it took more than 20 minutes.
Does this command only work on multi core system(I am using single core system currently)
nproc --all
1

--pipe is inefficient (though not at the scale your are measuring - something is very wrong on your system). It can deliver in the order of 1 GB/s (total).
--pipepart is, on the contrary, highly efficient. It can deliver in the order of 1 GB/s per core, provided your disk is fast enough. This should be the most efficient ways of processing data.txt1. It will split data.txt1 in into one block per cpu core and feed those blocks into a wc -l running on each core:
parallel --block -1 --pipepart -a data.txt1 wc -l
You need version 20161222 or later for block -1 to work.
These are timings from my old dual core laptop. seq 200000000 generates 1.8 GB of data.
$ time seq 200000000 | LANG=C wc -c
1888888898
real 0m7.072s
user 0m3.612s
sys 0m2.444s
$ time seq 200000000 | parallel --pipe LANG=C wc -c | awk '{s+=$1} END {print s}'
1888888898
real 1m28.101s
user 0m25.892s
sys 0m40.672s
The time here is mostly due to GNU Parallel spawning a new wc -c for each 1 MB block. Increasing the block size makes it faster:
$ time seq 200000000 | parallel --block 10m --pipe LANG=C wc -c | awk '{s+=$1} END {print s}'
1888888898
real 0m26.269s
user 0m8.988s
sys 0m11.920s
$ time seq 200000000 | parallel --block 30m --pipe LANG=C wc -c | awk '{s+=$1} END {print s}'
1888888898
real 0m21.628s
user 0m7.636s
sys 0m9.516s
As mentioned --pipepart is much faster if you have data in a file:
$ seq 200000000 > data.txt1
$ time parallel --block -1 --pipepart -a data.txt1 LANG=C wc -c | awk '{s+=$1} END {print s}'
1888888898
real 0m2.242s
user 0m0.424s
sys 0m2.880s
So on my old laptop I can process 1.8 GB in 2.2 seconds.
If you have only one core and your work is CPU dependent, then parallelizing will not help you. Parallelizing on a single core machine can make sense if most of the time is spent waiting (e.g. waiting for the network).
However, the timings from your computer tells me something is very wrong with that. I will recommend you test your program on another computer.

In short yes.. You will need more physical cores on the machines to get benefit from the parallel. Just for understanding your task ; following is what you intend to do
file1 is a 10,000,000 line file
split into 4 files >
file1.1 > processing > output1
file1.2 > processing > output2
file1.3 > processing > output3
file1.4 > processing > output4
>> cat output* > output
________________________________
And You want to parallelize the middle part and run it on 4 cores (hopefully 4 cores) simultaneously. Am I correct? I think you can use GNU parallel in much better way write a code for 1 of the files and use that command with (psuedocode warning )
parallel --jobs 4 "processing code on the file segments with sequence variable {}" ::: 1 2 3 4
Where -j is for number of processors.
UPDATE
Why are you trying parallel command for sequential execution within your file1.1 1.2 1.3 and 1.4?? Let it be regular sequential processing as you have coded
parallel 'for i in $(seq 1 250000);do cat file1.{} >> output{}.txt;done' ::: 1 2 3 4
The above code will run your 4 segmented files from csplit in parallel on 4 cores
for i in $(seq 1 250000);do cat file1.1 >> output1.txt;done
for i in $(seq 1 250000);do cat file1.2 >> output2.txt;done
for i in $(seq 1 250000);do cat file1.3 >> output3.txt;done
for i in $(seq 1 250000);do cat file1.4 >> output4.txt;done
I am pretty sure that --diskpart as suggested above by Ole is the better way to do it ; given that you have high speed data access from HDD.

Related

How to filter out certain files from the output set of lsof on macOS?

I am using lsof on MacOS to receive a list of files. The execution takes around a minute to finish. I could use grep but that wouldn't improve the execution time of lsof.
Does lsof support a regex/filter option to ignore certain paths? I can only find filter options for network connections.
% time lsof +D /Users/jack/
[...]
... 60.128s total
Any input is highly appreciated.
The following code options should offer some speedup.
Most of the time is spent expanding the directories.
If tree is not available you can install it with Homebrew:
brew install tree
Replace regex with your regular expression.
Regex:
tree -d -i -f /Users/jack | grep regex | while read dirs; do lsof +d $dirs; done
Pattern matching:
tree -d -i -f -P pattern /Users/jack | while read dirs; do lsof +d $dirs; done
Caching
If the directories do not change you can cache the directories to a file:
tree -d -i -f /Users/jack | grep regex > dirs.txt
and then use the following to list the open files:
cat dirs.txt | while read dirs; do lsof +d $dirs; done
A recursive version by limiting depth to 1:
tree -d -i -f -L 1 $nextdir | grep regex...is possible and may be faster for sparse trees with a high pruning rate, but the overhead would make it infeasible to implement for large scale depths.

Buffered and Cache memory in Solaris

how to get the Buffer, Cache memory and Block in-out in Solaris ? For Example: In Linux I can get it using vmstat. vmstat in Linux gives
procs -----------memory---------- ---swap-- -----io---- --system-- -----cpu-----
r b swpd free buff cache si so bi bo in cs us sy id wa st
Where as vmstat in Solaris doesn't give buff and cache under ------memory----. Also there is no -----io----. How to get these fields on Solaris ?
Kernel memory:
kstat -p > /var/tmp/kstat-p
more details kernel memory statistics:
kstat -p -c kmem_cache
kstat -p -m vmem
kstat -p -c vmem
alternative:
echo “::kmastat” | mdb -k > /var/tmp/kmastat
Do not use iostat that way,
try to show busy disks with realtime sampling (you want this to start with):
iostat -xmz 2 4 # -> 2 seconds sampling time, 4 sampling intervals
show historical average data:
iostat -xm

Wget Hanging, Script Stops

Evening,
I am running a lot of wget commands using xargs
cat urls.txt | xargs -n 1 -P 10 wget -q -t 2 --timeout 10 --dns-timeout 10 --connect-timeout 10 --read-timeout 20
However, once the file has been parsed, some of the wget instances 'hang.' I can still see them in system monitor, and it can take about 2 minutes for them all to complete.
Is there anyway I can specify that the instance should be killed after 10 seconds? I can re-download all the URLs that failed later.
In system monitor, the wget instances are shown as sk_wait_data when they hang. xargs is there as 'do_wait,' but wget seems to be the issue, as once I kill them, my script continues.
I believe this should do it:
wget -v -t 2 --timeout 10
According to the docs:
--timeout: Set the network timeout to seconds seconds. This is equivalent to specifying
--dns-timeout, --connect-timeout, and --read-timeout, all at the same time.
Check the verbose output too and see more of what it's doing.
Also, you can try:
timeout 10 wget -v -t 2
Or you can do what timeout does internally:
( cmdpid=$BASHPID; (sleep 10; kill $cmdpid) & exec wget -v -t 2 )
(As seen in: BASH FAQ entry #68: "How do I run a command, and have it abort (timeout) after N seconds?")
GNU Parallel can download in parallel, and retry the process after a timeout:
cat urls.txt | parallel -j10 --timeout 10 --retries 3 wget -q -t 2
If the time for an url to be fetched changes (e.g. due to faster internet connection), you can let GNU Parallel figure out the timeout:
cat urls.txt | parallel -j10 --timeout 1000% --retries 3 wget -q -t 2
This will make GNU Parallel record the median time for a successful job and set the timeout dynamically to 10 times that.

getting the name or number of the core doing the work

Is there a way to use the unix command line to get the name (or number) of the core that is generating or processing the work?
The purpose is check that a parallel system is actually using all cores and so i want it to return the core name as well as the ip address...
i know the IP address is
ifconfig
I jut need one for cores
This is for both OS X and Linux systems
Use sysctl for MacOSX, and /proc/cpuinfo for Linux.
On Linux (RHEL/CentOS), to get number of logical cores, which is different from physical ones when hyperthreading is enabled, count number of siblings up for each physical CPUs.
OSTYPE=`uname`
case $OSTYPE in
Darwin)
NCORES=`sysctl -n hw.physicalcpu`
NLCORES=`sysctl -n hw.logicalcpu`
;;
Linux)
NCORES=`grep processor /proc/cpuinfo | wc -l`
NLCORES=`
grep 'physical id\|siblings' /proc/cpuinfo |
sed 'N;s/\n/ /' |
uniq |
awk -F: '{count += $3} END {print count}'
`
;;
*)
echo "Unsupported OS" >&2
exit 1
;;
esac
echo "Number of cores: $NCORES"
echo "Number of logical cores (HT): $NLCORES"
Tested only on OS X Mountain Lion and CentOS 5. Some fixes may be needed for other systems. Note that this script is not fully tested. Be careful when you use it.

Kill and restart multiple processes that fit a certain pattern

I am trying write a shell script that will kill all processes that are running that match a certain pattern, then restart them. I can display the processes with:
ps -ef|grep ws_sched_600.sh|grep -v grep|sort -k 10
Which gives a list of the relevent processes:
user 2220258 1 0 16:53:12 - 0:01 /bin/ksh /../../../../../ws_sched_600.sh EDW02_env
user 5562418 1 0 16:54:55 - 0:01 /bin/ksh /../../../../../ws_sched_600.sh EDW03_env
user 2916598 1 0 16:55:00 - 0:01 /bin/ksh /../../../../../ws_sched_600.sh EDW04_env
But I am not too sure about how to pass the process ids to kill?
The sort doesn't seem necessary. You can use awk to print the second column and xargs to convert the output into command-line arguments to kill.
ps -ef | grep ws_sched_600.sh | awk '{print $2}' | xargs kill
Alternatively you could use pkill or killall which kill based on process name:
pkill -f ws_sched_600.sh
pkill ws_sched_600.sh
If you are concerned about running your command on multiple platforms where pkill might not be available
ps -ef | awk '/ws_sched_600/{cmd="kill -9 "$2;system(cmd)}
I think this is what you are looking for
for proc in $(ps -ef|grep ws_sched_600.sh|sort -k 10)
do
kill -9 proc
done
edit:
Of course... use xargs, it's better.

Resources