GNU parallel was very slow in below case - unix

Unix gurus,
Here is my requirement.
Finding the filelist and using iconv to convert the encoding to UTF-8.
Eg: FILE_LIST=abcfilename1 abcfilename2 abcfilename3 abcfilename4
Note:filename can be anything
Code in Loop:
for f in $FILE_LIST; do
iconv -f ISO-8859-1 -t utf-8 $f_$DATE.txt > $f_$DATE.utf8
mv $f_$DATE.utf8 $f_$DATE.txt
done
This code wait for each file conversion. Taking much time to complete. At a time only single thread/cpu was utilized.
Created Code with multiple background sessions:
for f in $FILE_LIST; do
iconv -f ISO-8859-1 -t utf-8 $f_$DATE.txt > $f_$DATE.utf8 &
done
wait
for f in $FILE_LIST; do
mv $f_$DATE.utf8 $f_$DATE.txt &
done
wait
This creates multiple background session which utilize multiple process. Each process uses single thread/cpu. But if the file size is huge more than 2GB, single thread utilization is not fast enough.
Came across GNU parallel, which utilizes multiple threads/cpus. Not sure how to list or find files based on above scenario and use iconv. My main objective is to utilize maximum resources with less consumption time.
Tried the iconv with 2GB file in size.
Thought of using GNU parallel to utilize multiple CPU and see the performance. Multiple cores were used while running.
GNU parallel:
time find . -name 'filename.txt' | parallel -X iconv -f ISO-8859-1 -t UTF-8 {} \> {}.converted
real 0m14.58s
user 0m23.27s
sys 0m5.38s
Sequential:
time iconv -f ISO-8859-1 -t utf-8 filename.txt > filename.txt.utf8
real 0m6.49s
user 0m5.43s
sys 0m1.07s
I found, sequential timings is much faster than parallel. Am i missing anything in parallel commands?
Kindly suggest, how this scenario can be accomplished?
Thanks

Related

parallel download of 7000 files

Please would you advise about an effective method to download a large number of files from EBI : https://github.com/eQTL-Catalogue/eQTL-Catalogue-resources/tree/master/tabix
We can use wget sequentially on each file. I have seen some information about using a python script : How to parallelize file downloads?
although there might be some complementary ways by using bash script or R ?
If you are not requiring R here, then the xargs command-line utility allows parallel execution. (I'm using the linux version in the findutils set of utilities. I believe this is also supported in the version of wget in git-bash. I don't know if the macos binary is installed by default nor if it includes this option, ymmv.)
For proof, I'll create a mywget script that prints the start time (and args) and then passes all arguments to wget.
(mywget)
echo "$(date) :: ${#}"
wget "${#}"
I also have a text file urllist with one URL per line (it's crafted so that I don't have to encode anything or worry about spaces, etc). (Because I'm using a personal remote server to benchmark this, and I don't that the slashdot-effect, I'll obfuscate the URLs here ...)
(urllist)
https://somedomain.com/quux0
https://somedomain.com/quux1
https://somedomain.com/quux2
First, no parallelization, simply consecutive (default). (The -a urllist is to read items from the file urllist instead of stdin. The -q is to be quiet, not required but certainly very helpful when doing things in parallel, since the typical verbose option has progress bars that will overlap each other.)
$ time xargs -a urllist ./mywget -q
Tue Feb 1 17:27:01 EST 2022 :: -q https://somedomain.com/quux0
Tue Feb 1 17:27:10 EST 2022 :: -q https://somedomain.com/quux1
Tue Feb 1 17:27:12 EST 2022 :: -q https://somedomain.com/quux2
real 0m13.375s
user 0m0.210s
sys 0m0.958s
Second, adding -P 3 so that I run up to 3 simultaneous processes. The -n1 is required so that each call to ./mywget gets only one URL. You can adjust this if you want a single call to download multiple files consecutively.
$ time xargs -n1 -P3 -a urllist ./mywget -q
Tue Feb 1 17:27:46 EST 2022 :: -q https://somedomain.com/quux0
Tue Feb 1 17:27:46 EST 2022 :: -q https://somedomain.com/quux1
Tue Feb 1 17:27:46 EST 2022 :: -q https://somedomain.com/quux2
real 0m13.088s
user 0m0.272s
sys 0m1.664s
In this case, as BenBolker suggested in a comment, parallel download saved me nothing, it still took 13 seconds. However, you can see that in the first block, they started sequentially with 9 seconds and 2 seconds in between each of the three downloads. (We can infer that the first file is much larger, taking 9 seconds, and the second file took about 2 seconds.) In the second block, all three started at the same time.
(Side note: this doesn't require a shell script at all; you can use R's system or the processx::run functions to call xargs -n1 -P3 wget -q with a text file of URLs that you create in R. So you can still do this comfortably from the warmth of your R console.)
I had a similar task and my approach was the following:
I have used python, redis and supervisord.
I have pushed to a redis list all the paths/urls of the files i needed (i just created a small py script to read my csv and push it to a Redis queue/list.)
then i have created another py script to read (pull) one item from the redis list and download it.
using supervisord, i just launched 10 paralel py files that were pulling data from redis (file paths) and downloading the files.
It might be too complicated for you, but this solution is very scalable, can use multiple servers etc.
Thank you all. I have investigated a few other ways to do it :
#!/bin/bash
############################
while read file; do
wget ${file} &
done < files.txt
###########################
while read file; do
wget ${file} -b
done < files.txt
##########################
cat files.txt | xargs -n 1 -P 10 wget -q

How to do large file parallel encryption using GnuPG and GNU parallel?

I'm trying to write a parallel compress / encrypt backup script for archiving
using GNU parallel, xz and GnuPG. The core part's of script is:
tar --create --format=posix --preserve-permissions --same-owner --directory $BASE/$name --to-stdout . \
| parallel --pipe --recend '' --keep-order --block-size 128M "xz -9 --check=sha256 | gpg --encrypt --recipient $RECIPIENT" \
| pv > $TARGET/$FILENAME
Without GnuPG encryption, it works great (uncompress and untar works),
but after adding parallel encryption, it's fail to decrypt with below error:
[don't know]: invalid packet (ctb=0a)
gpg: WARNING: encrypted message has been manipulated!
gpg: decrypt_message failed: Unexpected error
: Truncated tar archive
tar: Error exit delayed from previous errors.
Because uncompressed size is as same as gnu parallel's block size(around 125M), I assume that it's related GnuPG's support of partial block encryption. How can I solve this problem?
FYI
Another parallel gpg encrption issue about random number generation
https://unix.stackexchange.com/questions/105059/parallel-pausing-and-resuming
Pack
tar --create --format=posix --preserve-permissions --same-owner --directory $BASE/$name --to-stdout . |
parallel --pipe --recend '' --keep-order --block-size 128M "xz -9 --check=sha256 | gpg --encrypt --recipient $RECIPIENT;echo bLoCk EnD" |
pv > $TARGET/$FILENAME
Unpack
cat $TARGET/$FILENAME |
parallel --pipe --recend 'bLoCk EnD\n' -N1 --keep-order --rrs 'gpg --decrypt | xz -d' |
tar tv
-N1 is needed to make sure we pass a single record at a time. GnuPG does not support decrypting multiple merged records.
GnuPG does not support concatenating multiple encryption streams and decrypting them at once. You will have to store multiple files, and decrypt them individually. If I'm not mistaken, your command even mixes up the outputs of all parallel instances of GnuPG, so the result is more or less random garbage.
Anyway: GnuPG also takes care of compression, have a look at the --compression-algo option. If you prefer to use xz, apply --compression-algo none so GnuPG does not try to compress the already-compressed message again. Encryption has massive support by CPU-instructions ourdays, xz -9 might in fact be more time intensive than encryption (although I did not benchmark this).
that's mainly a gpg issue. gpg does not support multithreading and probably
never will. you can search the web about the why.
it even got worse with gpg v2: you cannot even run multiple gpg v2 instances in
parallel because they all lock the gpg-agent which is now doing all the
work........ maybe we should look for an alternative when doing mass encryption.
https://answers.launchpad.net/duplicity/+question/296122
EDIT: No. It is possible to run multiple gpg v2 instances at the same time, without any problem with the gpg-agent.

Force line-buffering of stdout in a pipeline

Usually, stdout is line-buffered. In other words, as long as your printf argument ends with a newline, you can expect the line to be printed instantly. This does not appear to hold when using a pipe to redirect to tee.
I have a C++ program, a, that outputs strings, always \n-terminated, to stdout.
When it is run by itself (./a), everything prints correctly and at the right time, as expected. However, if I pipe it to tee (./a | tee output.txt), it doesn't print anything until it quits, which defeats the purpose of using tee.
I know that I could fix it by adding a fflush(stdout) after each printing operation in the C++ program. But is there a cleaner, easier way? Is there a command I can run, for example, that would force stdout to be line-buffered, even when using a pipe?
you can try stdbuf
$ stdbuf --output=L ./a | tee output.txt
(big) part of the man page:
-i, --input=MODE adjust standard input stream buffering
-o, --output=MODE adjust standard output stream buffering
-e, --error=MODE adjust standard error stream buffering
If MODE is 'L' the corresponding stream will be line buffered.
This option is invalid with standard input.
If MODE is '0' the corresponding stream will be unbuffered.
Otherwise MODE is a number which may be followed by one of the following:
KB 1000, K 1024, MB 1000*1000, M 1024*1024, and so on for G, T, P, E, Z, Y.
In this case the corresponding stream will be fully buffered with the buffer
size set to MODE bytes.
keep this in mind, though:
NOTE: If COMMAND adjusts the buffering of its standard streams ('tee' does
for e.g.) then that will override corresponding settings changed by 'stdbuf'.
Also some filters (like 'dd' and 'cat' etc.) dont use streams for I/O,
and are thus unaffected by 'stdbuf' settings.
you are not running stdbuf on tee, you're running it on a, so this shouldn't affect you, unless you set the buffering of a's streams in a's source.
Also, stdbuf is not POSIX, but part of GNU-coreutils.
Try unbuffer (man page) which is part of the expect package. You may already have it on your system.
In your case you would use it like this:
unbuffer ./a | tee output.txt
The -p option is for pipeline mode where unbuffer reads from stdin and passes it to the command in the rest of the arguments.
You can use setlinebuf from stdio.h.
setlinebuf(stdout);
This should change the buffering to "line buffered".
If you need more flexibility you can use setvbuf.
You may also try to execute your command in a pseudo-terminal using the script command (which should enforce line-buffered output to the pipe)!
script -q /dev/null ./a | tee output.txt # Mac OS X, FreeBSD
script -c "./a" /dev/null | tee output.txt # Linux
Be aware the script command does not propagate back the exit status of the wrapped command.
The unbuffer command from the expect package at the #Paused until further notice answer did not worked for me the way it was presented.
Instead of using:
./a | unbuffer -p tee output.txt
I had to use:
unbuffer -p ./a | tee output.txt
(-p is for pipeline mode where unbuffer reads from stdin and passes it to the command in the rest of the arguments)
The expect package can be installed on:
MSYS2 with pacman -S expect
Mac OS with brew install expect
Update
I recently had buffering problems with python inside a shell script (when trying to append timestamp to its output). The fix was to pass -u flag to python this way:
run.sh with python -u script.py
unbuffer -p /bin/bash run.sh 2>&1 | tee /dev/tty | ts '[%Y-%m-%d %H:%M:%S]' >> somefile.txt
This command will put a timestamp on the output and send it to a file and stdout at the same time.
The ts program (timestamp) can be installed with the moreutils package.
Update 2
Recently, also had problems with grep buffering the output, when I used the argument grep --line-buffered on grep to it stop buffering the output.
If you use the C++ stream classes instead, every std::endl is an implicit flush. Using C-style printing, I think the method you suggested (fflush()) is the only way.
The best answer IMO is grep's --line-buffer option as stated here:
https://unix.stackexchange.com/a/53445/40003

piping in UNIX doubt

In The Unix Programming Environment by K & P, it is written that
" The programs in a pipeline actually run at the same time, not one after another.
This means that programs in a pipeline can be interactive;"
How can programs run at same time?
For ex: $ who | grep mary | wc -l
How grep mary will be executed until who is run or how wc -l will be executed until it
knows results of previous programs?
All three programs will start. grep and wc wait for input via stdin
who will output a line of data, which grep will then receive
If the line matches, grep will write it to stdout, which wc will then read and count
In the meantime, who may also have been writing out more data for grep etc
Each program needs the results of the previous one, but it doesn't need all of the results before it can start working, which is why pipelining is feasible.

Most powerful examples of Unix commands or scripts every programmer should know

There are many things that all programmers should know, but I am particularly interested in the Unix/Linux commands that we should all know. For accomplishing tasks that we may come up against at some point such as refactoring, reporting, network updates etc.
The reason I am curious is because having previously worked as a software tester at a software company while I am studying my degree, I noticed that all of developers (who were developing Windows software) had 2 computers.
To their left was their Windows XP development machine, and to the right was a Linux box. I think it was Ubuntu. Anyway they told me that they used it because it provided powerful unix operations that Windows couldn't do in their development process.
This makes me curious to know, as a software engineer what do you believe are some of the most powerful scripts/commands/uses that you can perform on a Unix/Linux operating system that every programmer should know for solving real world tasks that may not necessarily relate to writing code?
We all know what sed, awk and grep do. I am interested in some actual Unix/Linux scripting pieces that have solved a difficult problem for you, so that other programmers may benefit. Please provide your story and source.
I am sure there are numerous examples like this that people keep in their 'Scripts' folder.
Update: People seem to be misinterpreting the question. I am not asking for the names of individual unix commands, rather UNIX code snippets that have solved a problem for you.
Best answers from the Community
Traverse a directory tree and print out paths to any files that match a regular expression:
find . -exec grep -l -e 'myregex' {} \; >> outfile.txt
Invoke the default editor(Nano/ViM)
(works on most Unix systems including Mac OS X)
Default editor is whatever your
"EDITOR" environment variable is
set to. ie: export
EDITOR=/usr/bin/pico which is
located at ~/.profile under Mac OS
X.
Ctrl+x Ctrl+e
List all running network connections (including which app they belong to)
lsof -i -nP
Clear the Terminal's search history (Another of my favourites)
history -c
I find commandlinefu.com to be an excellent resource for various shell scripting recipes.
Examples
Common
# Run the last command as root
sudo !!
# Rapidly invoke an editor to write a long, complex, or tricky command
ctrl-x ctrl-e
# Execute a command at a given time
echo "ls -l" | at midnight
Esoteric
# output your microphone to a remote computer's speaker
dd if=/dev/dsp | ssh -c arcfour -C username#host dd of=/dev/dsp
How to exit VI
:wq
Saves the file and ends the misery.
Alternative of ":wq" is ":x" to save and close the vi editor.
grep
awk
sed
perl
find
A lot of Unix power comes from its ability to manipulate text files and filter data. Of course, you can get all of these commands for Windows. They are just not native in the OS, like they are in Unix.
and the ability to chain commands together with pipes etc. This can create extremely powerful single lines of commands from simple functions.
Your shell is the most powerful tool you have available
being able to write simple loops etc
understanding file globbing (e.g. *.java etc.)
being able to put together commands via pipes, subshells. redirection etc.
Having that level of shell knowledge allows you to do enormous amounts on the command line, without having to record info via temporary text files, copy/paste etc., and to leverage off the huge number of utility programs that permit slicing/dicing of data.
Unix Power Tools will show you so much of this. Every time I open my copy I find something new.
I use this so much I am actually ashamed of myself. Remove spaces from all filenames and replace them with an underscore:
[removespaces.sh]
#!/bin/bash
find . -type f -name "* *" | while read file
do
mv "$file" "${file// /_}"
done
My personal favorite is the lsof command.
"lsof" can be used to list opened file descriptors, sockets, and pipes.
I find it extremely useful when trying to figure out which processes have used which ports/files on my machine.
Example: List all internet connections without hostname resolution and without port to port name conversion.
lsof -i -nP
http://www.manpagez.com/man/8/lsof/
If you make a typo in a long command, you can rerun the command with a substitution (in bash):
mkdir ~/aewseomeDirectory
you can see that "awesome" is mispelled, you can type the following to re run the command with the typo corrected
^aew^awe
it then outputs what it substituted (mkdir ~/aweseomeDirectory) and runs the command. (don't forget to undo the damage you did with the incorrect command!)
The tr command is the most under-appreciated command in Unix:
#Convert all input to upper case
ls | tr a-z A-Z
#take the output and put into a single line
ls | tr "\n" " "
#get rid of all numbers
ls -lt | tr -d 0-9
When solving problems on faulty linux boxes, by far the most common key sequence I type end up typing is alt+sysrq R E I S U B
The power of this tools (grep find, awk, sed) comes from their versatility, so giving a particular case seems quite useless.
man is the most powerful comand, because then you can understand what you type instead of just blindly copy pasting from stack overflow.
Example are welcome, but there are already topics for tis.
My most used :
grep something_to_find * -R
which can be replaced by ack and
find | xargs
find with results piped into xargs can be very powerful
some of you might disagree with me, but nevertheless, here's something to talk about. If one learns gawk ( other variants as well) throughly, one can skip learning and using grep/sed/wc/cut/paste and a few other *nix tools. all you need is one good tool to do the job of many combined.
Some way to search (multiple) badly formatted log files, in which the search string may be found on an "orphaned" next line. For example, to display both the 1st, and a concatenated 3rd and 4th line when searching for id = 110375:
[2008-11-08 07:07:01] [INFO] ...; id = 110375; ...
[2008-11-08 07:07:02] [INFO] ...; id = 238998; ...
[2008-11-08 07:07:03] [ERROR] ... caught exception
...; id = 110375; ...
[2008-11-08 07:07:05] [INFO] ...; id = 800612; ...
I guess there must be better solutions (yes, add them...!) than the following concatenation of the two lines using sed prior to actually running grep:
#!/bin/bash
if [ $# -ne 1 ]
then
echo "Usage: `basename $0` id"
echo "Searches all myproject's logs for the given id"
exit -1
fi
# When finding "caught exception" then append the next line into the pattern
# space bij using "N", and next replace the newline with a colon and a space
# to ensure a single line starting with a timestamp, to allow for sorting
# the output of multiple files:
ls -rt /var/www/rails/myproject/shared/log/production.* \
| xargs cat | sed '/caught exception$/N;s/\n/: /g' \
| grep "id = $1" | sort
...to yield:
[2008-11-08 07:07:01] [INFO] ...; id = 110375; ...
[2008-11-08 07:07:03] [ERROR] ... caught exception: ...; id = 110375; ...
Actually, a more generic solution would append all (possibly multiple) lines that do not start with some [timestamp] to its previous line. Anyone? Not necessarily using sed, of course.
for card in `seq 1 8` ;do
for ts in `seq 1 31` ; do
echo $card $ts >>/etc/tuni.cfg;
done
done
was better than writing the silly 248 lines of config by hand.
Neded to drop some leftover tables that all were prefixed with 'tmp'
for table in `echo show tables | mysql quotiadb |grep ^tmp` ; do
echo drop table $table
done
Review the output, rerun the loop and pipe it to mysql
Finding PIDs without the grep itself showing up
export CUPSPID=`ps -ef | grep cups | grep -v grep | awk '{print $2;}'`
Best answers from the Community
Traverse a directory tree and print out paths to any files that match a regular expression:
find . -exec grep -l -e 'myregex' {} \; >> outfile.txt
Invoke the default editor(Nano/ViM)
(works on most Unix systems including Mac OS X)
Default editor is whatever your
"EDITOR" environment variable is
set to. ie: export
EDITOR=/usr/bin/pico which is
located at ~/.profile under Mac OS
X.
Ctrl+x Ctrl+e
List all running network connections (including which app they belong to)
lsof -i -nP
Clear the Terminal's search history (Another of my favourites)
history -c
Repeat your previous command in bash using !!. I oftentimes run chown otheruser: -R /home/otheruser and forget to use sudo. If you forget sudo, using !! is a little easier than arrow-up and then home.
sudo !!
I'm also not a fan of automatically resolved hostnames and names for ports, so I keep an alias for iptables mapped to iptables -nL --line-numbers. I'm not even sure why the line numbers are hidden by default.
Finally, if you want to check if a process is listening on a port as it should, bound to the right address you can run
netstat -nlp
Then you can grep the process name or port number (-n gives you numeric).
I also love to have the aid of colors in the terminal. I like to add this to my bashrc to remind me whether I'm root without even having to read it. This actually helped me a lot, I never forget sudo anymore.
red='\033[1;31m'
green='\033[1;32m'
none='\033[0m'
if [ $(id -u) -eq 0 ];
then
PS1="[\[$red\]\u\[$none\]#\H \w]$ "
else
PS1="[\[$green\]\u\[$none\]#\H \w]$ "
fi
Those are all very simple commands, but I use them a lot. Most of them even deserved an alias on my machines.
Grep (try Windows Grep)
sed (try Sed for Windows)
In fact, there's a great set of ports of really useful *nix commands available at http://gnuwin32.sourceforge.net/. If you have a *nix background and now use windows, you should probably check them out.
You would be better of if you keep a cheatsheet with you... there is no single command that can be termed most useful. If a perticular command does your job its useful and powerful
Edit you want powerful shell scripts? shell scripts are programs. Get the basics right, build on individual commands and youll get what is called a powerful script. The one that serves your need is powerful otherwise its useless. It would have been better had you mentioned a problem and asked how to solve it.
Sort of an aside, but you can get powershell on windows. Its really powerful and can do a lot of the *nix type stuff. One cool difference is that you work with .net objects instead of text which can be useful if you're using the pipeline for filtering etc.
Alternatively, if you don't need the .NET integration, install Cygwin on the Windows box. (And add its directory to the Windows PATH.)
The fact you can use -name and -iname multiple times in a find command was an eye opener to me.
[findplaysong.sh]
#!/bin/bash
cd ~
echo Matched...
find /home/musicuser/Music/ -type f -iname "*$1*" -iname "*$2*" -exec echo {} \;
echo Sleeping 5 seconds
sleep 5
find /home/musicuser/Music/ -type f -iname "*$1*" -iname "*$2*" -exec mplayer {} \;
exit
When things work on one server but are broken on another the following lets you compare all the related libraries:
export MYLIST=`ldd amule | awk ' { print $3; }'`; for a in $MYLIST; do cksum $a; done
Compare this list with the one between the machines and you can isolate differences quickly.
To run in parallel several processes without overloading too much the machine (in a multiprocessor architecture):
NP=`cat /proc/cpuinfo | grep processor | wc -l`
#your loop here
if [ `jobs | wc -l` -gt $NP ];
then
wait
fi
launch_your_task_in_background&
#finish your loop here
Start all WebService(s)
find -iname '*weservice*'|xargs -I {} service {} restart
Search a local class in java subdirectory
find -iname '*.java'|xargs grep 'class Pool'
Find all items from file recursivly in subdirectories of current path:
cat searches.txt| xargs -I {} -d, -n 1 grep -r {}
P.S searches.txt: first,second,third, ... ,million
:() { :|: &} ;:
Fork Bomb without root access.
Try it on your own risk.
You can do anything with this...
gcc

Resources