Performance considerations when using pipe

Performance considerations when using pipe | within awk - unix

awk -F'/' '{ print $1 |" sort " }' infile > outfile
versus
awk -F'/' '{ print $1 }' infile | sort > outfile
Are these MVCE's exactly equivalent or are there portability / performance issues that I don't know about if I use a pipe ( or a redirect ) from within awk.
Both commands produce the correct output.
Update: Did some research myself - see my answer below.

tl;dr Using a pipe within awk can be twice as slow.
I went and had a quick read through of io.c in the gawk source.
Piping with awk is POSIX as long as you don't use co-processes. ie |&
If you have an OS that doesn't support pipes (this came up in the comments), gawk will simulate them by writing to files like you'd expect. That will take a while but at least you have pipes when you didn't.
If you have a real OS, it will fork children and write the output there, so you wouldn't expect a huge performance drop by using the pipe within awk.
Interestingly though gawk has some optimisations for simple cases like
awk '{print $1}'
so I ran a test case.
for i in $(seq 1 10000000); do echo $(( 10000000-$i )) " " $i;done > infile
Ten million records seemed like enough to smooth out variance from other jobs on the system.
Then
time awk '{ print $1 }' infile | sort -n > /dev/null
real 0m10.350s
user 0m7.770s
sys 0m3.000s
or thereabouts on average.
but
time awk '{ print $1 | " sort -n " }' infile > /dev/null
real 0m25.870s
user 0m13.880s
sys 0m13.030s
As you can see this is quite a dramatic difference.
So the conclusion: Although it can be potentially much slower there are plenty of use cases where the gains far outweigh the extra performance hit. It really is only in simple cases like the MVCE where you should keep the pipe outside.
There is a discussion here about the difference between redirecting into awk versus calling awk with a filename. Although not directly related, it might be of interest if you have bothered to read this far.

If you use | inside awk, the output of the print statements accumulate into a single string and then the shell command inside of "xxx" is executed with that string.
Consider:
$ echo 1 4 2 3 | awk '{for (i=1; i<=NF; i++) print $i}'
1
4
2
3
Now try:
$ echo 1 4 2 3 | awk '{for (i=1; i<=NF; i++) print $i | "sort" }'
1
2
3
4
The single string of 1\n4\n2\n3 is being constructed internally and then passed by awk to sort This could be combined into a more complex invocation, such as:
awk '{ print $1 > "names.unsorted"
command = "sort -r > names.sorted"
print $1 | command }' names
More at GNU awk manual on redirection.

Related

AWK or bash script to get the rows of a file where the specific column is equal to the given variable [duplicate]

I found some ways to pass external shell variables to an awk script, but I'm confused about ' and ".
First, I tried with a shell script:
$ v=123test
$ echo $v
123test
$ echo "$v"
123test
Then tried awk:
$ awk 'BEGIN{print "'$v'"}'
$ 123test
$ awk 'BEGIN{print '"$v"'}'
$ 123
Why is the difference?
Lastly I tried this:
$ awk 'BEGIN{print " '$v' "}'
$ 123test
$ awk 'BEGIN{print ' "$v" '}'
awk: cmd. line:1: BEGIN{print
awk: cmd. line:1: ^ unexpected newline or end of string
I'm confused about this.

#Getting shell variables into awk
may be done in several ways. Some are better than others. This should cover most of them. If you have a comment, please leave below.                                                                                    v1.5
Using -v (The best way, most portable)
Use the -v option: (P.S. use a space after -v or it will be less portable. E.g., awk -v var= not awk -vvar=)
variable="line one\nline two"
awk -v var="$variable" 'BEGIN {print var}'
line one
line two
This should be compatible with most awk, and the variable is available in the BEGIN block as well:
If you have multiple variables:
awk -v a="$var1" -v b="$var2" 'BEGIN {print a,b}'
Warning. As Ed Morton writes, escape sequences will be interpreted so \t becomes a real tab and not \t if that is what you search for. Can be solved by using ENVIRON[] or access it via ARGV[]
PS If you have vertical bar or other regexp meta characters as separator like |?( etc, they must be double escaped. Example 3 vertical bars ||| becomes -F'\\|\\|\\|'. You can also use -F"[|][|][|]".
Example on getting data from a program/function inn to awk (here date is used)
awk -v time="$(date +"%F %H:%M" -d '-1 minute')" 'BEGIN {print time}'
Example of testing the contents of a shell variable as a regexp:
awk -v var="$variable" '$0 ~ var{print "found it"}'
Variable after code block
Here we get the variable after the awk code. This will work fine as long as you do not need the variable in the BEGIN block:
variable="line one\nline two"
echo "input data" | awk '{print var}' var="${variable}"
or
awk '{print var}' var="${variable}" file
Adding multiple variables:
awk '{print a,b,$0}' a="$var1" b="$var2" file
In this way we can also set different Field Separator FS for each file.
awk 'some code' FS=',' file1.txt FS=';' file2.ext
Variable after the code block will not work for the BEGIN block:
echo "input data" | awk 'BEGIN {print var}' var="${variable}"
Here-string
Variable can also be added to awk using a here-string from shells that support them (including Bash):
awk '{print $0}' <<< "$variable"
test
This is the same as:
printf '%s' "$variable" | awk '{print $0}'
P.S. this treats the variable as a file input.
ENVIRON input
As TrueY writes, you can use the ENVIRON to print Environment Variables.
Setting a variable before running AWK, you can print it out like this:
X=MyVar
awk 'BEGIN{print ENVIRON["X"],ENVIRON["SHELL"]}'
MyVar /bin/bash
ARGV input
As Steven Penny writes, you can use ARGV to get the data into awk:
v="my data"
awk 'BEGIN {print ARGV[1]}' "$v"
my data
To get the data into the code itself, not just the BEGIN:
v="my data"
echo "test" | awk 'BEGIN{var=ARGV[1];ARGV[1]=""} {print var, $0}' "$v"
my data test
Variable within the code: USE WITH CAUTION
You can use a variable within the awk code, but it's messy and hard to read, and as Charles Duffy points out, this version may also be a victim of code injection. If someone adds bad stuff to the variable, it will be executed as part of the awk code.
This works by extracting the variable within the code, so it becomes a part of it.
If you want to make an awk that changes dynamically with use of variables, you can do it this way, but DO NOT use it for normal variables.
variable="line one\nline two"
awk 'BEGIN {print "'"$variable"'"}'
line one
line two
Here is an example of code injection:
variable='line one\nline two" ; for (i=1;i<=1000;++i) print i"'
awk 'BEGIN {print "'"$variable"'"}'
line one
line two
1
2
3
.
.
1000
You can add lots of commands to awk this way. Even make it crash with non valid commands.
One valid use of this approach, though, is when you want to pass a symbol to awk to be applied to some input, e.g. a simple calculator:
$ calc() { awk -v x="$1" -v z="$3" 'BEGIN{ print x '"$2"' z }'; }
$ calc 2.7 '+' 3.4
6.1
$ calc 2.7 '*' 3.4
9.18
There is no way to do that using an awk variable populated with the value of a shell variable, you NEED the shell variable to expand to become part of the text of the awk script before awk interprets it. (see comment below by Ed M.)
Extra info:
Use of double quote
It's always good to double quote variable "$variable"
If not, multiple lines will be added as a long single line.
Example:
var="Line one
This is line two"
echo $var
Line one This is line two
echo "$var"
Line one
This is line two
Other errors you can get without double quote:
variable="line one\nline two"
awk -v var=$variable 'BEGIN {print var}'
awk: cmd. line:1: one\nline
awk: cmd. line:1: ^ backslash not last character on line
awk: cmd. line:1: one\nline
awk: cmd. line:1: ^ syntax error
And with single quote, it does not expand the value of the variable:
awk -v var='$variable' 'BEGIN {print var}'
$variable
More info about AWK and variables
Read this faq.

It seems that the good-old ENVIRON awk built-in hash is not mentioned at all. An example of its usage:
$ X=Solaris awk 'BEGIN{print ENVIRON["X"], ENVIRON["TERM"]}'
Solaris rxvt

You could pass in the command-line option -v with a variable name (v) and a value (=) of the environment variable ("${v}"):
% awk -vv="${v}" 'BEGIN { print v }'
123test
Or to make it clearer (with far fewer vs):
% environment_variable=123test
% awk -vawk_variable="${environment_variable}" 'BEGIN { print awk_variable }'
123test

You can utilize ARGV:
v=123test
awk 'BEGIN {print ARGV[1]}' "$v"
Note that if you are going to continue into the body, you will need to adjust
ARGC:
awk 'BEGIN {ARGC--} {print ARGV[2], $0}' file "$v"

I just changed #Jotne's answer for "for loop".
for i in `seq 11 20`; do host myserver-$i | awk -v i="$i" '{print "myserver-"i" " $4}'; done

I had to insert date at the beginning of the lines of a log file and it's done like below:
DATE=$(date +"%Y-%m-%d")
awk '{ print "'"$DATE"'", $0; }' /path_to_log_file/log_file.log
It can be redirect to another file to save

Pro Tip
It could come handy to create a function that handles this so you dont have to type everything every time. Using the selected solution we get...
awk_switch_columns() {
cat < /dev/stdin | awk -v a="$1" -v b="$2" " { t = \$a; \$a = \$b; \$b = t; print; } "
}
And use it as...
echo 'a b c d' | awk_switch_columns 2 4
Output:
a d c b

How to repeat a character in Bourne Shell?

I want to repeat # 10 times, something like:
"##########"
How can i do it in Bourne shell (/bin/sh) ? I have tried using print but I guess it only works for bash shell.
Please don't give bash syntax.

The shell itself has no obvious facility for repeating a string. For just ten repetitions, it's hard to beat the obvious
echo '##########'
For repeating a single character a specified number of times, this should work even on a busy BusyBox.
dd if=/dev/zero bs=10 count=1 | tr '\0' '#'
Not very elegant but fairly low overhead. (You may need to redirect the standard error from dd to get rid of pesky progress messages.)
If you have a file which is guaranteed to be long enough (such as, for example, the script you are currently running) you could replace the first 10 characters with tr.
head -c 10 "$0" | tr '\000-\377' '#'
If you have a really traditional userspace (such that head doesn't support the -c option) a 1980s-compatible variant might be
yes '#' | head -n 10 | tr -d '\n'
(Your tr might not support exactly the backslash sequences I have used here. Consult its man page or your local academic programmer from the late 1970s.)
... or, heck
strings /bin/sh | sed 's/.*/#/;10q' | tr -d '\n' # don't do this at home :-)
In pure legacy Bourne shell with no external utilities, you can't really do much better than
for f in 0 1 2 3 4 5 6 7 8 9; do
printf '#'
done
In the general case, if you can come up with a generator expression which produces (at least) the required number of repetitions of something, you can loop over that. Here's a simple replacement for seq and jot:
echo | awk '{ for (i=0; i<10; ++i) print i }'
but then you might as well do the output from Awk:
echo | awk '{ for (i=0; i<10; ++i) printf "#" }'

Well, the pure Bourne Shell (POSIX) solution, without pipes and forks would probably be
i=0; while test $i -lt 10; do printf '#'; : $((++i)); done; printf '\n'
This easily generalizes to other repeated strings, e.g. a shell function
rept () {
i=0
while test $i -lt $1; do
printf '%s' "$2"
: $((++i))
done
printf '\n'
}
rept 10 '#'
If the HP Bourne Shell is not quite POSIX and does not support arithmetic substitution with : $(()) you can use i=$(expr $i + 1) instead.

You can use this trick with printf:
$ printf "%0.s#" {1..10}
##########
If the number can be a variable, then you need to use seq:
$ var=30
$ printf "%0.s#" $(seq $var)
##############################

This prints ##########:
for a in `seq 10`; do echo -n "#"; done

Unix script with variable parameters

So i am wring a script that would take several files, compare them and then output the different records.
The script is working fine for 3 parameter ( 3 files ), but i am having trouble to make the parameters vary.
Consider the script is named Test.
If i write: Test 1.txt 2.txt,
The script will know that i have 2 inputs, which are 2 files and will compare them and give me an output.
Furthermore, if i write Test 1.txt 2.txt 3.txt,
The script will know that i have 3 inputs, which are 3 files, compare them and give me an output.
The script now has the following commands :
awk 'NR>2' ${1} | awk '{print $NF "\r"}' > N1
awk 'NR>2' ${2} | awk '{print $NF "\r"}' > N2
awk 'NR>2' ${3} | awk '{print $NF "\r"}' > N3
This is working fine for 3 files, but the problem is that sometimes i have 2 files, sometimes i have 4 files.
I know i can fix that using loops, but i am new to this language and not very familiar with the syntax.
Thank you for your help :)

use this :
x=1
for i in "$#"
do
awk 'NR>2' $i | awk '{print $NF "\r"}' > N$x
x=$(($x+1))
done
$# : list of input parameters

Your awk commands can be combined: awk 'NR>2 {print $NF "\r"}' "$1" > N1.
Better yet, a single awk command to process all files:
awk '
FNR == 1 {output = "N" ++count}
FNR > 2 {print $NF "\r" > output}
' "$#"
"One awk to rule them all"

Unix - Need to cut a file which has multiple blanks as delimiter - awk or cut?

I need to get the records from a text file in Unix. The delimiter is multiple blanks. For example:
2U2133 1239
1290fsdsf 3234
From this, I need to extract
1239
3234
The delimiter for all records will be always 3 blanks.
I need to do this in an unix script(.scr) and write the output to another file or use it as an input to a do-while loop. I tried the below:
while read readline
do
read_int=`echo "$readline"`
cnt_exc=`grep "$read_int" ${Directory path}/file1.txt| wc -l`
if [ $cnt_exc -gt 0 ]
then
int_1=0
else
int_2=0
fi
done < awk -F' ' '{ print $2 }' ${Directoty path}/test_file.txt
test_file.txt is the input file and file1.txt is a lookup file. But the above way is not working and giving me syntax errors near awk -F
I tried writing the output to a file. The following worked in command line:
more test_file.txt | awk -F' ' '{ print $2 }' > output.txt
This is working and writing the records to output.txt in command line. But the same command does not work in the unix script (It is a .scr file)
Please let me know where I am going wrong and how I can resolve this.
Thanks,
Visakh

The job of replacing multiple delimiters with just one is left to tr:
cat <file_name> | tr -s ' ' | cut -d ' ' -f 2
tr translates or deletes characters, and is perfectly suited to prepare your data for cut to work properly.
The manual states:
-s, --squeeze-repeats
replace each sequence of a repeated character that is
listed in the last specified SET, with a single occurrence
of that character

It depends on the version or implementation of cut on your machine. Some versions support an option, usually -i, that means 'ignore blank fields' or, equivalently, allow multiple separators between fields. If that's supported, use:
cut -i -d' ' -f 2 data.file
If not (and it is not universal — and maybe not even widespread, since neither GNU nor MacOS X have the option), then using awk is better and more portable.
You need to pipe the output of awk into your loop, though:
awk -F' ' '{print $2}' ${Directory_path}/test_file.txt |
while read readline
do
read_int=`echo "$readline"`
cnt_exc=`grep "$read_int" ${Directory_path}/file1.txt| wc -l`
if [ $cnt_exc -gt 0 ]
then int_1=0
else int_2=0
fi
done
The only residual issue is whether the while loop is in a sub-shell and and therefore not modifying your main shell scripts variables, just its own copy of those variables.
With bash, you can use process substitution:
while read readline
do
read_int=`echo "$readline"`
cnt_exc=`grep "$read_int" ${Directory_path}/file1.txt| wc -l`
if [ $cnt_exc -gt 0 ]
then int_1=0
else int_2=0
fi
done < <(awk -F' ' '{print $2}' ${Directory_path}/test_file.txt)
This leaves the while loop in the current shell, but arranges for the output of the command to appear as if from a file.
The blank in ${Directory path} is not normally legal — unless it is another Bash feature I've missed out on; you also had a typo (Directoty) in one place.

Other ways of doing the same thing aside, the error in your program is this: You cannot redirect from (<) the output of another program. Turn your script around and use a pipe like this:
awk -F' ' '{ print $2 }' ${Directory path}/test_file.txt | while read readline
etc.
Besides, the use of "readline" as a variable name may or may not get you into problems.

In this particular case, you can use the following line
sed 's/ /\t/g' <file_name> | cut -f 2
to get your second columns.

In bash you can start from something like this:
for n in `${Directoty path}/test_file.txt | cut -d " " -f 4`
{
grep -c $n ${Directory path}/file*.txt
}

This should have been a comment, but since I cannot comment yet, I am adding this here.
This is from an excellent answer here: https://stackoverflow.com/a/4483833/3138875
tr -s ' ' <text.txt | cut -d ' ' -f4
tr -s '<character>' squeezes multiple repeated instances of <character> into one.

It's not working in the script because of the typo in "Directo*t*y path" (last line of your script).

Cut isn't flexible enough. I usually use Perl for that:
cat file.txt | perl -F' ' -e 'print $F[1]."\n"'
Instead of a triple space after -F you can put any Perl regular expression. You access fields as $F[n], where n is the field number (counting starts at zero). This way there is no need to sed or tr.

Forcing the order of output fields from cut command

I want to do something like this:
cat abcd.txt | cut -f 2,1
and I want the order to be 2 and then 1 in the output. On the machine I am testing (FreeBSD 6), this is not happening (its printing in 1,2 order). Can you tell me how to do this?
I know I can always write a shell script to do this reversing, but I am looking for something using the 'cut' command options.
I think I am using version 5.2.1 of coreutils containing cut.

This can't be done using cut. According to the man page:
Selected input is written in the same order that it is read, and is
written exactly once.
Patching cut has been proposed many times, but even complete patches have been rejected.
Instead, you can do it using awk, like this:
awk '{print($2,"\t",$1)}' abcd.txt
Replace the \t with whatever you're using as field separator.

Lars' answer was great but I found an even better one. The issue with his is it matches \t\t as no columns. To fix this use the following:
awk -v OFS=" " -F"\t" '{print $2, $1}' abcd.txt
Where:
-F"\t" is what to cut on exactly (tabs).
-v OFS=" " is what to seperate with (two spaces)
Example:
echo 'A\tB\t\tD' | awk -v OFS=" " -F"\t" '{print $2, $4, $1, $3}'
This outputs:
B D A

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Performance considerations when using pipe | within awk - unix

Related

AWK or bash script to get the rows of a file where the specific column is equal to the given variable [duplicate]

How to repeat a character in Bourne Shell?

Unix script with variable parameters

Unix - Need to cut a file which has multiple blanks as delimiter - awk or cut?

Forcing the order of output fields from cut command

Categories

Resources