awk if statement with simple math - unix

I'm just trying to do some basic calculations on a CSV file.
Data:
31590,Foo,70
28327,Bar,291
25155,Baz,583
24179,Food,694
28670,Spaz,67
22190,bawk,4431
29584,alfred,142
27698,brian,379
24372,peter,22
25064,weinberger,8
Here's my simple awk script:
#!/usr/local/bin/gawk -f
BEGIN { FPAT="([^,]*)|(\"[^\"]+\")"; OFS=","; OFMT="%.2f"; }
NR > 1
END { if ($3>1336) $4=$3*0.03; if ($3<1336) $4=$3*0.05;}1**
Wrong output:
31590,Foo,70
28327,Bar,291
28327,Bar,291
25155,Baz,583
25155,Baz,583
24179,Food,694
24179,Food,694
28670,Spaz,67
28670,Spaz,67
22190,bawk,4431
22190,bawk,4431
29584,alfred,142
29584,alfred,142
27698,brian,379
27698,brian,379
24372,peter,22
24372,peter,22
25064,weinberger,8
25064,weinberger,8
Excepted output:
31590,Foo,70,3.5
28327,Bar,291,14.55
25155,Baz,583,29.15
24179,Food,694,34.7
28670,Spaz,67,3.35
22190,bawk,4431,132.93
29584,alfred,142,7.1
27698,brian,379,18.95
24372,peter,22,1.1
25064,weinberger,8,.04
Simple math is if
field $3 > 1336 = $3*.03 and results in field $4
field $3 < 1336 = $3*.05 and results in field $4

There's no need to force awk to recompile every record (by assigning to $4), just print the current record followed by the result of your calculation:
awk 'BEGIN{FS=OFS=","; OFMT="%.2f"} {print $0, $3*($3>1336?0.03:0.05)}' file

You shouldn't have anything in the END block
BEGIN {
FS = OFS = ","
OFMT="%.2f"
}
{
if ($3 > 1336)
$4 = $3 * 0.03
else
$4 = $3 * 0.05
print
}
This results in
31590,Foo,70,3.5
28327,Bar,291,14.55
25155,Baz,583,29.15
24179,Food,694,34.7
28670,Spaz,67,3.35
22190,bawk,4431,132.93
29584,alfred,142,7.1
27698,brian,379,18.95
24372,peter,22,1.1
25064,weinberger,8,0.4

$ awk -F, -v OFS=, '{if ($3>1336) $4=$3*0.03; else $4=$3*0.05;} 1' data
31590,Foo,70,3.5
28327,Bar,291,14.55
25155,Baz,583,29.15
24179,Food,694,34.7
28670,Spaz,67,3.35
22190,bawk,4431,132.93
29584,alfred,142,7.1
27698,brian,379,18.95
24372,peter,22,1.1
25064,weinberger,8,0.4
Discussion
The END block is not executed at the end of each line but at the end of the whole file. Consequently, it is not helpful here.
The original code has two free standing conditions, NR>1 and 1. The default action for each is to print the line. That is why, in the "wrong output," all lines after the first were doubled in the output.

With awk:
awk -F, -v OFS=, '$3>1336?$4=$3*.03:$4=$3*.05' file
The conditional-expression ? action1 : action2 ; is the much shorter terinary operator in awk.

Related

if rows are otherwise-identical keep the one with higher value in one field

I have a file that looks like this:
cat f1.csv:
col1,col2,col3
AK136742,BC051226,996
AK161599,Gm15417,4490
AK161599,Gm15417,6915
AK161599,Zbtb7b,1339
AK161599,Zbtb7b,1475
AK161599,Zbtb7b,1514
What I want to do is to keep one of the otherwise-duplicated rows if they have a greater number on col3. So if the col1 and col2 are the same then keep the row if has the greater number on the col3.
So the desired output should be:
col1,col2,col3
AK136742,BC051226,996
AK161599,Gm15417,6915
AK161599,Zbtb7b,1514
I used the command below but it does not solve the problem:
cat f1.csv | sort -rnk3 | awk '!x[$3]++'
Any help is appreciated - thanks!
with your shown samples, please try following.
awk '
BEGIN{
FS=OFS=","
}
{ ind = $1 FS $2 }
FNR==1{
print
next
}
{
arr[ind]=(arr[ind]>$NF?arr[ind]:$NF)
}
END{
for(i in arr){
print i,arr[i]
}
}
' Input_file
Explanation: Adding detailed explanation for above.
awk ' ##Starting awk program from here.
BEGIN{ ##Starting BEGIN section of this program from here.
FS=OFS="," ##Setting FS, OFS as comma here.
}
{ ind = $1 FS $2 } ##Setting ind as 1st and 2nd field value here.
FNR==1{ ##Checking if its first line.
print ##Then print it.
next ##next will skip all further statements from here.
}
{
arr[ind]=(arr[ind]>$NF?arr[ind]:$NF) ##Creating arr with index of ind and keeping only higher value after each line comparison of last field.
}
END{ ##Starting END block of this program from here.
for(i in arr){ ##Starting a for loop here.
print i,arr[i] ##Printing index and array arr value here.
}
}
' Input_file ##Mentioning Input_file name here.
$ head -n 1 f1.csv; { tail -n +2 f1.csv | sort -t, -k1,2 -k3rn | awk -F, '!seen[$1,$2]++'; }
col1,col2,col3
AK136742,BC051226,996
AK161599,Gm15417,6915
AK161599,Zbtb7b,1514
or to avoid naming the input file twice (e.g. so it'll work if the input is a pipe):
$ awk '{print (NR>1) "," $0}' f1.csv | sort -t, -k1,1n -k1,2 -k3rn | cut -d',' -f2- | awk -F, '!seen[$1,$2]++'
col1,col2,col3
AK136742,BC051226,996
AK161599,Gm15417,4490
AK161599,Zbtb7b,1339
The answers provided seem a little complicated to me. Here's an answer all in awk:
#! /usr/bin/awk -f
NR == 1 {
heading = $0
next
}
{
key = $1 "," $2
if( values[key] < $3 ) {
values[key] = $3
}
}
END {
print heading
for( k in values ) {
print k "," values[k] | "sort -t, -k1,2"
}
}
$ ./max.awk -F, max.dat
col1,col2,col3
AK136742,BC051226,996
AK161599,Gm15417,6915
AK161599,Zbtb7b,1514
Using sort, you need
sort -t, -k3,3nr file.csv | sort -t, -su -k1,2
The first sort sorts the input numerically by the 3rd column in the descending order. The second sort is stable -s (not all sort implementations support that) and uniques the output by the first two columns, thus leaving the maximum for each combination.
I ignored the header line.

How to calculate max and min of multiple columns (row wise) using awk

This might be simple - I have a file as below:
df.csv
col1,col2,col3,col4,col5
A,2,5,7,9
B,6,10,2,3
C,3,4,6,8
I want to perform max(col2,col4) - min(col3,col5) but I get an error using max and min in awk and write the result in a new column. So the desired output should look like:
col1,col2,col3,col4,col5,New_col
A,2,5,7,9,2
B,6,10,2,3,3
C,3,4,6,8,2
I used the code below but is does not work - how can I solve this?
awk -F, '{print $1,$2,$3,$4,$5,$(max($7,$9)-min($8,$10))}'
Thank you.
$ cat tst.awk
BEGIN { FS=OFS="," }
{ print $0, (NR>1 ? max($2,$4) - min($3,$5) : "New_col") }
function max(a,b) {return (a>b ? a : b)}
function min(a,b) {return (a<b ? a : b)}
$ awk -f tst.awk file
col1,col2,col3,col4,col5,New_col
A,2,5,7,9,2
B,6,10,2,3,3
C,3,4,6,8,2
If your actual "which is larger" calculation is more involved than just using >, e.g. if you were comparing dates in some non-alphabetic format or peoples names where you have to compare the surname before the forename and handle titles, etc., then you'd write the functions as:
function max(a,b) {
# some algorithm to compare the 2 strings
}
function min(a,b) {return (max(a,b) == a ? b : a)}
You may use this awk:
awk 'BEGIN{FS=OFS=","} NR==1 {print $0, "New_col"; next} {print $0, ($2 > $4 ? $2 : $4) - ($3 < $5 ? $3 : $5)}' df.csv
col1,col2,col3,col4,col5,New_col
A,2,5,7,9,2
B,6,10,2,3,3
C,3,4,6,8,2
A more readable version:
awk '
BEGIN { FS = OFS = "," }
NR == 1 {
print $0, "New_col"
next
}
{
print $0, ($2 > $4 ? $2 : $4) - ($3 < $5 ? $3 : $5)
}' df.csv
get an error using max and min in awk and write the result in a new column.
No such function are available in awk but for two values you might harness ternary operator, so in place of
max($7,$9)
try
($7>$9)?$7:$9
and in place of
min($8,$10)
try
($8<$10)?$8:$10
Above exploit ?: which might be explained as check?valueiftrue:valueiffalse, simple example, let file.txt content be
100,100
100,300
300,100
300,300
then
awk 'BEGIN{FS=","}{print ($1>$2)?$1:$2}' file.txt
output
100
300
300
300
Also are you sure about 1st $ in $(max($7,$9)-min($8,$10))? By doing so you instructed awk to get value of n-th column, where n is result of computation inside (...).

Is there way to extract all the duplicate records based on a particular column?

I'm trying to extract all (only) the duplicate values from a pipe delimited file.
My data file has 800 thousands rows with multiple columns and I'm particularly interested about column 3. So I need to get the duplicate values of column 3 and extract all the duplicate rows from that file.
I'm, however able to achieve this as shown below..
cat Report.txt | awk -F'|' '{print $3}' | sort | uniq -d >dup.txt
and I take the above in loop as shown below..
while read dup
do
grep "$dup" Report.txt >>only_dup.txt
done <dup.txt
I've also tried the awk method
while read dup
do
awk -v a=$dup '$3 == a { print $0 }' Report.txt>>only_dup.txt
done <dup.txt
But, as I have large number of records in the file, it's taking ages to complete. So I'm looking for an easy and quick alternative.
For example, I have data like this:
1|learning|Unix|Business|Requirements
2|learning|Unix|Business|Team
3|learning|Linux|Business|Requirements
4|learning|Unix|Business|Team
5|learning|Linux|Business|Requirements
6|learning|Unix|Business|Team
7|learning|Windows|Business|Requirements
8|learning|Mac|Business|Requirements
And my expected output which doesn't include unique records:
1|learning|Unix|Business|Requirements
2|learning|Unix|Business|Team
4|learning|Unix|Business|Team
6|learning|Unix|Business|Team
3|learning|Linux|Business|Requirements
5|learning|Linux|Business|Requirements
This may be what you want:
$ awk -F'|' 'NR==FNR{cnt[$3]++; next} cnt[$3]>1' file file
1|learning|Unix|Business|Requirements
2|learning|Unix|Business|Team
3|learning|Linux|Business|Requirements
4|learning|Unix|Business|Team
5|learning|Linux|Business|Requirements
6|learning|Unix|Business|Team
or if the file's too large for all the keys ($3 values) to fit in memory (which shouldn't be a problem with just the unique $3 values from 800,000 lines):
$ cat tst.awk
BEGIN { FS="|" }
{ currKey = $3 }
currKey == prevKey {
if ( !prevPrinted++ ) {
print prevRec
}
print
next
}
{
prevKey = currKey
prevRec = $0
prevPrinted = 0
}
$ sort -t'|' -k3,3 file | awk -f tst.awk
3|learning|Linux|Business|Requirements
5|learning|Linux|Business|Requirements
1|learning|Unix|Business|Requirements
2|learning|Unix|Business|Team
4|learning|Unix|Business|Team
6|learning|Unix|Business|Team
EDIT2: As per Ed sir's suggestion fine tuned my suggestion with more meaningful names(IMO) of arrays.
awk '
match($0,/[^\|]*\|/){
val=substr($0,RSTART+RLENGTH)
if(!unique_check_count[val]++){
numbered_indexed_array[++count]=val
}
actual_valued_array[val]=(actual_valued_array[val]?actual_valued_array[val] ORS:"")$0
line_count_array[val]++
}
END{
for(i=1;i<=count;i++){
if(line_count_array[numbered_indexed_array[i]]>1){
print actual_valued_array[numbered_indexed_array[i]]
}
}
}
' Input_file
Edit by Ed Morton: FWIW here's how I'd have named the variables in the above code:
awk '
match($0,/[^\|]*\|/) {
key = substr($0,RSTART+RLENGTH)
if ( !numRecs[key]++ ) {
keys[++numKeys] = key
}
key2recs[key] = (key in key2recs ? key2recs[key] ORS : "") $0
}
END {
for ( keyNr=1; keyNr<=numKeys; keyNr++ ) {
key = keys[keyNr]
if ( numRecs[key]>1 ) {
print key2recs[key]
}
}
}
' Input_file
EDIT: Since OP changed Input_file with |delimited so changing code a bit to as follows, which deals with new Input_file(Thanks to Ed Morton sir for pointing it out).
awk '
match($0,/[^\|]*\|/){
val=substr($0,RSTART+RLENGTH)
if(!a[val]++){
b[++count]=val
}
c[val]=(c[val]?c[val] ORS:"")$0
d[val]++
}
END{
for(i=1;i<=count;i++){
if(d[b[i]]>1){
print c[b[i]]
}
}
}
' Input_file
Could you please try following, following will give output in same sequence of in which lines are occurring in Input_file.
awk '
match($0,/[^ ]* /){
val=substr($0,RSTART+RLENGTH)
if(!a[val]++){
b[++count]=val
}
c[val]=(c[val]?c[val] ORS:"")$0
d[val]++
}
END{
for(i=1;i<=count;i++){
if(d[b[i]]>1){
print c[b[i]]
}
}
}
' Input_file
Output will be as follows.
2 learning Unix Business Team
4 learning Unix Business Team
6 learning Unix Business Team
3 learning Linux Business Requirements
5 learning Linux Business Requirements
Explanation for above code:
awk ' ##Starting awk program here.
match($0,/[^ ]* /){ ##Using match function of awk which matches regex till first space is coming.
val=substr($0,RSTART+RLENGTH) ##Creating variable val whose value is sub-string is from starting point of RSTART+RLENGTH value to till end of line.
if(!a[val]++){ ##Checking condition if value of array a with index val is NULL then go further and increase its index too.
b[++count]=val ##Creating array b whose index is increment value of variable count and value is val variable.
} ##Closing BLOCK for if condition of array a here.
c[val]=(c[val]?c[val] ORS:"")$0 ##Creating array named c whose index is variable val and value is $0 along with keep concatenating its own value each time it comes here.
d[val]++ ##Creating array named d whose index is variable val and its value is keep increasing with 1 each time cursor comes here.
} ##Closing BLOCK for match here.
END{ ##Starting END BLOCK section for this awk program here.
for(i=1;i<=count;i++){ ##Starting for loop from i=1 to till value of count here.
if(d[b[i]]>1){ ##Checking if value of array d with index b[i] is greater than 1 then go inside block.
print c[b[i]] ##Printing value of array c whose index is b[i].
}
}
}
' Input_file ##Mentioning Input_file name here.
Another in awk:
$ awk -F\| '{ # set delimiter
n=$1 # store number
sub(/^[^|]*/,"",$0) # remove number from string
if($0 in a) { # if $0 in a
if(a[$0]==1) # if $0 seen the second time
print b[$0] $0 # print first instance
print n $0 # also print current
}
a[$0]++ # increase match count for $0
b[$0]=n # number stored to b and only needed once
}' file
Output for the sample data:
2|learning|Unix|Business|Team
4|learning|Unix|Business|Team
3|learning|Linux|Business|Requirements
5|learning|Linux|Business|Requirements
6|learning|Unix|Business|Team
Also, would this work:
$ sort -k 2 file | uniq -D -f 1
or -k2,5 or smth. Nope, as the delimiter changed from space to pipe.
Two steps of improvement.
First step:
After
awk -F'|' '{print $3}' Report.txt | sort | uniq -d >dup.txt
# or
cut -d "|" -f3 < Report.txt | sort | uniq -d >dup.txt
you can use
grep -f <(sed 's/.*/^.*|.*|&|.*|/' dup.txt) Report.txt
# or without process substitution
sed 's/.*/^.*|.*|&|.*|/' dup.txt > dup.sed
grep -f dup.sed Report.txt
Second step:
Use awk as given in other, better, answers.

While read line, awk $line and write to variable

I am trying to split a file into different smaller files depending on the value of the fifth field. A very nice way to do this was already suggested and also here.
However, I am trying to incorporate this into a .sh script for qsub, without much success.
The problem is that in the section where the file to which output the line is specified,
i.e., f = "Alignments_" $5 ".sam" print > f
, I need to pass a variable declared earlier in the script, which specifies the directory where the file should be written. I need to do this with a variable which is built for each task when I send out the array job for multiple files.
So say $output_path = ./Sample1
I need to write something like
f = $output_path "/Alignments_" $5 ".sam" print > f
But it does not seem to like having a $variable that is not a $field belonging to awk. I don't even think it likes having two "strings" before and after the $5.
The error I get back is that it takes the first line of the file to be split (little.sam) and tries to name f like that, followed by /Alignments_" $5 ".sam" (those last three put together correctly). It says, naturally, that it is too big a name.
How can I write this so it works?
Thanks!
awk -F '[:\t]' ' # read the list of numbers in Tile_Number_List
FNR == NR {
num[$1]
next
}
# process each line of the .BAM file
# any lines with an "unknown" $5 will be ignored
$5 in num {
f = "Alignments_" $5 ".sam" print > f
} ' Tile_Number_List.txt little.sam
UPDATE, AFTER ADDING -V TO AWK AND DECLARING THE VARIABLE OPATH
input=$1
outputBase=${input%.bam}
mkdir -v $outputBase\_TEST
newdir=$outputBase\_TEST
samtools view -h $input | awk 'NR >= 18' | awk -F '[\t:]' -v opath="$newdir" '
FNR == NR {
num[$1]
next
}
$5 in num {
f = newdir"/Alignments_"$5".sam";
print > f
} ' Tile_Number_List.txt -
mkdir: created directory little_TEST'
awk: cmd. line:10: (FILENAME=- FNR=1) fatal: can't redirect to `/Alignments_1101.sam' (Permission denied)
awk variables are like C variables - just reference them by name to get their value, no need to stick a "$" in front of them like you do with shell variables:
awk -F '[:\t]' ' # read the list of numbers in Tile_Number_List
FNR == NR {
num[$1]
next
}
# process each line of the .BAM file
# any lines with an "unknown" $5 will be ignored
$5 in num {
output_path = "./Sample1/"
f = output_path "Alignments_" $5 ".sam"
print > f
} ' Tile_Number_List.txt little.sam
To pass the value of the shell variable such as $output_path to awk you need to use the -v option.
$ output_path=./Sample1/
$ awk -F '[:\t]' -v opath="$ouput_path" '
# read the list of numbers in Tile_Number_List
FNR == NR {
num[$1]
next
}
# process each line of the .BAM file
# any lines with an "unknown" $5 will be ignored
$5 in num {
f = opath"Alignments_"$5".sam"
print > f
} ' Tile_Number_List.txt little.sam
Also you still have the error from your previous question left in your script
EDIT:
The awk variable created with -v is obase but you use newdir what you want is:
input=$1
outputBase=${input%.bam}
mkdir -v $outputBase\_TEST
newdir=$outputBase\_TEST
samtools view -h "$input" | awk -F '[\t:]' -v opath="$newdir" '
FNR == NR && NR >= 18 {
num[$1]
next
}
$5 in num {
f = opath"/Alignments_"$5".sam" # <-- opath is the awk variable not newdir
print > f
}' Tile_Number_List.txt -
You should also move NR >= 18 into the second awk script.

How do I get the maximum value for a text in a UNIX file as shown below?

I have a unix file with the following contents.
$cat myfile.txt
abc:1
abc:2
hello:3
hello:6
wonderful:1
hai:2
hai:4
hai:8
How do I get the max value given for each text in the file above.
'abc' value 2
'hello' value 6
'hai' value 8
'womderful' 1
Based on the current example in your question, minus the first line of expected output:
awk -F':' '{arr[$1]=$2 ; next} END {for (i in arr) {print i, arr[i]} } ' inputfile
You example input and expected output are very confusing.... The reason I posted this is to get feedback from the OP forthcoming
This assumes the data is unsorted, but also works with sorted data (New):
sort -t: -k2n inputfile | awk -F':' '{arr[$1]=$2 ; next} END {for (i in arr) {print i, arr[i]} } '

Resources