LINUX: How to AWK to find maximums in every unique line

LINUX: How to AWK to find maximums in every unique line - unix

I have a file with this record
myFile.txt :
firstname lastname note
Ben Quick 12
Fred Free 10
Mark Marked 18
Mark Marked 20
Ben Quick 8
and want to print every unique person with his maximum note
the result will be like this
firstname lastname maxNote
Ben Quick 12
Fred Free 10
Mark Marked 20
so I have this awk script
#!/bin/awk -f
BEGIN {
print "firstname lastname maxNote"
}
NR > 1 {
list[$1," ",$2]++
}
END {
for(l in list)
print l
}
it print only unique firstname + lastname
my question is : how can I sort them by note and print the whole line of each ?
your help is appreciated !

With GNU awk:
awk 'NR==1{print; next}
{
note=$NF # save note from last field
NF-- # remove last field from $0
name=$0
if(a[name]<note){a[name]=note}
}
END{
for(i in a){print i,a[i]}
}' file
Output:
firstname lastname note
Fred Free 10
Ben Quick 12
Mark Marked 20
See: 8 Powerful Awk Built-in Variables – FS, OFS, RS, ORS, NR, NF, FILENAME, FNR

With your shown samples please try following. Written and tested with GNU awk should work with any awk.
awk '
FNR==1{
print
next
}
match($0,/[^0-9]*/){
val=substr($0,RSTART,RLENGTH)
sub(/[[:space:]]+$/,"",val)
}
{
arr[val]=(arr[val]>$NF?arr[val]:$NF)
}
END{
for(key in arr){
print key,arr[key]
}
}
' Input_file
Explanation: Adding detailed explanation for above.
awk ' ##Starting awk program from here.
FNR==1{ ##Checking condition if this is first line.
print ##printing current line here.
next ##next will skip all further statements from here.
}
match($0,/[^0-9]*/){ ##using match function to match till any digit comes in line.
val=substr($0,RSTART,RLENGTH) ##Creating val which has sub string of current line, with values of RSTART and RLENGTH.
sub(/[[:space:]]+$/,"",val) ##Substituting spaces coming at last in val wit NULL here.
}
{
arr[val]=(arr[val]>$NF?arr[val]:$NF) ##Creating array with index of val and keeping only maximum value by comparing with last field os current line.
}
END{ ##Starting END block of this program from here.
for(key in arr){ ##Traversing into arr elemennts now.
print key,arr[key] ##Printing key and its value here.
}
}
' Input_file ##Mentioning Input_file name here.

$ awk 'NR==1 {print; next}
{k=$1 FS $2}
max[k]<$3 {max[k]=$3}
END {for(k in max) print k, max[k]}' file
firstname lastname note
Fred Free 10
Ben Quick 12
Mark Marked 20
another approach
$ sed 1q file; sed 1d file | sort -k1,2 -k3nr | awk '!a[$1,$2]++'
firstname lastname note
Ben Quick 12
Fred Free 10
Mark Marked 20

Using any awk, sort, and cut in any shell on every Unix box:
$ awk '{print (NR>1), $0}' file | sort -k1,1n -k4,4rn | cut -d' ' -f2- | awk '!seen[$1,$2]++'
firstname lastname note
Mark Marked 20
Ben Quick 12
Fred Free 10

Related

if rows are otherwise-identical keep the one with higher value in one field

I have a file that looks like this:
cat f1.csv:
col1,col2,col3
AK136742,BC051226,996
AK161599,Gm15417,4490
AK161599,Gm15417,6915
AK161599,Zbtb7b,1339
AK161599,Zbtb7b,1475
AK161599,Zbtb7b,1514
What I want to do is to keep one of the otherwise-duplicated rows if they have a greater number on col3. So if the col1 and col2 are the same then keep the row if has the greater number on the col3.
So the desired output should be:
col1,col2,col3
AK136742,BC051226,996
AK161599,Gm15417,6915
AK161599,Zbtb7b,1514
I used the command below but it does not solve the problem:
cat f1.csv | sort -rnk3 | awk '!x[$3]++'
Any help is appreciated - thanks!

with your shown samples, please try following.
awk '
BEGIN{
FS=OFS=","
}
{ ind = $1 FS $2 }
FNR==1{
print
next
}
{
arr[ind]=(arr[ind]>$NF?arr[ind]:$NF)
}
END{
for(i in arr){
print i,arr[i]
}
}
' Input_file
Explanation: Adding detailed explanation for above.
awk ' ##Starting awk program from here.
BEGIN{ ##Starting BEGIN section of this program from here.
FS=OFS="," ##Setting FS, OFS as comma here.
}
{ ind = $1 FS $2 } ##Setting ind as 1st and 2nd field value here.
FNR==1{ ##Checking if its first line.
print ##Then print it.
next ##next will skip all further statements from here.
}
{
arr[ind]=(arr[ind]>$NF?arr[ind]:$NF) ##Creating arr with index of ind and keeping only higher value after each line comparison of last field.
}
END{ ##Starting END block of this program from here.
for(i in arr){ ##Starting a for loop here.
print i,arr[i] ##Printing index and array arr value here.
}
}
' Input_file ##Mentioning Input_file name here.

$ head -n 1 f1.csv; { tail -n +2 f1.csv | sort -t, -k1,2 -k3rn | awk -F, '!seen[$1,$2]++'; }
col1,col2,col3
AK136742,BC051226,996
AK161599,Gm15417,6915
AK161599,Zbtb7b,1514
or to avoid naming the input file twice (e.g. so it'll work if the input is a pipe):
$ awk '{print (NR>1) "," $0}' f1.csv | sort -t, -k1,1n -k1,2 -k3rn | cut -d',' -f2- | awk -F, '!seen[$1,$2]++'
col1,col2,col3
AK136742,BC051226,996
AK161599,Gm15417,4490
AK161599,Zbtb7b,1339

The answers provided seem a little complicated to me. Here's an answer all in awk:
#! /usr/bin/awk -f
NR == 1 {
heading = $0
next
}
{
key = $1 "," $2
if( values[key] < $3 ) {
values[key] = $3
}
}
END {
print heading
for( k in values ) {
print k "," values[k] | "sort -t, -k1,2"
}
}
$ ./max.awk -F, max.dat
col1,col2,col3
AK136742,BC051226,996
AK161599,Gm15417,6915
AK161599,Zbtb7b,1514

Using sort, you need
sort -t, -k3,3nr file.csv | sort -t, -su -k1,2
The first sort sorts the input numerically by the 3rd column in the descending order. The second sort is stable -s (not all sort implementations support that) and uniques the output by the first two columns, thus leaving the maximum for each combination.
I ignored the header line.

Is there way to extract all the duplicate records based on a particular column?

I'm trying to extract all (only) the duplicate values from a pipe delimited file.
My data file has 800 thousands rows with multiple columns and I'm particularly interested about column 3. So I need to get the duplicate values of column 3 and extract all the duplicate rows from that file.
I'm, however able to achieve this as shown below..
cat Report.txt | awk -F'|' '{print $3}' | sort | uniq -d >dup.txt
and I take the above in loop as shown below..
while read dup
do
grep "$dup" Report.txt >>only_dup.txt
done <dup.txt
I've also tried the awk method
while read dup
do
awk -v a=$dup '$3 == a { print $0 }' Report.txt>>only_dup.txt
done <dup.txt
But, as I have large number of records in the file, it's taking ages to complete. So I'm looking for an easy and quick alternative.
For example, I have data like this:
1|learning|Unix|Business|Requirements
2|learning|Unix|Business|Team
3|learning|Linux|Business|Requirements
4|learning|Unix|Business|Team
5|learning|Linux|Business|Requirements
6|learning|Unix|Business|Team
7|learning|Windows|Business|Requirements
8|learning|Mac|Business|Requirements
And my expected output which doesn't include unique records:
1|learning|Unix|Business|Requirements
2|learning|Unix|Business|Team
4|learning|Unix|Business|Team
6|learning|Unix|Business|Team
3|learning|Linux|Business|Requirements
5|learning|Linux|Business|Requirements

This may be what you want:
$ awk -F'|' 'NR==FNR{cnt[$3]++; next} cnt[$3]>1' file file
1|learning|Unix|Business|Requirements
2|learning|Unix|Business|Team
3|learning|Linux|Business|Requirements
4|learning|Unix|Business|Team
5|learning|Linux|Business|Requirements
6|learning|Unix|Business|Team
or if the file's too large for all the keys ($3 values) to fit in memory (which shouldn't be a problem with just the unique $3 values from 800,000 lines):
$ cat tst.awk
BEGIN { FS="|" }
{ currKey = $3 }
currKey == prevKey {
if ( !prevPrinted++ ) {
print prevRec
}
print
next
}
{
prevKey = currKey
prevRec = $0
prevPrinted = 0
}
$ sort -t'|' -k3,3 file | awk -f tst.awk
3|learning|Linux|Business|Requirements
5|learning|Linux|Business|Requirements
1|learning|Unix|Business|Requirements
2|learning|Unix|Business|Team
4|learning|Unix|Business|Team
6|learning|Unix|Business|Team

EDIT2: As per Ed sir's suggestion fine tuned my suggestion with more meaningful names(IMO) of arrays.
awk '
match($0,/[^\|]*\|/){
val=substr($0,RSTART+RLENGTH)
if(!unique_check_count[val]++){
numbered_indexed_array[++count]=val
}
actual_valued_array[val]=(actual_valued_array[val]?actual_valued_array[val] ORS:"")$0
line_count_array[val]++
}
END{
for(i=1;i<=count;i++){
if(line_count_array[numbered_indexed_array[i]]>1){
print actual_valued_array[numbered_indexed_array[i]]
}
}
}
' Input_file
Edit by Ed Morton: FWIW here's how I'd have named the variables in the above code:
awk '
match($0,/[^\|]*\|/) {
key = substr($0,RSTART+RLENGTH)
if ( !numRecs[key]++ ) {
keys[++numKeys] = key
}
key2recs[key] = (key in key2recs ? key2recs[key] ORS : "") $0
}
END {
for ( keyNr=1; keyNr<=numKeys; keyNr++ ) {
key = keys[keyNr]
if ( numRecs[key]>1 ) {
print key2recs[key]
}
}
}
' Input_file
EDIT: Since OP changed Input_file with |delimited so changing code a bit to as follows, which deals with new Input_file(Thanks to Ed Morton sir for pointing it out).
awk '
match($0,/[^\|]*\|/){
val=substr($0,RSTART+RLENGTH)
if(!a[val]++){
b[++count]=val
}
c[val]=(c[val]?c[val] ORS:"")$0
d[val]++
}
END{
for(i=1;i<=count;i++){
if(d[b[i]]>1){
print c[b[i]]
}
}
}
' Input_file
Could you please try following, following will give output in same sequence of in which lines are occurring in Input_file.
awk '
match($0,/[^ ]* /){
val=substr($0,RSTART+RLENGTH)
if(!a[val]++){
b[++count]=val
}
c[val]=(c[val]?c[val] ORS:"")$0
d[val]++
}
END{
for(i=1;i<=count;i++){
if(d[b[i]]>1){
print c[b[i]]
}
}
}
' Input_file
Output will be as follows.
2 learning Unix Business Team
4 learning Unix Business Team
6 learning Unix Business Team
3 learning Linux Business Requirements
5 learning Linux Business Requirements
Explanation for above code:
awk ' ##Starting awk program here.
match($0,/[^ ]* /){ ##Using match function of awk which matches regex till first space is coming.
val=substr($0,RSTART+RLENGTH) ##Creating variable val whose value is sub-string is from starting point of RSTART+RLENGTH value to till end of line.
if(!a[val]++){ ##Checking condition if value of array a with index val is NULL then go further and increase its index too.
b[++count]=val ##Creating array b whose index is increment value of variable count and value is val variable.
} ##Closing BLOCK for if condition of array a here.
c[val]=(c[val]?c[val] ORS:"")$0 ##Creating array named c whose index is variable val and value is $0 along with keep concatenating its own value each time it comes here.
d[val]++ ##Creating array named d whose index is variable val and its value is keep increasing with 1 each time cursor comes here.
} ##Closing BLOCK for match here.
END{ ##Starting END BLOCK section for this awk program here.
for(i=1;i<=count;i++){ ##Starting for loop from i=1 to till value of count here.
if(d[b[i]]>1){ ##Checking if value of array d with index b[i] is greater than 1 then go inside block.
print c[b[i]] ##Printing value of array c whose index is b[i].
}
}
}
' Input_file ##Mentioning Input_file name here.

Another in awk:
$ awk -F\| '{ # set delimiter
n=$1 # store number
sub(/^[^|]*/,"",$0) # remove number from string
if($0 in a) { # if $0 in a
if(a[$0]==1) # if $0 seen the second time
print b[$0] $0 # print first instance
print n $0 # also print current
}
a[$0]++ # increase match count for $0
b[$0]=n # number stored to b and only needed once
}' file
Output for the sample data:
2|learning|Unix|Business|Team
4|learning|Unix|Business|Team
3|learning|Linux|Business|Requirements
5|learning|Linux|Business|Requirements
6|learning|Unix|Business|Team
Also, would this work:
$ sort -k 2 file | uniq -D -f 1
or -k2,5 or smth. Nope, as the delimiter changed from space to pipe.

Two steps of improvement.
First step:
After
awk -F'|' '{print $3}' Report.txt | sort | uniq -d >dup.txt
# or
cut -d "|" -f3 < Report.txt | sort | uniq -d >dup.txt
you can use
grep -f <(sed 's/.*/^.*|.*|&|.*|/' dup.txt) Report.txt
# or without process substitution
sed 's/.*/^.*|.*|&|.*|/' dup.txt > dup.sed
grep -f dup.sed Report.txt
Second step:
Use awk as given in other, better, answers.

How to find the value of column on the basis of matching column from two files?

file 1 : emp.txt
7839|KING|PRESIDENT||17-Nov-81|5000||10
7698|BLAKE|MANAGER|7839|01-May-81|2850||30
7782|CLARK|MANAGER|7839|09-Jun-81|2450||10
7566|JONES|MANAGER|7839|02-Apr-81|2975||20
7788|SCOTT|ANALYST|7566|19-Apr-87|3000||20
7902|FORD|ANALYST|7566|03-Dec-81|3000||20
7369|SMITH|CLERK|7902|17-Dec-80|800||20
7499|ALLEN|SALESMAN|7698|20-Feb-81|1600|300|30
7521|WARD|SALESMAN|7698|22-Feb-81|1250|500|30
7654|MARTIN|SALESMAN|7698|28-Sep-81|1250|1400|30
file 2 : dept.txt
10|ACCOUNTING|NEW YORK
20|RESEARCH|DALLAS
30|SALES|CHICAGO
40|OPERATIONS|BOSTON
I want to print below output :
7839|KING|PRESIDENT||17-Nov-81|5000||10|NEW YORK
7698|BLAKE|MANAGER|7839|01-May-81|2850||30|CHICAGO
7782|CLARK|MANAGER|7839|09-Jun-81|2450||10|NEW YORK
7566|JONES|MANAGER|7839|02-Apr-81|2975||20|DALLAS
7788|SCOTT|ANALYST|7566|19-Apr-87|3000||20|DALLAS
7902|FORD|ANALYST|7566|03-Dec-81|3000||20|DALLAS
7369|SMITH|CLERK|7902|17-Dec-80|800||20|DALLAS
7499|ALLEN|SALESMAN|7698|20-Feb-81|1600|300|30|CHICAGO
7521|WARD|SALESMAN|7698|22-Feb-81|1250|500|30|CHICAGO
7654|MARTIN|SALESMAN|7698|28-Sep-81|1250|1400|30|CHICAGO
I tried below awk statement, but it is not printing anything -
awk -F'|' 'NR==FNR {val[$1]=$3; next} $8 in val {print $1,$2,$3,$4,$5,$6,$7,$8,val[$1]}' OFS="|" dept.txt emp.txt
Any Suggestion ??

Use $NF, which is the value of the last field:
➜ awk '
BEGIN { FS = OFS = "|" }
NR==FNR { location[$1] = $NF; next }
{ print (location[$NF] ? $0 OFS location[$NF] : $0) }
' dept.txt emp.txt
7839|KING|PRESIDENT||17-Nov-81|5000||10|NEW YORK
7698|BLAKE|MANAGER|7839|01-May-81|2850||30|CHICAGO
7782|CLARK|MANAGER|7839|09-Jun-81|2450||10|NEW YORK
7566|JONES|MANAGER|7839|02-Apr-81|2975||20|DALLAS
7788|SCOTT|ANALYST|7566|19-Apr-87|3000||20|DALLAS
7902|FORD|ANALYST|7566|03-Dec-81|3000||20|DALLAS
7369|SMITH|CLERK|7902|17-Dec-80|800||20|DALLAS
7499|ALLEN|SALESMAN|7698|20-Feb-81|1600|300|30|CHICAGO
7521|WARD|SALESMAN|7698|22-Feb-81|1250|500|30|CHICAGO
7654|MARTIN|SALESMAN|7698|28-Sep-81|1250|1400|30|CHICAGO
This assumes you still want the entire line regardless if the dept city index exists. If not then please update your question to reflect common use cases and expected output.

The problem is there are two spaces in front of the matching column. Since you are using '|' as your field separator then each row of the second file is divided as follows.(Using the first row as example.)
10|ACCOUNTING|NEW YORK
$1=" 10"
$2="ACCOUNTING"
$3="NEW YORK"
So you are mapping Accounting with " 10" rather than "10". Thats why you don't get any match in the second file. (Assuming you wanted to use val[$8] rather than val[$1] in the second print command).
Do the following. This will fix your problem.
awk -F'|' 'NR==FNR {sub(" ","",$1);val[$1]=$3; next;} $8 in val {print $1,$2
,$3,$4,$5,$6,$7,$8,val[$8]}' OFS="|" dept.txt emp.txt
Output:
7839|KING|PRESIDENT||17-Nov-81|5000||10|NEW YORK
7698|BLAKE|MANAGER|7839|01-May-81|2850||30|CHICAGO
7782|CLARK|MANAGER|7839|09-Jun-81|2450||10|NEW YORK
7566|JONES|MANAGER|7839|02-Apr-81|2975||20|DALLAS
7788|SCOTT|ANALYST|7566|19-Apr-87|3000||20|DALLAS
7902|FORD|ANALYST|7566|03-Dec-81|3000||20|DALLAS
7369|SMITH|CLERK|7902|17-Dec-80|800||20|DALLAS
7499|ALLEN|SALESMAN|7698|20-Feb-81|1600|300|30|CHICAGO
7521|WARD|SALESMAN|7698|22-Feb-81|1250|500|30|CHICAGO
7654|MARTIN|SALESMAN|7698|28-Sep-81|1250|1400|30|CHICAGO

In your code line, you should call the hash by the column that has the id where you hashed each value, in your case, column 8 is the one that stores the common id for the file you want to print the info out.
awk -F\| 'NR==FNR {val[$1]=$3; next} {print $1, $2, $3, $4, $5, $6, $7, $8, val[$8]};' OFS="|" dept.txt emp.txt

How do I get the maximum value for a text in a UNIX file as shown below?

I have a unix file with the following contents.
$cat myfile.txt
abc:1
abc:2
hello:3
hello:6
wonderful:1
hai:2
hai:4
hai:8
How do I get the max value given for each text in the file above.
'abc' value 2
'hello' value 6
'hai' value 8
'womderful' 1

Based on the current example in your question, minus the first line of expected output:
awk -F':' '{arr[$1]=$2 ; next} END {for (i in arr) {print i, arr[i]} } ' inputfile
You example input and expected output are very confusing.... The reason I posted this is to get feedback from the OP forthcoming
This assumes the data is unsorted, but also works with sorted data (New):
sort -t: -k2n inputfile | awk -F':' '{arr[$1]=$2 ; next} END {for (i in arr) {print i, arr[i]} } '

How to print the last but one record of a file using awk?

How to print the last but one record of a file using awk?

Something like:
awk '{ prev_line=this_line; this_line=$0 } END { print prev_line }' < file
Essentially, keep a record of the line before the current one, until you hit the end of the file, then print the previous line.
edit to respond to comment:
To just extract the second field in the penultimate line:
awk '{ prev_f2=this_f2; this_f2=$2 } END { print prev_f2 }' < file

You can do it with awk but you may find that:
tail -2 inputfile | head -1
will be a quicker solution - it grabs the last two lines of the complete set then the first of those two.
The following transcript shows how this works:
pax$ echo '1
> 2
> 3
> 4
> 5' | tail -2 | head -1
4
If you must use awk, you can use:
pax$ echo '1
2
3
4
5' | awk '{last = this; this = $0} END {print last}'
4
It works by keeping the last and current line in variables last and this and just printing out last when the file is finished.

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

LINUX: How to AWK to find maximums in every unique line - unix

Using any awk, sort, and cut in any shell on every Unix box: $ awk '{print (NR>1), $0}' file | sort -k1,1n -k4,4rn | cut -d' ' -f2- | awk '!seen[$1,$2]++' firstname lastname note Mark Marked 20 Ben Quick 12 Fred Free 10

Related

if rows are otherwise-identical keep the one with higher value in one field

Is there way to extract all the duplicate records based on a particular column?

How to find the value of column on the basis of matching column from two files?

How do I get the maximum value for a text in a UNIX file as shown below?

How to print the last but one record of a file using awk?

Categories

Resources