Unix - print distinct field lengths - unix

I would like to print the total number of times each length occurs in a field.
The column type is varChar and the strings in that field are either 9, 10, or 15 characters long. I want to know how many exist of each length.
My code:
awk -F'|'
'NR>1 $61!="" &&
if /length($61)=15/ then {a++}
elif /length($61)=10/ then {b++}
else /length($61)=9/ then {c++}
fi {print a ", " b ", " c}'
ERROR:
awk -F'|' 'NR>1 $61!="" && if /length($61)=15/ then {a++} elif /length($61)=10/ then {b++} else /length($61)=9/ then {c++} fi {print a ", " b ", " c}'
Syntax Error The source line is 1.
The error context is
NR>1 >>> $61!= <<<
awk: 0602-500 Quitting The source line is 1.
INPUT
A pipe delimited .sqf file with 1.2 million rows and column 61 is varChar 15.

based on your pseudo-code I guess you want
awk -F'|' -v OFS=', ' 'NR>1 {count[length($61)]++}
END {print count[15],count[10],count[9]}' file
you'll also have the other length counts there in case of data quality check.
If you want to have 0 instead of null for missing counts, change to count[n]+0 as suggested in the comments.

Related

Filter a file using a column value greater than a number (awk not working)

I am trying to filter a file using values in the 8 column >= 10. I am using awk but for some reason it doesn't work. Am I doing something wrong, what am I missing?
head df_TPM.csv
LQNS02136402.1_14821_3p,12680.71611,11346.42368,11686.28693,9067.797819,7429.467928,5551.660333,3246.956281
LQNS02000137.1_325_3p,8342.540984,5905.726173,4503.363041,3616.191278,3142.965662,3678.829299,6288.621969
LQNS02278148.1_40791_3p,4921.502758,2461.882836,429.824973,261.273116,132.0239748,68.6191655,70.8815385
LQNS02278089.1_34112_3p,4246.71324,4584.529009,8687.922574,7570.83746,5801.384953,2870.020801,734.3131465
LQNS02278075.1_32377_5p,4143.547577,4093.91803,10804.12323,10062.99269,7925.240969,4712.484455,1080.915573
LQNS02138569.1_14892_3p,2668.27957,2160.173542,837.2584183,233.2310273,84.62362925,64.6037895,23.456714
LQNS02278075.1_32324_5p,2331.608924,491.8868983,1527.312199,881.8683105,747.1474225,347.397634,74.07259175
LQNS02278075.1_32382_3p,2140.686095,2439.122353,10837.38169,12569.95295,9385.530878,6022.323737,1705.900969
LQNS02000138.1_777_5p,1819.275149,1762.009649,8565.396754,33280.90019,32176.07604,15849.37306,11872.99383
LQNS02278186.1_47223_3p,1687.843418,728.4288968,1328.048172,1306.424238,2102.27342,14.78892225,9.92647375
#Extract column 1 and 8 and print if $8>=10
cat df_TPM.csv |awk -F"," '{print $1, $8}' | grep -E "^LQN" | awk -F " " '$2>= 10'
LQNS02276925.1_23356_5p 5.352369
LQNS02277221.1_25158_5p 2.82778125
LQNS02277812.1_29775_3p 11.1090745
LQNS02278074.1_32154_3p 6.124789
LQNS02278139.1_39525_5p 22.6656355
#As you can see lots of numbers shouldn't be there (ex: 2.82778125 < 10)
By seeing OP's comment, in case you don't want to search for LQN text in starting of line and want to check if 8th column is greater than 10 then try following(to check if lines have LQN remove ! from following codes).
awk -F"," '$8+0 >= 10 && !/^LQN/{print $1, $8}' df_TPM.csv
OR To get total lines try: counting those matched lines could be done in a single awk itself.
awk -F"," '$8+0 >= 10 && !/^LQN/{count++} END{print count}' df_TPM.csv
Explanation: Adding detailed explanation for above.
awk -F"," ' ##Starting awk program from here.
$8+0 >= 10 && !/^LQN/{ ##Checking condition if 8th field is greater than 10 and NOT LQN.
count++ ##Increasing count with 1 here.
}
END{ ##Starting END block of this awk program from here.
print count ##Printing count value here.
}
' df_TPM.csv ##Mentioning Input_file name here.
To handle control M characters in awk code itself try: considering that you don't want control M characters in your Input_file.
awk -F"," '{gsub(/\r/,"")} $8 >= 10 && !/^LQN/{count++} END{print count}' df_TPM.csv
You need to tell your awk to coerce $8 into a number by computing $8+0. It is recommended that you ensure you have GNU awk installed to avoid issues. Also, you may probably use dos2unix before working on the files to normalize the line endings.
The whole command can be written as
awk -F"," '/^LQN/ && $8+0 >= 10 {print $1, $8}' df_TPM.csv
See the online awk demo.
NOTE: To only count these lines, use
The whole command can be written as
awk -F, '/^LQN/ && $8+0 >= 10 {cnt++} END{print cnt}' df_TPM.csv
To find the lines that do not start with LQN, just add the negation operator ! before /^LQN/:
awk -F, '!/^LQN/ && $8+0 >= 10 {cnt++} END{print cnt}' df_TPM.csv
Details
-F"," (= -F,) - set the field separator to a comma
/^LQN/ && $8+0 >= 10 - if the current line starts with LQN and the eighth field is equal or larger than 10
!/^LQN/ && $8+0 >= 10 - if the current line does not start with LQN and the eighth field is equal or larger than 10
{print $1, $8} - print Field 1 and 8
{cnt++} - increment the cnt variable
END{print cnt} - print the cnt variable once the awk finishes processing lines.

Is there way to extract all the duplicate records based on a particular column?

I'm trying to extract all (only) the duplicate values from a pipe delimited file.
My data file has 800 thousands rows with multiple columns and I'm particularly interested about column 3. So I need to get the duplicate values of column 3 and extract all the duplicate rows from that file.
I'm, however able to achieve this as shown below..
cat Report.txt | awk -F'|' '{print $3}' | sort | uniq -d >dup.txt
and I take the above in loop as shown below..
while read dup
do
grep "$dup" Report.txt >>only_dup.txt
done <dup.txt
I've also tried the awk method
while read dup
do
awk -v a=$dup '$3 == a { print $0 }' Report.txt>>only_dup.txt
done <dup.txt
But, as I have large number of records in the file, it's taking ages to complete. So I'm looking for an easy and quick alternative.
For example, I have data like this:
1|learning|Unix|Business|Requirements
2|learning|Unix|Business|Team
3|learning|Linux|Business|Requirements
4|learning|Unix|Business|Team
5|learning|Linux|Business|Requirements
6|learning|Unix|Business|Team
7|learning|Windows|Business|Requirements
8|learning|Mac|Business|Requirements
And my expected output which doesn't include unique records:
1|learning|Unix|Business|Requirements
2|learning|Unix|Business|Team
4|learning|Unix|Business|Team
6|learning|Unix|Business|Team
3|learning|Linux|Business|Requirements
5|learning|Linux|Business|Requirements
This may be what you want:
$ awk -F'|' 'NR==FNR{cnt[$3]++; next} cnt[$3]>1' file file
1|learning|Unix|Business|Requirements
2|learning|Unix|Business|Team
3|learning|Linux|Business|Requirements
4|learning|Unix|Business|Team
5|learning|Linux|Business|Requirements
6|learning|Unix|Business|Team
or if the file's too large for all the keys ($3 values) to fit in memory (which shouldn't be a problem with just the unique $3 values from 800,000 lines):
$ cat tst.awk
BEGIN { FS="|" }
{ currKey = $3 }
currKey == prevKey {
if ( !prevPrinted++ ) {
print prevRec
}
print
next
}
{
prevKey = currKey
prevRec = $0
prevPrinted = 0
}
$ sort -t'|' -k3,3 file | awk -f tst.awk
3|learning|Linux|Business|Requirements
5|learning|Linux|Business|Requirements
1|learning|Unix|Business|Requirements
2|learning|Unix|Business|Team
4|learning|Unix|Business|Team
6|learning|Unix|Business|Team
EDIT2: As per Ed sir's suggestion fine tuned my suggestion with more meaningful names(IMO) of arrays.
awk '
match($0,/[^\|]*\|/){
val=substr($0,RSTART+RLENGTH)
if(!unique_check_count[val]++){
numbered_indexed_array[++count]=val
}
actual_valued_array[val]=(actual_valued_array[val]?actual_valued_array[val] ORS:"")$0
line_count_array[val]++
}
END{
for(i=1;i<=count;i++){
if(line_count_array[numbered_indexed_array[i]]>1){
print actual_valued_array[numbered_indexed_array[i]]
}
}
}
' Input_file
Edit by Ed Morton: FWIW here's how I'd have named the variables in the above code:
awk '
match($0,/[^\|]*\|/) {
key = substr($0,RSTART+RLENGTH)
if ( !numRecs[key]++ ) {
keys[++numKeys] = key
}
key2recs[key] = (key in key2recs ? key2recs[key] ORS : "") $0
}
END {
for ( keyNr=1; keyNr<=numKeys; keyNr++ ) {
key = keys[keyNr]
if ( numRecs[key]>1 ) {
print key2recs[key]
}
}
}
' Input_file
EDIT: Since OP changed Input_file with |delimited so changing code a bit to as follows, which deals with new Input_file(Thanks to Ed Morton sir for pointing it out).
awk '
match($0,/[^\|]*\|/){
val=substr($0,RSTART+RLENGTH)
if(!a[val]++){
b[++count]=val
}
c[val]=(c[val]?c[val] ORS:"")$0
d[val]++
}
END{
for(i=1;i<=count;i++){
if(d[b[i]]>1){
print c[b[i]]
}
}
}
' Input_file
Could you please try following, following will give output in same sequence of in which lines are occurring in Input_file.
awk '
match($0,/[^ ]* /){
val=substr($0,RSTART+RLENGTH)
if(!a[val]++){
b[++count]=val
}
c[val]=(c[val]?c[val] ORS:"")$0
d[val]++
}
END{
for(i=1;i<=count;i++){
if(d[b[i]]>1){
print c[b[i]]
}
}
}
' Input_file
Output will be as follows.
2 learning Unix Business Team
4 learning Unix Business Team
6 learning Unix Business Team
3 learning Linux Business Requirements
5 learning Linux Business Requirements
Explanation for above code:
awk ' ##Starting awk program here.
match($0,/[^ ]* /){ ##Using match function of awk which matches regex till first space is coming.
val=substr($0,RSTART+RLENGTH) ##Creating variable val whose value is sub-string is from starting point of RSTART+RLENGTH value to till end of line.
if(!a[val]++){ ##Checking condition if value of array a with index val is NULL then go further and increase its index too.
b[++count]=val ##Creating array b whose index is increment value of variable count and value is val variable.
} ##Closing BLOCK for if condition of array a here.
c[val]=(c[val]?c[val] ORS:"")$0 ##Creating array named c whose index is variable val and value is $0 along with keep concatenating its own value each time it comes here.
d[val]++ ##Creating array named d whose index is variable val and its value is keep increasing with 1 each time cursor comes here.
} ##Closing BLOCK for match here.
END{ ##Starting END BLOCK section for this awk program here.
for(i=1;i<=count;i++){ ##Starting for loop from i=1 to till value of count here.
if(d[b[i]]>1){ ##Checking if value of array d with index b[i] is greater than 1 then go inside block.
print c[b[i]] ##Printing value of array c whose index is b[i].
}
}
}
' Input_file ##Mentioning Input_file name here.
Another in awk:
$ awk -F\| '{ # set delimiter
n=$1 # store number
sub(/^[^|]*/,"",$0) # remove number from string
if($0 in a) { # if $0 in a
if(a[$0]==1) # if $0 seen the second time
print b[$0] $0 # print first instance
print n $0 # also print current
}
a[$0]++ # increase match count for $0
b[$0]=n # number stored to b and only needed once
}' file
Output for the sample data:
2|learning|Unix|Business|Team
4|learning|Unix|Business|Team
3|learning|Linux|Business|Requirements
5|learning|Linux|Business|Requirements
6|learning|Unix|Business|Team
Also, would this work:
$ sort -k 2 file | uniq -D -f 1
or -k2,5 or smth. Nope, as the delimiter changed from space to pipe.
Two steps of improvement.
First step:
After
awk -F'|' '{print $3}' Report.txt | sort | uniq -d >dup.txt
# or
cut -d "|" -f3 < Report.txt | sort | uniq -d >dup.txt
you can use
grep -f <(sed 's/.*/^.*|.*|&|.*|/' dup.txt) Report.txt
# or without process substitution
sed 's/.*/^.*|.*|&|.*|/' dup.txt > dup.sed
grep -f dup.sed Report.txt
Second step:
Use awk as given in other, better, answers.

awk if statement with simple math

I'm just trying to do some basic calculations on a CSV file.
Data:
31590,Foo,70
28327,Bar,291
25155,Baz,583
24179,Food,694
28670,Spaz,67
22190,bawk,4431
29584,alfred,142
27698,brian,379
24372,peter,22
25064,weinberger,8
Here's my simple awk script:
#!/usr/local/bin/gawk -f
BEGIN { FPAT="([^,]*)|(\"[^\"]+\")"; OFS=","; OFMT="%.2f"; }
NR > 1
END { if ($3>1336) $4=$3*0.03; if ($3<1336) $4=$3*0.05;}1**
Wrong output:
31590,Foo,70
28327,Bar,291
28327,Bar,291
25155,Baz,583
25155,Baz,583
24179,Food,694
24179,Food,694
28670,Spaz,67
28670,Spaz,67
22190,bawk,4431
22190,bawk,4431
29584,alfred,142
29584,alfred,142
27698,brian,379
27698,brian,379
24372,peter,22
24372,peter,22
25064,weinberger,8
25064,weinberger,8
Excepted output:
31590,Foo,70,3.5
28327,Bar,291,14.55
25155,Baz,583,29.15
24179,Food,694,34.7
28670,Spaz,67,3.35
22190,bawk,4431,132.93
29584,alfred,142,7.1
27698,brian,379,18.95
24372,peter,22,1.1
25064,weinberger,8,.04
Simple math is if
field $3 > 1336 = $3*.03 and results in field $4
field $3 < 1336 = $3*.05 and results in field $4
There's no need to force awk to recompile every record (by assigning to $4), just print the current record followed by the result of your calculation:
awk 'BEGIN{FS=OFS=","; OFMT="%.2f"} {print $0, $3*($3>1336?0.03:0.05)}' file
You shouldn't have anything in the END block
BEGIN {
FS = OFS = ","
OFMT="%.2f"
}
{
if ($3 > 1336)
$4 = $3 * 0.03
else
$4 = $3 * 0.05
print
}
This results in
31590,Foo,70,3.5
28327,Bar,291,14.55
25155,Baz,583,29.15
24179,Food,694,34.7
28670,Spaz,67,3.35
22190,bawk,4431,132.93
29584,alfred,142,7.1
27698,brian,379,18.95
24372,peter,22,1.1
25064,weinberger,8,0.4
$ awk -F, -v OFS=, '{if ($3>1336) $4=$3*0.03; else $4=$3*0.05;} 1' data
31590,Foo,70,3.5
28327,Bar,291,14.55
25155,Baz,583,29.15
24179,Food,694,34.7
28670,Spaz,67,3.35
22190,bawk,4431,132.93
29584,alfred,142,7.1
27698,brian,379,18.95
24372,peter,22,1.1
25064,weinberger,8,0.4
Discussion
The END block is not executed at the end of each line but at the end of the whole file. Consequently, it is not helpful here.
The original code has two free standing conditions, NR>1 and 1. The default action for each is to print the line. That is why, in the "wrong output," all lines after the first were doubled in the output.
With awk:
awk -F, -v OFS=, '$3>1336?$4=$3*.03:$4=$3*.05' file
The conditional-expression ? action1 : action2 ; is the much shorter terinary operator in awk.

Removing all occurences of duplicates in a file on Unix

I would like to remove both occurrences of duplicates from a file based on a number of columns. Here is a toy example:
Would like to delete all records that do not have uniqueness through the first 4 columns. So applying the awk script to:
BLUE,CAR,RED,HOUSE,40
BLUE,CAR,BLACK,HOUSE,20
BLUE,CAR,GREEN,HOUSE,10
BLUE,TRUCK,RED,HOUSE,40
BLUE,TRUCK,GREEN,HOUSE,40
BLUE,TRUCK,RED,HOUSE,40
Should result in
BLUE,CAR,RED,HOUSE,40
BLUE,CAR,BLACK,HOUSE,20
BLUE,CAR,GREEN,HOUSE,10
BLUE,TRUCK,GREEN,HOUSE,40
I have tried:
awk -F"," -v OFS="," '{cnt[$1,$2,$3,$4]++} END {for (rec in cnt) if (cnt[rec] == 1) print rec}' ss.txt
Which successfully removes both dupes, but does not apply the correct delimiter or print the whole record, resulting in:
BLUECARREDHOUSE
BLUETRUCKGREENHOUSE
BLUECARBLACKHOUSE
BLUECARGREENHOUSE
I prefer an awk solution but any portable solution is welcomed.
Given that you want the whole record for the records that are unique in the first 4 columns, this would do the job:
awk -F',' '{cnt[$1,$2,$3,$4]++;line[$1,$2,$3,$4] = $0}
END {for (rec in cnt) if (cnt[rec] == 1) print line[rec]}' \
ss.txt
Save the lines as well as the counts; get back what you entered. This gets painful if you have gigabyte files; there are ways to only save the unique lines if you want. This only saves the first version of each line, and deletes an entry when it is known to be non-unique. (Untested - but I believe it should work. Modified per comment from Ed Morton.)
awk -F',' '{ if (cnt[$1,$2,$3,$4]++ == 0)
line[$1,$2,$3,$4] = $0
else
delete line[$1,$2,$3,$4]
}
END {for (rec in line) print line[rec]}' \
ss.txt
If you only wanted the 4 key columns, then this just saves the 4 columns in the comma-separated format that you'll print:
awk -F',' '{cnt[$1,$2,$3,$4]++;line[$1,$2,$3,$4] = $1 "," $2 "," $3 "," $4}
END {for (rec in cnt) if (cnt[rec] == 1) print line[rec]}' \
ss.txt

How do I get the maximum value for a text in a UNIX file as shown below?

I have a unix file with the following contents.
$cat myfile.txt
abc:1
abc:2
hello:3
hello:6
wonderful:1
hai:2
hai:4
hai:8
How do I get the max value given for each text in the file above.
'abc' value 2
'hello' value 6
'hai' value 8
'womderful' 1
Based on the current example in your question, minus the first line of expected output:
awk -F':' '{arr[$1]=$2 ; next} END {for (i in arr) {print i, arr[i]} } ' inputfile
You example input and expected output are very confusing.... The reason I posted this is to get feedback from the OP forthcoming
This assumes the data is unsorted, but also works with sorted data (New):
sort -t: -k2n inputfile | awk -F':' '{arr[$1]=$2 ; next} END {for (i in arr) {print i, arr[i]} } '

Resources