I have a file containing a first column of IDs and all other columns are numerical values which I want to compute z-scores. I know that there are lots of posts to calculate z-score using Python and R. I am not familiar with Python and I do not want to use R.
I already have a way to calculate mean and standard-deviation of all my columns (I have 30 columns), but I need to calculate the z-scores for each column, and I am not sure how to do it, or if it is possible using awk.
My data is tab delimited, for example:
ID W A
BR_400 1005.98 19.35
FG_50 434.89 2.987
DS_195_At 39.86 0.567
ES_23_Md 41.45 19.55
My command to calculate mean and std for all columns:
cat input.txt | awk '{for(i=1;i<=NF;i++) {sum[i] += $i; sumsq[i] += ($i)^2}} END {for (i=1;i<=NF;i++) {printf "%f %f \n", sum[i]/NR, sqrt((sumsq[i]-sum[i]^2/NR)/NR)}}' > mean_std.txt
The z-scores formula:
z = (x – mean) / std
Any suggestions?
The expected output has only z-scores for each column:
ID W zscore A zscore
BR_400 1.370068724 0.852212191
FG_50 0.119047359 -0.743935933
DS_195_At -0.746299556 -0.979997685
ES_23_Md -0.742816526 0.871721427
You may use this awk:
awk 'BEGIN {
FS=OFS="\t"
}
NR == 1 {
print
next
}
NR == FNR {
++n
for(i=2;i<=NF;i++) {
sum[i] += $i
sumsq[i] += ($i)^2
}
next
}
FNR == 1 { # compute mean and std values here
for (i=2;i<=NF;i++) {
mean[i] = sum[i]/n
std[i] = sqrt( (sumsq[i] - sum[i]^2/n) / (n-1) )
}
next
}
{
printf "%s", $1 OFS
for (i=2;i<=NF;i++)
printf "%f%s", ($i - mean[i]) / std[i], (i < NF ? OFS : ORS)
}' file file | column -t
ID W A-scor
BR_400 1.370069 0.852212
FG_50 0.119047 -0.743936
DS_195_At -0.746300 -0.979998
ES_23_Md -0.742817 0.871721
Related
I am using awk to compute Mean Frational Bias form a data file. How can I make the data points a variable to call in to my equation?
Input.....
col1 col2
row #1 Yavg: 14.87954
row #2 Xavg: 20.83804
row #3 Ystd: 7.886613
row #4 Xstd: 8.628519
I am looking to feed into this equation....
MFB = .5 * (Yavg-Xavg)/[(Yavg+Xavg)/2]
output....
col1 col2
row #1 Yavg: 14.87954
row #2 Xavg: 20.83804
row #3 Ystd: 7.886613
row #4 Xstd: 8.628519
row #5 MFB: (computed value)
currently trying to use the following code to do this but not working....
var= 'linear_reg-County119-O3-2004-Winter2013-2018XYstats.out.out'
val1=$(awk -F, OFS=":" "NR==2{print $2; exit}" <$var)
val2=$(awk -F, OFS=":" "NR==1{print $2; exit}" <$var)
#MFB = .5*((val2-val1)/((val2+val1)/2))
awk '{ print "MFB :" .5*((val2-val1)/((val2+val1)/2))}' >> linear_regCounty119-O3-2004-Winter2013-2018XYstats-wMFB.out
Try running: awk -f mfb.awk input.txt where
mfb.awk:
BEGIN { FS = OFS = ": " } # set the separators
{ v[$1] = $2; print } # store each line in an array named "v"
END {
MFB = 0.5 * (v["Yavg"] - v["Xavg"]) / ((v["Yavg"] + v["Xavg"]) / 2)
print "MFB", MFB
}
input.txt:
Yavg: 14.87954
Xavg: 20.83804
Ystd: 7.886613
Xstd: 8.628519
Output:
Yavg: 14.87954
Xavg: 20.83804
Ystd: 7.886613
Xstd: 8.628519
MFB: -0.166823
Alternatively, mfb.awk can be the following, resembling your original code:
BEGIN { FS = OFS = ": " }
{ print }
NR == 1 { Yavg = $2 } NR == 2 { Xavg = $2 }
END {
MFB = 0.5 * (Yavg - Xavg) / ((Yavg + Xavg) / 2)
print "MFB", MFB
}
Note that you don't usually toss variables back and forth between the shell and Awk (at least when you deal with a single input file).
This might be simple - I have a file as below:
df.csv
col1,col2,col3,col4,col5
A,2,5,7,9
B,6,10,2,3
C,3,4,6,8
I want to perform max(col2,col4) - min(col3,col5) but I get an error using max and min in awk and write the result in a new column. So the desired output should look like:
col1,col2,col3,col4,col5,New_col
A,2,5,7,9,2
B,6,10,2,3,3
C,3,4,6,8,2
I used the code below but is does not work - how can I solve this?
awk -F, '{print $1,$2,$3,$4,$5,$(max($7,$9)-min($8,$10))}'
Thank you.
$ cat tst.awk
BEGIN { FS=OFS="," }
{ print $0, (NR>1 ? max($2,$4) - min($3,$5) : "New_col") }
function max(a,b) {return (a>b ? a : b)}
function min(a,b) {return (a<b ? a : b)}
$ awk -f tst.awk file
col1,col2,col3,col4,col5,New_col
A,2,5,7,9,2
B,6,10,2,3,3
C,3,4,6,8,2
If your actual "which is larger" calculation is more involved than just using >, e.g. if you were comparing dates in some non-alphabetic format or peoples names where you have to compare the surname before the forename and handle titles, etc., then you'd write the functions as:
function max(a,b) {
# some algorithm to compare the 2 strings
}
function min(a,b) {return (max(a,b) == a ? b : a)}
You may use this awk:
awk 'BEGIN{FS=OFS=","} NR==1 {print $0, "New_col"; next} {print $0, ($2 > $4 ? $2 : $4) - ($3 < $5 ? $3 : $5)}' df.csv
col1,col2,col3,col4,col5,New_col
A,2,5,7,9,2
B,6,10,2,3,3
C,3,4,6,8,2
A more readable version:
awk '
BEGIN { FS = OFS = "," }
NR == 1 {
print $0, "New_col"
next
}
{
print $0, ($2 > $4 ? $2 : $4) - ($3 < $5 ? $3 : $5)
}' df.csv
get an error using max and min in awk and write the result in a new column.
No such function are available in awk but for two values you might harness ternary operator, so in place of
max($7,$9)
try
($7>$9)?$7:$9
and in place of
min($8,$10)
try
($8<$10)?$8:$10
Above exploit ?: which might be explained as check?valueiftrue:valueiffalse, simple example, let file.txt content be
100,100
100,300
300,100
300,300
then
awk 'BEGIN{FS=","}{print ($1>$2)?$1:$2}' file.txt
output
100
300
300
300
Also are you sure about 1st $ in $(max($7,$9)-min($8,$10))? By doing so you instructed awk to get value of n-th column, where n is result of computation inside (...).
I have detail.txt file ,which contains
cat >detail.txt
Student ID,Student Name, Percentage
101,A,75
102,B,77
103,C,34
104,D,42
105,E,75
106,F,42
107,G,77
1.I want to print concatenated output based on Percentage (group by Percentage) and print student name in single line separated by comma(,).
Expected Output:
75-A,E
77-B,G
42-D,F
34-C
For above question i got that how can achieve this for 75 or 77 or 42. But i did not get how to write a code grouping third field (Percentage).
I tried below code
awk -F"," '{OFS=",";if($3=="75") print $2}' detail.txt
2. I want to get output based on grading system which is given below.
marks < 45=THIRD
marks>=45 and marks<60 =SECOND
marks>=60 and marks<=75 =FIRST
marks>75 =DIST
Expected Output:
DIST:B,G
FIRST:A,E
THIRD:C,D,F
Please help me to get the expected output. Thank You..
awk solution:
awk -F, 'NR>1{
if ($3<45) k="THIRD"; else if ($3>=45 && $3<60) k="SECOND";
else if ($3>=60 && $3<=75) k="FIRST"; else k="DIST";
a[k] = a[k]? a[k]","$2 : $2;
}END{ for(i in a) print i":"a[i] }' detail.txt
k - variable that will be assigned with "grading system" name according to one of the if (...) <exp>; else if(...) <exp> ... statements
a[k] - array a is indexed by determined "grading system" name k
a[k] = a[k]? a[k]","$2 : $2 - all "student names"(presented by the 2nd field $2) are accumulated/grouped into the needed "grading system"
The output:
DIST:B,G
THIRD:C,D,F
FIRST:A,E
With GNU awk for true multi-dimensional arrays:
$ cat tst.awk
BEGIN { FS=OFS="," }
NR>1 {
stud = $2
pct = $3
if ( pct <= 45 ) { band = "THIRD" }
else if ( pct <= 60 ) { band = "SECOND" }
else if ( pct <= 75 ) { band = "FIRST" }
else { band = "DIST" }
pcts[pct][stud]
bands[band][stud]
}
END {
for (pct in pcts) {
out = ""
for (stud in pcts[pct]) {
out = (out == "" ? pct "-" : out OFS) stud
}
print out
}
print "----"
for (band in bands) {
out = ""
for (stud in bands[band]) {
out = (out == "" ? band ":" : out OFS) stud
}
print out
}
}
.
$ gawk -f tst.awk file
34-C
42-D,F
75-A,E
77-B,G
----
DIST:B,G
THIRD:C,D,F
FIRST:A,E
For your first question, the following awk one-liner should do:
awk -F, '{a[$3]=a[$3] (a[$3] ? "," : "") $2} END {for(i in a) printf "%s-%s\n", i, a[i]}' input.txt
The second question can work almost the same way, storing your mark divisions in an array, then stepping through that array to determine the subscript for a new array:
BEGIN { FS=","; m[0]="THIRD"; m[45]="SECOND"; m[60]="FIRST"; m[75]="DIST" } { for (i=0;i<=100;i++) if ((i in m) && $3 > i) mdiv=m[i]; marks[mdiv]=marks[mdiv] (marks[mdiv] ? "," : "") $2 } END { for(i in marks) printf "%s:%s\n", i, marks[i] }
But this is unreadable. When you need this level of complexity, you're past the point of a one-liner. :)
So .. combining the two and breaking them out for easier reading (and commenting) we get the following:
BEGIN {
FS=","
m[0]="THIRD"
m[45]="SECOND"
m[60]="FIRST"
m[75]="DIST"
}
{
a[$3]=a[$3] (a[$3] ? "," : "") $2 # Build an array with percentage as the index
for (i=0;i<=100;i++) # Walk through the possible marks
if ((i in m) && $3 > i) mdiv=m[i] # selecting the correct divider on the way
marks[mdiv]=marks[mdiv] (marks[mdiv] ? "," : "") $2
# then build another array with divider
# as the index
}
END { # Once we've processed all the input,
for(i in a) # step through the array,
printf "%s-%s\n", i, a[i] # printing the results.
print "----"
for(i in marks) # step through the array,
printf "%s:%s\n", i, marks[i] # printing the results.
}
You may be wondering why we for (i=0;i<=100;i++) instead of simply using for (i in m). This is because awk does not guarantee the order of array elements, and when stepping through the m array, it's important that we see the keys in increasing order.
I have a file with 50,000 row and 200 column, most of values in row/column are zero, so I want to delete all rows , when there are more than 100 column having zero values.
Please suggest me any AWK/ unix command.
Thanks
This MIGHT be what you need:
awk 'gsub(/(^|[[:space:]])0([[:space:]]|$)/,"&")<100' file
but without seeing some sample input/output we're guessing.
The following is an awk script that will count the number of zero-valued fields in each record (row) of input, and only output the record if this count does not reach above 100.
#!/usr/bin/awk -f
{
zcount = 0;
for (i = 1; i <= NF; ++i) {
if ($i == 0)
++zcount;
if (zcount > 100)
next;
}
print;
}
To run it, first make it executable:
$ chmod script.awk
Then, assuming your data is in the file data.in:
$ ./script.awk data.in
Alternatively, without making it executable:
$ awk -f script.awk data.in
The following variation of the script allows you to specify the maximum number of zeroes that you want to allow:
#!/usr/bin/awk -f
BEGIN { if (zmax == 0) zmax = 100 }
{
zcount = 0;
for (i = 1; i <= NF; ++i) {
if ($i == 0)
++zcount;
if (zcount >= zmax)
next;
}
print;
}
You run this with
$ ./script.awk -v zmax=90 data.in
The second script will default to 100 if you leave -v zmax=N out.
another awk
$ awk 'BEGIN{FS=" +0 +"} NF<100' file
How would you find the maximum and minimum a,b,c values for the rows that start with MATH from the following file?
TITLE a b c
MATH 12.3 -0.42 5.5
ENGLISH 70.45 3.21 6.63
MATH 3.32 2.43 9.42
MATH 3.91 -1.56 7.22
ENGLISH 89.21 4.66 5.32
It can not be just 1 command line. It has to be a script file using BEGIN function and END.
I get the wrong minimum value and I end up getting a string for max when I run my program. Please help!
Here is my code for the column a:
BEGIN { x=1 }
{
if ($1 == "MATH") {
min=max=$2;
for ( i=0; i<=NF; i++) {
min = (min < $i ? min : $i)
max = (max > $i ? max : $i)
}
}
}
END { print "max a value is ", max, " min a value is ", min }
Thanks!
This code could demonstrate a concept of what you want:
awk '$1!="MATH"{next}1;!i++{min=$2;max=$2;}{for(j=2;j<=NF;++j){min=(min<$j)?min:$j;max=(max>$j)?max:$j}}END{printf "Max value is %.2f. Min value is %.2f.\n", max, min}' file
Output:
MATH 12.3 -0.42 5.5
MATH 3.32 2.43 9.42
MATH 3.91 -1.56 7.22
Max value is 12.30. Min value is -1.56.
Remove 1 to suppress the messages:
awk '$1!="MATH"{next};...
Script version:
#!/usr/bin/awk
$1 != "MATH" {
# Move to next record if not about "MATH".
next
}
!i++ {
# Only does this on first match.
min = $2; max = $2
}
{
for (j = 2; j <= NF; ++j) {
min = (min < $j) ? min : $j
max = (max > $j) ? max : $j
}
}
END {
printf "Max value is %.2f. Min value is %.2f.\n", max, min
}
look at your for loop
it starts from i=0 so the condition should be
i<NF
instead of
i<= NF
try the following line instead of that line .... i hope you get what u are looking for
for(i=0;i<NF;i++){
rest all looks fine to me.... thanks
The i variable in the for loop should at least begin with 2(the 2rd field), not 0, which represent the whole line, and end with NF.
BEGIN { x=1;min=2147483647;max=-2147483648}
{
if ($1 == "MATH") {
for ( i=2; i<=NF; i++) {
min = (min < $i ? min : $i)
max = (max > $i ? max : $i)
}
}
}
END { print "max a value is ", max, " min a value is ", min }
Run with command:(testawk.script for the above awk script filename, test.data for input data filename)
cat test.data | awk -f testawk.script
output:
max a value is 12.30 min a value is -1.56
I don't have a terminal handy on me right now but something along these lines will get the smallest of each line.
cat YOURFILE | grep "^MATH" | cat test | \
while read CMD; do
A=`echo $CMD | awk '{ print $2 }'`
B=`echo $CMD | awk '{ print $3 }'`
C=`echo $CMD | awk '{ print $4 }'`
#IF Statement for comparing the three of them
#echo the smallest
done