Aggregate as new column in R - r

Input:
Time,id1,id2
22:30,1,0
22:32,2,1
22:33,1,0
22:34,2,1
Output Desired
Time,Time2,id1,id2
22:30,22:33,1,0
22:32,22:34,2,1
Output by my code
Time,id1,id2
22:30,22:33,1,0
22:32,22:34,2,1
What change should I make to my code aggregate(Time~,df,FUN=toString)
My id1 and id2 together is the key and times are in and out time for each key. I need to get time in and time out as separate column values. Currently they are in column Time.
I tried it using awk also.

If you do not want to use any packages, this will work:
df <- aggregate(Time~.,df,FUN=toString)
df
#output
id1 id2 Time
1 0 22:30, 22:33
2 1 22:32, 22:34
df$Time2 <- lapply(strsplit(as.character(df$Time), ","),"[", 2)
df$Time <- lapply(strsplit(as.character(df$Time), ","),"[", 1)
df
#output
id1 id2 Time Time2
1 0 22:30 22:33
2 1 22:32 22:34

With awk
$ cat time.awk
BEGIN {
FS = OFS = ","
}
function in_time() {
n++
store[id1, id2] = n
itime[n] = time; iid1[n] = id1; iid2[n] = id2
}
function out_time( i) {
i = store[id1, id2]
otime[i] = time
}
NR > 1 {
time = $1; id1 = $2; id2 = $3
if ((id1, id2) in store) out_time()
else in_time()
}
END {
print "Time,id1,id2"
for (i = 1; i <= n; i++)
print itime[i], otime[i], iid1[i], iid2[i]
}
Usage:
awk -f time.awk file.dat

Related

how to use awk to pull fields and make them variables to do data calculation

I am using awk to compute Mean Frational Bias form a data file. How can I make the data points a variable to call in to my equation?
Input.....
col1 col2
row #1 Yavg: 14.87954
row #2 Xavg: 20.83804
row #3 Ystd: 7.886613
row #4 Xstd: 8.628519
I am looking to feed into this equation....
MFB = .5 * (Yavg-Xavg)/[(Yavg+Xavg)/2]
output....
col1 col2
row #1 Yavg: 14.87954
row #2 Xavg: 20.83804
row #3 Ystd: 7.886613
row #4 Xstd: 8.628519
row #5 MFB: (computed value)
currently trying to use the following code to do this but not working....
var= 'linear_reg-County119-O3-2004-Winter2013-2018XYstats.out.out'
val1=$(awk -F, OFS=":" "NR==2{print $2; exit}" <$var)
val2=$(awk -F, OFS=":" "NR==1{print $2; exit}" <$var)
#MFB = .5*((val2-val1)/((val2+val1)/2))
awk '{ print "MFB :" .5*((val2-val1)/((val2+val1)/2))}' >> linear_regCounty119-O3-2004-Winter2013-2018XYstats-wMFB.out
Try running: awk -f mfb.awk input.txt where
mfb.awk:
BEGIN { FS = OFS = ": " } # set the separators
{ v[$1] = $2; print } # store each line in an array named "v"
END {
MFB = 0.5 * (v["Yavg"] - v["Xavg"]) / ((v["Yavg"] + v["Xavg"]) / 2)
print "MFB", MFB
}
input.txt:
Yavg: 14.87954
Xavg: 20.83804
Ystd: 7.886613
Xstd: 8.628519
Output:
Yavg: 14.87954
Xavg: 20.83804
Ystd: 7.886613
Xstd: 8.628519
MFB: -0.166823
Alternatively, mfb.awk can be the following, resembling your original code:
BEGIN { FS = OFS = ": " }
{ print }
NR == 1 { Yavg = $2 } NR == 2 { Xavg = $2 }
END {
MFB = 0.5 * (Yavg - Xavg) / ((Yavg + Xavg) / 2)
print "MFB", MFB
}
Note that you don't usually toss variables back and forth between the shell and Awk (at least when you deal with a single input file).

Compute z-score for all columns using awk

I have a file containing a first column of IDs and all other columns are numerical values which I want to compute z-scores. I know that there are lots of posts to calculate z-score using Python and R. I am not familiar with Python and I do not want to use R.
I already have a way to calculate mean and standard-deviation of all my columns (I have 30 columns), but I need to calculate the z-scores for each column, and I am not sure how to do it, or if it is possible using awk.
My data is tab delimited, for example:
ID W A
BR_400 1005.98 19.35
FG_50 434.89 2.987
DS_195_At 39.86 0.567
ES_23_Md 41.45 19.55
My command to calculate mean and std for all columns:
cat input.txt | awk '{for(i=1;i<=NF;i++) {sum[i] += $i; sumsq[i] += ($i)^2}} END {for (i=1;i<=NF;i++) {printf "%f %f \n", sum[i]/NR, sqrt((sumsq[i]-sum[i]^2/NR)/NR)}}' > mean_std.txt
The z-scores formula:
z = (x – mean) / std
Any suggestions?
The expected output has only z-scores for each column:
ID W zscore A zscore
BR_400 1.370068724 0.852212191
FG_50 0.119047359 -0.743935933
DS_195_At -0.746299556 -0.979997685
ES_23_Md -0.742816526 0.871721427
You may use this awk:
awk 'BEGIN {
FS=OFS="\t"
}
NR == 1 {
print
next
}
NR == FNR {
++n
for(i=2;i<=NF;i++) {
sum[i] += $i
sumsq[i] += ($i)^2
}
next
}
FNR == 1 { # compute mean and std values here
for (i=2;i<=NF;i++) {
mean[i] = sum[i]/n
std[i] = sqrt( (sumsq[i] - sum[i]^2/n) / (n-1) )
}
next
}
{
printf "%s", $1 OFS
for (i=2;i<=NF;i++)
printf "%f%s", ($i - mean[i]) / std[i], (i < NF ? OFS : ORS)
}' file file | column -t
ID W A-scor
BR_400 1.370069 0.852212
FG_50 0.119047 -0.743936
DS_195_At -0.746300 -0.979998
ES_23_Md -0.742817 0.871721

How to get the data from logfile if the input and logdata are in different format?

My log file data is
[10/04/16 02:07:20 BST] Data 1
[11/04/16 02:07:20 BST] Data 1
[10/05/16 04:11:09 BST] Data 2
[12/05/16 04:11:09 BST] Data 2
[11/06/16 06:22:35 BST] Data 3
My input format is
./filename Apr 11 16 00:00:00 Jul 10 16 00:00:00
I am converting the input format to logfile format with the following function,
convert_date () {
local months=( Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec )
local i
for (( i=0; i<11; i++ )); do
[[ $1 = ${months[$i]} ]] && break
done
printf "\[%2d\/%02d\/%02d $4 BST\]\n" $2 $(( i+1 )) $3
for (( i=0; i<11; i++ )); do
[[ $5 = ${months[$i]} ]] && break
done
}
And also I am storing the result in variable and using it
Start=$( convert_date $1 $2 $3 $4 )
End=$( convert_date $5 $6 $7 $8 )
But the codes gives me result only if the Stattime and endtime are present in the log file. How can I get the data between the two times even if the start and endtimes are not present in the logfile. What awk script can I use?
Your Bash (assumed) function seems to output date in following format:
$ bash test.sh "Apr 11 16 00:00:00"
\[11\/04\/16 00:00:00 BST\]
Working with that, test.awk:
BEGIN {
FS="[[/: ]+"; # set field separator to all delimiters in datetime format in the date
split(start,arr,"[\\\\[/ :]") # split the start variable in pieces for reorganize
start=arr[4]" "arr[3]" "arr[2]" "arr[5]" "arr[6]" "arr[7] # reorganize
}
start <= $4" "$3" "$2" "$5" "$6" "$7 # compare reorganized data to date in start variable
$ awk -v start="\[11\/04\/16 00:00:00 BST\]" -f test.awk test.in
[11/04/16 02:07:20 BST] Data 1
[10/05/16 04:11:09 BST] Data 2
[12/05/16 04:11:09 BST] Data 2
[11/06/16 06:22:35 BST] Data 3
It complains a bit, though:
awk: warning: escape sequence '\[' treated as plain '['
The input format is in "Input.Format" file and the log file is in "Log.File". This shell script 'cat' both files and pipes to 'awk'. The awk script changes the month format to number an compares to log file to turn on or off a print switch.
#!/bin/sh
cat Input.Format Log.File | awk 'BEGIN {
Month = " JanFebMarAprMayJunJulAugSepOctNovDec"
} {
if (NR == 1) {
startm = index(Month, $2) / 3
if (length(startm) == 1) { startm = "0" startm }
startm = $4 startm $3
endm = index(Month, $6) / 3
if (length(endm) == 1) { endm = "0" endm }
endm = $4 endm $3
# print startm " " endm
}
else {
logdate = substr($1,8,2) substr($1,5,2) substr($1,2,2)
# print logdate
if (logdate >= startm ) { prtsw = 1 }
if (logdate > endm ) { prtsw = 0 }
if (prtsw == 1 ) { print $0 }
}
}'

Awk/Sed to replace word after specific string

I have a file like below
<DATABASE name="ABC" url="jdbc:sybase:Tds:eqprod3:5060/ABC01" driver="com.sybase.jdbc2.jdbc.SybDriver" user="user" pwd="password" minConnections="10" maxConnections="10" maxConnectionLife="1440000" startDate="01/01/2014" endDate="01/30/2014" type="dev"/>
<DATABASE name="XYZ" url="jdbc:sybase:Tds:eqprod2:5050/XYZ01" driver="com.sybase.jdbc2.jdbc.SybDriver" user="user" pwd="password" minConnections="10" maxConnections="10" maxConnectionLife="1440000" startDate="02/01/2014" endDate="02/02/2014" type="dev"/>
Now I want to search for word ABC01 in url part and search next startDate to it and change the value, lets say to 02/01/2014.
Could you please help me to get required output.
With sed :
sed '/ABC01/ s/startDate="[^"]*"/startDate="02\/01\/2014"/g' your.file
This script works for both single- and multi-line DATABASE elements.
awk '
# search for ABC01 in url attribute
/url="[^"]*ABC01[^"]*"/{
# set the flag
f = 1;
# current line number
i1 = NR;
}
# save line if the flag is set
(f){lines[NR] = $0}
# output line otherwise
(!f){print $0}
# if the flag is set, search for startDate attribute
(f && $0 ~ /startDate="/){
# replace value of startDate attribute with 02/01/2014
s = gensub(/(startDate=")[^"]+/, "\\102/01/2014", 1, $0)
# current line number
i2 = NR;
# output non-modified lines (no output if i2 == i1)
for (i = i1; i < i2; i++){print lines[i]};
# output modified line
print s;
# unset the flag
f = 0;
}'

How to print a line with a pattern which is nearest to another line with a specific pattern?

I want to find a pattern which is nearest to a specific pattern. Such as I want to print "bbb=" which is under the "yyyy:" (it is the closest line with bbb= to yyyy). It is line 8. line numbers and the order might be changed so it is better not to use line numbers.
root# vi a
"a" 15 lines
1 ## xxxx:
2 aaa=3
3 bbb=4
4 ccc=2
5 ddd=1
6 ## yyyy:
7 aaa=1
8 bbb=0
9 ccc=3
10 ddd=3
11 ## zzzz:
12 aaa=1
13 bbb=1
14 ccc=1
15 ddd=1
Do you have an idea using awk or grep for this purpose?
Something like this?
awk '/^## yyyy:/ { i = 1 }; i && /^bbb=/ { print; exit }'
Or can a line above also match if? In that case, perhaps:
awk '/^bbb=/ && !i { p=NR; s=$0 }; /^bbb=/ && i { print (NR-i < i-p) ? $0 : s; exit }; /^## yyyy:/ { i=NR }'
Taking into account that there might not be a previous or next entry:
/^bbb=/ && !i { p1 = NR; s1 = $0 }
/^bbb=/ && i { p2 = NR; s2 = $0; exit }
/^## yyyy:/ { i = NR }
END {
if (p1 == 0)
print s2
else if (p2 == 0)
print s1
else
print (i - p1 < p2 - i ? s1 : s2)
}
Quick and dirty using grep:
grep -A 100 '##yyyy' filename | grep 'bbb='

Resources