I suspect I'm Doing It Wrong, but I'd like to pass a character vector as an argument to a function in ddply. There's a lot of Q&A on removing quotes, etc. but none of it seems to work for me (eg. Remove quotes from a character vector in R and http://r.789695.n4.nabble.com/Pass-character-vector-to-function-argument-td3045226.html).
# reproducible data
df1<-data.frame(a=sample(1:50,10),b=sample(1:50,10),c=sample(1:50,10),d=(c("a","b","c","a","a","b","b","a","c","d")))
df2<-data.frame(a=sample(1:50,9),b=sample(1:50,9),c=sample(1:50,9),d=(c("e","f","g","e","e","f","f","e","g")))
df3<-data.frame(a=sample(1:50,8),b=sample(1:50,8),c=sample(1:50,8),d=(c("h","i","j","h","h","i","i","h")))
#make a list
list.1<-list(df1=df1,df2=df2,df3=df3)
# desired output
lapply(list.1, function(x) ddply(x, .(d), function(x) data.frame(am=mean(x$a), bm=mean(x$b), cm=mean(x$c))))
$df1
d am bm cm
1 a 31.00000 29.25000 18.50000
2 b 31.66667 24.33333 34.66667
3 c 18.50000 5.50000 24.50000
4 d 36.00000 39.00000 43.00000
$df2
d am bm cm
1 e 18.25000 32.50000 18
2 f 27.66667 41.33333 24
3 g 25.00000 7.50000 42
$df3
d am bm cm
1 h 36.00000 25.00000 20.50000
2 i 25.33333 37.33333 24.33333
3 j 32.00000 32.00000 46.00000
But my actual use-case has many new columns and different types of calculations that I want to calculate in the ddply function. So I want to do something like:
# here's a simple version of a function that I want to send to ddply
func <- "am=mean(x$a), bm=mean(x$b), cm=mean(x$c)"
# here's how I imagine it might work
lapply(list.1, function(x) ddply(x, .(d), function(x) data.frame(func)) )
# not the desired outcome...
$df1
d func
1 a am=mean(x$a), bm=mean(x$b), cm=mean(x$c)
2 b am=mean(x$a), bm=mean(x$b), cm=mean(x$c)
3 c am=mean(x$a), bm=mean(x$b), cm=mean(x$c)
4 d am=mean(x$a), bm=mean(x$b), cm=mean(x$c)
$df2
d func
1 e am=mean(x$a), bm=mean(x$b), cm=mean(x$c)
2 f am=mean(x$a), bm=mean(x$b), cm=mean(x$c)
3 g am=mean(x$a), bm=mean(x$b), cm=mean(x$c)
$df3
d func
1 h am=mean(x$a), bm=mean(x$b), cm=mean(x$c)
2 i am=mean(x$a), bm=mean(x$b), cm=mean(x$c)
3 j am=mean(x$a), bm=mean(x$b), cm=mean(x$c)
I've tried noquote, deparse, eval(as.symbol()), do.call(data.frame, ...) and some of the methods here: https://github.com/hadley/devtools/wiki/Evaluation on func to no avail. The solution might be obvious at this point (ie. melt everything!), but in case it's not, here's a longer example that's closer to my use case:
# sample data
s <- 23 # number of samples
r <- 10 # number of runs per sample
el <- 17 # number of elements
mydata <- data.frame(ID = unlist(lapply(LETTERS[1:s], function(x) rep(x, r))),
run = rep(1:r, s))
# insert fake element data
mydata[letters[1:el]] <- lapply(1:el, function(i) rnorm(s*r, runif(1)*i^2))
# generate all combinations of 5 runs from ten runs
su <- 5 # number of runs to sample from ten runs
idx <- combn(unique(mydata$run), su)
# RSE function
RSE <- function(x) {100*( (sd(x)/sqrt(length(x)))/mean(x) )}
# make a list of dfs for all samples for each combination of five runs
# to prepare to calculate RSEs
combys1 <- lapply(1:ncol(idx), function(i) mydata[mydata$run %in% idx[,i],] )
# make a list of dfs with RSE for each ID, for each combination of runs
combys2 <- lapply(1:length(combys1), function(i) ddply(combys1[[i]], "ID", summarise, RSEa=RSE(a), RSEb=RSE(b), RSEc=RSE(c), meana=mean(a), meanb=mean(b), meanc=mean(c)))
I want to replace RSEa=RSE(a), RSEb=RSE(b), RSEc=RSE(c), meana=mean(a), meanb=mean(b), meanc=mean(c) in the last line above with the object doRSE from here, to avoid lots of typing:
# prepare to calculate new colums with RSE and means
RSEs <- sapply(3:ncol(mydata), function(j) paste0("RSE",names(mydata[j])))
RSExs <- sapply(3:ncol(mydata), function(j) paste0("RSE(",names(mydata[j]),")"))
doRSE <- paste0(sapply(1:length(RSEs), function(x) paste0(RSEs[x],"=",RSExs[x])), collapse=",", sep="")
I'm open to solutions involving base, data.table and dirty tricks. Seems like these are close to what I want, but I can't quite translate them to my problem:
Pass character argument and evaluate,
Force evaluation of multiple variables using vector of character,
Using a vector of characters that correspond to an expression as an argument to a function
UPDATE Here's the catch: I want to be able to modify the func in the simple example (or doRSE in my use-case) to create a bunch of new columns that result from various calculations on the existing columns to explore the data. I want a workflow that allows the resulting dataframes to have new columns that were not in the original dataframes. Sorry that wasn't more clear in the original question. I can't see how to adapt #Marius' answer to do this, but #mnel's is helpful (see update below)
Working through #mnel's excellent dirty tricks, with some minor fixes I can get the desired result on my use-case:
# #mnel's solution, adapted (no period before eval)
combys2 <- lapply(combys1, function(x) do.call(ddply,c(.data = quote(x),
.variables = quote(.(ID)), .fun = quote(summarize),
eval(parse(text = sprintf('.(%s)', doRSE ))))))
head(combys2)
[[1]]
ID RSEa RSEb RSEc RSEd RSEe RSEf RSEg RSEh RSEi
1 A 168.30658 21.68632 5.657228 5.048057 4.162017 2.9581874 1.849009 0.6925148 0.4393491
2 B 26.55071 26.20427 4.782578 4.385409 2.342764 2.1813874 2.719625 1.1576681 0.6427935
3 C 73.83165 14.47216 8.154435 6.273202 3.046978 1.2179457 2.811405 1.1401837 0.8167067
4 D 31.96170 57.89260 9.438220 7.388410 3.755772 0.8601780 3.724875 0.8358204 0.9939387
5 E 63.22537 60.35532 5.839690 11.691304 3.828430 0.9217787 4.204300 0.8217187 0.7876634
6 F 56.37635 65.37907 4.149568 5.496308 2.227544 2.1548455 2.847291 1.1956212 0.2506518
7 G 69.32232 23.63214 4.255847 7.979225 4.917660 1.6185960 3.156521 0.3265555 0.8133279
8 H 29.82015 40.74184 7.372100 7.464792 2.749862 0.6054420 4.061368 0.9973909 1.3807720
9 I 50.58114 19.53732 2.989920 9.767678 4.000249 1.7451322 1.175397 0.9952093 0.9095086
10 J 92.96462 39.77475 6.140688 10.295668 3.407726 2.4663758 3.030444 0.5743419 0.9296482
11 K 90.72381 42.25092 2.483069 6.781054 3.142082 1.8080633 2.891740 1.1996176 0.8525290
12 L -385.24547 40.81267 4.506087 8.148382 2.976488 0.8304432 2.234134 0.2108664 0.4979777
13 M 22.77743 33.98332 2.913926 8.764639 2.307293 0.8366635 3.229944 1.0003125 0.3878567
14 N 66.75163 34.16087 6.611326 13.865377 1.285522 1.3863958 4.165575 0.7379386 0.4515194
15 O 37.37188 100.57479 5.738877 5.724862 2.839638 1.1366610 3.186332 0.7383855 0.3954544
16 P 17.08913 26.62210 6.060130 4.110893 2.688908 2.6970727 1.609043 1.3860834 0.8780010
17 Q 13.96392 74.92279 5.469304 8.467638 2.974131 1.2135436 3.284564 0.6232778 1.0759226
18 R 42.59899 30.75952 4.842832 8.764158 1.874020 1.5791048 3.427342 1.4479638 0.2964455
19 S 26.03307 15.56352 6.968717 7.783876 4.439733 2.0764179 4.683080 0.7459654 1.1268772
20 T 71.57945 33.81362 7.147049 11.201551 2.128315 2.2051611 2.419805 0.2688807 1.1559635
21 U 73.93002 11.77155 7.738910 7.207041 1.478491 1.4409844 4.042419 0.5883490 0.5585716
22 V 67.93166 39.54994 5.701551 8.636122 2.472963 1.6514199 2.627965 1.0359048 0.8747136
23 W 11.23057 12.51272 7.003448 7.424559 4.102693 0.6614847 2.246305 1.3422405 0.2665246
RSEj RSEk RSEl RSEm RSEn RSEo RSEp RSEq
1 0.6366733 0.3713819 2.1993487 0.3865293 0.5436581 0.9187585 0.4344699 0.8915868
2 0.3445095 0.2932025 1.8563179 0.5397595 1.0433388 0.3533622 0.1942316 0.1941072
3 0.2720344 0.5507595 2.0305726 0.4377259 0.8589854 0.5690906 0.1397337 0.4043247
4 0.6606667 0.6769112 3.4737352 0.5674656 1.2519256 0.8718298 0.1162969 0.8287504
5 0.4620774 0.5598069 1.9236112 0.7990046 0.9832732 0.6847352 0.4070675 0.9005185
6 0.7981610 0.4005493 0.9721068 0.2770989 1.7054674 0.3110139 0.4521183 0.8740444
7 0.3969116 0.4717575 4.1341106 0.7510628 0.9998299 0.5342292 0.4319642 1.1861705
8 0.2963956 0.2652221 0.4775827 0.2617120 0.8261874 0.5266087 0.1900943 0.2350553
9 0.2609359 0.5431035 2.6478440 0.1606919 0.7407281 0.6802262 0.1802069 0.7438792
10 0.4239787 0.8753544 3.4218030 0.5467869 0.7404017 0.5581173 0.3682014 0.6361436
11 0.4188502 0.8629862 4.4181479 0.1623873 0.8018811 0.5873609 0.3592134 0.5357984
12 0.5790265 0.5009210 3.7534287 0.1933726 0.5809601 0.5777868 0.3400925 0.4783890
13 0.3562582 0.2552756 2.1393219 0.1849345 0.5796194 0.6129469 0.3363311 0.4382125
14 0.7921502 0.6147990 2.9054634 0.5852325 1.4954072 0.9983203 0.2937837 0.7654504
15 0.5840424 0.2757707 1.5695675 0.3305385 0.8712636 0.5816490 0.1985457 0.7213289
16 0.3301280 0.3008273 2.9014987 0.4540833 0.5966479 0.9042004 0.1631630 0.7262141
17 0.5882511 0.2820978 3.0652666 0.4518936 1.3168151 0.4749311 0.2244693 0.6583083
18 0.4048816 0.3708787 3.2207478 0.2603412 1.3168318 0.3318745 0.3120436 0.6210711
19 0.4425123 0.3602076 3.7609863 0.5399527 0.8302572 0.3246904 0.1952143 0.2915325
20 0.5877835 0.6339015 1.6908570 0.3223056 0.5239339 0.6607198 0.2808094 0.3697380
21 0.4454056 0.7733354 4.3433420 0.4391075 0.5503594 0.5893406 0.2262403 0.2361512
22 0.9583940 0.6365843 3.0033951 0.6507968 0.8610046 0.6363198 0.2866719 0.5736855
23 0.4969730 0.3895182 2.0021608 0.3354475 1.4398250 0.7386870 0.2458906 0.3414804
...
...
You can do some ugly computing on the language using quote and plyr::.
Reading https://github.com/hadley/devtools/wiki/Computing-on-the-language will probably help understand whether you really want to do this.
Anyway, an approach could be to use
use .() to create your vector of arguments eg and use how summarize works
.(am=mean(a), bm=mean(b), cm=mean(c))
and if you really wanted to use a character string
foo<- "am=mean(a), bm=mean(b), cm=mean(c)"
eval(parse(text = sprintf('.(%s)', foo )))
Use quote liberally to create your list to be passed to to do.call
for example
lapply(list.1, function(x) do.call(ddply,c(.data = quote(x),
.variables = quote(.(d)), .fun = quote(summarize),
.(am=mean(a), bm=mean(b), cm=mean(c)))))
Oh boy is that ugly.
Or, you could use data.tables
library(data.table)
listDT <- lapply(list.1, data.table)
lapply(listDT, function(x) x[,lapply(.SD, mean), by = 'd'])
or
mystuff <- sprintf('list(%s)', foo)
lapply(listDT, function(x) x[, eval(parse(text = mystuff)), by = 'd'])
However, if you had all the same columns in all your data.tables, it would be more efficient to create one large data.table (with an identifer for each element of the list) and work on that.
Here's a ddply function that calculates the mean for all the columns that aren't d in your dataframes:
lapply(list.1,
function(x) {
ddply(
x,
.(d),
function(df_part) {
result_df <- data.frame(d=df_part$d[1])
non_d_cols <- colnames(df_part)[! colnames(df_part) == "d"]
for (col in non_d_cols) {
col_mean <- mean(df_part[[col]])
col_name <- paste0(col, "_mean")
result_df[[col_name]] <- col_mean
}
return(result_df)
})
})
That seems to me like the simplest way to do it, and it should generalize well to other calculations you might want to do on those columns. Maybe you could pass in a character vector argument of the columns you want to calculate the mean for, and use that in place of non_d_cols.
Related
Sorry if there is somewhere the answer already - I didn't found.
I do have
vector <- c("WHEA","RICE","MAIZ","BARL","PMIL","SMIL","SORG","OCER","POTA","SWPO","YAMS","CASS","ORTS","BEAN",
"CHIC","COWP","PIGE","LENT","OPUL","SOYB","GROU","CNUT","BANA","PLNT","TROF","TEMF","VEGE","OILP",
"SUNF","RAPE","SESA","OOIL","SUGC","SUGB","COTT","OFIB","ACOF","RCOF","COCO","TEAS","TOBA","REST")
vector_list <- list( Cereal <- vector[1:8], Roots <- vector[9:13], Pulses <- vector[14:19], oilcr <- vector[20:27], millet <- vector[5:6], cofffee <- vector[32:33], fruit <- vector[39:40], banpl <- vector[37:38])
names(vector_list) <- c("Cereal", "Roots", "Pulses", "oilcr", "millet", "coffee", "fruit", "banpl")
Test <- data.frame(A=vector, B = 1:42)
I want to create an new column of to Test$group. For this it should look if the df$A appears in one of the vecotrs_list vectors, if yes it shall return the name of the name of the corresponding vector of vectors list.
I've tried (for example) such an approach, but failed:
Test$group <- sapply(groups, function (x){
if (x %in% Test$A) return(names(x))})
You can try something like the following:
tmp <- unlist(vector_list)
rename_vec <- names(tmp)
names(rename_vec) <- tmp
Test$group <- rename_vec[Test$A]
A B group
1 WHEA 1 Cereal1
2 RICE 2 Cereal2
3 MAIZ 3 Cereal3
4 BARL 4 Cereal4
5 PMIL 5 Cereal5
6 SMIL 6 Cereal6
7 SORG 7 Cereal7
8 OCER 8 Cereal8
9 POTA 9 Roots1
10 SWPO 10 Roots2
11 YAMS 11 Roots3
12 CASS 12 Roots4
13 ORTS 13 Roots5
The unlist creates a vector with the names of your group:
unlist(vector_list)
Cereal1 Cereal2 Cereal3 Cereal4 Cereal5 Cereal6 Cereal7 Cereal8 Roots1 Roots2 Roots3 Roots4 Roots5
"WHEA" "RICE" "MAIZ" "BARL" "PMIL" "SMIL" "SORG" "OCER" "POTA" "SWPO" "YAMS" "CASS" "ORTS"
Pulses1 Pulses2 Pulses3 Pulses4 Pulses5 Pulses6 oilcr1 oilcr2 oilcr3 oilcr4 oilcr5 oilcr6 oilcr7
"BEAN" "CHIC" "COWP" "PIGE" "LENT" "OPUL" "SOYB" "GROU" "CNUT" "BANA" "PLNT" "TROF" "TEMF"
oilcr8 millet1 millet2 coffee1 coffee2 fruit1 fruit2 banpl1 banpl2
"VEGE" "PMIL" "SMIL" "OOIL" "SUGC" "COCO" "TEAS" "ACOF" "RCOF"
You then want to inverse the name and the component of the vector, so that you can use Test$A to select the proper components of your vector which will give you the group:
rename_vec <- names(tmp)
names(rename_vec) <- tmp
rename_vec
WHEA RICE MAIZ BARL PMIL SMIL SORG OCER POTA SWPO YAMS
"Cereal1" "Cereal2" "Cereal3" "Cereal4" "Cereal5" "Cereal6" "Cereal7" "Cereal8" "Roots1" "Roots2" "Roots3"
CASS ORTS BEAN CHIC COWP PIGE LENT OPUL SOYB GROU CNUT
"Roots4" "Roots5" "Pulses1" "Pulses2" "Pulses3" "Pulses4" "Pulses5" "Pulses6" "oilcr1" "oilcr2" "oilcr3"
BANA PLNT TROF TEMF VEGE PMIL SMIL OOIL SUGC COCO TEAS
"oilcr4" "oilcr5" "oilcr6" "oilcr7" "oilcr8" "millet1" "millet2" "coffee1" "coffee2" "fruit1" "fruit2"
ACOF RCOF
"banpl1" "banpl2"
I have a temperature profiler (tp) data for date, depth and temperature. The depth for each date is not exactly the same so I need to unify it to the same depth and set the temperature for that depth by linear approximation. I was able to do this with a loop using ‘approx’ function (see first part of the enclosed code). But I know that I should do it better without a loop (considering I will have about 600,000 rows). I tried to do it with ‘by’ function but was not successful transforming the result (list) into a data frame or matrix (see second part of the code).
Keep in mind that length of the rounded depth is not always the same as in the example.
Rounded depth is in Depth2 column, interpulated temperature is put in Temp2
What is the ‘right’ way to solve this?
# create df manually
tp <- data.frame(Date=double(31), Depth=double(31), Temperature=double(31))
tp$Date[1:11] <- '2009-12-17' ; tp$Date[12:22] <- '2009-12-18'; tp$Date[23:31] <- '2009-12-19'
tp$Depth <- c(24.92,25.50,25.88,26.33,26.92,27.41,27.93,28.37,28.82,29.38,29.92,25.07,25.56,26.06,26.54,27.04,27.53,28.03,28.52,29.02,29.50,30.01,25.05,25.55,26.04,26.53,27.02,27.52,28.01,28.53,29.01)
tp$Temperature <- c(19.08,19.06,19.06,18.87,18.67,17.27,16.53,16.43,16.30,16.26,16.22,17.62,17.43,17.11,16.72,16.38,16.28,16.20,16.15,16.13,16.11,16.08,17.54,17.43,17.32,17.14,16.89,16.53,16.28,16.20,16.13)
# create rounded depth column
tp$Depth2 <- round(tp$Depth)
# loop on date to calculate linear approximation for rounded depth
dtgrp <- tp[!duplicated(tp[,1]),1]
for (i in dtgrp) {
x1 <- tp[tp$Date == i, "Depth"]
y1 <- tp[tp$Date == i, "Temperature"]
x2 <- tp[tp$Date == i, "Depth2"]
tpa <- approx(x=x1,y=y1,xout=x2, rule=2)
tp[tp$Date == i, "Temp2"] <- tpa$y
}
# reduce result to rounded depth
tp1 <- tp[!duplicated(tp[,-c(2:3)]),-c(2:3)]
# not part of the question, but the end need is for a matrix, so this complete it:
library(reshape2)
tpbydt <- acast(tp1, Date~Depth2, value.var="Temp2")
# second part: I tried to use the by function (instead of loop) but got lost when tring to convert it to data frame or matrix
rdpth <- function(x1,y1,x2) {
tpa <- approx(x=x1,y=y1,xout=x2, rule=2)
return(tpa)
}
tp2 <- by(tp, tp$Date,function(tp) rdpth(tp$Depth,tp$Temperature,tp$Depth2), simplify = TRUE)
Very close with by call but remember it returns a list of objects. Therefore, consider building a list of data frames to be row binded at very end:
df_list <- by(tp, tp$Date, function(sub) {
tpa <- approx(x=sub$Depth, y=sub$Temperature, xout=sub$Depth2, rule=2)
df <- unique(data.frame(Date = sub$Date,
Depth2 = sub$Depth2,
Temp2 = tpa$y,
stringsAsFactors = FALSE))
return(df)
})
tp2 <- do.call(rbind, unname(df_list))
tp2
# Date Depth2 Temp2
# 1 2009-12-17 25 19.07724
# 2 2009-12-17 26 19.00933
# 5 2009-12-17 27 18.44143
# 7 2009-12-17 28 16.51409
# 9 2009-12-17 29 16.28714
# 11 2009-12-17 30 16.22000
# 12 2009-12-18 25 17.62000
# 21 2009-12-18 26 17.14840
# 4 2009-12-18 27 16.40720
# 6 2009-12-18 28 16.20480
# 8 2009-12-18 29 16.13080
# 10 2009-12-18 30 16.08059
# 13 2009-12-19 25 17.54000
# 22 2009-12-19 26 17.32898
# 41 2009-12-19 27 16.90020
# 61 2009-12-19 28 16.28510
# 81 2009-12-19 29 16.13146
And if you reset row.names, this is exactly identical to your tp1 output:
identical(data.frame(tp1, row.names = NULL),
data.frame(tp2, row.names = NULL))
# [1] TRUE
Suppose i have the following Db:
db<-data.frame(para=c(round(rnorm(20,10,10),0)),sal1=c(rnorm(20,100,7)),sal2=c(rnorm(20,100,7)),sal3=c(rnorm(10,100,7)),sal4=rep(c("a","b"),5))
para sal1 sal2 sal3 sal4
1 13 104.73988 96.53538 107.03285 a
2 1 94.54826 88.79930 101.17248 b
3 26 102.36344 94.83702 91.11708 a
4 13 99.32913 95.90670 90.49470 b
Basically what i want:
paste(db$sal1,db$sal2,db$sal3, sep="-")
para sal1 sal2 sal3 sal4 newcol
1 8 105.11 101.38 100.01 a 105.11-101.38-100.01
2 2 109.55 88.98 104.12 b 109.55-88.98-104.12
3 25 100.12 103.84 102.43 a 100.12-103.84-102.43
4 15 105.22 90.95 100.67 b 105.22-90.95-100.67
5 21 97.57 97.78 103.89 a 97.57-97.78-103.89
6 -1 101.88 100.22 88.21 b 101.88-100.22-88.21
7 12 104.20 95.26 93.72 a 104.2-95.26-93.72
8 16 106.25 100.70 94.95 b 106.25-100.7-94.95
9 24 101.36 97.91 99.67 a 101.36-97.91-99.67
Actually the way i want:
colnam<-c("sal1","sal2","sal3")
colnameful<-paste0("db$",colnam)
paste using variable colnameful
Something like below i tried:
paste(colnameful, sep="-") # Useless
Please suggest, Thanks in advance.
Maybe something like:
colnam<-c("sal1","sal2","sal3")
db[[paste(colnam,collapse = "-")]] <- with(db,paste(sal1,sal2,sal3,sep = "-"))
...though you may find that you want to round the sal1, etc values before pasting them together.
If you only have the columns in a character vector you could try this instead of the with() piece:
do.call(paste,c(db[,colnam],sep = "-"))
EDIT:
Thanks to #joran and #Gregor (from the comments) who helped me understand that it is not good practice to use eval(parse(text = '')) . Also the function which I wrote was unnecessarily over complicated. So just found out a simple solution with only one loop (apply)
apply(db[colnam], 1, paste, collapse = "-")
Original Answer :
Based on the output showed , it seems you want to paste sal1 , sal2 and sal3 together but using colnam
colnam<-c("sal1","sal2","sal3")
You can use eval(parse( combination in sapply to get the values and then apply row-wise computation to paste the columns together
apply(sapply(paste0("db$",colnam), function(x) eval(parse(text = x))), 1,
function(x) paste(x, collapse = "-"))
Supose I have this variables:
data <- data.frame(x=rnorm(10), y=rnorm(10))
form <- 'z = x*y'
How can I compute z (using data's variables) and add as a new variable to data?
I tried with parse() and eval() (base on an old question), but without success :/
Given what #Nico said is correct you might do:
d1 <- within(data, eval(parse(text=form)) )
d1
x y z
1 0.5939462 1.58683345 0.94249368
2 0.3329504 0.55848643 0.18594826
3 1.0630998 -1.27659221 -1.35714497
4 -0.3041839 -0.57326541 0.17437812
5 0.3700188 -1.22461261 -0.45312970
6 0.2670988 -0.47340064 -0.12644474
7 -0.5425200 -0.62036668 0.33656135
8 1.2078678 0.04211587 0.05087041
9 1.1604026 -0.91092165 -1.05703586
10 0.7002136 0.15802877 0.11065390
transform() is the easy way if using this interactively:
data <- data.frame(x=rnorm(10), y=rnorm(10))
data <- transform(data, z = x * y)
R> head(data)
x y z
1 -1.0206 0.29982 -0.30599
2 -1.6985 1.51784 -2.57805
3 0.8940 1.19893 1.07187
4 -0.3672 -0.04008 0.01472
5 0.5266 -0.29205 -0.15381
6 0.2545 -0.26889 -0.06842
You can't do this using form though, but within(), which is similar to transform(), does allow this, e.g.
R> within(data, eval(parse(text = form)))
x y z
1 -0.8833 -0.05256 0.046428
2 1.6673 1.61101 2.686115
3 1.1261 0.16025 0.180453
4 0.9726 -1.32975 -1.293266
5 -1.6220 -0.51079 0.828473
6 -1.1981 2.62663 -3.147073
7 -0.3596 -0.01506 0.005416
8 -0.9700 0.21865 -0.212079
9 1.0626 1.30377 1.385399
10 -0.8020 -1.04639 0.839212
though it involves some amount of jiggery-pokery with the language which to my mind is not elegant. Effectively, you are doing something like this:
R> eval(eval(parse(text = form), data), data, parent.frame())
[1] 0.046428 2.686115 0.180453 -1.293266 0.828473 -3.147073 0.005416
[8] -0.212079 1.385399 0.839212
(and assigning the result to the named component in data.)
Does form have to come like this, as a character string representing some expression to be evaluated?
I have a set of hospital admission data that I need to process, I am stuck when trying to loop the data and pick up the stuff I need, here is the example:
Date Ward
1 A
2 A
3 A
4 A B
5 A
6 A
7 A C
8 C
9 C
10 C
And I need them to be transformed into:
Ward Adm_Date Dis_Date
A 1 4
B 4 4
A 4 7
C 7 10
To put it in sentence, this is a admission record patient X who:
go to ward A from day 1 to day 4
go to ward B (maybe it's an ICU ward) for less than a day in day 4, and move back to ward A on that day
stay in ward A from day 4 to day 7
move to ward C from ward A from day 7 and stay in ward C till day 10
I am thinking of using ddply by filtering the ward but it is not OK since B will be "omitted" and the period of time for A is not broken down into 2 pieces.
Any suggestions? Thanks!
dat <- data.frame(Date=1:10,Ward=c(rep("A",3),"A B",rep("A",2),"A C",rep("C",3)))
dat$Ward <- as.character(dat$Ward)
# Change data to a "long" format
Date2 <- rep(dat$Date,nchar(gsub(" ","",dat$Ward)))
Ward2 <- unlist(strsplit(dat$Ward," "))
dat2 <- data.frame(Date=Date2,Ward=Ward2)
dat2$Ward <- as.character(dat2$Ward) # pesky factors!
# Create output
Ward3 <- unlist(strsplit(gsub("(\\w)\\1+","\\1",paste(dat2$Ward,collapse="")),""))
#helper function to find lengths of repeated characters, probably a better way of doing this
repCharLength <- function(str)
{
out <- numeric(0)
tmp <- 1
for (i in 2:length(str))
{
if (str[i]!=str[i-1])
{out<-c(out,tmp)
tmp<-1}
else
tmp <- tmp+1
}
return(c(out,tmp))
}
stays <- repCharLength(dat2$Ward)
Adm_Date <- c(1,dat2$Date[cumsum(stays)[1:(length(stays)-1)]])
Dis_Date <- dat2$Date[cumsum(stays)]
dat3 <- data.frame(Ward=Ward3,Adm_Date=Adm_Date,Dis_Date=Dis_Date)
> dat3
Ward Adm_Date Dis_Date
1 A 1 4
2 B 4 4
3 A 4 7
4 C 7 10
A bit more involved than I first thought, and there is probably a better way to get the stay lengths than using the helper function I wrote, but this seems to do the job.
Edit
In light of Spacedman's comment, there is a library function to calculate Ward3 and stays:
Ward3 <- rle(dat2$Ward)$values
stays <- rle(dat2$Ward)$lengths
It's not a complex answer but you can transform your data
X <- data.frame(
Date=1:10,
Ward=c("A","A","A","A B","A","A","A C","C","C","C"),
stringsAsFactors=FALSE
)
w <- strsplit(X$Ward," +")
n <- sapply(w, length)
X_mod <- data.frame(
Date = rep(X$Date, n),
Ward = unlist(w, FALSE, FALSE)
)
With X_mod you could write vectorized (=fast) solution. For start with(X_mod, c(0,cumsum(Ward[-1]!=Ward[-length(Ward)]))) gives you id of visit.