Related
i can figure out the solution of my problem but in a very not optimal way and thus the solution i have is not adapted for a large df. Let me explain.
I have a big dataframe and i need to create new columns by subtracting two others ones. Let me show you using a simple df.
A<-rnorm(10)
B<-rnorm(10)
C<-rnorm(10)
D<-rnorm(10)
E<-rnorm(10)
F<-rnorm(10)
df1<-data_frame(A,B,C,D,E,F)
# A tibble: 10 x 6
A B C D E F
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 -2.8750025 0.4685855 2.4435767 1.6999761 -1.3848386 -0.58992249
2 0.2551404 1.8555876 0.8365116 -1.6151186 -1.7754623 0.04423463
3 0.7740396 -1.0756147 0.6830024 -2.3879337 -1.3165875 -1.36646493
4 0.2059932 0.9322016 1.2483196 -0.1787840 0.3546773 -0.12874831
5 -0.4561725 -0.1464692 -0.7112905 0.2791592 0.5835127 0.16493237
6 1.2401795 -1.1422917 -0.6189480 -1.4975416 0.5653565 -1.32575021
7 -1.6173618 0.2283430 0.6154920 0.6082847 0.0273447 0.16771783
8 0.3340799 -0.5096500 -0.5270123 -0.2814217 -2.3732234 0.27972188
9 -0.4841361 0.1651265 0.0296500 0.4324903 -0.3895971 -2.90426195
10 -2.7106357 0.5496335 0.3081533 -0.3083264 -0.1341055 -0.17927807
I need (i) to subtract two columns at a similar distance : D-A, E-B, F-C while (ii) giving the new column a name based on the name of the initial variables' names.
I did in that way and it works:
df2<-df1 %>%
transmute (!!paste0("diff","D","A") := D-A,
!!paste0("diff","E","B") := E-B,
!!paste0("diff","F","C") := F-C)
# A tibble: 10 x 3
diffDA diffEB diffFC
<dbl> <dbl> <dbl>
1 4.5749785 -1.8534241 -3.0334991
2 -1.8702591 -3.6310500 -0.7922769
3 -3.1619734 -0.2409728 -2.0494674
4 -0.3847772 -0.5775242 -1.3770679
5 0.7353317 0.7299819 0.8762229
6 -2.7377211 1.7076482 -0.7068022
7 2.2256465 -0.2009983 -0.4477741
8 -0.6155016 -1.8635734 0.8067342
9 0.9166264 -0.5547236 -2.9339120
10 2.4023093 -0.6837390 -0.4874314
However, i have many columns and i would like to find a way to make the code simpler. I tried many things (like with mutate_all, mutate_at or add_columns) but nothing works...
OK, here's a method that will work for the full width of your data set.
df1 <- tibble(A = rnorm(10),
B = rnorm(10),
C = rnorm(10),
D = rnorm(10),
E = rnorm(10),
F = rnorm(10),
G = rnorm(10),
H = rnorm(10),
I = rnorm(10))
ct <- 1:ncol(df1)
diff_tbl <- tibble(testcol = rnorm(10))
for (i in ct) {
new_tbl <- tibble(col = df1[[i+3]] - df1[[i]])
names(new_tbl)[1] <- paste('diff',colnames(df1[i+3]),colnames(df1[i]),sep='')
diff_tbl <- bind_cols(diff_tbl,new_tbl)
}
diff_tbl <- diff_tbl %>%
select(-testcol)
df1 <- bind_cols(df1,diff_tbl)
Basically, what you are doing is creating a second dummy tibble to compute the differences, iterating over the possible differences (i.e. gaps of three columns) then assembling them into a single tibble, then binding those columns to the original tibble. As you can see, I extended df1 by three extra columns and the whole thing worked like a charm.
It's probable that there's a more elegant way to do this, but this method definitely works. There's one slightly awkward thing in that I had to create the diff_tbl with a dummy column and then remove it before the final bind_cols() call, but it's not a major thing, I think.
You could divide the data frame in two parts and do
inds <- ncol(df1)/2
df1[paste0("diff", names(df1[(inds + 1):ncol(df1)]), names(df1[1:inds]))] <-
df1[(inds + 1):ncol(df1)] - df1[1:inds]
Note that column names with dashes in them are improper and not recommended.
result = df1[4:6] - df1[1:3]
names(result) = paste(names(df1)[4:6], names(df1)[1:3], sep = "-")
result
# D-A E-B F-C
# 1 0.12459065 0.05855622 0.6134559
# 2 -2.65583389 0.26425762 0.8344115
# 3 -1.48761765 -3.13999402 1.3008065
# 4 -4.37469763 1.37551178 1.3405191
# 5 1.01657135 -0.90690359 1.5848562
# 6 -0.34050959 -0.57687686 -0.3794937
# 7 0.85233808 0.57911293 -0.8896393
# 8 0.01931559 0.91385740 3.2685647
# 9 -0.62012982 -2.34166712 -0.4001903
# 10 -2.21764146 0.05927664 0.3965072
I have this accelerometer dataset and, let's say that I have some n number of observations for each subject (30 subjects total) for body-acceleration x time.
I want to make a plot so that it plots these body acceleration x time points for each subject in a different color on the y axis and the x axis is just an index. I tried this:
ggplot(data = filtered_data_walk, aes(x = seq_along(filtered_data_walk$'body-acceleration-mean-y-time'), y = filtered_data_walk$'body-acceleration-mean-y-time')) +
geom_line(aes(color = filtered_data_walk$subject))
But, the problem is that it doesn't superimpose the 30 lines, instead, they run along side each other. In other words, I end up with n1 + n2 + n3 + ... + n30 x index points, instead of max{n1, n2, ..., n30}. This is my first time posting, so I hope this makes sense (I know my formatting is bad).
One solution I thought of was to create a new variable which gives a value of 1 to n for all the observations of each subject. So, for example, if I had 6 observations for subject1, 4 observations for subject2, and 9 observations for subject3, this new variable would be sequenced like:
1 2 3 4 5 6 1 2 3 4 1 2 3 4 5 6 7 8 9
Is there an easy way to do this? Please help, ty.
Assuming your data is formatted as a data.frame or matrix, for a toy dataset like
x <- data.frame(replicate(5, rnorm(10)))
x
# X1 X2 X3 X4 X5
# 1 -1.36452272 -1.46446475 2.0444381 0.001585876 -1.1085990
# 2 -1.41303046 -0.14690269 1.6179084 -0.310162018 -1.5528733
# 3 -0.15319554 -0.18779791 -0.3005058 0.351619212 1.6282955
# 4 -0.38712167 -0.14867239 -1.0776359 0.106694311 -0.7065382
# 5 -0.50711166 -0.95992916 1.3522922 1.437085757 -0.7921355
# 6 -0.82377208 0.50423328 -0.5366513 -1.315263679 1.0604499
# 7 -0.01462037 -1.15213287 0.9910678 0.372623508 1.9002438
# 8 1.49721113 -0.84914197 0.2422053 0.337141898 1.2405208
# 9 1.95914245 -1.43041783 0.2190829 -1.797396822 0.4970690
# 10 -1.75726827 -0.04123615 -0.1660454 -1.071688768 -0.3331887
...you might be able to get there with something like
plot(x[,1], type='l', xlim=c(1, nrow(x)), ylim=c(min(x), max(x)))
for(i in 2:ncol(x)) lines(x[,i], col=i)
You could play with formatting some more, of course, do things with lty= and lwd= and maybe a color ramp of your own choosing, etc.
If your data is in the format below...
x <- data.frame(id=c("A","A","A","B","B","B","B","C","C"), acc=rnorm(9))
x
# id acc
# 1 A 0.1796964
# 2 A 0.8770237
# 3 A -2.4413527
# 4 B 0.9379746
# 5 B -0.3416141
# 6 B -0.2921062
# 7 B 0.1440221
# 8 C -0.3248310
# 9 C -0.1058267
...you could get there with
maxn <- max(with(x, tapply(acc, id, length)))
ids <- sort(unique(x$id))
plot(x$acc[x$id==ids[1]], type='l', xlim=c(1,maxn), ylim=c(min(x$acc),max(x$acc)))
for(i in 2:length(ids)) lines(x$acc[x$id==ids[i]], col=i)
Hope this helps, and that I interpreted your problem right--
That's pretty quick to do if you are OK with using dplyr. group_by to enforce a separate counter for each subject, mutate to add the actual counter, and your ggplot should work. Example with iris dataset:
group_by(iris, Species) %>%
mutate(index = seq_along(Petal.Length)) %>%
ggplot() + geom_line(aes(x=index, y=Petal.Length, color=Species))
In the graph below,
Is it possible to create same graph with less lines of codes? I mean, since each Figs. A-D has different label settings, I have to write settings for each Fig. which makes it longer.
The graph below is produced with the data in pdf device.
Any help with these issues is highly appreciated.(Newbie to R!). Since all the code is too long to post here, I have posted a part relevant to the problem here for Fig.C
#FigC
label1=c(0,100,200,300)
plot(data$TimeVariable2C,data$Variable2C,axes=FALSE,ylab="",xlab="",xlim=c(0,24),
ylim=c(0,2.4),xaxs="i",yaxs="i",pch=19)
lines(data$TimeVariable3C,data$Variable3C)
axis(2,tick=T,at=seq(0.0,2.4,by=0.6),label= seq(0.0,2.4,by=0.6))
axis(1,tick=T,at=seq(0,24,by=6),label=seq(0,24,by=6))
mtext("(C)",side=1,outer=F,line=-10,adj=0.8)
minor.tick(nx=5,ny=5)
par(new=TRUE)
plot(data$TimeVariable1C,data$Variable1C,axes=FALSE,xlab="",ylab="",type="l",
ylim=c(800,0),xaxs="i",yaxs="i")
axis(3,xlim=c(0,24),tick=TRUE,at= seq(0,24,by=6),label=seq(0,24,by=6),col.axis="violetred4",col="violetred4")
axis(4,tick=TRUE,at= label1,label=label1,col.axis="violetred4",col="violetred4")
polygon(data$TimeVariable1C,data$Variable1C,col='violetred4',border=NA)
You ask many questions in the same OP. I will try to answer to just one : How to simplify your code or rather how to call it once for each letter. I think it is better to put your data in the long format. For example, This will create a list of 4 elements
ll <- lapply(LETTERS[1:4],function(let){
dat.let <- dat[,grepl(let,colnames(dat))]
dd <- reshape(dat.let,direction ='long',
v.names=c('TimeVariable','Variable'),
varying=1:6)
dd$time <- factor(dd$time)
dd$Type <- let
dd
}
)
ll is a list of 4 data.frame, where each one that looks like :
head(ll[[1]])
time TimeVariable Variable id Type
1.1 1 0 0 1 A
2.1 1 0 5 2 A
3.1 1 8 110 3 A
4.1 1 16 0 4 A
5.1 1 NA NA 5 A
6.1 1 NA NA 6 A
Then you can use it like this for example :
library(Hmisc)
layout(matrix(1:4, 2, 2, byrow = TRUE))
lapply(ll,function(data){
label1=c(0,100,200,300)
Type <- unique(dat$Type)
dat <- subset(data,time==2)
x.mm <- max(dat$Variable,na.rm=TRUE)
plot(dat$TimeVariable,dat$Variable,axes=FALSE,ylab="",xlab="",xlim=c(0,x.mm),
ylim=c(0,2.4),xaxs="i",yaxs="i",pch=19)
dat <- subset(data,time==2)
lines(dat$TimeVariable,dat$Variable)
axis(2,tick=T,at=seq(0.0,2.4,by=0.6),label= seq(0.0,2.4,by=0.6))
axis(1,tick=T,at=seq(0,x.mm,by=6),label=seq(0,x.mm,by=6))
mtext(Type,side=1,outer=F,line=-10,adj=0.8)
minor.tick(nx=5,ny=5)
par(new=TRUE)
dat <- subset(data,time==1)
plot(dat$TimeVariable,dat$Variable,axes=FALSE,xlab="",ylab="",type="l",
ylim=c(800,0),xaxs="i",yaxs="i")
axis(3,xlim=c(0,24),tick=TRUE,at= seq(0,24,by=6),label=seq(0,24,by=6),col.axis="violetred4",col="violetred4")
axis(4,tick=TRUE,at= label1,label=label1,col.axis="violetred4",col="violetred4")
polygon(dat$TimeVariable,dat$Variable,col='violetred4',border=NA)
})
Another advantage of using the long data format is to use ``ggplot2andfacet_wrap` for example .
## transform your data to a data.frame
dat.l <- do.call(rbind,ll)
library(ggplot2)
ggplot(subset(dat.l,time !=1)) +
geom_line(aes(x=TimeVariable,y=Variable,group=time,color=time))+
geom_polygon(data=subset(dat.l,time ==1),
aes(x=TimeVariable,y=60-Variable/10,fill=Type))+
geom_line(data=subset(dat.l,time ==1),
aes(x=TimeVariable,y=Variable,fill=Type))+
facet_wrap(~Type,scales='free')
I suspect I'm Doing It Wrong, but I'd like to pass a character vector as an argument to a function in ddply. There's a lot of Q&A on removing quotes, etc. but none of it seems to work for me (eg. Remove quotes from a character vector in R and http://r.789695.n4.nabble.com/Pass-character-vector-to-function-argument-td3045226.html).
# reproducible data
df1<-data.frame(a=sample(1:50,10),b=sample(1:50,10),c=sample(1:50,10),d=(c("a","b","c","a","a","b","b","a","c","d")))
df2<-data.frame(a=sample(1:50,9),b=sample(1:50,9),c=sample(1:50,9),d=(c("e","f","g","e","e","f","f","e","g")))
df3<-data.frame(a=sample(1:50,8),b=sample(1:50,8),c=sample(1:50,8),d=(c("h","i","j","h","h","i","i","h")))
#make a list
list.1<-list(df1=df1,df2=df2,df3=df3)
# desired output
lapply(list.1, function(x) ddply(x, .(d), function(x) data.frame(am=mean(x$a), bm=mean(x$b), cm=mean(x$c))))
$df1
d am bm cm
1 a 31.00000 29.25000 18.50000
2 b 31.66667 24.33333 34.66667
3 c 18.50000 5.50000 24.50000
4 d 36.00000 39.00000 43.00000
$df2
d am bm cm
1 e 18.25000 32.50000 18
2 f 27.66667 41.33333 24
3 g 25.00000 7.50000 42
$df3
d am bm cm
1 h 36.00000 25.00000 20.50000
2 i 25.33333 37.33333 24.33333
3 j 32.00000 32.00000 46.00000
But my actual use-case has many new columns and different types of calculations that I want to calculate in the ddply function. So I want to do something like:
# here's a simple version of a function that I want to send to ddply
func <- "am=mean(x$a), bm=mean(x$b), cm=mean(x$c)"
# here's how I imagine it might work
lapply(list.1, function(x) ddply(x, .(d), function(x) data.frame(func)) )
# not the desired outcome...
$df1
d func
1 a am=mean(x$a), bm=mean(x$b), cm=mean(x$c)
2 b am=mean(x$a), bm=mean(x$b), cm=mean(x$c)
3 c am=mean(x$a), bm=mean(x$b), cm=mean(x$c)
4 d am=mean(x$a), bm=mean(x$b), cm=mean(x$c)
$df2
d func
1 e am=mean(x$a), bm=mean(x$b), cm=mean(x$c)
2 f am=mean(x$a), bm=mean(x$b), cm=mean(x$c)
3 g am=mean(x$a), bm=mean(x$b), cm=mean(x$c)
$df3
d func
1 h am=mean(x$a), bm=mean(x$b), cm=mean(x$c)
2 i am=mean(x$a), bm=mean(x$b), cm=mean(x$c)
3 j am=mean(x$a), bm=mean(x$b), cm=mean(x$c)
I've tried noquote, deparse, eval(as.symbol()), do.call(data.frame, ...) and some of the methods here: https://github.com/hadley/devtools/wiki/Evaluation on func to no avail. The solution might be obvious at this point (ie. melt everything!), but in case it's not, here's a longer example that's closer to my use case:
# sample data
s <- 23 # number of samples
r <- 10 # number of runs per sample
el <- 17 # number of elements
mydata <- data.frame(ID = unlist(lapply(LETTERS[1:s], function(x) rep(x, r))),
run = rep(1:r, s))
# insert fake element data
mydata[letters[1:el]] <- lapply(1:el, function(i) rnorm(s*r, runif(1)*i^2))
# generate all combinations of 5 runs from ten runs
su <- 5 # number of runs to sample from ten runs
idx <- combn(unique(mydata$run), su)
# RSE function
RSE <- function(x) {100*( (sd(x)/sqrt(length(x)))/mean(x) )}
# make a list of dfs for all samples for each combination of five runs
# to prepare to calculate RSEs
combys1 <- lapply(1:ncol(idx), function(i) mydata[mydata$run %in% idx[,i],] )
# make a list of dfs with RSE for each ID, for each combination of runs
combys2 <- lapply(1:length(combys1), function(i) ddply(combys1[[i]], "ID", summarise, RSEa=RSE(a), RSEb=RSE(b), RSEc=RSE(c), meana=mean(a), meanb=mean(b), meanc=mean(c)))
I want to replace RSEa=RSE(a), RSEb=RSE(b), RSEc=RSE(c), meana=mean(a), meanb=mean(b), meanc=mean(c) in the last line above with the object doRSE from here, to avoid lots of typing:
# prepare to calculate new colums with RSE and means
RSEs <- sapply(3:ncol(mydata), function(j) paste0("RSE",names(mydata[j])))
RSExs <- sapply(3:ncol(mydata), function(j) paste0("RSE(",names(mydata[j]),")"))
doRSE <- paste0(sapply(1:length(RSEs), function(x) paste0(RSEs[x],"=",RSExs[x])), collapse=",", sep="")
I'm open to solutions involving base, data.table and dirty tricks. Seems like these are close to what I want, but I can't quite translate them to my problem:
Pass character argument and evaluate,
Force evaluation of multiple variables using vector of character,
Using a vector of characters that correspond to an expression as an argument to a function
UPDATE Here's the catch: I want to be able to modify the func in the simple example (or doRSE in my use-case) to create a bunch of new columns that result from various calculations on the existing columns to explore the data. I want a workflow that allows the resulting dataframes to have new columns that were not in the original dataframes. Sorry that wasn't more clear in the original question. I can't see how to adapt #Marius' answer to do this, but #mnel's is helpful (see update below)
Working through #mnel's excellent dirty tricks, with some minor fixes I can get the desired result on my use-case:
# #mnel's solution, adapted (no period before eval)
combys2 <- lapply(combys1, function(x) do.call(ddply,c(.data = quote(x),
.variables = quote(.(ID)), .fun = quote(summarize),
eval(parse(text = sprintf('.(%s)', doRSE ))))))
head(combys2)
[[1]]
ID RSEa RSEb RSEc RSEd RSEe RSEf RSEg RSEh RSEi
1 A 168.30658 21.68632 5.657228 5.048057 4.162017 2.9581874 1.849009 0.6925148 0.4393491
2 B 26.55071 26.20427 4.782578 4.385409 2.342764 2.1813874 2.719625 1.1576681 0.6427935
3 C 73.83165 14.47216 8.154435 6.273202 3.046978 1.2179457 2.811405 1.1401837 0.8167067
4 D 31.96170 57.89260 9.438220 7.388410 3.755772 0.8601780 3.724875 0.8358204 0.9939387
5 E 63.22537 60.35532 5.839690 11.691304 3.828430 0.9217787 4.204300 0.8217187 0.7876634
6 F 56.37635 65.37907 4.149568 5.496308 2.227544 2.1548455 2.847291 1.1956212 0.2506518
7 G 69.32232 23.63214 4.255847 7.979225 4.917660 1.6185960 3.156521 0.3265555 0.8133279
8 H 29.82015 40.74184 7.372100 7.464792 2.749862 0.6054420 4.061368 0.9973909 1.3807720
9 I 50.58114 19.53732 2.989920 9.767678 4.000249 1.7451322 1.175397 0.9952093 0.9095086
10 J 92.96462 39.77475 6.140688 10.295668 3.407726 2.4663758 3.030444 0.5743419 0.9296482
11 K 90.72381 42.25092 2.483069 6.781054 3.142082 1.8080633 2.891740 1.1996176 0.8525290
12 L -385.24547 40.81267 4.506087 8.148382 2.976488 0.8304432 2.234134 0.2108664 0.4979777
13 M 22.77743 33.98332 2.913926 8.764639 2.307293 0.8366635 3.229944 1.0003125 0.3878567
14 N 66.75163 34.16087 6.611326 13.865377 1.285522 1.3863958 4.165575 0.7379386 0.4515194
15 O 37.37188 100.57479 5.738877 5.724862 2.839638 1.1366610 3.186332 0.7383855 0.3954544
16 P 17.08913 26.62210 6.060130 4.110893 2.688908 2.6970727 1.609043 1.3860834 0.8780010
17 Q 13.96392 74.92279 5.469304 8.467638 2.974131 1.2135436 3.284564 0.6232778 1.0759226
18 R 42.59899 30.75952 4.842832 8.764158 1.874020 1.5791048 3.427342 1.4479638 0.2964455
19 S 26.03307 15.56352 6.968717 7.783876 4.439733 2.0764179 4.683080 0.7459654 1.1268772
20 T 71.57945 33.81362 7.147049 11.201551 2.128315 2.2051611 2.419805 0.2688807 1.1559635
21 U 73.93002 11.77155 7.738910 7.207041 1.478491 1.4409844 4.042419 0.5883490 0.5585716
22 V 67.93166 39.54994 5.701551 8.636122 2.472963 1.6514199 2.627965 1.0359048 0.8747136
23 W 11.23057 12.51272 7.003448 7.424559 4.102693 0.6614847 2.246305 1.3422405 0.2665246
RSEj RSEk RSEl RSEm RSEn RSEo RSEp RSEq
1 0.6366733 0.3713819 2.1993487 0.3865293 0.5436581 0.9187585 0.4344699 0.8915868
2 0.3445095 0.2932025 1.8563179 0.5397595 1.0433388 0.3533622 0.1942316 0.1941072
3 0.2720344 0.5507595 2.0305726 0.4377259 0.8589854 0.5690906 0.1397337 0.4043247
4 0.6606667 0.6769112 3.4737352 0.5674656 1.2519256 0.8718298 0.1162969 0.8287504
5 0.4620774 0.5598069 1.9236112 0.7990046 0.9832732 0.6847352 0.4070675 0.9005185
6 0.7981610 0.4005493 0.9721068 0.2770989 1.7054674 0.3110139 0.4521183 0.8740444
7 0.3969116 0.4717575 4.1341106 0.7510628 0.9998299 0.5342292 0.4319642 1.1861705
8 0.2963956 0.2652221 0.4775827 0.2617120 0.8261874 0.5266087 0.1900943 0.2350553
9 0.2609359 0.5431035 2.6478440 0.1606919 0.7407281 0.6802262 0.1802069 0.7438792
10 0.4239787 0.8753544 3.4218030 0.5467869 0.7404017 0.5581173 0.3682014 0.6361436
11 0.4188502 0.8629862 4.4181479 0.1623873 0.8018811 0.5873609 0.3592134 0.5357984
12 0.5790265 0.5009210 3.7534287 0.1933726 0.5809601 0.5777868 0.3400925 0.4783890
13 0.3562582 0.2552756 2.1393219 0.1849345 0.5796194 0.6129469 0.3363311 0.4382125
14 0.7921502 0.6147990 2.9054634 0.5852325 1.4954072 0.9983203 0.2937837 0.7654504
15 0.5840424 0.2757707 1.5695675 0.3305385 0.8712636 0.5816490 0.1985457 0.7213289
16 0.3301280 0.3008273 2.9014987 0.4540833 0.5966479 0.9042004 0.1631630 0.7262141
17 0.5882511 0.2820978 3.0652666 0.4518936 1.3168151 0.4749311 0.2244693 0.6583083
18 0.4048816 0.3708787 3.2207478 0.2603412 1.3168318 0.3318745 0.3120436 0.6210711
19 0.4425123 0.3602076 3.7609863 0.5399527 0.8302572 0.3246904 0.1952143 0.2915325
20 0.5877835 0.6339015 1.6908570 0.3223056 0.5239339 0.6607198 0.2808094 0.3697380
21 0.4454056 0.7733354 4.3433420 0.4391075 0.5503594 0.5893406 0.2262403 0.2361512
22 0.9583940 0.6365843 3.0033951 0.6507968 0.8610046 0.6363198 0.2866719 0.5736855
23 0.4969730 0.3895182 2.0021608 0.3354475 1.4398250 0.7386870 0.2458906 0.3414804
...
...
You can do some ugly computing on the language using quote and plyr::.
Reading https://github.com/hadley/devtools/wiki/Computing-on-the-language will probably help understand whether you really want to do this.
Anyway, an approach could be to use
use .() to create your vector of arguments eg and use how summarize works
.(am=mean(a), bm=mean(b), cm=mean(c))
and if you really wanted to use a character string
foo<- "am=mean(a), bm=mean(b), cm=mean(c)"
eval(parse(text = sprintf('.(%s)', foo )))
Use quote liberally to create your list to be passed to to do.call
for example
lapply(list.1, function(x) do.call(ddply,c(.data = quote(x),
.variables = quote(.(d)), .fun = quote(summarize),
.(am=mean(a), bm=mean(b), cm=mean(c)))))
Oh boy is that ugly.
Or, you could use data.tables
library(data.table)
listDT <- lapply(list.1, data.table)
lapply(listDT, function(x) x[,lapply(.SD, mean), by = 'd'])
or
mystuff <- sprintf('list(%s)', foo)
lapply(listDT, function(x) x[, eval(parse(text = mystuff)), by = 'd'])
However, if you had all the same columns in all your data.tables, it would be more efficient to create one large data.table (with an identifer for each element of the list) and work on that.
Here's a ddply function that calculates the mean for all the columns that aren't d in your dataframes:
lapply(list.1,
function(x) {
ddply(
x,
.(d),
function(df_part) {
result_df <- data.frame(d=df_part$d[1])
non_d_cols <- colnames(df_part)[! colnames(df_part) == "d"]
for (col in non_d_cols) {
col_mean <- mean(df_part[[col]])
col_name <- paste0(col, "_mean")
result_df[[col_name]] <- col_mean
}
return(result_df)
})
})
That seems to me like the simplest way to do it, and it should generalize well to other calculations you might want to do on those columns. Maybe you could pass in a character vector argument of the columns you want to calculate the mean for, and use that in place of non_d_cols.
I have a set of hospital admission data that I need to process, I am stuck when trying to loop the data and pick up the stuff I need, here is the example:
Date Ward
1 A
2 A
3 A
4 A B
5 A
6 A
7 A C
8 C
9 C
10 C
And I need them to be transformed into:
Ward Adm_Date Dis_Date
A 1 4
B 4 4
A 4 7
C 7 10
To put it in sentence, this is a admission record patient X who:
go to ward A from day 1 to day 4
go to ward B (maybe it's an ICU ward) for less than a day in day 4, and move back to ward A on that day
stay in ward A from day 4 to day 7
move to ward C from ward A from day 7 and stay in ward C till day 10
I am thinking of using ddply by filtering the ward but it is not OK since B will be "omitted" and the period of time for A is not broken down into 2 pieces.
Any suggestions? Thanks!
dat <- data.frame(Date=1:10,Ward=c(rep("A",3),"A B",rep("A",2),"A C",rep("C",3)))
dat$Ward <- as.character(dat$Ward)
# Change data to a "long" format
Date2 <- rep(dat$Date,nchar(gsub(" ","",dat$Ward)))
Ward2 <- unlist(strsplit(dat$Ward," "))
dat2 <- data.frame(Date=Date2,Ward=Ward2)
dat2$Ward <- as.character(dat2$Ward) # pesky factors!
# Create output
Ward3 <- unlist(strsplit(gsub("(\\w)\\1+","\\1",paste(dat2$Ward,collapse="")),""))
#helper function to find lengths of repeated characters, probably a better way of doing this
repCharLength <- function(str)
{
out <- numeric(0)
tmp <- 1
for (i in 2:length(str))
{
if (str[i]!=str[i-1])
{out<-c(out,tmp)
tmp<-1}
else
tmp <- tmp+1
}
return(c(out,tmp))
}
stays <- repCharLength(dat2$Ward)
Adm_Date <- c(1,dat2$Date[cumsum(stays)[1:(length(stays)-1)]])
Dis_Date <- dat2$Date[cumsum(stays)]
dat3 <- data.frame(Ward=Ward3,Adm_Date=Adm_Date,Dis_Date=Dis_Date)
> dat3
Ward Adm_Date Dis_Date
1 A 1 4
2 B 4 4
3 A 4 7
4 C 7 10
A bit more involved than I first thought, and there is probably a better way to get the stay lengths than using the helper function I wrote, but this seems to do the job.
Edit
In light of Spacedman's comment, there is a library function to calculate Ward3 and stays:
Ward3 <- rle(dat2$Ward)$values
stays <- rle(dat2$Ward)$lengths
It's not a complex answer but you can transform your data
X <- data.frame(
Date=1:10,
Ward=c("A","A","A","A B","A","A","A C","C","C","C"),
stringsAsFactors=FALSE
)
w <- strsplit(X$Ward," +")
n <- sapply(w, length)
X_mod <- data.frame(
Date = rep(X$Date, n),
Ward = unlist(w, FALSE, FALSE)
)
With X_mod you could write vectorized (=fast) solution. For start with(X_mod, c(0,cumsum(Ward[-1]!=Ward[-length(Ward)]))) gives you id of visit.