Multi-way interaction: easy way to get numerical coefficient estimates? - r

Let's say there's a 4-way interaction, with a 2x2x2 factorial design plus a continuous variable.
Factors have the default contrast coding (contr.treatment). Here's an example:
set.seed(1)
cat1 <- as.factor(sample(letters[1:2], 1000, replace = TRUE))
cat2 <- as.factor(sample(letters[3:4], 1000, replace = TRUE))
cat3 <- as.factor(sample(letters[5:6], 1000, replace = TRUE))
cont1 <- rnorm(1000)
resp <- rnorm(1000)
df <- data.frame(cat1, cat2, cat3, cont1, resp)
mod <- lm(resp ~ cont1 * cat1 * cat2 * cat3, data = df)
Looking at the output of coef(mod), we get something like:
(Intercept) cont1 cat1b
0.019822407 0.011990238 0.207604677
cat2d cat3f cont1:cat1b
-0.010132897 0.105397591 -0.001153867
cont1:cat2d cat1b:cat2d cont1:cat3f
0.023358901 -0.194991402 0.060960695
cat1b:cat3f cat2d:cat3f cont1:cat1b:cat2d
-0.240624582 -0.117278931 -0.069880751
cont1:cat1b:cat3f cont1:cat2d:cat3f cat1b:cat2d:cat3f
-0.120446848 -0.141688864 0.136945262
cont1:cat1b:cat2d:cat3f
0.201792298
And to get the estimated intercept for cat1b (for example), we would add our implicit (Intercept) term and cat1b, i.e. coef(mod)[1] + coef(mod)[3]. To get the change in slope for the same category, we would use coef(mod)[2] + coef(mod)[6], a la this r-bloggers post. It gets pretty tedious to write all of them out, and methods(class="lm") doesn't look like it has any functions that do this right out of the gate.
Is there some obvious way to get numerical estimates for the intercept and slope for each combination of factors?

You're looking for the lsmeans package. Check it out:
lstrends(mod, specs = c('cat1', 'cat2', 'cat3'), var = 'cont1')
cat1 cat2 cat3 cont1.trend SE df lower.CL upper.CL
a c e 0.01199024 0.08441129 984 -0.15365660 0.1776371
b c e 0.01083637 0.08374605 984 -0.15350502 0.1751778
a d e 0.03534914 0.09077290 984 -0.14278157 0.2134799
b d e -0.03568548 0.09644117 984 -0.22493948 0.1535685
a c f 0.07295093 0.08405090 984 -0.09198868 0.2378905
b c f -0.04864978 0.09458902 984 -0.23426916 0.1369696
a d f -0.04537903 0.09363128 984 -0.22911897 0.1383609
b d f -0.03506820 0.08905581 984 -0.20982934 0.1396929

Related

How to perform nonlinear least squares with shared parameters in R?

I would like to perform nonlinear least squares regression in R where I simultaneously minimize the squared residuals of three models (see below). Now, the three models share some of the parameters, in my example, parameters b and d.
Is there a way of doing this with either nls(), or, either packages minpack.lm or nlsr?
So, ideally, I would like to generate the objective function (the sum of least squares of all models together) and regress all parameters at once: a1, a2, a3, b, c1, c2, c3 and d.
(I am trying to avoid running three independent regressions and then perform some averaging on b and d.)
my_model <- function(x, a, b, c, d) {
a * b ^ (x - c) + d
}
# x values
x <- seq(0, 10, 0.2)
# Shared parameters
b <- 2
d <- 10
a1 <- 1
c1 <- 1
y1 <- my_model(x,
a = a1,
b = b,
c = c1,
d = d) + rnorm(length(x))
a2 <- 2
c2 <- 5
y2 <- my_model(x,
a = a2,
b = b,
c = c2,
d = d) + rnorm(length(x))
a3 <- -2
c3 <- 3
y3 <- my_model(x,
a = a3,
b = b,
c = c3,
d = d) + rnorm(length(x))
plot(
y1 ~ x,
xlim = range(x),
ylim = d + c(-50, 50),
type = 'b',
col = 'red',
ylab = 'y'
)
lines(y2 ~ x, type = 'b', col = 'green')
lines(y3 ~ x, type = 'b', col = 'blue')
Below we run nls (using a slightly modified model) and nlxb (from nlsr) but nlxb stops before convergence. Desite these problems both of these nevertheless do give results which visually fit the data well. These problems suggest that there are problems with the model itself so in the Other section, guided by the nlxb output, we show how to fix the model giving a submodel of the original model which fits the data easily with both nls and nlxb and also gives a good fit. At the end in the Notes section we provide the data in reproducible form.
nls
Assuming the setup shown reproducibly in the Note at the end, reformulate the problem for the nls plinear algorithm by defining a right hand side matrix whose columns multiply each of the linear parameters, a1, a2, a3 and d, respectively. plinear does not require starting values for those simplifying the setup. It will report them as .lin1, .lin2, .lin3 and .lin4 respectively.
To get starting values we used a simpler model with no grouping and a grid search over b from 1 to 10 and c also from 1 to 10 using nls2 in the package of the same name. We also found that nls still produced errors but by using abs in the formula, as shown, it ran to completion.
The problems with the model suggest that there is a fundamental problem with it and in the Other section we discuss how to fix it up.
xx <- c(x, x, x)
yy <- c(y1, y2, y3)
# startingi values using nls2
library(nls2)
fo0 <- yy ~ cbind(b ^ abs(xx - c), 1)
st0 <- data.frame(b = c(1, 10), c = c(1, 10))
fm0 <- nls2(fo0, start = st0, alg = "plinear-brute")
# run nls using starting values from above
g <- rep(1:3, each = length(x))
fo <- yy ~ cbind((g==1) * b ^ abs(xx - c[g]),
(g==2) * b ^ abs(xx - c[g]),
(g==3) * b ^ abs(xx - c[g]),
1)
st <- with(as.list(coef(fm0)), list(b = b, c = c(c, c, c)))
fm <- nls(fo, start = st, alg = "plinear")
plot(yy ~ xx, col = g)
for(i in unique(g)) lines(predict(fm) ~ xx, col = i, subset = g == i)
fm
giving:
Nonlinear regression model
model: yy ~ cbind((g == 1) * b^abs(xx - c[g]), (g == 2) * b^abs(xx - c[g]), (g == 3) * b^abs(xx - c[g]), 1)
data: parent.frame()
b c1 c2 c3 .lin1 .lin2 .lin3 .lin4
1.997 0.424 1.622 1.074 0.680 0.196 -0.532 9.922
residual sum-of-squares: 133
Number of iterations to convergence: 5
Achieved convergence tolerance: 5.47e-06
(continued after plot)
nlsr
With nlsr it would be done like this. No grid search for starting values was needed and adding abs was not needed either. The b and d values seem similar to the nls solution but the other coefficients differ. Visually both solutions seem to fit the data.
On the other hand from the JSingval column we see that the jacobian is rank deficient which caused it to stop and not produce SE values and the convergence is in doubt (although it may be sufficient given that visually the plot, not shown, seems like a good fit). We discuss how to fix this up in the Other section.
g1 <- g == 1; g2 <- g == 2; g3 <- g == 3
fo2 <- yy ~ g1 * (a1 * b ^ (xx - c1) + d) +
g2 * (a2 * b ^ (xx - c2) + d) +
g3 * (a3 * b ^ (xx - c3) + d)
st2 <- list(a1 = 1, a2 = 1, a3 = 1, b = 1, c1 = 1, c2 = 1, c3 = 1, d = 1)
fm2 <- nlxb(fo2, start = st2)
fm2
giving:
vn: [1] "yy" "g1" "a1" "b" "xx" "c1" "d" "g2" "a2" "c2" "g3" "a3" "c3"
no weights
nlsr object: x
residual sumsquares = 133.45 on 153 observations
after 16 Jacobian and 22 function evaluations
name coeff SE tstat pval gradient JSingval
a1 3.19575 NA NA NA 9.68e-10 4097
a2 0.64157 NA NA NA 8.914e-11 662.5
a3 -1.03096 NA NA NA -1.002e-09 234.9
b 1.99713 NA NA NA -2.28e-08 72.57
c1 2.66146 NA NA NA -2.14e-09 10.25
c2 3.33564 NA NA NA -3.955e-11 1.585e-13
c3 2.0297 NA NA NA -7.144e-10 1.292e-13
d 9.92363 NA NA NA -2.603e-12 3.271e-14
We can calculate SE's using nls2 as a second stage but this still does not address the problem with the whole lthing that the singular values suggest.
summary(nls2(fo2, start = coef(fm2), algorithm = "brute-force"))
giving:
Formula: yy ~ g1 * (a1 * b^(xx - c1) + d) + g2 * (a2 * b^(xx - c2) + d) +
g3 * (a3 * b^(xx - c3) + d)
Parameters:
Estimate Std. Error t value Pr(>|t|)
a1 3.20e+00 5.38e+05 0.0 1
a2 6.42e-01 3.55e+05 0.0 1
a3 -1.03e+00 3.16e+05 0.0 1
b 2.00e+00 2.49e-03 803.4 <2e-16 ***
c1 2.66e+00 9.42e-02 28.2 <2e-16 ***
c2 3.34e+00 2.43e+05 0.0 1
c3 2.03e+00 8.00e+05 0.0 1
d 9.92e+00 4.42e+05 0.0 1
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.959 on 145 degrees of freedom
Number of iterations to convergence: 8
Achieved convergence tolerance: NA
Other
When nls has trouble fitting a model it often suggests that there is something wrong with the model itself. Playing around with it a bit, guided by the JSingval column in nlsr output above which suggests that c parameters or d might be the problem, we find that if we fix all c parameter values to 0 then the model is easy to fit given sufficiently good starting values and it still gives a low residual sum of squares.
library(nls2)
fo3 <- yy ~ cbind((g==1) * b ^ xx, (g==2) * b ^ xx, (g==3) * b ^ xx, 1)
st3 <- coef(fm0)["b"]
fm3 <- nls(fo3, start = st3, alg = "plinear")
giving:
Nonlinear regression model
model: yy ~ cbind((g == 1) * b^xx, (g == 2) * b^xx, (g == 3) * b^xx, 1)
data: parent.frame()
b .lin1 .lin2 .lin3 .lin4
1.9971 0.5071 0.0639 -0.2532 9.9236
residual sum-of-squares: 133
Number of iterations to convergence: 4
Achieved convergence tolerance: 1.67e-09
which the following anova indicates is comparable to fm from above despite having 3 fewer parameters:
anova(fm3, fm)
giving:
Analysis of Variance Table
Model 1: yy ~ cbind((g == 1) * b^xx, (g == 2) * b^xx, (g == 3) * b^xx, 1)
Model 2: yy ~ cbind((g == 1) * b^abs(xx - c[g]), (g == 2) * b^abs(xx - c[g]), (g == 3) * b^abs(xx - c[g]), 1)
Res.Df Res.Sum Sq Df Sum Sq F value Pr(>F)
1 148 134
2 145 133 3 0.385 0.14 0.94
We can redo fm3 using nlxb like this:
fo4 <- yy ~ g1 * (a1 * b ^ xx + d) +
g2 * (a2 * b ^ xx + d) +
g3 * (a3 * b ^ xx + d)
st4 <- list(a1 = 1, a2 = 1, a3 = 1, b = 1, d = 1)
fm4 <- nlxb(fo4, start = st4)
fm4
giving:
nlsr object: x
residual sumsquares = 133.45 on 153 observations
after 24 Jacobian and 33 function evaluations
name coeff SE tstat pval gradient JSingval
a1 0.507053 0.005515 91.94 1.83e-132 8.274e-08 5880
a2 0.0638554 0.0008735 73.11 4.774e-118 1.26e-08 2053
a3 -0.253225 0.002737 -92.54 7.154e-133 -4.181e-08 2053
b 1.99713 0.002294 870.6 2.073e-276 -2.55e-07 147.5
d 9.92363 0.09256 107.2 3.367e-142 -1.219e-11 10.26
Note
The assumed input below is the same as in the question except we additionally
set the seed to make it reproducible.
set.seed(123)
my_model <- function(x, a, b, c, d) a * b ^ (x - c) + d
x <- seq(0, 10, 0.2)
b <- 2; d <- 10 # shared
a1 <- 1; c1 <- 1
y1 <- my_model(x, a = a1, b = b, c = c1, d = d) + rnorm(length(x))
a2 <- 2; c2 <- 5
y2 <- my_model(x, a = a2, b = b, c = c2, d = d) + rnorm(length(x))
a3 <- -2; c3 <- 3
y3 <- my_model(x, a = a3, b = b, c = c3, d = d) + rnorm(length(x))
I'm not sure this is really the best way, but you could minimize the sum of the squared residuals using optim().
#start values
params <- c(a1=1, a2=1, a3=1, b=1, c1=1, c2=1, c3=1,d=1)
# minimize total sum of squares of residuals
fun <- function(p) {
sum(
(y1-my_model(x, p["a1"], p["b"], p["c1"], p["d"]))^2 +
(y2-my_model(x, p["a2"], p["b"], p["c2"], p["d"]))^2 +
(y3-my_model(x, p["a3"], p["b"], p["c3"], p["d"]))^2
)
}
out <- optim(params, fun, method="BFGS")
out$par
# a1 a2 a3 b c1 c2 c3
# 0.8807542 1.0241804 -2.8805848 1.9974615 0.7998103 4.0030597 3.5184600
# d
# 9.8764917
And we can add the plots on top of the image
curve(my_model(x, out$par["a1"], out$par["b"], out$par["c1"], out$par["d"]), col="red", add=T)
curve(my_model(x, out$par["a2"], out$par["b"], out$par["c2"], out$par["d"]), col="green", add=T)
curve(my_model(x, out$par["a3"], out$par["b"], out$par["c3"], out$par["d"]), col="blue", add=T)

How to get the results from nls

My data are as follows:
df<-read.table (text=" cup size
12 30.76923077
13 38.46153846
14 53.84615385
15 76.92307692
16 92.30769231
17 100", header=TRUE)
I have used the following codes to get the results
y=as.vector(df$cup)
x=as.vector(df$size)
fit <- nls(1/y ~ 1 + exp(-b*(x-c)), start = list(b = 0.01, c = 12), alg = "plinear")
I want to extract values b, c, .lin separately
like this:
b= 0.01552
c= 5.98258
lin= 0.04887
I have used fit[[2]], but it does not work for me.
This should work:
df<-read.table (text=" cup size
12 30.76923077
13 38.46153846
14 53.84615385
15 76.92307692
16 92.30769231
17 100", header=TRUE)
y=as.vector(df$cup)
x=as.vector(df$size)
fit <- nls(1/y ~ 1 + exp(-b*(x-c)), start = list(b = 0.01, c = 12), alg = "plinear")
myvector <- coef(fit)
b <- myvector[1]
c <- myvector[2]
lin <- myvector[3]
As suggested by #user20650, coef is the function used to extract the values from the model.
If you refer to the source code you'll see it's hidden in the m element in the function getAllPars()
fit$m$getAllPars()
b c .lin
0.01552340 5.98262729 0.04887368

How to compare boxplots in R?

I have three different datasets with 3 groups. A, B and C.
each group has 3 columns:
A:
a , b and c
B:
d , e and f
C:
g, h and i
So,
Here I tried to demonstrate my data as an example:
A:
a <- seq(from=-1, to=4, by=0.5); b <- seq(from=-1, to=8, by=0.5); c <- seq(from=-2, to=3, by=0.5)
B:
d <- seq(from=-4, to=1, by=0.5); e <- seq(from=-3, to=2, by=0.5); f <- seq(from=-1, to=5, by=0.5)
C:
g <- seq(from=-1, to=1, by=0.5); h <- seq(from=-2, to=3.5, by=0.5); i <- seq(from=-4, to=0.5, by=0.5)
and I've generated boxplots for them.
boxplot(a,b,c, col=c(4:6))
boxplot(d,e,f, col=c(4:6))
boxplot(g,h,i, col=c(4:6))
A:
B:
C:
my question is, how can I compare group 1 ('darkblue' a,d and g) in all three boxplots to each other to see how they are correlated to each other?
and do the same for all group 2 ('lightblue' b, e and h), and group 3 ('pink' c, f and i).
I've tried to use Tukey's test with this (for first 5 cases of each group):
TukeyHSD(aov(lm(a[1:5] ~ b[1:5]+c[1:5])),ordered=T,conf.level=0.95)
but I got an error:
Error in TukeyHSD.aov(aov(lm(a[1:5] ~ b[1:5] + c[1:5])), ordered = T, :
no factors in the fitted model
In addition: Warning message:
In replications(paste("~", xx), data = mf) : non-factors ignored: b[1:5]
I know how to make factors by using cut with:
cut(a, seq(from=-5, to=4, by=0.2))
my problem is that I do not want to give a range to my data; instead, I want to compare the whole data in group 1 in the first boxplot to the other two boxplots.
Thanks in advance,

Using split function in R

I am trying to simulate three small datasets, which contains x1,x2,x3,x4, trt and IND.
However, when I try to split simulated data by IND using "split" in R I get Warning messages and outputs are correct. Could someone please give me a hint what I did wrong in my R code?
# Step 2: simulate data
Alpha = 0.05
S = 3 # number of replicates
x = 8 # number of covariates
G = 3 # number of treatment groups
N = 50 # number of subjects per dataset
tot = S*N # total subjects for a simulation run
# True parameters
alpha = c(0.5, 0.8) # intercepts
b1 = c(0.1,0.2,0.3,0.4) # for pi_1 of trt A
b2 = c(0.15,0.25,0.35,0.45) # for pi_2 of trt B
b = c(1.1,1.2,1.3,1.4);
##############################################################################
# Scenario 1: all covariates are independent standard normally distributed #
##############################################################################
set.seed(12)
x1 = rnorm(n=tot, mean=0, sd=1);x2 = rnorm(n=tot, mean=0, sd=1);
x3 = rnorm(n=tot, mean=0, sd=1);x4 = rnorm(n=tot, mean=0, sd=1);
###############################################################################
p1 = exp(alpha[1]+b1[1]*x1+b1[2]*x2+b1[3]*x3+b1[4]*x4)/
(1+exp(alpha[1]+b1[1]*x1+b1[2]*x2+b1[3]*x3+b1[4]*x4) +
exp(alpha[2]+b2[1]*x1+b2[2]*x2+b2[3]*x3+b2[4]*x4))
p2 = exp(alpha[2]+b2[1]*x1+b2[2]*x2+b2[3]*x3+b2[4]*x4)/
(1+exp(alpha[1]+b1[1]*x1+b1[2]*x2+b1[3]*x3+b1[4]*x4) +
exp(alpha[2]+b2[1]*x1+b2[2]*x2+b2[3]*x3+b2[4]*x4))
p3 = 1/(1+exp(alpha[1]+b1[1]*x1+b1[2]*x2+b1[3]*x3+b1[4]*x4) +
exp(alpha[2]+b2[1]*x1+b2[2]*x2+b2[3]*x3+b2[4]*x4))
# To assign subjects to one of treatment groups based on response probabilities
tmp = function(x){sample(c("A","B","C"), 1, prob=x, replace=TRUE)}
trt = apply(cbind(p1,p2,p3),1,tmp)
IND=rep(1:S,each=N) #create an indicator for split simulated data
sim=data.frame(x1,x2,x3,x4,trt, IND)
Aset = subset(sim, trt=="A")
Bset = subset(sim, trt=="B")
Cset = subset(sim, trt=="C")
Anew = split(Aset, f = IND)
Bnew = split(Bset, f = IND)
Cnew = split(Cset, f = IND)
The warning message:
> Anew = split(Aset, f = IND)
Warning message:
In split.default(x = seq_len(nrow(x)), f = f, drop = drop, ...) :
data length is not a multiple of split variable
and the output becomes
$`2`
x1 x2 x3 x4 trt IND
141 1.0894068 0.09765185 -0.46702047 0.4049424 A 3
145 -1.2953113 -1.94291045 0.09926239 -0.5338715 A 3
148 0.0274979 0.72971804 0.47194731 -0.1963896 A 3
$`3`
[1] x1 x2 x3 x4 trt IND
<0 rows> (or 0-length row.names)
I have checked my R code several times however, I can't figure out what I did wrong. Many thanks in advance
IND is the global variable for the full data, sim. You want to use the specific one for the subset, eg
Anew <- split(Aset, f = Aset$IND)
It's a warning, not an error, which means split executed successfully, but may not have done what you wanted to do.
From the "details" section of the help file:
f is recycled as necessary and if the length of x is not a multiple of
the length of f a warning is printed. Any missing values in f are
dropped together with the corresponding values of x.
Try checking the length of your IND against the size of your dataframe, maybe.
Not sure what your goal is once you have your data split, but this sounds like a good candidate for the plyr package.
> library(plyr)
> ddply(sim, .(trt,IND), summarise, x1mean=mean(x1), x2sum=sum(x2), x3min=min(x3), x4max=max(x4))
trt IND x1mean x2sum x3min x4max
1 A 1 -0.49356448 -1.5650528 -1.016615 2.0027822
2 A 2 0.05908053 5.1680463 -1.514854 0.8184445
3 A 3 0.22898716 1.8584443 -1.934188 1.6326763
4 B 1 0.01531230 1.1005720 -2.002830 2.6674931
5 B 2 0.17875088 0.2526760 -1.546043 1.2021935
6 B 3 0.13398967 -4.8739380 -1.565945 1.7887837
7 C 1 -0.16993037 -0.5445507 -1.954848 0.6222546
8 C 2 -0.04581149 -6.3230167 -1.491114 0.8714535
9 C 3 -0.41610973 0.9085831 -1.797661 2.1174894
>
Where you can substitute summarise and its following arguments for any function that returns a data.frame or something that can be coerced to one. If lists are the target, ldply is your friend.

quick/elegant way to construct mean/variance summary table

I can achieve this task, but I feel like there must be a "best" (slickest, most compact, clearest-code, fastest?) way of doing it and have not figured it out so far ...
For a specified set of categorical factors I want to construct a table of means and variances by group.
generate data:
set.seed(1001)
d <- expand.grid(f1=LETTERS[1:3],f2=letters[1:3],
f3=factor(as.character(as.roman(1:3))),rep=1:4)
d$y <- runif(nrow(d))
d$z <- rnorm(nrow(d))
desired output:
f1 f2 f3 y.mean y.var
1 A a I 0.6502307 0.09537958
2 A a II 0.4876630 0.11079670
3 A a III 0.3102926 0.20280568
4 A b I 0.3914084 0.05869310
5 A b II 0.5257355 0.21863126
6 A b III 0.3356860 0.07943314
... etc. ...
using aggregate/merge:
library(reshape)
m1 <- aggregate(y~f1*f2*f3,data=d,FUN=mean)
m2 <- aggregate(y~f1*f2*f3,data=d,FUN=var)
mvtab <- merge(rename(m1,c(y="y.mean")),
rename(m2,c(y="y.var")))
using ddply/summarise (possibly best but haven't been able to make it work):
mvtab2 <- ddply(subset(d,select=-c(z,rep)),
.(f1,f2,f3),
summarise,numcolwise(mean),numcolwise(var))
results in
Error in output[[var]][rng] <- df[[var]] :
incompatible types (from closure to logical) in subassignment type fix
using melt/cast (maybe best?)
mvtab3 <- cast(melt(subset(d,select=-c(z,rep)),
id.vars=1:3),
...~.,fun.aggregate=c(mean,var))
## now have to drop "variable"
mvtab3 <- subset(mvtab3,select=-variable)
## also should rename response variables
Won't (?) work in reshape2. Explaining ...~. to someone could be tricky!
Here is a solution using data.table
library(data.table)
d2 = data.table(d)
ans = d2[,list(avg_y = mean(y), var_y = var(y)), 'f1, f2, f3']
I'm a bit puzzled. Does this not work:
mvtab2 <- ddply(d,.(f1,f2,f3),
summarise,y.mean = mean(y),y.var = var(y))
This give me something like this:
f1 f2 f3 y.mean y.var
1 A a I 0.6502307 0.095379578
2 A a II 0.4876630 0.110796695
3 A a III 0.3102926 0.202805677
4 A b I 0.3914084 0.058693103
5 A b II 0.5257355 0.218631264
Which is in the right form, but it looks like the values are different that what you specified.
Edit
Here's how to make your version with numcolwise work:
mvtab2 <- ddply(subset(d,select=-c(z,rep)),.(f1,f2,f3),summarise,
y.mean = numcolwise(mean)(piece),
y.var = numcolwise(var)(piece))
You forgot to pass the actual data to numcolwise. And then there's the little ddply trick that each piece is called piece internally. (Which Hadley points out in the comments shouldn't be relied upon as it may change in future versions of plyr.)
(I voted for Joshua's.) Here's an Hmisc::summary.formula solution. The advantage of this for me is that it is well integrated with the Hmisc::latex output "channel".
summary(y ~ interaction(f3,f2,f1), data=d, method="response",
fun=function(y) c(mean.y=mean(y) ,var.y=var(y) ))
#-----output----------
y N=108
+-----------------------+-------+---+---------+-----------+
| | |N |mean.y |var.y |
+-----------------------+-------+---+---------+-----------+
|interaction(f3, f2, f1)|I.a.A | 4|0.6502307|0.095379578|
| |II.a.A | 4|0.4876630|0.110796695|
snipped output to show the latex -> PDF -> png output:
#joran is spot-on with the ddply answer. Here's how I would do it with aggregate. Note that I avoid the formula interface (it is slower).
aggregate(d$y, d[,c("f1","f2","f3")], FUN=function(x) c(mean=mean(x),var=var(x)))
I'm slightly addicted to speed comparisons even though they're largely irrelevant for me in this situation ...
joran_ddply <- function(d) ddply(d,.(f1,f2,f3),
summarise,y.mean = mean(y),y.var = var(y))
joshulrich_aggregate <- function(d) {
aggregate(d$y, d[,c("f1","f2","f3")],
FUN=function(x) c(mean=mean(x),var=var(x)))
}
formula_aggregate <- function(d) {
aggregate(y~f1*f2*f3,data=d,
FUN=function(x) c(mean=mean(x),var=var(x)))
}
library(data.table)
d2 <- data.table(d)
ramnath_datatable <- function(d) {
d[,list(avg_y = mean(y), var_y = var(y)), 'f1, f2, f3']
}
library(Hmisc)
dwin_hmisc <- function(d) {summary(y ~ interaction(f3,f2,f1),
data=d, method="response",
fun=function(y) c(mean.y=mean(y) ,var.y=var(y) ))
}
library(rbenchmark)
benchmark(joran_ddply(d),
joshulrich_aggregate(d),
ramnath_datatable(d2),
formula_aggregate(d),
dwin_hmisc(d))
aggregate is fastest (even faster than data.table, which is a surprise to me, although things might be different with a bigger table to aggregate), even using the formula interface ...)
test replications elapsed relative user.self sys.self
5 dwin_hmisc(d) 100 1.235 2.125645 1.168 0.044
4 formula_aggregate(d) 100 0.703 1.209983 0.656 0.036
1 joran_ddply(d) 100 3.345 5.757315 3.152 0.144
2 joshulrich_aggregate(d) 100 0.581 1.000000 0.596 0.000
3 ramnath_datatable(d2) 100 0.750 1.290878 0.708 0.000
(Now I just need Dirk to step up and post an Rcpp solution that is 1000 times faster than anything else ...)
I find the doBy package has some very convenient functions for things like this. For example, the function ?summaryBy is quite handy. Consider:
> summaryBy(y~f1+f2+f3, data=d, FUN=c(mean, var))
f1 f2 f3 y.mean y.var
1 A a I 0.6502307 0.095379578
2 A a II 0.4876630 0.110796695
3 A a III 0.3102926 0.202805677
4 A b I 0.3914084 0.058693103
5 A b II 0.5257355 0.218631264
6 A b III 0.3356860 0.079433136
7 A c I 0.3367841 0.079487973
8 A c II 0.6273320 0.041373836
9 A c III 0.4532720 0.022779672
10 B a I 0.6688221 0.044184575
11 B a II 0.5514724 0.020359289
12 B a III 0.6389354 0.104056229
13 B b I 0.5052346 0.138379070
14 B b II 0.3933283 0.050261804
15 B b III 0.5953874 0.161943989
16 B c I 0.3490460 0.079286849
17 B c II 0.5534569 0.207381592
18 B c III 0.4652424 0.187463143
19 C a I 0.3340988 0.004994589
20 C a II 0.3970315 0.126967554
21 C a III 0.3580250 0.066769484
22 C b I 0.7676858 0.124945402
23 C b II 0.3613772 0.182689385
24 C b III 0.4175562 0.095933470
25 C c I 0.3592491 0.039832864
26 C c II 0.7882591 0.084271963
27 C c III 0.3936949 0.085758343
So the function call is simple, easy to use, and I would say, elegant.
Now, if your primary concern is speed, it seems that it would be reasonable--at least with smaller sized tasks (note that I couldn't get the ramnath_datatable function to work for whatever reason):
test replications elapsed relative user.self
4 dwin_hmisc(d) 100 0.50 2.778 0.50
3 formula_aggregate(d) 100 0.23 1.278 0.24
5 gung_summaryBy(d) 100 0.34 1.889 0.35
1 joran_ddply(d) 100 1.34 7.444 1.32
2 joshulrich_aggregate(d) 100 0.18 1.000 0.19
I've came accross with this question and found the benchmarks are done with small tables, so it's hard to tell which method is better with 100 rows.
I've also modified the data a bit also to make it "unsorted", this would be a more common case, for example as the data is in a DB.
I've added a few more data.table trials to see if setting a key is faster beforehand. It seems here, setting the key beforehand doesn't improve much the performance, so ramnath solution seems to be the fastest.
set.seed(1001)
d <- data.frame(f1 = sample(LETTERS[1:3], 30e5, replace = T), f2 = sample(letters[1:3], 30e5, replace = T),
f3 = sample(factor(as.character(as.roman(1:3))), 30e5, replace = T), rep = sample(1:4, replace = T))
d$y <- runif(nrow(d))
d$z <- rnorm(nrow(d))
str(d)
require(Hmisc)
require(plyr)
require(data.table)
d2 = data.table(d)
d3 = data.table(d)
# Set key of d3 to compare how fast it is if the DT is already keyded
setkey(d3,f1,f2,f3)
joran_ddply <- function(d) ddply(d,.(f1,f2,f3),
summarise,y.mean = mean(y),y.var = var(y))
formula_aggregate <- function(d) {
aggregate(y~f1*f2*f3,data=d,
FUN=function(x) c(mean=mean(x),var=var(x)))
}
ramnath_datatable <- function(d) {
d[,list(avg_y = mean(y), var_y = var(y)), 'f1,f2,f3']
}
key_agg_datatable <- function(d) {
setkey(d2,f1,f2,f3)
d[,list(avg_y = mean(y), var_y = var(y)), 'f1,f2,f3']
}
one_key_datatable <- function(d) {
setkey(d2,f1)
d[,list(avg_y = mean(y), var_y = var(y)), 'f1,f2,f3']
}
including_3key_datatable <- function(d) {
d[,list(avg_y = mean(y), var_y = var(y)), 'f1,f2,f3']
}
dwin_hmisc <- function(d) {summary(y ~ interaction(f3,f2,f1),
data=d, method="response",
fun=function(y) c(mean.y=mean(y) ,var.y=var(y) ))
}
require(rbenchmark)
benchmark(joran_ddply(d),
joshulrich_aggregate(d),
ramnath_datatable(d2),
including_3key_datatable(d3),
one_key_datatable(d2),
key_agg_datatable(d2),
formula_aggregate(d),
dwin_hmisc(d)
)
# test replications elapsed relative user.self sys.self
# dwin_hmisc(d) 100 1757.28 252.121 1590.89 165.65
# formula_aggregate(d) 100 433.56 62.204 390.83 42.50
# including_3key_datatable(d3) 100 7.00 1.004 6.02 0.98
# joran_ddply(d) 100 173.39 24.877 119.35 53.95
# joshulrich_aggregate(d) 100 328.51 47.132 307.14 21.22
# key_agg_datatable(d2) 100 24.62 3.532 19.13 5.50
# one_key_datatable(d2) 100 29.66 4.255 22.28 7.34
# ramnath_datatable(d2) 100 6.97 1.000 5.96 1.01
And here is a solution using Hadley Wickham's new dplyr library.
library(dplyr)
d %>% group_by(f1, f2, f3) %>%
summarise(y.mean = mean(y), z.mean = mean(z))

Resources