I want to visualize some assumptions on regression theory. Starting point is this population and linear regression:
set.seed(1234)
runifdisc <- function(n, min = 0, max = 1) sample(min:max, n, replace = T)
x1 <- runifdisc(100000, 1, 10)
e <- runifdisc(100000, 0, 20)
y <- 4 + 2 * x1 + e
dat_pop <- data.frame(x1, e, y)
m_pop <- lm(y ~ x1, data = dat_pop)
Since x1 has 10 values, m_pop leads to 10 different predictions for y given x.
> table(round(predict(m_pop)))
16 18 20 22 24 26 28 30 32 34
10096 10081 9864 10078 9927 9914 9915 10124 10018 9983
If I am not mistaken, the predictions of y given x1 should have the same variance for each specific value of x1 in a large number of samples. An assumption that can also be applied to the residuals. However, in my code, the variance varies with the different y values given x1:
n <- 1000
vec_y1 <- rep(0, n)
vec_y2 <- rep(0, n)
vec_y3 <- rep(0, n)
vec_y4 <- rep(0, n)
vec_y5 <- rep(0, n)
vec_y6 <- rep(0, n)
vec_y7 <- rep(0, n)
vec_y8 <- rep(0, n)
vec_y9 <- rep(0, n)
vec_y10 <- rep(0, n)
#Draw 1000 samples from dat_pop; in each sample, save the prediction of y given x1
#in vectors vec_y1-vec_y10.
for (i in 1:n){
s <- dat_pop[sample(nrow(dat_pop), 1000), ]
m <- lm(y ~ x1, data = s)
#Prediction for y given x1 == 1
vec_y1[i] <- m$coefficients[1] + 1 * m$coefficients[2]
#Prediction for y given x1 == 2
vec_y2[i] <- m$coefficients[1] + 2 * m$coefficients[2]
#Prediction for y given x1 == 3
vec_y3[i] <- m$coefficients[1] + 3 * m$coefficients[2]
#Prediction for y given x1 == ...
vec_y4[i] <- m$coefficients[1] + 4 * m$coefficients[2]
vec_y5[i] <- m$coefficients[1] + 5 * m$coefficients[2]
vec_y6[i] <- m$coefficients[1] + 6 * m$coefficients[2]
vec_y7[i] <- m$coefficients[1] + 7 * m$coefficients[2]
vec_y8[i] <- m$coefficients[1] + 8 * m$coefficients[2]
vec_y9[i] <- m$coefficients[1] + 9 * m$coefficients[2]
vec_y10[i] <- m$coefficients[1] + 10 * m$coefficients[2]
}
#Variance of different predictions for y given x1 in the samples above.
#This variance should be equal for all vectors vec_y1-vec_y10.
var(vec_y1)
var(vec_y3)
var(vec_y5)
var(vec_y8)
var(vec_y10)
The variance is larger for the lower and upper values of x1.
> var(vec_y1)
[1] 0.1234933
> var(vec_y3)
[1] 0.06295427
> var(vec_y5)
[1] 0.03637214
> var(vec_y8)
[1] 0.06016804
> var(vec_y10)
[1] 0.118478
My question addresses, on the one hand, my understanding of the assumption from regression theory. Perhaps there is a misunderstanding on my side. On the other hand, the question is about the code that produces the same variance for all y given x1.
I think this will help to solve it. It took me a while to see it...
The variance is smallest at 5 because it is the mean of your model (x1 values spans from 0 to 10) If you change the model to eg. 1:20, then 10 will have the minimal variance.
You train the model x times in the for loop. every time the predicted slope changes a bit, but most at the ends (1 and 10). And that is the reason behind it. The regression will alway go through the center.
Below are two test samples (two iterations from the loop).
Related
Hi I'm currently doing coding to simulate data using inverse method. Im using parallel exponential model where I let the lambda=b0+b1x. My simulation is based on survival analysis.
#generate data
gen <- function(n,lambda,b0,b1){
set.seed(1)
u <- runif(n,0,1)
c1 <- rexp(n,lambda)
x <- rnorm(n,0,1)
t1 = -log(1 - sqrt(u) ) / (b0 + b1*x) #inverse method
c <- 1*(t1 < c1)
t = pmin(t1, c1)
data1 <- data.frame(x, t, t1, c1, c)
return(data1)
}
data2 <- gen(20,0.01,2,4)
data2
x = data2$x
t = data2$t
xsum = sum(x)
tsum = sum(t)
The problem is that when run the second coding below, it won't show my mle for b0 and b1
#Likelihood
library(maxLik)
LLF <- function(para){
set.seed(1)
b0 = para[1]
b1 = para[2]
n = 1
z1 = (n*log(2)) + (n*log(b0+b1*xsum)) - ((b0+b1*xsum)*tsum) + (n*log(1-exp((-(b0 + b1*xsum)*tsum))))
return(z1)
}
mle <- maxLik(LLF, start = c(2,4))
The problem is you assigned n=1 in the LLF. Since we usually maximize the parameters given the entire data, n should be equal to number of observations. If you update this info, your mle will converge. For example,
n<-nrow(data2)
#Likelihood
library(maxLik)
LLF <- function(para){
set.seed(1)
b0 = para[1]
b1 = para[2]
#n = 1
z1 = (n*log(2)) + (n*log(b0+b1*xsum)) - ((b0+b1*xsum)*tsum) + (n*log(1-exp((-(b0 + b1*xsum)*tsum))))
return(z1)
}
mle <- maxLik(LLF, start = c(2,4))
summary(mle)
Maximum Likelihood estimation
Newton-Raphson maximisation, 3 iterations
Return code 1: gradient close to zero
Log-Likelihood: -22.7055
2 free parameters
Estimates:
Estimate Std. error t value Pr(> t)
[1,] 1.986 NA NA NA
[2,] 3.986 NA NA NA
I have been trying to use biglm to run linear regressions on a large dataset (approx 60,000,000 lines). I want to use AIC for model selection. However I discovered when playing with biglm on smaller datasets that the AIC variables returned by biglm are different from those returned by lm. This even applies to the example in the biglm help.
data(trees)
ff<-log(Volume)~log(Girth)+log(Height)
chunk1<-trees[1:10,]
chunk2<-trees[11:20,]
chunk3<-trees[21:31,]
library(biglm)
a <- biglm(ff,chunk1)
a <- update(a,chunk2)
a <- update(a,chunk3)
AIC(a)#48.18546
a_lm <- lm(ff, trees)
AIC(a_lm)#-62.71125
Can someone please explain what is happening here? Are the AICs generated with biglm safe to use for comparing biglm models on the same dataset?
tl;dr it looks to me like there is a pretty obvious bug in the AIC method for biglm-class objects (more specifically, in the update method), in the current (0.9-1) version, but the author of the biglm package is a smart, experienced guy, and biglm is widely used, so perhaps I'm missing something. Googling for "biglm AIC df.resid", it seems this has been discussed way back in 2009? Update: the package author/maintainer reports via e-mail that this is indeed a bug.
Something funny seems to be going on here. The differences in AIC between models should be the same across modeling frameworks, whatever the constants that have been used and however parameters are counted (because these constants and parameter-counting should be consistent within modeling frameworks ...)
Original example:
data(trees)
ff <- log(Volume)~log(Girth)+log(Height)
chunk1<-trees[1:10,]
chunk2<-trees[11:20,]
chunk3<-trees[21:31,]
library(biglm)
a <- biglm(ff,chunk1)
a <- update(a,chunk2)
a <- update(a,chunk3)
a_lm <- lm(ff, trees)
Now fit a reduced model:
ff2 <- log(Volume)~log(Girth)
a2 <- biglm(ff2, chunk1)
a2 <- update(a2, chunk2)
a2 <- update(a2 ,chunk3)
a2_lm <- lm(ff2,trees)
Now compare AIC values:
AIC(a)-AIC(a2)
## [1] 1.80222
AIC(a_lm)-AIC(a2_lm)
## [1] -20.50022
Check that we haven't screwed something up:
all.equal(coef(a),coef(a_lm)) ## TRUE
all.equal(coef(a2),coef(a2_lm)) ## TRUE
Look under the hood:
biglm:::AIC.biglm
## function (object, ..., k = 2)
## deviance(object) + k * (object$n - object$df.resid)
In principle this is the right formula (number of observations minus residual df should be the number of parameters fitted), but digging in, it looks like the $df.resid component of the objects hasn't been updated properly:
a$n ## 31, correct
a$df.resid ## 7, only valid before updating!
Looking at biglm:::update.biglm, I would add
object$df.resid <- object$df.resid + NROW(mm)
right before or after the line that reads
object$n <- object$n + NROW(mm)
...
This seems like a fairly obvious bug to me, so perhaps I'm missing something obvious, or perhaps it has been fixed.
A simple workaround would be to define your own AIC function as
AIC.biglm <- function (object, ..., k = 2) {
deviance(object) + k * length(coef(object))
}
AIC(a)-AIC(a2) ## matches results from lm()
(although note that AIC(a_lm) is still not equal to AIC(a), because stats:::AIC.default() uses 2*log-likelihood rather than deviance (these two measures differ in their additive coefficients) ...)
I have played around with this a bit. I am not certain, but I think the formula for AIC used by the package biglm is:
2 * (n.parameters + obs.added - 1) + deviance(a)
where obs_added is the number of observations in chunk2 plus the number of observations in chunk3:
obs.added <- dim(chunk2)[1] + dim(chunk3)[1]
and n.parameters is the number of estimated coefficients returned by summary(a) + 1 (where the +1 is for the error term), and deviance(a) is the deviance of your model a.
####################################################
data(trees)
ff <- log(Volume)~log(Girth)+log(Height)
n.parm <- 4
chunk1<-trees[1:10,]
chunk2<-trees[11:20,]
chunk3<-trees[21:31,]
obs.added <- dim(chunk2)[1] + dim(chunk3)[1]
library(biglm)
a <- biglm(ff,chunk1)
a <- update(a,chunk2)
a <- update(a,chunk3)
AIC(a)
summary(a)
deviance(a)
2 * (n.parm + obs.added - 1) + deviance(a)
round(AIC(a), 5) == round(2 * (n.parm + obs.added - 1) + deviance(a), 5)
# [1] TRUE
####################################################
Since I am not 100% certain my answer is correct, you can play around with the code below and see whether you can find a scenario where the proposed formula for AIC does not work. If I find any such scenarios I will attempt to modify the code below and the formula above as necessary.
#########################################################
# Generate some data
n <- 118 # number of observations
B0 <- 2 # intercept
B1 <- -1.5 # slope 1
B2 <- 0.4 # slope 2
B3 <- 2.0 # slope 3
B4 <- -0.8 # slope 4
sigma2 <- 5 # residual variance
x1 <- round(runif(n, -5 , 5), digits = 3) # covariate 1
x2 <- round(runif(n, 10 , 20), digits = 3) # covariate 2
x3 <- round(runif(n, 2 , 8), digits = 3) # covariate 3
x4 <- round(runif(n, 12 , 15), digits = 3) # covariate 4
eps <- rnorm(n, mean = 0, sd = sqrt(sigma2)) # error
y <- B0 + B1 * x1 + B2 * x2 + B3 * x3 + B4 * x4 + eps # dependent variable
my.data <- data.frame(y, x1, x2, x3, x4)
# analyze data with linear regression
model.1 <- lm(my.data$y ~ my.data$x1 + my.data$x2 + my.data$x3 + my.data$x4)
summary(model.1)
AIC(model.1)
n.parms <- length(model.1$coefficients) + 1
my.AIC <- 2 * n.parms - 2 * as.numeric(logLik(model.1))
my.AIC
#########################################################
ff0 <- y ~ 1
ff1 <- y ~ x1
ff2 <- y ~ x1 + x2
ff3 <- y ~ x1 + x2 + x3
ff4 <- y ~ x1 + x2 + x3 + x4
n.parm0 <- 2
n.parm1 <- 3
n.parm2 <- 4
n.parm3 <- 5
n.parm4 <- 6
n.chunks <- 5
chunk1<-my.data[ 1:round(((nrow(my.data)/n.chunks)*1)+0),]
chunk2<-my.data[round(((nrow(my.data)/n.chunks)*1)+1):round(((nrow(my.data)/n.chunks)*2)+0),]
chunk3<-my.data[round(((nrow(my.data)/n.chunks)*2)+1):round(((nrow(my.data)/n.chunks)*3)+0),]
chunk4<-my.data[round(((nrow(my.data)/n.chunks)*3)+1):round(((nrow(my.data)/n.chunks)*4)+0),]
chunk5<-my.data[round(((nrow(my.data)/n.chunks)*4)+1):nrow(my.data),]
obs.added <- dim(chunk2)[1] + dim(chunk3)[1] + dim(chunk4)[1] + dim(chunk5)[1]
# check division of data
foo <- list()
foo[[1]] <- chunk1
foo[[2]] <- chunk2
foo[[3]] <- chunk3
foo[[4]] <- chunk4
foo[[5]] <- chunk5
all.data.foo <- do.call(rbind, foo)
all.equal(my.data, all.data.foo)
####################################################
library(biglm)
####################################################
a0 <- biglm(ff0, chunk1)
a0 <- update(a0, chunk2)
a0 <- update(a0, chunk3)
a0 <- update(a0, chunk4)
a0 <- update(a0, chunk5)
AIC(a0)
summary(a0)
deviance(a0)
print(a0)
2 * (n.parm0 + obs.added - 1) + deviance(a0)
round(AIC(a0), 5) == round(2 * (n.parm0 + obs.added - 1) + deviance(a0), 5)
####################################################
a1 <- biglm(ff1, chunk1)
a1 <- update(a1, chunk2)
a1 <- update(a1, chunk3)
a1 <- update(a1, chunk4)
a1 <- update(a1, chunk5)
AIC(a1)
summary(a1)
deviance(a1)
print(a1)
2 * (n.parm1 + obs.added - 1) + deviance(a1)
round(AIC(a1), 5) == round(2 * (n.parm1 + obs.added - 1) + deviance(a1), 5)
####################################################
a2 <- biglm(ff2, chunk1)
a2 <- update(a2, chunk2)
a2 <- update(a2, chunk3)
a2 <- update(a2, chunk4)
a2 <- update(a2, chunk5)
AIC(a2)
summary(a2)
deviance(a2)
print(a2)
2 * (n.parm2 + obs.added - 1) + deviance(a2)
round(AIC(a2), 5) == round(2 * (n.parm2 + obs.added - 1) + deviance(a2), 5)
####################################################
a3 <- biglm(ff3, chunk1)
a3 <- update(a3, chunk2)
a3 <- update(a3, chunk3)
a3 <- update(a3, chunk4)
a3 <- update(a3, chunk5)
AIC(a3)
summary(a3)
deviance(a3)
print(a3)
2 * (n.parm3 + obs.added - 1) + deviance(a3)
round(AIC(a3), 5) == round(2 * (n.parm3 + obs.added - 1) + deviance(a3), 5)
####################################################
a4 <- biglm(ff4, chunk1)
a4 <- update(a4, chunk2)
a4 <- update(a4, chunk3)
a4 <- update(a4, chunk4)
a4 <- update(a4, chunk5)
AIC(a4)
summary(a4)
deviance(a4)
print(a4)
2 * (n.parm4 + obs.added - 1) + deviance(a4)
round(AIC(a4), 5) == round(2 * (n.parm4 + obs.added - 1) + deviance(a4), 5)
####################################################
EDIT
I suggested biglm uses the following equation for AIC:
2 * (n.parameters + obs.added - 1) + deviance(a)
Ben Bolker pointed out that the equation biglm uses for AIC is:
deviance(object) + k * (object$n - object$df.resid)
Ben also determined that biglm was not updating the first value for residual df.
Given that new information, I now see that the two equations are equivalent.
First, restrict the two equations to the following, which is the only place they differ:
(n.parameters + obs.added - 1) # mine
(object$n - object$df.resid) # Ben's
Re-arrange mine to account for me adding 1 to the number of parameters and then subtracting one:
((n.parameters-1) + obs.added) = ((4-1) + obs.added) = (3 + 21) = 24
Now morph my equation into Ben's:
My 3 is the same as:
(number of observations in chunk1 - object$df.resid) = (10 - 7) = 3
giving:
((number of obs in chunk1 - object$df.resid) + obs.added) = ((10-7) + 21)
or:
(3 + 21) = 24
Re-arrange:
((number of obs in chunk1 + obs.added) - object$df.resid) = ((10 + 21) - 7)
or:
(31 - 7) = 24
And:
((number of observations in chunk1 + obs.added) - object$df.resid)
is the same as:
(total number of observations - object$df.resid)
Which is the same as:
(object$n - object$df.resid) = (31 - 7) = 24
It appears the equation I proposed really is the equation biglm uses for AIC, just expressed in a different form.
Of course, I was only able to realize this because Ben provided both the critical code and the critical explanation of the error.
I have a time-series which I need to fit onto an AR (auto-regression) model.
The AR model has the form:
x(t) = a0 + a1*x(t-1) + a2*x(t-2) + ... + aq*x(t-q) + noise.
I have two contraints:
Find the best AR fit when lag.max = 50.
Sum of all coefficients a0 + a1 + ... + aq = 1
I wrote the below code:
require(FitAR)
data(lynx) # my real data comes from the stock market.
z <- -log(lynx)
#find best model
step <- SelectModel(z, ARModel = "AR" ,lag.max = 50, Criterion = "AIC",Best=10)
summary(step) # display results
# fit the model and get coefficients
arfit <- ar(z,p=1, order.max=ceil(mean(step[,1])), aic=FALSE)
#check if sum of coefficients are 1
sum(arfit$ar)
[1] 0.5784978
My question is, how to add the constraint: sum of all coefficients = 1?
I looked at this question, but I do not realize how to use it.
**UPDATE**
I think I manage to solve my question as follow.
library(quadprog)
coeff <- arfit$ar
y <- 0
for (i in 1:length(coeff)) {
y <- y + coeff[i]*c(z[(i+1):length(z)],rep(0,i))
ifelse (i==1, X <- c(z[2:length(z)],0), X <- cbind(X,c(z[(i+1):length(z)],rep(0,i))))
}
Dmat <- t(X) %*% X
s <- solve.QP(Dmat , t(y) %*% X, matrix(1, nr=15, nc=1), 1, meq=1 )
s$solution
# The coefficients should sum up to 1
sum(s$solution)
I am trying to simulate a three-variable dataset so that I can run linear regression models on it. 'X1' and 'X2' would be continuous independent variables (mean=0, sd=1), and 'Y' would be the continuous dependent variable.
The variables will be regression model will produce coefficients like this:
Y = 5 + 3(X1) - 2(X2)
I would like to simulate this dataset such that the resulting regression model has an R-squared value of 0.2. How can I determine the value of 'sd.value' so that the regression model has this R-squared?
n <- 200
set.seed(101)
sd.value <- 1
X1 <- rnorm(n, 0, 1)
X2 <- rnorm(n, 0, 1)
Y <- rnorm(n, (5 + 3*X1 - 2*X2), sd.value)
simdata <- data.frame(X1, X2, Y)
summary(lm(Y ~ X1 + X2, data=simdata))
Take a look at this code, it should be close enough to what you want:
simulate <- function(n.obs=10^4, beta=c(5, 3, -2), R.sq=0.8) {
stopifnot(length(beta) == 3)
df <- data.frame(x1=rnorm(n.obs), x2=rnorm(n.obs)) # x1 and x2 are independent
var.epsilon <- (beta[2]^2 + beta[3]^2) * (1 - R.sq) / R.sq
stopifnot(var.epsilon > 0)
df$epsilon <- rnorm(n.obs, sd=sqrt(var.epsilon))
df$y <- with(df, beta[1] + beta[2]*x1 + beta[3]*x2 + epsilon)
return(df)
}
get.R.sq <- function(desired) {
model <- lm(y ~ x1 + x2, data=simulate(R.sq=desired))
return(summary(model)$r.squared)
}
df <- data.frame(desired.R.sq=seq(from=0.05, to=0.95, by=0.05))
df$actual.R.sq <- sapply(df$desired.R.sq, FUN=get.R.sq)
plot(df)
abline(a=0, b=1, col="red", lty=2)
Basically your question comes down to figuring out the expression for var.epsilon. Since we have y = b1 + b2*x1 + b3*x2 + epsilon, and Xs and epsilon are all independent, we have var[y] = b2^2 * var[x1] + b3^2 * var[x2] + var[eps], where the var[Xs]=1 by assumption. You can then solve for var[eps] as a function of R-squared.
So the formula for R^2 is 1-var(residual)/var(total)
In this case, the variance of Y is going to be 3^2+2^2+sd.value^2, since we are adding three independent random variables. And, asymptotically, the residual variance is going to be simply sd.value^2.
So you can compute rsquared explicitly with this function:
rsq<-function(x){1-x^2/(9+ 4+x^2)}
With a little algebra, you can compute the inverse of this function:
rsqi<-function(x){sqrt(13)*sqrt((1-x)/x)}
So setting sd.value<-rsqi(rsquared) should give you what you want.
We can test this as follows:
simrsq<-function(x){
Y <- rnorm(n, (5 + 3*X1 - 2*X2), rsqi(x))
simdata <- data.frame(X1, X2, Y)
summary(lm(Y ~ X1 + X2, data=simdata))$r.squared
}
> meanrsq<-rep(0,9)
> for(i in 1:50)
+ meanrsq<-meanrsq+Vectorize(simrsq)((1:9)/10)
> meanrsq/50
[1] 0.1031827 0.2075984 0.3063701 0.3977051 0.5052408 0.6024988 0.6947790
[8] 0.7999349 0.8977187
So it looks to be correct.
This is how I would do it (blind iterative algorithm, assuming no knowledge, for when you are purely interested in "how to simulate this"):
simulate.sd <- function(nsim=10, n=200, seed=101, tol=0.01) {
set.seed(seed)
sd.value <- 1
rsquare <- 1:nsim
results <- 1:nsim
for (i in 1:nsim) {
# tracking iteration: if we miss the value, abort at sd.value > 7.
iter <- 0
while (rsquare[i] > (0.20 + tol) | rsquare[i] < (0.2 - tol)) {
sd.value <- sd.value + 0.01
rsquare[i] <- simulate.sd.iter(sd.value, n)
iter <- iter + 1
if (iter > 3000) { break }
}
results[i] <- sd.value # store the current sd.value that is OK!
sd.value <- 1
}
cbind(results, rsquare)
}
simulate.sd.iter <- function(sd.value, n=200) { # helper function
# Takes the sd.value, creates data, and returns the r-squared
X1 <- rnorm(n, 0, 1)
X2 <- rnorm(n, 0, 1)
Y <- rnorm(n, (5 + 3*X1 - 2*X2), sd.value)
simdata <- data.frame(X1, X2, Y)
return(summary(lm(Y ~ X1 + X2, data=simdata))$r.squared)
}
simulate.sd()
A few things to note:
I let the X1 and X2 vary, since this affects this sought sd.value.
The tolerance is how exact you want this estimate to be. Are you fine with an r-squared of ~0.19 or ~0.21? Have the tolerance be 0.01.
Note that a too precise tolerance might not allow you to find a result.
The value of 1 is quite a bad starting value, making this iterative algorithm quite slow.
The resulting vector for 10 results is:
[1] 5.64 5.35 5.46 5.42 5.79 5.39 5.64 5.62 4.70 5.55,
which takes roughly 13 seconds on my machine.
My next step would be to start from 4.5, add 0.001 to the iteration instead of 0.01, and perhaps lower the tolerance. Good luck!
Alright, some summary statistics for nsim=100, taking 150 seconds, with steps increase of 0.001, and tolerance still at 0.01:
Min. 1st Qu. Median Mean 3rd Qu. Max.
4.513 4.913 5.036 5.018 5.157 5.393
Why are you interested in this though?
Here is another code to generate multiple linear regression with errors follow normal distribution:
OPS sorry this code just produces multiple regression
sim.regression<-function(n.obs=10,coefficients=runif(10,-5,5),s.deviation=.1){
n.var=length(coefficients)
M=matrix(0,ncol=n.var,nrow=n.obs)
beta=as.matrix(coefficients)
for (i in 1:n.var){
M[,i]=rnorm(n.obs,0,1)
}
y=M %*% beta + rnorm(n.obs,0,s.deviation)
return (list(x=M,y=y,coeff=coefficients))
}
Let's say that you have a normally distributed variable y with a 3-group categorical predictor x that has the orthogonal contrasts c1 and c2. I am trying to create a program in R that, given x, c1, and c2, creates y such that c1 and c2 have effect sizes r1 and r2 specified by the user.
For example, let's say that x, c1, c2, r1, and r2 were created like the following:
x <- factor(rep(c(1, 2, 3), 100))
contrasts(x) <- matrix(c(0, -.5, .5, -2/3, 1/3, 1/3),
nrow = 3, ncol = 2, dimnames = list(c("1", "2", "3"), c("c1", "c2")))
contrasts(x)
c1 c2
1 0.0 -0.6666667
2 -0.5 0.3333333
3 0.5 0.3333333
r1 <- .09
r2 <- 0
I would like the program to create y such that the variance in y accounted for by c1 equals r1 (.09) and the variance in y accounted for by c2 equals r2 (0).
Does anybody know how I might go about this? I know that I should be using the rnorm function, but I'm stuck on which population means / sds rnorm should use when it does its sampling.
Courtesy of some generous advice from my colleagues, I now have one function that creates simulated data given a specified number of groups, a set of contrasts, a set of regression coefficients, a specified N per cell, and a specified within-group variance
sim.factor <- function(levels, contr, beta, perCell, errorVar){
# Build design matrix X
X <- cbind(rep(1,levels*perCell), kronecker(contr, rep(1,perCell)))
# Generate y
y <- X %*% beta + rnorm(levels*perCell, sd=sqrt(errorVar))
# Build and return data frame
dat <- cbind.data.frame(y, X[,-1])
names(dat)[-1] <- colnames(contr)
return(dat)
}
I also wrote a function that, given a set of regression coefficients, N per cell, number of groups, set of orthogonal contrasts, desired delta-R^2 for a contrast of interest, returns the required within-group variance:
ws.var <- function(levels, contr, beta, perCell, dc){
# Build design matrix X
X <- cbind(rep(1,levels), contr)
# Generate the expected means
means <- X %*% beta
# Find the sum of squares due to each contrast
var <- (t(means) %*% contr)^2 / apply(contr^2 / perCell, 2, sum)
# Calculate the within-conditions sum of squares
wvar <- var[1] / dc - sum(var)
# Convert the sum of squares to variance
errorVar <- wvar / (3 * (perCell - 1))
return(errorVar)
}
After doing some testing as follows, the functions seem to generate the desired delta R^2 for contrast c1.
contr <- contr.helmert(3)
colnames(contr) <- c("c1","c2")
beta <- c(0, 1, 0)
perCell <- 50
levels = 3
dc <- .08
N <- 1000
# Calculate the error variance
errorVar <- ws.var(levels, contr, beta, perCell, dc)
# To store delta R^2 values
d1 <- vector("numeric", length = N)
# Use the functions
for(i in 1:N)
{
d <- sim.factor(levels=3,
contr=contr,
beta=beta,
perCell=perCell,
errorVar=errorVar)
d1[i] <- lm.sumSquares(lm(y ~ c1 + c2, data = d))[1, 2] # From the lmSupport package
}
m <- round(mean(d1), digits = 3)
bmp("Testing simulation functions.bmp")
hist(d1, xlab = "Percentage of variance due to c1", main = "")
text(.18, 180, labels = paste("Mean =", m))
dev.off()
Patrick