Here's what I tried, making use of the mvtnorm package
Sample Dataset
library(mvtnorm)
set.seed(2357)
df <- data.frame(
x = rnorm(1000, mean=80, sd=20),
y = rnorm(1000, mean=0, sd=5),
z = rnorm(1000, mean=0, sd=5)
)
head(df)
x y z
1 70.38 1.307 0.2005
2 59.76 5.781 -3.5095
3 54.14 -1.313 -1.9022
4 79.91 7.754 -6.2076
5 87.07 1.389 1.1065
6 75.89 1.684 6.2979
Fit multivariate normal dist and check P(x <= 80) ~ 0.5
# Get the dimension means and correlation matrix
means <- c(x=mean(df$x), y=mean(df$y), z=mean(df$z))
corr <- cor(df)
# Check P(x <= 80)
sum(df$x <= 80)/nrow(df) # 0.498
pmvnorm(lower=-Inf, upper=c(80, Inf, Inf), mean=means, corr=corr) # 0.8232
Why is the fitted result 0.82? Where did I go wrong?
First, you don't need to simulate anything to study the pmvnorm function:
pmvnorm(lower=rep(-Inf, 3), upper=c(80, Inf, Inf), mean=c(80,0,0), corr=diag(rep(1,3)))
The result is 0.5, as you expected.
Your means vector is approximately (79, 0, 0), so let's try it:
pmvnorm(lower=rep(-Inf, 3), upper=c(80, Inf, Inf), mean=c(79,0,0), corr=diag(rep(1,3)))
The result now is 0.8413447. There's nothing the matter. By specifying only the correlation matrix, you told the software to assume that all variances were unity. In your simulation, the variances were 400, 25, and 25: very different from what you specified in the arguments!
The correct calculation uses the covariance matrix of the data, not its correlation matrix:
pmvnorm(lower=rep(-Inf, 3), upper=c(80, Inf, Inf), mean=means, sigma=cov(df))
The result is 0.5178412, quite in keeping with the data.
Related
I would like to create a setup where the dependent variable y and two explanatory variables x and z with specified conditional correlations r_yxand r_zx and r_yz that I can vary myself.
r_yx <- 0.9
r_zx <- -0.5
r_yz <- 0.9
n <- 10000
x <- rnorm(n, mean = 0, sd = 1)
y <- r_yx*x + rnorm(n, mean = 0, sd = 1)
Question 1: How can I now create x so that it the correlation with y is r_yx = .9 and the correlation with z = -.5?
Question 2: Also checking cor(y,x) shows that it is not that close to 0.9. How can I set it correctly?
It's difficult to generate random samples that have an exact correlation, but you can get pretty close. For positive correlations with x, just add some noise to x. The better you want the correlation, the smaller you make the sd.
We can get fairly close to 0.9 and -0.5 like this
n <- 10000
set.seed(1)
x <- rnorm(n, mean = 0, sd = 1)
y <- x + rnorm(n, 0, sd = 0.5)
z <- -x + rnorm(n, 0, 1.75)
cor(x, y)
#> [1] 0.8986359
cor(x, z)
#> [1] -0.4983418
However, you cannot independently change the correlation of y to z if you have already fixed their correlation to x. In the extreme case, if y's correlation with x was 1, then y would be identical to x, and would therefore have to have the same correlation to z as x did.
Created on 2022-06-12 by the reprex package (v2.0.1)
I would like to simulate data from the follow model in R
Y ~ N(b0 + b1*X, sigma)
and fit the following model in R
lm(Y ~ 1 + X, data)
roughly here is what the R code would be,
nsims = 1000
X = 1:50
b0 = rnorm(nsims, 55.63, 31.40)
b1 = rnorm(nsims, 1.04, .39)
sigma = rnorm(nsims, 11.34, 4.11)
The catch is that I'd like b0, b1, and sigma to be correlated. I'd like them to have this for a correlation.
R <- matrix(c(1, .16, .54,
.16, 1, .13,
.54, .13, 1),
nrow = 3)
colnames(R) <- c("b0", "b1", "sigma")
Now given that I want this correlation structure, my rnorm code above is wrong. If my data didn't need this correlation matrix, I would probably do the following,
sim_data <- data.frame()
for(i in 1:nsims){
Y = b0[i] + b1[i]*X + rnorm(length(X), 0, sigma[i])
data_tmp <- data.frame(Y = Y, X = X, ID = i)
sim_data <- rbind(sim_data, data_tmp)
}
But this ignores my correlation structure because of the way I generated the parameters. Can anyone offer me some suggestions or pointers where to look for how to incorporate a correlation?
Simulate 3-dimensional normal distribution and take variables from it. You can use MASS package for multivariate simulation and MBESS package for transformation from correlation to covariance matric which is needed in mvrnorm function.
library(MASS)
library(MBESS)
R <- matrix(c(1, .16, .54,
.16, 1, .13,
.54, .13, 1),
nrow = 3)
SD <- c(31.40, .39, 4.11)
## convert correlation matrix to covariance matrix
Cov <- cor2cov(R, SD)
### you can also do it algebraically without MBESS package
### Cov <- SD %*% t(SD) * R
### where %*% is matrix multiplication and * is normal multiplication
### t() is transpose function
# simulate multivariate normal distribution
mvnorm <- mvrnorm(
1000,
mu = c(55.63, 1.04, 11.34),
Sigma = Cov,
empirical = T
)
# check whether correlation matrix is right
cor(mvnorm)
[,1] [,2] [,3]
[1,] 1.00 0.16 0.54
[2,] 0.16 1.00 0.13
[3,] 0.54 0.13 1.00
# extract variables
b0 <- mvnorm[, 1]
b1 <- mvnorm[, 2]
sigma <- mvnorm[, 3]
I want to generate two uncorrelated random variables (x1,x2) that show specified Pearson correlations with an existing variable y, e.g:
cor(x1,y)=0,4;
cor(x2,y)=0,3;
cor(x1,x2)=0,03.
So, I have continuous values, normally distributed, for y (using spatial interpolation technique) and now i want to generate simulated continuous values (e.g. Normal distribution) for two explanatory variables x1 and x2 using the correlation coefficients pointed above.
I tried mvrnorm (MASS) and copula R packages, but i did not find the way to do what i want.
If one can help me getting there i will appreciate a lot. Kind regards.
The mvrnorm function in the MASS package should be able to do this (the copula package as well, I am just less familiar with it).
What did you try and how did the results differ from what you expected?
Here is a quick mvrnorm example:
> ?MASS::mvrnorm
> library(MASS)
>
> r <- cbind( c(1, 0.4, 0.3),
+ c(0.4, 1, 0.03),
+ c(0.3, 0.03, 1))
>
> xy <- mvrnorm(n=100, mu=c(0,0,0), Sigma=r, empirical=TRUE )
> colnames(xy) <- c('y','x1','x2')
>
> cor(xy)
y x1 x2
y 1.0 0.40 0.30
x1 0.4 1.00 0.03
x2 0.3 0.03 1.00
>
Edit
Here is one way with an existing y variable:
y <- rnorm(100) # existing y
# generate x1 and x2, make sure y is first column
xy <- cbind( y, x1=rnorm(100), x2=rnorm(100))
# center and scale
mns <- apply(xy, 2, mean)
sds <- apply(xy, 2, sd)
xy2 <- sweep(xy, 2, mns, FUN="-")
xy2 <- sweep(xy2, 2, sds, FUN="/")
# find existing correlations
v.obs <- cor(xy2)
# remove correlation
xy3 <- xy2 %*% solve(chol(v.obs))
# check
zapsmall(cor(xy3))
# new correlation
r <- cbind( c(1, 0.4, 0.3),
c(0.4, 1, 0.03),
c(0.3, 0.03, 1))
xy4 <- xy3 %*% chol(r)
# undo center and scale
xy4 <- sweep(xy4, 2, sds, FUN="*")
xy4 <- sweep(xy4, 2, mns, FUN="+")
#check
cor(xy4)
all.equal(y, xy[,1])
The mvrnorm function uses svd and Eigen values instead of chol. You could also follow that code using your own y instead of random values for that part of the matrix.
Here is the problem: Five observations on Y are to be taken when X = 4, 8, 12, 16, 20, respectively. The true regression function is E(y) = 20 + 4X, and the ei are independent N(O, 25).
Generate five normal random numbers, with mean 0 and variance 25. Consider these random numbers as the error terms for the five Y observations at X = 4,8, 12, 16, 20 and calculate Y1, Y2 , Y3 , Y4 , and Y5. Obtain the least squares estimates bo and b1, when fitting a straight line to the five cases. Also calculate Yh when Xh = 10 and obtain a 95 percent confidence interval for E(Yh) when Xh = 10. I did part 1, but I need help to repeat it for 200 times.
Repeat part (1) 200 times, generating new random numbers each time.
Make a frequency distribution of the 200 estimates b1. Calculate the mean and standard deviation of the 200 estimates b1. Are the results consistent with theoretical expectations?
What proportion of the 200 confidence intervals for E(Yh) when Xh = 10 include E(Yh)? Is this result consistent with theoretical expectations?
Here's my code so far, I am stumped on how to repeat part 1 for 200 times:
X <- matrix(c(4, 8, 12, 16, 20), nrow = 5, ncol = 1)
e <- matrix(c(rnorm(5,0,sqrt(5))), nrow = 5, ncol = 1)
Y <- 20 + 4 * X + e
mydata <- data.frame(cbind(Y=Y, X=X, e=e))
names(mydata) <- c("Y","X","e")
reg<-lm(Y ~ X, data = mydata)
predict(reg, newdata = data.frame(X=10), interval="confidence")
There is mistake in your code. You want independent N(O, 25) errors, but you passed sqrt(5) as standard error to rnorm(). It should be 5.
We first wrap up your code into a function. This function takes no input, but run experiment once, and returns regression coefficients b0, b1 and prediction fit, lwr, upr in a named vector.
sim <- function () {
x <- c(4, 8, 12, 16, 20)
y <- 20 + 4 * x + rnorm(5,0,5)
fit <- lm(y ~ x)
pred <- predict(fit, data.frame(x = 10), interval = "confidence")
pred <- setNames(c(pred), dimnames(pred)[[2]])
## return simulation result
c(coef(fit), pred)
}
For example, let's try
set.seed(2016)
sim()
#(Intercept) x fit lwr upr
# 24.222348 3.442742 58.649773 47.522309 69.777236
Now we use replicate to repeat such experiment 200 times.
set.seed(0)
z <- t(replicate(200, sim()))
head(z)
# (Intercept) x fit lwr upr
#[1,] 24.100535 3.987755 63.97808 57.61262 70.34354
#[2,] 6.417639 5.101501 57.43265 52.44263 62.42267
#[3,] 20.652355 3.797991 58.63227 52.74861 64.51593
#[4,] 20.349829 3.816426 58.51409 52.59115 64.43702
#[5,] 19.891873 4.095140 60.84327 57.49911 64.18742
#[6,] 24.586749 3.589483 60.48158 53.64574 67.31743
There will be 200 rows, for results of 200 simulations.
The second column contains estimation for b1 under 200 simulations, we compute their mean and standard error:
mean(z[,2])
# [1] 3.976249
sd(z[,2])
# [1] 0.4263377
We know that the true value is 4, and it is evident that our estimate is consistent with true values.
Finally, let's check with 95% confidence interval for prediction at X = 10. The true value is 20 + 4 * 10 = 60, so the proportion of confidence interval that covers this true vale is
mean(z[, "lwr"] < 60 & z[, "upr"] > 60)
## 0.95
which is exactly 0.95.
Let's say that you have a normally distributed variable y with a 3-group categorical predictor x that has the orthogonal contrasts c1 and c2. I am trying to create a program in R that, given x, c1, and c2, creates y such that c1 and c2 have effect sizes r1 and r2 specified by the user.
For example, let's say that x, c1, c2, r1, and r2 were created like the following:
x <- factor(rep(c(1, 2, 3), 100))
contrasts(x) <- matrix(c(0, -.5, .5, -2/3, 1/3, 1/3),
nrow = 3, ncol = 2, dimnames = list(c("1", "2", "3"), c("c1", "c2")))
contrasts(x)
c1 c2
1 0.0 -0.6666667
2 -0.5 0.3333333
3 0.5 0.3333333
r1 <- .09
r2 <- 0
I would like the program to create y such that the variance in y accounted for by c1 equals r1 (.09) and the variance in y accounted for by c2 equals r2 (0).
Does anybody know how I might go about this? I know that I should be using the rnorm function, but I'm stuck on which population means / sds rnorm should use when it does its sampling.
Courtesy of some generous advice from my colleagues, I now have one function that creates simulated data given a specified number of groups, a set of contrasts, a set of regression coefficients, a specified N per cell, and a specified within-group variance
sim.factor <- function(levels, contr, beta, perCell, errorVar){
# Build design matrix X
X <- cbind(rep(1,levels*perCell), kronecker(contr, rep(1,perCell)))
# Generate y
y <- X %*% beta + rnorm(levels*perCell, sd=sqrt(errorVar))
# Build and return data frame
dat <- cbind.data.frame(y, X[,-1])
names(dat)[-1] <- colnames(contr)
return(dat)
}
I also wrote a function that, given a set of regression coefficients, N per cell, number of groups, set of orthogonal contrasts, desired delta-R^2 for a contrast of interest, returns the required within-group variance:
ws.var <- function(levels, contr, beta, perCell, dc){
# Build design matrix X
X <- cbind(rep(1,levels), contr)
# Generate the expected means
means <- X %*% beta
# Find the sum of squares due to each contrast
var <- (t(means) %*% contr)^2 / apply(contr^2 / perCell, 2, sum)
# Calculate the within-conditions sum of squares
wvar <- var[1] / dc - sum(var)
# Convert the sum of squares to variance
errorVar <- wvar / (3 * (perCell - 1))
return(errorVar)
}
After doing some testing as follows, the functions seem to generate the desired delta R^2 for contrast c1.
contr <- contr.helmert(3)
colnames(contr) <- c("c1","c2")
beta <- c(0, 1, 0)
perCell <- 50
levels = 3
dc <- .08
N <- 1000
# Calculate the error variance
errorVar <- ws.var(levels, contr, beta, perCell, dc)
# To store delta R^2 values
d1 <- vector("numeric", length = N)
# Use the functions
for(i in 1:N)
{
d <- sim.factor(levels=3,
contr=contr,
beta=beta,
perCell=perCell,
errorVar=errorVar)
d1[i] <- lm.sumSquares(lm(y ~ c1 + c2, data = d))[1, 2] # From the lmSupport package
}
m <- round(mean(d1), digits = 3)
bmp("Testing simulation functions.bmp")
hist(d1, xlab = "Percentage of variance due to c1", main = "")
text(.18, 180, labels = paste("Mean =", m))
dev.off()
Patrick