Is there a way to compute correlation of U = 2X1 − X2 and V = X1 + 2X2 using R (not manually) given the variance of X1, variance of X2 and Covariance between X1 and X2?
The covariance matrix of two random variables is the 2x2 symmetric matrix whose diagonals are the variances of the two components and whose off-diagonal elements are the covariances. That is, if the variances of X1 and X2 were v1 and v2 and the covariance v12 then the covariance matrix of X would be matrix(c(v1, v12, v12, v2), 2). We can readily form a covariance matrix via cov(d) where d is a two column matrix of data. To be concrete let us the form the covariance matrix of the builtin two column data frame BOD. Then we can use the formula below to get the covariance matrix of the transformation and use cov2cor to get the correlation matrix. The upper (and also by symmetry the lower) off-diagonal element of the correlation matrix will be the desired correlation. No packages are used.
# inputs: covariance matrix V and transformation matrix M
V <- cov(BOD)
M <- matrix(c(2, 1, -1, 2), 2)
cov2cor(M %*% V %*% t(M))[1, 2]
## [1] -0.3023
To double check transform BOD using M and then calculate the correlation of that. We see that the result is the same.
cor(as.matrix(BOD) %*% t(M))[1, 2]
## [1] -0.3023
Related
I have proposed my own model and now am trying to implement it using R, I have got stuck on how to find the observed matrix applying my formula i have use glm() to fit logistic model with penalty term, using binary data set x1, x2, x3 ,y (all binary 0,1) fit1 if the glm() model def.new is the penalise deviance.
X.tilde <- as.matrix(x) # n*p matrix of the data set
W <- Diagonal(length(y), weights) # n*n diagonal matrix of the weights
qq <- exp(fit1$fitted.values)/(1 + exp(fit1$fitted.values)) # n*1 vector (pi=probability of the logistic model )
cc <- t(1 - qq) # n*1 vector
gg <- (dev.new) * t(dev.new) # p*p matrix
ff <- (X.tilde) %*% t(X.tilde) # n*n matrix
pp <- exp(fit1$coefficients)/(1 + exp(fit1$coefficients)) # p*1 matrix
ss <- t(1/(1 + exp(fit1$coefficients))) # p*1 vector
aa <- t(X.tilde) %*% qq %*% cc %*% W %*% (X.tilde) # p*p matrix
firstP <- (aa + (pp * ss)) # p*p matrix
info.mat <- firstP+gg # p*p matrix
info.mat <- as.matrix(info.mat)
this code returns the following error
Error in e1 + Matrix(e2) :
Matrices must have same number of rows for arithmetic
As in my theory the dimension is fine by when I implement its not correct
any help?
r
I am doing a simulation study for a mixed effect model (three levels; observations nested within subjects within schools):
f <- lmer(measurement ~ time + race + gender + s_ses +
fidelity + (1 + time|school/subject), mydata_long, REML=0)
The model allows the intercept and time slope to vary across subjects and schools. I am wondering how I can fix the variances to be specific values. I do know how to do that when there is only random intercept:
VarCorr(f)['subject:school']<-0.13
VarCorr(f)['school']<-0.20
However, when there is a random slope, these codes don't work since there are different components in the variance aspect (see the attached picture).
How can I fix the variances of subject: school (Intercept), subject:school time, school (Intercept), and school time to specific values in this case. Any suggestions?
A simulation example. The hardest part is getting the random-effects parameters correctly specified: the key things you need to know are (1) internally the random effects variance matrix is scaled by the residual variance; (2) for vector-valued random effects (like this random-slopes model), the variance-covariance matrix is specified in terms of its Cholesky factor: if we want covariance matrix V, there is a lower-triangular matrix such that C %*% t(C) == V. We compute C using chol(), then read off the elements of the lower triangle (including the diagonal) in column-major order (see helper functions below).
Set up experimental design (simplified from yours, but with the same random effects components):
mydata_long <- expand.grid(time=1:40,
school=factor(letters[1:25]),
subject=factor(LETTERS[1:25]))
Helper functions to convert from
a vector of standard deviations, one or more correlation parameters (in lower-triangular/column major order), and a residual standard deviation
to
a vector of "theta" parameters as used internally by lme4 (see description above)
... and back the other way (conv_chol)
conv_sc <- function(sdvec,cor,sigma) {
## construct symmetric matrix with cor in lower/upper triangles
cormat <- matrix(1,nrow=length(sdvec),ncol=length(sdvec))
cormat[lower.tri(cormat)] <- cor
cormat[upper.tri(cormat)] <- t(cormat)[upper.tri(cormat)]
## convert to covariance matrix and scale by 1/sigma^2
V <- outer(sdvec, sdvec)*cormat/sigma^2
## extract lower triangle in column-major order
return(t(chol(V))[lower.tri(V,diag=TRUE)])
}
conv_chol <- function(ch, s) {
m <- matrix(NA,2,2)
m[lower.tri(m,diag=TRUE)] <- ch
m[upper.tri(m)] <- 0
V <- m %*% t(m) * s^2
list(sd=sqrt(diag(V)), cor=cov2cor(V)[1,2])
}
If you want to start from covariance matrices rather than standard deviations and correlations you can modify the code to skip some steps (starting and ending with V).
Pick some values and convert (and back-convert, to check)
tt1 <- conv_sc(c(0.7, 1.2), 0.3, 0.5)
tt2 <- conv_sc(c(1.4, 0.2), -0.2, 0.5)
tt <- c(tt1, tt2)
conv_chol(tt1, s=0.5)
conv_chol(tt2, s=0.5)
Set up formula and simulate:
form <- m ~ time + (1 + time|school/subject)
set.seed(101)
mydata_long$m <- simulate(form[-2], ## [-2] drops the response
family=gaussian,
newdata=mydata_long,
newparams=list(theta=tt,
beta=c(1,1),
sigma=0.5))[[1]]
f <- lmer(form, data=mydata_long, REML=FALSE)
VarCorr(f)
The fitted results are close to what we requested above ...
Groups Name Std.Dev. Corr
subject:school (Intercept) 0.66427
time 1.16488 0.231
school (Intercept) 1.78312
time 0.22459 -0.156
Residual 0.49772
Now do the same thing 200 times, to explore the distribution of estimates:
simfun <- function() {
mydata_long$m <- simulate(form[-2],
family=gaussian,
newdata=mydata_long,
newparams=list(theta=tt,
beta=c(1,1),
sigma=0.5))[[1]]
f <- lmer(form, data=mydata_long, REML=FALSE)
return(as.data.frame(VarCorr(f))[,"sdcor"])
}
set.seed(101)
res <- plyr::raply(200,suppressMessages(simfun()),.progress="text")
Here plyr::raply() is used for convenience, you can do this however you like (for loop, lapply(), replicate(), purrr::map() ...)
par(las=1)
boxplot(res)
## add true values to the plot
points(1:7,c(0.7,1.2,0.3,1.4,0.2,-0.3,0.5),col=2,cex=3,lwd=3)
I have referred to some of the previous answers such as
How to generate random numbers from a normal distribution with specific mean and variance?
Is there any way to generate uncorrelated random variables using Python?
Generate matrix with iid normal random variables using R
It's still not clear how to generate uncorrelated random normal
vectors with a different mean.
The catch is that the number of samples in each vector length should be less (as low as 20, we want 2 (20*1) vectors). Probably this is a bad constraint.
I used the replicate function with rnorm as mentioned in one of the above posts like:
c2=replicate(10000, cor(rnorm(100), rnorm(100)))
For large numbers like 10,000 and above, the correlation is almost 0. But
c2=replicate(20, cor(rnorm(100), rnorm(100)))
gives a positive or negative correlation.
You can use the mvrnorm function from MASS to do that.
Defining the variables properties:
# Means
m1 <- 5
m2 <- 10
# variance
s1 <- 5
s2 <- 1
# Correlations
X1 <- 0
Creating the variables:
set.seed(123)
dat <- MASS::mvrnorm(20, mu = c(m1, m2),
Sigma = matrix(c(s1, X1,
X1, s2),
ncol = 2, byrow = TRUE),
empirical = TRUE)
Testing for the correlations:
dat %>%
cor()
[,1] [,2]
[1,] 1.000000000000000000 0.000000000000000197
[2,] 0.000000000000000197 1.000000000000000000
The result is different when using a variance matrix and a correlation matrix. Why is this happening?
I will write down the results directly for convenience.
Variance matrix - naming as co
0.1234 0.125
0.1250 0.245
Correlation matrix - naming as coo (made by cov2cor function)
1.0000 0.7189
0.7189 1.0000
Result
pmvnorm(mean=c(1,1),sigma=co, lower=rep(-Inf,2), upper=c(0.7,4)
0.1965493
pmvnorm(mean=c(1,1),corr=coo, lower=rep(-Inf,2), upper=c(0.7,4)
0.3820885
I made a covariance matrix, and we got a correlation matrix using covariance matrix. And these two values were implemented, and the result was different.
It is code.
install.packages("mvtnorm")
library(mvtnorm)
co <- matrix(c(0.1234,0.125,0.125,0.245),2,2)
coo <- cov2cor(co)
pmvnorm(mean=c(1,1),sigma=co, lower=rep(-Inf,2), upper=c(0.7,4)
pmvnorm(mean=c(1,1),corr=coo, lower=rep(-Inf,2), upper=c(0.7,4)
Please let me know why.
As per ?pmvnorm (emphasis mine)
sigma: the covariance matrix of dimension n. Either ‘corr’ or
‘sigma’ can be specified. If ‘sigma’ is given, the problem is
standardized. If neither ‘corr’ nor ‘sigma’ is given, the
identity matrix is used for ‘sigma’.
So to make both calculations consistent you need to give standardised upper limits when giving a correlation matrix.
# Using covariance matrix sigma
cov <- matrix(c(0.1234,0.125,0.125,0.245), 2, 2);
x1 <- pmvnorm(mean = c(1, 1), sigma = cov, lower = -Inf, upper = c(0.7, 4));
x1;
#[1] 0.1965493
#attr(,"error")
#[1] 1e-15
#attr(,"msg")
#[1] "Normal Completion"
# Using correlation matrix corr
# Note: Need to scale the upper limits
cor <- cov2cor(cov);
x2 <- pmvnorm(mean = c(0, 0), corr = cor, lower = -Inf, upper = (c(0.7, 4) - c(1, 1)) / sqrt(diag(cov)));
x2;
#[1] 0.1965493
#attr(,"error")
#[1] 1e-15
#attr(,"msg")
#[1] "Normal Completion"
PS. It's a bit hidden, but ?pmvnorm includes a simpler example on the complementarity of both approaches.
# Correlation and Covariance
a <- pmvnorm(lower=-Inf, upper=c(2,2), sigma = diag(2)*2)
b <- pmvnorm(lower=-Inf, upper=c(2,2)/sqrt(2), corr=diag(2))
stopifnot(all.equal(round(a,5) , round(b, 5)))
I need to create "2D data set with 200 samples created from a multivariate Gaussian distribution with a non-diagonal covariance matrix", but I'm neither a statistician nor a mathematician, and I didn't exactly get this.
Here is what I understood. Diagonal matrix is a matrix that has all zeros in the entries outside the main diagonal. Therefore, I assume non-diagonal means a matrix that doesn't have all zeros in the entries outside the main diagonal, such that any random matrix would do, right? So, I started by creating a random matrix, cause it doesn't say any size here, I just did 100x100:
m <- matrix(rnorm(100*100), 100, 100)
I don't know how to achieve the rest. I know the sample() function which creates a sample, but how can I create "2D data set with 200 samples created from a multivariate Gaussian distribution"?
As long as you have mean vector and covariance matrix, simulating multivariate normal is very simple, via MASS:::mvrnorm. Have a look at ?mvrnorm for how to use this function.
If you do not have special requirement on the covariance matrix, i.e., a random covariance matrix will do. You need to first create a proper covariance matrix first.
A covariance matrix must be positive-definite. We can create a positive-definite matrix by taking crossproduct of a full-rank matrix. That is, if an n * p (n >= p) matrix X has full column rank, A = X' %*% X is positive-definite (hence a proper covariance).
Let's first generate a random X matrix:
p <- 100 ## we want p-dimensional multivariate normal
set.seed(0); X <- matrix(runif(p * p), p, p) ## this random matrix has full rank
Then get a covariance matrix:
COV <- crossprod(X) ## t(X) %*% X but about 2 times faster
We also need mean vector. Let's assume they are 0-mean:
mu <- rep(0, p)
Now we call MASS:::mvrnorm for random sampling:
library(MASS) ## no need to install
x <- mvrnorm(1000, mu, COV) ## mvrnorm(sample.size, mean, covariance)
Now x contains 1000 samples from 100-dimension (p-dimensional) multivariate normal distribution, with mean mu and covariance COV.
> str(x)
num [1:1000, 1:100] 1.66 -2.82 6.62 6.46 -3.35 ...
- attr(*, "dimnames")=List of 2
x is a matrix, each row of which is a random sample. So in total we have 1000 rows.
For multivariate normal, marginal distribution is still normal. Hence, we can plot histograms for marginals. The following sketches the 1st, 10th, 20th and 30th marginal:
par(mfrow = c(2,2))
hist(x[, 1], main = "1st marginal")
hist(x[, 10], main = "10th marginal")
hist(x[, 20], main = "20th marginal")
hist(x[, 30], main = "30th marginal")