How to create 2D data set from Gaussian distribution in R? - r

I need to create "2D data set with 200 samples created from a multivariate Gaussian distribution with a non-diagonal covariance matrix", but I'm neither a statistician nor a mathematician, and I didn't exactly get this.
Here is what I understood. Diagonal matrix is a matrix that has all zeros in the entries outside the main diagonal. Therefore, I assume non-diagonal means a matrix that doesn't have all zeros in the entries outside the main diagonal, such that any random matrix would do, right? So, I started by creating a random matrix, cause it doesn't say any size here, I just did 100x100:
m <- matrix(rnorm(100*100), 100, 100)
I don't know how to achieve the rest. I know the sample() function which creates a sample, but how can I create "2D data set with 200 samples created from a multivariate Gaussian distribution"?

As long as you have mean vector and covariance matrix, simulating multivariate normal is very simple, via MASS:::mvrnorm. Have a look at ?mvrnorm for how to use this function.
If you do not have special requirement on the covariance matrix, i.e., a random covariance matrix will do. You need to first create a proper covariance matrix first.
A covariance matrix must be positive-definite. We can create a positive-definite matrix by taking crossproduct of a full-rank matrix. That is, if an n * p (n >= p) matrix X has full column rank, A = X' %*% X is positive-definite (hence a proper covariance).
Let's first generate a random X matrix:
p <- 100 ## we want p-dimensional multivariate normal
set.seed(0); X <- matrix(runif(p * p), p, p) ## this random matrix has full rank
Then get a covariance matrix:
COV <- crossprod(X) ## t(X) %*% X but about 2 times faster
We also need mean vector. Let's assume they are 0-mean:
mu <- rep(0, p)
Now we call MASS:::mvrnorm for random sampling:
library(MASS) ## no need to install
x <- mvrnorm(1000, mu, COV) ## mvrnorm(sample.size, mean, covariance)
Now x contains 1000 samples from 100-dimension (p-dimensional) multivariate normal distribution, with mean mu and covariance COV.
> str(x)
num [1:1000, 1:100] 1.66 -2.82 6.62 6.46 -3.35 ...
- attr(*, "dimnames")=List of 2
x is a matrix, each row of which is a random sample. So in total we have 1000 rows.
For multivariate normal, marginal distribution is still normal. Hence, we can plot histograms for marginals. The following sketches the 1st, 10th, 20th and 30th marginal:
par(mfrow = c(2,2))
hist(x[, 1], main = "1st marginal")
hist(x[, 10], main = "10th marginal")
hist(x[, 20], main = "20th marginal")
hist(x[, 30], main = "30th marginal")

Related

How to manually calculate coefficients for Gamma GLM

The input I'm giving to the GLM function is:
glm(family=fam,data=regFrame1,start=starter1,formula=as.formula(paste(yvar,"~.+0")),na.action=na.exclude,y=T)
Where the family is Gamma and the link function is identity.
I'm trying to manually reproduce the coefficients from my model where one of them is for example:
Estimate Std. Error t value Pr(>|t|)
coefficient A 480.6062 195.2952 2.461 0.013902 *
I know the equation I need for coefficient A is:
βA = (XTX)−1XTY
Where y is my dependent variable and x is my independent variable.
In R I write this to produce βA:
# x transposed multiplied by x when both are matrices
xtx <- t(x) %*% x
# x transposed multiplied by y when both are matrices
xty <- t(x) %*% y
# we need to inverse xtx
xtxinv <- solve(xtx, tol=0)
# finally we multiply the inverse of xtx by xty to get betaHat
betaHat <- xtxinv %*% xty
betaHat = 148
When I complete this calculation manually I get the coefficient that is produced when running a GLM on the default normal Gaussian family without specifying a family. Which looks like this:
glm(data=regFrame1,formula=as.formula(paste(yvar,"~.+0")),na.action=na.exclude,y=T)
So the question is how do I tailor my manual calculation to the Gamma family identity link function instead of the Gaussian identity default that is in the glm.fit function in R.
The only two differences with my two runs using the glm function are:
providing the family (Gamma identity)
giving the model starting values (100 for each column in the dataframe)
I tried to recreate glm.fit function manually to get out the coefficient (beta). When I didn't provide a family or starting values I got the correct answer but when I gave Gamma as the family and identity as the link with starting values I get a much different coefficient.
For linear regression, which is fit with least squares, BA is indeed (XTX)-1XTY. However, for generalized linear regression, BA is fit by iteratively weighted least squares, which is an iterative algorithm. Therefore, there is no direct formula to compute BA. However, we can compute the equivalent of the hat matrix H in linear regression. In linear regression, the hat matrix is H=X(XTX)-1XT. In generalized linear model, the analogy of the hat matrix is H=WX(XTWX)-1XT where W = diag(mu'(XB)). In both cases, Hy give the fitted values, yA. Here is code to demonstrate.
#' Test that the two parameterizations of Gamma are the same
curve(dgamma(x, 3, scale=3), xlim=c(0, 10))
grid <- seq(0, 10, length=1000)
d <- 1/grid/gamma(3)*(grid/(1/3)/9)^3*exp(-grid/3)
plot(grid, d, type='l')
#' Generate random variates according to GLM with
#' Y_i ~ Gamma(mean=mu,
#' squared coefficient of variation (variance over squared mean) = phi)
#' Y_i ~ Gamma(shape=alpha, scale=beta)
#' mu = alpha*beta
#' phi= 1/alpha
#' Let Beta = (3, 4)
set.seed(123)
X <- data.frame(x1=runif(1000, 0, 10))
mu = (3+4*X$x1)^(-1)
y=NULL
for (i in 1:1000) {
alpha = 1/3
beta = mu[i] * 3
y[i]=rgamma(1, alpha, scale=beta)
}
#' Fit the model and compute the hat matrix, then the fitted values manually
mod <- glm(y ~ ., family=Gamma(), data=X)
x <- as.matrix(cbind(1, X))
W=diag(c(-(x%*%c(3, 4))^(-2)))
H=W%*%x%*%solve(t(x)%*%W%*%x)%*%t(x)
#Manual fitted values
head(H%*%y)
#Fitted values from model
head(mod$fitted.values)

Extract the confidence intervals of lmer random effects; plotted with dotplot(ranef())

I´m trying to extract the confidence intervals and the intercept values that are plotted with dotplot(ranef()). How can I do this?
attach(sleepstudy)
library(lme4)
fm1 <- lmer(Reaction ~ Days + (Days | Subject), sleepstudy)
lattice::dotplot(ranef(fm1, condVar=TRUE))
I tried exploring the list object fm1 but could not fiund the CI.
rr <- ranef(fm1) ## condVar = TRUE has been the default for a while
With as.data.frame: gives the conditional mode and SD, from which you can calculate the intervals (technically, these are not "confidence intervals" because the values of the BLUPs/conditional modes are not parameters ...)
dd <- as.data.frame(rr)
transform(dd, lwr = condval - 1.96*condsd, upr = condval + 1.96*condsd)
Or with broom.mixed::tidy:
broom.mixed::tidy(m1, effects = "ran_vals", conf.int = TRUE)
broom.mixed::tidy() uses as.data.frame.ranef.mer() (the method called by as.data.frame) internally: this function takes the rather complicated data structure described in ?lme4::ranef and extracts the conditional modes and standard deviations in a more user-friendly format:
If ‘condVar’ is ‘TRUE’ the ‘"postVar"’
attribute is an array of dimension j by j by k (or a list of such
arrays). The kth face of this array is a positive definite
symmetric j by j matrix. If there is only one grouping factor in
the model the variance-covariance matrix for the entire random
effects vector, conditional on the estimates of the model
parameters and on the data, will be block diagonal; this j by j
matrix is the kth diagonal block. With multiple grouping factors
the faces of the ‘"postVar"’ attributes are still the diagonal
blocks of this conditional variance-covariance matrix but the
matrix itself is no longer block diagonal.
In this particular case, here's what you need to do to replicate the condsd column of as.data.frame():
## get the 'postVar' attribute of the first (and only) RE term
aa <- attr(rr$Subject, "postVar")
## for each slice of the array, extract the diagonal;
## transpose and drop dimensions;
## take the square root
sqrt(c(t(apply(aa, 3, diag))))

Fix variances to specific values in lme4/lmer

I am doing a simulation study for a mixed effect model (three levels; observations nested within subjects within schools):
f <- lmer(measurement ~ time + race + gender + s_ses +
fidelity + (1 + time|school/subject), mydata_long, REML=0)
The model allows the intercept and time slope to vary across subjects and schools. I am wondering how I can fix the variances to be specific values. I do know how to do that when there is only random intercept:
VarCorr(f)['subject:school']<-0.13
VarCorr(f)['school']<-0.20
However, when there is a random slope, these codes don't work since there are different components in the variance aspect (see the attached picture).
How can I fix the variances of subject: school (Intercept), subject:school time, school (Intercept), and school time to specific values in this case. Any suggestions?
A simulation example. The hardest part is getting the random-effects parameters correctly specified: the key things you need to know are (1) internally the random effects variance matrix is scaled by the residual variance; (2) for vector-valued random effects (like this random-slopes model), the variance-covariance matrix is specified in terms of its Cholesky factor: if we want covariance matrix V, there is a lower-triangular matrix such that C %*% t(C) == V. We compute C using chol(), then read off the elements of the lower triangle (including the diagonal) in column-major order (see helper functions below).
Set up experimental design (simplified from yours, but with the same random effects components):
mydata_long <- expand.grid(time=1:40,
school=factor(letters[1:25]),
subject=factor(LETTERS[1:25]))
Helper functions to convert from
a vector of standard deviations, one or more correlation parameters (in lower-triangular/column major order), and a residual standard deviation
to
a vector of "theta" parameters as used internally by lme4 (see description above)
... and back the other way (conv_chol)
conv_sc <- function(sdvec,cor,sigma) {
## construct symmetric matrix with cor in lower/upper triangles
cormat <- matrix(1,nrow=length(sdvec),ncol=length(sdvec))
cormat[lower.tri(cormat)] <- cor
cormat[upper.tri(cormat)] <- t(cormat)[upper.tri(cormat)]
## convert to covariance matrix and scale by 1/sigma^2
V <- outer(sdvec, sdvec)*cormat/sigma^2
## extract lower triangle in column-major order
return(t(chol(V))[lower.tri(V,diag=TRUE)])
}
conv_chol <- function(ch, s) {
m <- matrix(NA,2,2)
m[lower.tri(m,diag=TRUE)] <- ch
m[upper.tri(m)] <- 0
V <- m %*% t(m) * s^2
list(sd=sqrt(diag(V)), cor=cov2cor(V)[1,2])
}
If you want to start from covariance matrices rather than standard deviations and correlations you can modify the code to skip some steps (starting and ending with V).
Pick some values and convert (and back-convert, to check)
tt1 <- conv_sc(c(0.7, 1.2), 0.3, 0.5)
tt2 <- conv_sc(c(1.4, 0.2), -0.2, 0.5)
tt <- c(tt1, tt2)
conv_chol(tt1, s=0.5)
conv_chol(tt2, s=0.5)
Set up formula and simulate:
form <- m ~ time + (1 + time|school/subject)
set.seed(101)
mydata_long$m <- simulate(form[-2], ## [-2] drops the response
family=gaussian,
newdata=mydata_long,
newparams=list(theta=tt,
beta=c(1,1),
sigma=0.5))[[1]]
f <- lmer(form, data=mydata_long, REML=FALSE)
VarCorr(f)
The fitted results are close to what we requested above ...
Groups Name Std.Dev. Corr
subject:school (Intercept) 0.66427
time 1.16488 0.231
school (Intercept) 1.78312
time 0.22459 -0.156
Residual 0.49772
Now do the same thing 200 times, to explore the distribution of estimates:
simfun <- function() {
mydata_long$m <- simulate(form[-2],
family=gaussian,
newdata=mydata_long,
newparams=list(theta=tt,
beta=c(1,1),
sigma=0.5))[[1]]
f <- lmer(form, data=mydata_long, REML=FALSE)
return(as.data.frame(VarCorr(f))[,"sdcor"])
}
set.seed(101)
res <- plyr::raply(200,suppressMessages(simfun()),.progress="text")
Here plyr::raply() is used for convenience, you can do this however you like (for loop, lapply(), replicate(), purrr::map() ...)
par(las=1)
boxplot(res)
## add true values to the plot
points(1:7,c(0.7,1.2,0.3,1.4,0.2,-0.3,0.5),col=2,cex=3,lwd=3)

How to simulate random Y numbers from a linear model with specific X and residuals?

I want to find a way to randomly generate 100 Y values from a linear model where
Yi= 2−8Xi+ ei
I want the residuals (ei) to come from a normal distribution with a specified mean and variance and X to be a vector of values from 1:100.
I know how to generate random variates using rnorm() but I'm not sure how to approach this more advanced matter. Any ideas for how I can specify the parameters I need would be welcome.
This should work:
X <- 1:100
Y <- 2 − 8 * X + rnorm(100, mean = 0, sd = 2)
str(Y)
#num [1:100] -3.51 -12.03 -21.05 -31.38 -36.46 ...

Solving Linear equation in R

Is there a way to compute correlation of U = 2X1 − X2 and V = X1 + 2X2 using R (not manually) given the variance of X1, variance of X2 and Covariance between X1 and X2?
The covariance matrix of two random variables is the 2x2 symmetric matrix whose diagonals are the variances of the two components and whose off-diagonal elements are the covariances. That is, if the variances of X1 and X2 were v1 and v2 and the covariance v12 then the covariance matrix of X would be matrix(c(v1, v12, v12, v2), 2). We can readily form a covariance matrix via cov(d) where d is a two column matrix of data. To be concrete let us the form the covariance matrix of the builtin two column data frame BOD. Then we can use the formula below to get the covariance matrix of the transformation and use cov2cor to get the correlation matrix. The upper (and also by symmetry the lower) off-diagonal element of the correlation matrix will be the desired correlation. No packages are used.
# inputs: covariance matrix V and transformation matrix M
V <- cov(BOD)
M <- matrix(c(2, 1, -1, 2), 2)
cov2cor(M %*% V %*% t(M))[1, 2]
## [1] -0.3023
To double check transform BOD using M and then calculate the correlation of that. We see that the result is the same.
cor(as.matrix(BOD) %*% t(M))[1, 2]
## [1] -0.3023

Resources