Estimating bias in linear regression and linear mixed model in R simulation - r

I want to run simulations to estimate bias in linear model and linear mixed model. The bias is E(beta)-beta where beta is the association between my X and Y.
I generated my X variable from a normal distribution and Y from a multivariate normal distribution.
I understand how I can calculate E(beta) from simulations, which is the sum of beta estimates from all simulations divided by the total number of simulation, but I am not sure how I can estimate true beta.
meanY <- meanY + X*betaV
This is how I generated the meanY (betaV is the effect size) that is then used to generate multivariate Y outcome as shown below.
Y[jj,] <- rnorm(nRep, mean=meanY[jj], sd=sqrt(varY))
I understand how I can calculate E(beta) from simulations, which is the sum of beta estimates from all simulations divided by the total number of simulation, but I am not sure how I can estimate the true beta.
From my limited understanding, true beta is not obtained from the data but from the setting where I set fixed beta value.
Based on how I generated my data, how can I estimate the true beta?

There are a couple of methods of simulating bias. I'll take an easy example using a linear model. A linear mixed model could likely use a similar approach, however i am not certain it would go well for a generalized linear mixed model (I am simply not certain).
A simple method for estimating bias, when working with a simple linear model, is to 'choose' which model to estimate ones bias from. Lets say for example Y = 3 + 4 * X + e. I have chosen beta <- c(3,4), and as such i need to only simulate my data. For a linear model, the model assumptions are
Observations are independent
Observations are normally distributed
The mean can be described as by the linear predictor
Using these 3 assumptions, simulating a fixed design is simple.
set.seed(1)
xseq <- seq(-10,10)
xlen <- length(xseq)
nrep <- 100
#Simulate X given a flat prior (uniformly distributed. A normal distribution would likely work fine as well)
X <- sample(xseq, size = xlen * nrep, replace = TRUE)
beta <- c(3, 4)
esd = 1
emu <- 0
e <- rnorm(xlen * nrep, emu, esd)
Y <- cbind(1, X) %*% beta + e
fit <- lm(Y ~ X)
bias <- coef(fit) -beta
>bias
(Intercept) X
0.0121017239 0.0001369908
which indicates a small bias. To test if this bias is significant, we could perform a wald-test or t-test (or replicate the process 1000 times, and check the distribution of outcomes).
#Simulate linear model many times
model_frame <- cbind(1,X)
emany <- matrix(rnorm(xlen * nrep * 1000, emu, esd),ncol = 1000)
#add simulated noise. Sweep adds X %*% beta across all columns of emany
Ymany <- sweep(emany, 1, model_frame %*% beta, "+")
#fit many models simulationiously (lm is awesome!)
manyFits <- lm(Y~X)
#Plot density of fitted parameters
par(mfrow=c(1,2))
plot(density(coef(manyFits)[1,]), main = "Density of intercept")
plot(density(coef(manyFits)[2,]), main = "Density of beta")
#Calculate bias, here i use sweep to substract beta across all rows of my coefficients
biasOfMany <- rowMeans(sweep(coef(manyFits), 1, beta, "-"))
>biasOfMany
(Intercept) X
5.896473e-06 -1.710337e-04
Here we see that the bias is reduced quite a bit, and has changed sign for betaX giving reason to believe the bias is insignificant.
Changing the design would allow one to look into bias of interactions, outliers and other stuff using the same method.
For linear mixed models, one could perform the same method, however here you would have to design the random variables, which would require some more work, and the implementation of lmer as far as i know, does not fit a model across all columns of Y.
However b (the random effects) could be simulated, and so could any noise parameters. Do however note, that as b is a single vector containing a single outcome of simulations (often of a multivariate normal distribution), one would have to re-run the model for each simulation of b. Basically this will increase the number of times one would have to re-run the model fitting procedure, in order to get a good estimate of the bias.

Related

Simulating likelihood ratio test (LRT) pvalue using Monte Carlo method [migrated]

This question was migrated from Stack Overflow because it can be answered on Cross Validated.
Migrated 26 days ago.
I'm trying to figure out my assignment to simulate lrt test p-value output using the Monte Carlo method. As far as I understand, the lrt test is supposed to test for "better", more accurate model.
I know how to perform such a test:
nested <- glm(finalgrade~absences,data=grades)
complex <- glm(finalgrade~absences+age,data=grades)
lrtest(nested, complex)
From there I can return my p-value and perform some calculations like type I and type II errors or power of a test and see how it changes depending of number of simulations.
My question is how am I supposed to simulate the random data. It doesn't have to be grades or school related stuff this was just a showcase of my understanding.
I was thinking about making data frame with 3 to 4 columns with 1 column being a dependent value (0,1) and the rest being random numbers generated from the normal distribution or some different distribution.
But I don't know if this approach will create understandable results, or if this even makes sense.
I looked at this function function but it didn't really help me to understand anything.
I came up with something like this:
library(lmtest)
n <- 1000
depentend = sample(c(0,1), replace=TRUE, size=n)
pvalue <- c()
for(i in 1:1000) {
independend_x = rnorm(n, mean = 2,sd = 0.2)
independend_y = rnorm(n, mean = 7,sd = 0.5)
nested <- lm(depentend~independend_x)
complex <- lm(depentend~independend_x + independend_y)
lrtest(nested, complex)
pvalue <- c(pvalue, as.numeric(lrtest(nested, complex)[5][2,1]))
}
but I don't know if this is the right direction.
I would be really thankful if someone could help me to understand how to simulate data for the Monte Carlo sampling method.
Monte Carlo simulations are performed to compute a distribution of something that is difficult to compute or for which one is too lazy to perform the exact computation.
The likelihood ratio test computes a p-value based on the distribution of the likelihood ratio $\Lambda$, and that distribution is the value that you want to simulate instead of compute or estimate with formula's. The trick is to use simulation instead of computations.
Your problem does not seem to be so much how to perform the simulations, but more like what is the distribution that you are interested in and want to simulate and what are the boundary conditions that you need to fix. Which computation or estimation is it that you want to replace/estimate with simulation?
For your likelihood ratio test you probably want to test the hypothesis $H_0: \theta_{age} = 0$ against the alternative hypothesis $H_a: \theta_{age} \neq 0$. In this case you compute the ratio of the likelihood $\mathcal{L}$ where one of the hypotheses is a composite hypothesis and you select the highest likelihood among them.
$$\Lambda = \frac{\mathcal{L}(\theta_{age} = 0| \text{some data})}{\text{sup}_{\theta_{age} \neq 0}\mathcal{L}( \theta_{age} | \text{some data})} = \frac{\mathcal{L}(\theta_{age} = 0| \text{some data})}{\mathcal{L}( \hat\theta_{age} | \text{some data})} $$ where the supremum is found by using the likelihood for the maximum likelihood estimator $ \hat\theta_{age} $
To compute these likelihood functions you need assumptions about the distributions. In your case you do this with glm (where you need to decide on some distribution and link function) or more simple lm (which assumes Gaussian conditional distribution for the data).
The simulations are then computed for a given null hypothesis. For instance, given some data, you assume that $\theta_{age} = 0$ and you want to compute what the distribution of the outcomes of $\Lambda$ is. You need some more data and parameters
The independent variables. These you probably want to fix at some values that relate to your practical problem. You want to know the distribution given some independent variables. Potentially you may wish to study what happens when there is an error in these independent variables, in that case you may also simulate these variables.
The variance/dispersion/noise-level of the conditional distribution. This you may vary to see how this influences the statistic. Or you have some value of interest, for instance if you have data for which you estimated the noise.
The other coefficients. These you may likewise vary or keep fixed depending on the situation, whether you want to model a particular situation or a more range of situations.
Example
The code below computes a simulation for a given regressor matrix (the independent variables) and given other coefficients. For large sample size the distribution will approach a chi-squared distribution. The simulation shows that using that limit as an estimate for the distribution underestimates the p-value by a lot.
(I ran the code with only 5000 simulations because I am using an online r-editor an compiler, on a computer you can get more precise results)
n_sim = 5*10^3
### simulate likelihood ratio test
### given coefficient and independent variables
### we assume a logistic model with binomial distribution
sim = function(theta1, X) {
### compute model
Z = X %*% theta1
p = 1/(1+exp(-Z))
### simulate dependent variable
Y = rbinom(length(p), 1, p)
### compute (log)likelihood ratio
mod1 = glm(Y ~ 1 + X[,2] + X[,3], family = binomial)
mod0 = glm(Y ~ 1 + X[,2], family = binomial)
logratio = -2*(logLik(mod0)-logLik(mod1))
return(as.numeric(logratio))
}
set.seed(1)
n = 10
### coefficients with the last one zero
theta1 = c(1,1,0)
### some regressor matrix, independent variables
X = cbind(rep(1,n), matrix(rnorm(n*2),n)) ### first column is intercept
### simulate
Lsim = replicate(n_sim,sim(theta1,X))
### ordering for empirical distribution
Lsim = Lsim[order(Lsim)]
perc = c(1:length(Lsim))/length(Lsim)
plot(Lsim,1-perc, main = "emperical distribution", ylab = "P(likelihood > L)", xlab = "L", type = "l")
lines(qchisq(perc,1),1-perc, lty = 2)
legend(8,1, c("n=10","n=40", "chi-squared estimate"), lty = c(1,1,2), col = c(1,2,1))
#### repeat with larger n
set.seed(1)
n = 40
theta1 = c(1,1,0)
X = cbind(rep(1,n), matrix(rnorm(n*2),n))
Lsim2 = replicate(n_sim,sim(theta1,X))
Lsim2 = Lsim2[order(Lsim2)]
lines(Lsim2, 1-perc, col = 2)
Note that there are many variants and this is just an example what simulation does. Here we simulate data based on a given distribution. (And it replaces a computation that we could not perform. We had an estimate with a chi-squared distribution, but that is not accurate for small $n$.)
Other times this distribution is not know and one uses the data and some resampling method to simulate/estimate the distribution of the statistic.
For your situation you need to figure out what exact computation (for which information/conditions are given) it is that you want to replace by using simulations.

95% CI for the ICC in linear mixed effects model (multilevel model, hierarchical model)

I fitted a linear mixed effect model to predict the math score as the outcome, x= participant factor (nominal or ordinal) as the fixed effect, Schl is the random effect. Then I compared it with the simple linear regression model using compare_performance, and while the output gives the ICC, I was not sure how to calculate the 95% for it? (for coefficients I used confintconfint and it did the job)
lm1<- lm(math~ gender, data= df)
lme1<- lmer(math~gender+(1|schl), data=df)
compare_performance(lm1,lme1)
the ICC was 0.15
From this gist from Peter Dahlgren, taken in turn from a CrossValidated answer by #Ashe, here is the crux:
calc.icc <- function(y) {
sumy <- summary(y)
(sumy$varcor$id[1]) / (sumy$varcor$id[1] + sumy$sigma^2)
}
boot.icc <- bootMer(mymod, calc.icc, nsim=1000)
#Draw from the bootstrap distribution the usual 95% upper and lower confidence limits
quantile(boot.icc$t, c(0.025, 0.975))
You can (and should) check that this calc.icc() function gives the same results as your compare_performance() function. Since this uses parametric bootstrapping, you can substitute any ICC function you like as it long takes a fitted model as input and returns the ICC as a single numeric value. (Also, because it uses PB, it will be slow; there are potentially faster approximate methods, but PB is reliable and easy to program.)

Fitting GLM (family = inverse.gaussian) on simulated AR(1)-data.

I am encountering quite an annoying and to me incomprehensible problem, and I hope some of you can help me. I am trying to estimate the autoregression (influence of previous measurements of variable X on current measurement of X) for 4 groups that have a positively skewed distribution to various degrees. The theory is that more positively skewed distributions have less variance, and since the relationship between 2 variables depends on the amount of shared variance, positively skewed distributions have a smaller autoregression that more normally distributed variables.
I use simulations to investigate this, and generate data as follows: I simulate data for n people with tp time points. I use a fixed autoregressive parameter, phi (at .3 so we have a stationary process). To generate positively skewed distributions I use a chi-square distributed error. Individuals differ in the degrees of freedom that is used for the chi2 distributed errors. In other words, degrees of freedom is a level 2 variable (and is in itself chi2(1)-distributed). Individuals with a very low df get a very skewed distribution whereas individuals with a higher df get a more normal distribution.
for(i in 1:n) { # Loop over persons.
chi[i, 1] <- rchisq(1, df[i]) # Set initial value.
for(t in 2:(tp + burn)) { # Loop over time points.
chi[i, t] <- phi[i] * chi[i, t - 1] + # Autoregressive effect.
rchisq(1, df[i]) # Chi-square distributed error.
} # End loop over time points.
} # End loop over persons.
Now that I have the outcome variable generated, I put it in long format, I create a lagged predictor, and I person mean center the predictor (or group mean center, or cluster mean center, all the same). I call this lagged and centered predictor chi.pred. I make the subgroups based on the degrees of freedom of individuals. The 25% with a lowest df goes in subgroup 1, 26% - 50% in subgroup 2, etc.
The problem is this: fitting a multilevel (i.e. mixed or random effects model) autoregressive(1) model with family = inverse.gaussian and link = 'identity', using glmer() from the lme4 package gives me quite a lot of warnings. E.g. "degenerate Hessian", "large eigen value/ratio", "failed to converge with max|grad", etc.. I just don't get why.
The model I fit are
# Random intercept, but fixed slope with subgroups as level 2 predictor of slope.
lmer(chi ~ chi.pred + chi.pred:factor(sub.df.noise) + (1|id), data = sim.data, control = lmerControl(optimizer = 'bobyqa'))
# Random intercept and slope.
lmer(chi ~ chi.pred + (1 + chi.pred|id), data = sim.data, control = lmerControl(optimizer = 'bobyqa'))
The reason I use inverse gaussian is because it is said to work better on skewed data.
Does anybody have any clue why I can't fit the models? I have tried increasing sample size and time points, different optimizers, I have double-double-double checked if lagging and centering the data is correct, increased the number of iterations, added some noise to the subgroups (since otherwise they are 1 on 1 related to degree of freedom) etc.

How does glmnet standardize variables when weights are present?

glmnet allows the user to input a vector of observation weights through the weights argument. glmnet also standardizes (by default) the predictor variables to have zero mean and unit variance. My question is: when weights is provided, does glmnet standardize the predictors using the weighted mean (and standard deviation) of each column or the unweighted mean (and standard deviation)?
There's a description of glmnet's standardization at Link
In the post you can see the Fortran-Code-Snippet of glmnet's source that computes the standardization. ("Proof" paragraph, second bullet).
I'm not familiar with Fortran, but to me it looks very much like it is in fact using the weighted mean and sd.
Edit: From the glmnet vignette:
"weights is for the observation weights. Default is 1 for each
observation. (Note: glmnet rescales the weights to sum to N, the
sample size.)"
With w in the Fortran code being the rescaled weights, this seems to be consistent with weighted mean standardization.
For what it's worth, consistent with the accepted answer, the weights in glmnet are sampling weights, and not inverse variance weights. For example if you have many more observations than unique observations, you can compress your dataset and get the same coefficient estimates:
n <- 50
m <- 5
y_norm <- rnorm(n)
y_bool <- rbinom(n,1,.5)
x <- matrix(rnorm(n*m),n)
w <- rpois(n,3) + 1 # weights
w_indx <- rep(1:n,times=w) # weights index
m1 = glmnet(x, y_norm, weights = w)
m2 = glmnet(x[w_indx,] ,y_norm[w_indx])
all.equal(coef(m1,s=.1),
coef(m2,s=.1))
>>> TRUE
M1 = glmnet(x,y_bool,weights = w,family = "binomial")
M2 = glmnet(x[w_indx,],y_bool[w_indx],family = "binomial")
all.equal(coef(M1,s=.1),
coef(M2,s=.1))
>>> TRUE
Of course, a bit more care needs to be used when using weights with cv.glmnet since the weights of aggregated records should be spread across folds using a multinomial distribution...

JAGS Random Effects Model Prediction

I'm trying to model a bayesian regression using an index as response (D47), temperature as predictor (Temp) and considering the random effects of a discrete variable (Material). I've found really good information regarding non-hierarchical regressions, some posts including even a prediction strategy for these models. Despite this, I've found a remarkable problem when predicting D47 values in my model, mostly because of the random intercept.
Is there any way to deal with a random intercept during the prediction of a JAGS regression?
Thanks for your answer,
model1<-"model {
# Priors
mu_int~dnorm(0, 0.0001) # Mean hyperparameter for random intercepts
sigma_int~dunif(0, 100) # SD hyperparameter for random intercepts
tau_int <- 1/(sigma_int*sigma_int)
for (i in 1:n) {
alpha[i]~dnorm(mu_int, tau_int) # Random intercepts
}
beta~dnorm(0, 0.01) # Common slope
sigma_res~dunif(0, 100) # Residual standard deviation
tau_res <- 1/(sigma_res*sigma_res)
# Likelihood
for (i in 1:n) {
mu[i] <- alpha[Mat[i]]+beta*Temp[i] # Expectation
D47[i]~dnorm(mu[i], tau_res) # The actual (random) responses
}
}"
Sure, you can make predictions with the random intercepts, all you need to do is specify it as some sort of derived quantity.
Try adding something like this to the model.
for(i in 1:(n)){
D47_pred[i] <- dnorm(mu[i], tau_res)
}
And then track D47_pred as a parameter.
edit:
Also, you need to change how you specify the prior for the random intercept. This will take a couple steps (updated code here from comments).
You will need to add a new constant to your data list, which represents the number of unique groups in vector Mat. I have labeled it M in this case (e.g. 4 groups in Mat, M = 4)
for (j in 1:(M)){
alpha[j] ~ dnorm(mu_int, tau_int) # Random intercepts
}
This specification just makes the correct number of random intercepts for your model

Resources