Confusion abou how R computes standard error - r

I recently want to run a standard error of Fama-Macbeth test, when we compute standard error, we need standard devation. This test's sd is \frac{1}{n^2}\sum (x_i-\bar x)^2. In my mind, the denominator is n for a normal computation of sd. So my question is, whether a program , such as R and Eviews, when they run a linear regression they also give the coefficients' standard error by sd who is computed by denominator \frac{1}{n^2} ?
Thanks for everyone.

The formula for calculating the standard error of coefficients in a linear regression can be found in introductory textbooks or e.g. How are the standard errors of coefficients calculated in a regression?.
The variance of the coefficients is given by
where sigma-hat-squared is the sum of squared residuals divided by the degrees of freedom, given by n-k-1, where n is the number of observations, k is the number of covariates, and assuming that the model has an intercept.
We can verify this empirically. Using the built-in mtcars dataset
fit <- lm(mpg ~ wt, mtcars)
we can see that
vcv <- (sum(fit$residuals^2)/(nrow(mtcars) - 2)) *
solve(t(model.matrix(fit)) %*% model.matrix(fit))
all.equal(summary(fit)$coefficients[, "Std. Error"],
sqrt(diag(vcv)))
# [1] TRUE

Related

95% CI for the ICC in linear mixed effects model (multilevel model, hierarchical model)

I fitted a linear mixed effect model to predict the math score as the outcome, x= participant factor (nominal or ordinal) as the fixed effect, Schl is the random effect. Then I compared it with the simple linear regression model using compare_performance, and while the output gives the ICC, I was not sure how to calculate the 95% for it? (for coefficients I used confintconfint and it did the job)
lm1<- lm(math~ gender, data= df)
lme1<- lmer(math~gender+(1|schl), data=df)
compare_performance(lm1,lme1)
the ICC was 0.15
From this gist from Peter Dahlgren, taken in turn from a CrossValidated answer by #Ashe, here is the crux:
calc.icc <- function(y) {
sumy <- summary(y)
(sumy$varcor$id[1]) / (sumy$varcor$id[1] + sumy$sigma^2)
}
boot.icc <- bootMer(mymod, calc.icc, nsim=1000)
#Draw from the bootstrap distribution the usual 95% upper and lower confidence limits
quantile(boot.icc$t, c(0.025, 0.975))
You can (and should) check that this calc.icc() function gives the same results as your compare_performance() function. Since this uses parametric bootstrapping, you can substitute any ICC function you like as it long takes a fitted model as input and returns the ICC as a single numeric value. (Also, because it uses PB, it will be slow; there are potentially faster approximate methods, but PB is reliable and easy to program.)

How to calculate R-squared in nls package (non-linear model) in R?

I analyzed non-linear regression using nls package.
power<- nls(formula= agw~a*area^b, data=calibration_6, start=list(a=1, b=1))
summary(power)
I heard in non-linear model, R-squared is not valid and rather than R-squared, we usually show residual standard error which R provides
However, I just want to know what R-squared is. Is that possible to check R-squared in nls package?
Many thanks!!!
OutPut
I found the solution. This method might not be correct in terms of statistics (As R^2 is not valid in non-linear model), but I just want see the overall goodness of fit for my non-linear model.
Step 1> to transform data as log (common logarithm)
When I use non-linear model, I can't check R^2
nls(formula= agw~a*area^b, data=calibration, start=list(a=1, b=1))
Therefore, I transform my data to log
x1<- log10(calibration$area)
y1<- log10(calibration$agw)
cal<- data.frame (x1,y1)
Step 2> to analyze linear regression
logdata<- lm (formula= y1~ x1, data=cal)
summary(logdata)
Call:
lm(formula = y1 ~ x1)
This model provides, y= -0.122 + 1.42x
But, I want to force intercept to zero, therefore,
Step 3> to force intercept to zero
logdata2<- lm (formula= y1~ 0 + x1)
summary(logdata2)
Now the equation is y= 1.322x, which means log (y) = 1.322 log (x),
so it's y= x^1.322.
In power curve model, I force intercept to zero. The R^2 is 0.9994

F-test with HAC estimate

I am calculating a multi-variate OLS regression in R, and I know the residual are autocorrelated. I know I can use Newey-West correction when performing the t-test to check whether one of the coefficient is zero. I can do that using:
require(sandwich)
model <- lm(y ~ x1 + x2)
coeftest(model, vcov=NeweyWest(model))
where y was the variable to regress and x1 and x2 the predictor. This seems a good approach since my sample size is large.
But what if I want to run an F-test to test whether the coefficient of x1 is 1 and the coefficient of x2 is zero simultaneously? I cannot find a way to do that in R, if I want to account for the autocorrelation of the residuals. For instance, if I use the function linearHypothesis in R, it seems that Newey-West cannot be used as an argument of vcov. Any suggestion? An alternative would be to do bootstrapping to estimate a confidence ellipse for my point (1,0), but I was hoping to use an F-test if possible. Thank you!

Calculating logLik by hand from a logistic regression

I ran a mixed model logistic regression adjusting my model with genetic relationship matrix using an R package known as GMMAT (function: glmmkin()).
My output from the model includes (taken from the user manual):
theta: the dispersion parameter estimate [1] and the variance component parameter estimate [2]
coefficients: fixed effects parameter estimates (including the intercept).
linear.predictors: the linear predictors.
fitted.values: fitted mean values on the original scale.
Y: a vector of length equal to the sample size for the final working vector.
P: the projection matrix with dimensions equal to the sample size.
residuals: residuals on the original scale. NOT rescaled by the dispersion parameter.
cov: covariance matrix for the fixed effects (including the intercept).
converged: a logical indicator for convergence.
I am trying to obtain the log-likelihood in order to compute variance explained. My first instinct was to pull apart the logLik.glm function in order to compute this "by hand" and I got stuck at trying to compute AIC. I used the answer from here.
I did a sanity check with a logistic regression run with stats::glm() where the model1$aic is 4013.232 but using the Stack Overflow answer I found, I obtained 30613.03.
My question is -- does anyone know how to compute log likelihood from a logistic regression by hand using the output that I have listed above in R?
No statistical insight here, just the solution I see from looking at glm.fit. This only works if you did not specify weights while fitting the models (or if you did, you would need to include those weights in the model object)
get_logLik <- function(s_model, family = binomial(logit)) {
n <- length(s_model$y)
wt <- rep(1, n) # or s_model$prior_weights if field exists
deviance <- sum(family$dev.resids(s_model$y, s_model$fitted.values, wt))
mod_rank <- sum(!is.na(s_model$coefficients)) # or s_model$rank if field exists
aic <- family$aic(s_model$y, rep(1, n), s_model$fitted.values, wt, deviance) + 2 * mod_rank
log_lik <- mod_rank - aic/2
return(log_lik)
}
For example...
model <- glm(vs ~ mpg, mtcars, family = binomial(logit))
logLik(model)
# 'log Lik.' -12.76667 (df=2)
sparse_model <- model[c("theta", "coefficients", "linear.predictors", "fitted.values", "y", "P", "residuals", "cov", "converged")]
get_logLik(sparse_model)
#[1] -12.76667

How does lmer (from the R package lme4) compute log likelihood?

I'm trying to understand the function lmer. I've found plenty of information about how to use the command, but not much about what it's actually doing (save for some cryptic comments here: http://www.bioconductor.org/help/course-materials/2008/PHSIntro/lme4Intro-handout-6.pdf). I'm playing with the following simple example:
library(data.table)
library(lme4)
options(digits=15)
n<-1000
m<-100
data<-data.table(id=sample(1:m,n,replace=T),key="id")
b<-rnorm(m)
data$y<-rand[data$id]+rnorm(n)*0.1
fitted<-lmer(b~(1|id),data=data,verbose=T)
fitted
I understand that lmer is fitting a model of the form Y_{ij} = beta + B_i + epsilon_{ij}, where epsilon_{ij} and B_i are independent normals with variances sigma^2 and tau^2 respectively. If theta = tau/sigma is fixed, I computed the estimate for beta with the correct mean and minimum variance to be
c = sum_{i,j} alpha_i y_{ij}
where
alpha_i = lambda/(1 + theta^2 n_i)
lambda = 1/[\sum_i n_i/(1+theta^2 n_i)]
n_i = number of observations from group i
I also computed the following unbiased estimate for sigma^2:
s^2 = \sum_{i,j} alpha_i (y_{ij} - c)^2 / (1 + theta^2 - lambda)
These estimates seem to agree with what lmer produces. However, I can't figure out how log likelihood is defined in this context. I calculated the probability density to be
pd(Y_{ij}=y_{ij}) = \prod_{i,j}[f_sigma(y_{ij}-ybar_i)]
* prod_i[f_{sqrt(sigma^2/n_i+tau^2)}(ybar_i-beta) sigma sqrt(2 pi/n_i)]
where
ybar_i = \sum_j y_{ij}/n_i (the mean of observations in group i)
f_sigma(x) = 1/(sqrt{2 pi}sigma) exp(-x^2/(2 sigma)) (normal density with sd sigma)
But log of the above is not what lmer produces. How is log likelihood computed in this case (and for bonus marks, why)?
Edit: Changed notation for consistency, striked out incorrect formula for standard deviation estimate.
The links in the comments contained the answer. Below I've put what the formulae simplify to in this simple example, since the results are somewhat intuitive.
lmer fits a model of the form , where and are independent normals with variances and respectively. The joint probability distribution of and is therefore
where
.
The likelihood is obtained by integrating this with respect to (which isn't observed) to give
where is the number of observations from group , and is the mean of observations from group . This is somewhat intuitive since the first term captures spread within each group, which should have variance , and the second captures the spread between groups. Note that is the variance of .
However, by default (REML=T) lmer maximises not the likelihood but the "REML criterion", obtained by additionally integrating this with respect to to give
where is given below.
Maximising likelihood (REML=F)
If is fixed, we can explicitly find the and which maximise likelihood. They turn out to be
Note has two terms for variation within and between groups, and is somewhere between the mean of and the mean of depending on the value of .
Substituting these into likelihood, we can express the log likelihood in terms of only:
lmer iterates to find the value of which minimises this. In the output, and are shown in the fields "deviance" and "logLik" (if REML=F) respectively.
Maximising restricted likelihood (REML=T)
Since the REML criterion doesn't depend on , we use the same estimate for as above. We estimate to maximise the REML criterion:
The restricted log likelihood is given by
In the output of lmer, and are shown in the fields "REMLdev" and "logLik" (if REML=T) respectively.

Resources