Calculating logLik by hand from a logistic regression - r

I ran a mixed model logistic regression adjusting my model with genetic relationship matrix using an R package known as GMMAT (function: glmmkin()).
My output from the model includes (taken from the user manual):
theta: the dispersion parameter estimate [1] and the variance component parameter estimate [2]
coefficients: fixed effects parameter estimates (including the intercept).
linear.predictors: the linear predictors.
fitted.values: fitted mean values on the original scale.
Y: a vector of length equal to the sample size for the final working vector.
P: the projection matrix with dimensions equal to the sample size.
residuals: residuals on the original scale. NOT rescaled by the dispersion parameter.
cov: covariance matrix for the fixed effects (including the intercept).
converged: a logical indicator for convergence.
I am trying to obtain the log-likelihood in order to compute variance explained. My first instinct was to pull apart the logLik.glm function in order to compute this "by hand" and I got stuck at trying to compute AIC. I used the answer from here.
I did a sanity check with a logistic regression run with stats::glm() where the model1$aic is 4013.232 but using the Stack Overflow answer I found, I obtained 30613.03.
My question is -- does anyone know how to compute log likelihood from a logistic regression by hand using the output that I have listed above in R?

No statistical insight here, just the solution I see from looking at glm.fit. This only works if you did not specify weights while fitting the models (or if you did, you would need to include those weights in the model object)
get_logLik <- function(s_model, family = binomial(logit)) {
n <- length(s_model$y)
wt <- rep(1, n) # or s_model$prior_weights if field exists
deviance <- sum(family$dev.resids(s_model$y, s_model$fitted.values, wt))
mod_rank <- sum(!is.na(s_model$coefficients)) # or s_model$rank if field exists
aic <- family$aic(s_model$y, rep(1, n), s_model$fitted.values, wt, deviance) + 2 * mod_rank
log_lik <- mod_rank - aic/2
return(log_lik)
}
For example...
model <- glm(vs ~ mpg, mtcars, family = binomial(logit))
logLik(model)
# 'log Lik.' -12.76667 (df=2)
sparse_model <- model[c("theta", "coefficients", "linear.predictors", "fitted.values", "y", "P", "residuals", "cov", "converged")]
get_logLik(sparse_model)
#[1] -12.76667

Related

How to perform logistic regression on not binary variable?

I was searching for this answer and I'm really suprised that haven't found it. I just want to peform three level logistic regression in R.
Let's define some artificial data:
set.seed(42)
y <- sample(0:2, 100, replace = T)
x <- rnorm(100)
My variable y is containing three numbers - 0, 1 and 2. So I thought that the simplest way would be just to use:
glm(y ~ x, family = binomial("logit"))
However I got information that y should be in interval [0,1]. Do you know how I can perform this regression ?
Please notice - I know that it's not so straightforward to perform multilevel logistic regression, there are several techniques how to do so e.g. One vs all. But as I was seeking for it, I wasn't able to find any.
Logistic regression as implemented by glm only works for 2 levels of output, not 3.
The message is a little vauge because you can specify the y-variable in logistic regression as 0s and 1s, or as a proportion (between 0 and 1) with a weights argument specifying the number of subjects the proportion is of.
With 3 or more ordered levels in the response you need to use a generalization, one common generalization is proportional odds logistic regression (also goes by other names). The polr function in the MASS package and the lrm function in the rms package (and probably other functions in other packages) fit these types of models, but glm does not.
set.seed(42)
y <- sample(0:2, 100, replace = TRUE)
x <- rnorm(100)
multinomial regression
If you don't want to treat your responses as ordered (i.e., nominal or categorical values):
library(nnet) ## 'recommended' package, i.e. installed by default
multinom(y~x)
Results
# weights: 9 (4 variable)
initial value 109.861229
final value 104.977336
converged
Call:
multinom(formula = y ~ x)
Coefficients:
(Intercept) x
1 -0.001529465 0.29386524
2 -0.649236723 -0.01933747
Residual Deviance: 209.9547
AIC: 217.9547
Or, if your responses are ordered:
ordinal regression
MASS::polr() does proportional-odds logistic regression. (You may also be interested in the ordinal package, which has more features; it can also do multinomial models.)
library(MASS) ## also 'recommended'
polr(ordered(y)~x)
Results
Call:
polr(formula = ordered(y) ~ x)
Coefficients:
x
0.06411137
Intercepts:
0|1 1|2
-0.4102819 1.3218487
Residual Deviance: 212.165
AIC: 218.165
If you read the error message, it offers a hint that you might get success with:
y <- sample(seq(0,1,length=3), 100, replace = T)
And in fact, you do. Now you challenge might be to interpret that in the context of the actual situation in reality (which you have not offered.) You do get a warning, but R warnings are not errors.
You might also look up the topic of polychotomous logistic regression, which is implemented in several variants that might be useful in particular situations. Frank Harrell's book Regression Modeling Strategies has material on such techniques. You may also post further questions on CrossValidated.com if you need help choosing which route to go.

Confusion abou how R computes standard error

I recently want to run a standard error of Fama-Macbeth test, when we compute standard error, we need standard devation. This test's sd is \frac{1}{n^2}\sum (x_i-\bar x)^2. In my mind, the denominator is n for a normal computation of sd. So my question is, whether a program , such as R and Eviews, when they run a linear regression they also give the coefficients' standard error by sd who is computed by denominator \frac{1}{n^2} ?
Thanks for everyone.
The formula for calculating the standard error of coefficients in a linear regression can be found in introductory textbooks or e.g. How are the standard errors of coefficients calculated in a regression?.
The variance of the coefficients is given by
where sigma-hat-squared is the sum of squared residuals divided by the degrees of freedom, given by n-k-1, where n is the number of observations, k is the number of covariates, and assuming that the model has an intercept.
We can verify this empirically. Using the built-in mtcars dataset
fit <- lm(mpg ~ wt, mtcars)
we can see that
vcv <- (sum(fit$residuals^2)/(nrow(mtcars) - 2)) *
solve(t(model.matrix(fit)) %*% model.matrix(fit))
all.equal(summary(fit)$coefficients[, "Std. Error"],
sqrt(diag(vcv)))
# [1] TRUE

unscale predictor coefficients lmer model fit with an unscaled response

I have fitted a lmer model, and now I am trying to interpret the coefficients in terms of the real coefficients instead of scaled ones.
My top model is:
lmer(logcptplus1~scale.t6+scale.logdepth+(1|location) + (1|Fyear),data=cpt, REML=TRUE)
so both the predictor variables are scaled, with one being the scaled log values. my response variable is not scaled and just logged.
to scale my predictor variables, I used the scale(data$column, center=TRUE,scale=TRUE) function in r.
The output for my model is:
Fixed effects:
Estimate Std. Error t value
(int) 3.31363 0.15163 21.853
scale.t6 -0.34400 0.10540 -3.264
scale.logdepth -0.58199 0.06486 -8.973
so how can I obtain real estimates for my response variable from these coefficients that are scaled based on my scaled predictor variables?
NOTE: I understand how to unscale my predictor variables, just not how to unscale/transform the coefficients
Thanks
The scale function does a z-transform of the data, which means it takes the original values, subtracts the mean, and then divides by the standard deviation.
to_scale <- 1:10
using_scale <- scale(to_scale, center = TRUE, scale = TRUE)
by_hand <- (to_scale - mean(to_scale))/sd(to_scale)
identical(as.numeric(using_scale), by_hand)
[1] TRUE
Therefore, to reverse the model coefficients all you need to do is multiply the coefficient by the standard deviation of the covariate and add the mean. The scale function holds onto the mean and sd for you. So, if we assume that your covariate values are the using_scale vector for the scale.t6 regression coefficient we can write a function to do the work for us.
get_real <- function(coef, scaled_covariate){
# collect mean and standard deviation from scaled covariate
mean_sd <- unlist(attributes(scaled_covariate)[-1])
# reverse the z-transformation
answer <- (coef * mean_sd[2]) + mean_sd[1]
# this value will have a name, remove it
names(answer) <- NULL
# return unscaled coef
return(answer)
}
get_real(-0.3440, using_scale)
[1] 4.458488
In other words, it is the same thing as unscaling your predictor variables because it is a monotonic transformation.

glm.nb with random effect as matrix

I am analyzing gene expression data in R. I would like to test for differences in expression when accounting for the phylogenetic effect.
I can run GLM with a negative binomial distribution and normalization factor as an offset:
library(MASS)
glm.nb(expression ~ Group + offset(log(normFactor)), data=data)
However, I don't know how to include phylogenetic effect in this model. I can obtain a variance-covariance or correlation matrix from my phylogeny:
library(ape)
tree <- read.tree("tree.nwk")
varCovMatrix <- vcv(tree, model = "Brownian", cor = FALSE)
I found that lmekin allows to specify the variance-covariance structure of the random effects:
library(coxme)
lmekin(expression ~ Group + (1| animal) + offset(log(normFactor)), data=data, varlist= varCovMatrix)
But I cannot specify negative binomial distribution and it isn't clear whether it understands offset.
The same problem is for MCMCglmm.
Please, help me to put into one GLMM:
the variance-covariance matrix
normalization factor as an offset
negative binomial distribution

Residuals and plots in ordered multinomial regression

I need to plot a binned residual plot with fitted versus residual values from an ordered multinominal logit regression.
How can I extract residuals when using polr? Is there any other function that runs ord multinominal logit in which residuals can be extracted?
This is the code I used
options(contrasts = c("contr.treatment", "contr.poly"))
mod1 <- polr(as.ordered(y) ~ x1 + x2 + x3, data, method='logistic')
fit <- mod1$fitted.values
res <- residuals(mod1)
binnedplot(fit, res)
The problem is that object 'res' is 'null'.
Thanks
For a start, can you tell us how residuals would be defined in principle for a model with categorical responses? fitted.values is a matrix of probabilities. You could define residuals in terms of correct prediction (defining the most likely outcome as the prediction, as in the default predict method for polr objects) -- or you could compute an n-by-n table of true values and predicted values. Alternatively you could reduce the ordinal data back to an integer scale and compute a mean outcome as the prediction ... but I can't see that there's any unique way to define the residuals in the first place.
In polr(), there is no function that returns residual. You should manually calculate it using its definition.
There are actually plenty of ways to get residuals from an ordinal probit/logit. Although polr does not provide any residuals, vglm provides several. See ?residualsvglm from the VGAMpackage (see also below).
NOTE: However, for a Control Function/2SRI approach Wooldridge (2014) suggests using the generalised residuals as described in Vella (1993). These are as far as I know currently not available in R, although I am working on that, but they are in Stata (using predict gr, score)
Residuals in VLGM
Surrogate residuals for polr
You can use the package sure (link), to calculate surrogate residuals with resids. The package is based on this paper, in the Journal of the American Statistical Association.
library(sure) # for residual function and sample data sets
library(MASS) # for polr function
df1 <- df1
df1$x1 <- df1$x
df1$x <- NULL
df1$y <- df2$y
df1$x2 <- df2$x
df1$x3 <- df3$x
options(contrasts = c("contr.treatment", "contr.poly"))
mod1 <- polr(as.ordered(y) ~ x1 + x2 + x3, data=df1, method='probit')
fit <- mod1$fitted.values
res <- resids(mod1)
EDIT: One big issue is the following (from ?resids):
"Note: Surrogate residuals require sampling from a continuous distribution; consequently, the result will be different with every call to resids. The internal functions used for sampling from truncated distributions when method = "latent" are based on modified versions of rtrunc and qtrunc."
Even when running resids(mod1, nsim=1000, method="latent"), there is no convergence of the outcome.

Resources