R: loglikelihood of Saturated Model in GLM - r

Let LL = loglikelihood
Residual Deviance = 2(LL(Saturated Model) - LL(Proposed Model))
However, when I use glm function, it seems that
Residual Deviance = -2LL(Proposed Model)
For example,
mydata <- read.csv("https://stats.idre.ucla.edu/stat/data/binary.csv")
mydata$rank <- factor(mydata$rank)
mylogit <- glm(admit ~ gre + gpa + rank, data = mydata, family = "binomial")
summary(mylogit)
###
Residual deviance: 458.52 on 394 degrees of freedom
AIC: 470.52
#Residual deviance
-2*logLik(mylogit)
##'log Lik.' 458.5175 (df=6)
#AIC
-2*logLik(mylogit)+2*(5+1)
##470.5175
Where is LL(Saturated Model) and how can I get it's value in R?
Thank you.

I have got the answer: it only happens when the log likelihood of the saturated model is 0, which for discrete models implies that the probability of the observed data under the saturated model is 1. Binary data is pretty much the only case where this is true (because individual fitted probabilities become either zero or one).H and Here for details.

Related

Estimating regression paths in lavaan, df and test statistics

I am trying to compare structural equation models using lavaan in R.
I have 4 latent variables, three of which are being estimated by 8 observed variables and one of which is being estimated by 2 observed variables.
When I run the measurement model, the test user model has 293 degrees of freedom with 58 model parameters. When I run the structural model, with three additional regression paths estimated, I receive the same model statistics with the same number of degrees of freedom (293) and the same number of model parameters (58).
Because the models are identical, and I try to compare them with anova, there are no degrees of freedom difference, and no chi-square difference, because it is the same model output.
So, I receive the following error
Warning message: In lavTestLRT(object = object, ..., model.names =
NAMES) : lavaan WARNING: some models have the same degrees of
freedom
semPaths is showing the regression coefficients estimated, and the output for parameter estimates are showing the regression coefficients for the structural model, but the fit indices (AIC, BIC, etc.), chi-square, and degrees of freedom are identical.
I thought I had simply put the wrong model in the summary function, but no, that was not it.
I am trying not to be a dolt, but I cannot figure out why lavaan is giving me exactly the same df and chi-square when I am estimating three additional paths/parameters.
Any insight is welcomed. Again, I apologize of I am missing the obvious.
Here is the code:
# Pre-Post Measurement Model 1 - (TM)
MeasTM1 <- '
posttransp8 =~ post_Understand_Successful_Work +
post_Purpose_Assignment +
post_Assignment_Objectives_Course +
post_Instructor_Identified_Goal +
post_Steps_Required +
post_Assignment_Instructions +
post_Detailed_Directions +
post_Knew_How_Evaluated
preskills8 =~ pre_Express_Ideas_Write +
pre_Express_Ideas_Speak +
pre_Collaborate_Academic +
pre_Analyz + pre_Synthesize +
pre_Apply_New_Contexts +
pre_Consider_Ethics +
pre_Capable_Self_Learn
postskills8 =~ post_Express_Ideas_Write +
post_Express_Ideas_Speak +
post_Collaborate_Academic +
post_Analyz + post_Synthesize +
post_Apply_New_Contexts +
post_Consider_Ethics +
post_Capable_Self_Learn
postbelong2 =~ post_Belong_School_Commty + post_Helped_Belong_School_Commty
'
fitMeasTM1 <- sem(MeasTM1, data=TILTSEM)
summary(fitMeasTM1, standardized=TRUE, fit.measures=TRUE)
semPaths(fitMeasTM1, whatLabels = "std", layout = "tree")
# Pre-Post Factor Model 1 - (TM)
#Testing regression on Pre-Post Skills
FactTM1 <- '
#latent factors
posttransp8 =~ post_Understand_Successful_Work +
post_Purpose_Assignment +
post_Assignment_Objectives_Course +
post_Instructor_Identified_Goal +
post_Steps_Required +
post_Assignment_Instructions +
post_Detailed_Directions +
post_Knew_How_Evaluated
preskills8 =~ pre_Express_Ideas_Write +
pre_Express_Ideas_Speak +
pre_Collaborate_Academic +
pre_Analyz + pre_Synthesize +
pre_Apply_New_Contexts +
pre_Consider_Ethics +
pre_Capable_Self_Learn
postskills8 =~ post_Express_Ideas_Write +
post_Express_Ideas_Speak +
post_Collaborate_Academic +
post_Analyz + post_Synthesize +
post_Apply_New_Contexts +
post_Consider_Ethics +
post_Capable_Self_Learn
postbelong2 =~ post_Belong_School_Commty + post_Helped_Belong_School_Commty
#regressions
postskills8 ~ preskills8 + postbelong2 + posttransp8
'
fitFactTM1 <- sem(FactTM1, data=TILTSEM)
summary(fitFactTM1, standardized=TRUE, fit.measures=TRUE)
semPaths(fitFactTM1, whatLabels = "std", layout = "tree")
anova(fitMeasTM1,fitFactTM1)
Here is the model output for the two models (to show that they are identical):
=========================Pre-Post Measurement Model 1 - (TM)=============================
Estimator ML
Optimization method NLMINB
Number of model parameters 58
Used Total
Number of observations 521 591
Model Test User Model:
Test statistic 1139.937
Degrees of freedom 293
P-value (Chi-square) 0.000
Model Test Baseline Model:
Test statistic 4720.060
Degrees of freedom 325
P-value 0.000
User Model versus Baseline Model:
Comparative Fit Index (CFI) 0.807
Tucker-Lewis Index (TLI) 0.786
Loglikelihood and Information Criteria:
Loglikelihood user model (H0) -13335.136
Loglikelihood unrestricted model (H1) -12765.167
Akaike (AIC) 26786.271
Bayesian (BIC) 27033.105
Sample-size adjusted Bayesian (BIC) 26849.000
Root Mean Square Error of Approximation:
RMSEA 0.074
90 Percent confidence interval - lower 0.070
90 Percent confidence interval - upper 0.079
P-value RMSEA <= 0.05 0.000
Standardized Root Mean Square Residual:
SRMR 0.068
=========================Pre-Post Factor Model 1 - (TM)======================
Estimator ML
Optimization method NLMINB
Number of model parameters 58
Used Total
Number of observations 521 591
Model Test User Model:
Test statistic 1139.937
Degrees of freedom 293
P-value (Chi-square) 0.000
Model Test Baseline Model:
Test statistic 4720.060
Degrees of freedom 325
P-value 0.000
User Model versus Baseline Model:
Comparative Fit Index (CFI) 0.807
Tucker-Lewis Index (TLI) 0.786
Loglikelihood and Information Criteria:
Loglikelihood user model (H0) -13335.136
Loglikelihood unrestricted model (H1) -12765.167
Akaike (AIC) 26786.271
Bayesian (BIC) 27033.105
Sample-size adjusted Bayesian (BIC) 26849.000
Root Mean Square Error of Approximation:
RMSEA 0.074
90 Percent confidence interval - lower 0.070
90 Percent confidence interval - upper 0.079
P-value RMSEA <= 0.05 0.000
Standardized Root Mean Square Residual:
SRMR 0.068
You aren't using up more degrees of freedom.
One thing that makes lavaan::sem() dangerous to use over lavaan::lavaan() is that its defaults are hard to remember and/or notice. If you look at ?lavaan::sem, you will see those defaults:
The sem function is a wrapper for the more general lavaan function, but setting the following default options: int.ov.free = TRUE, int.lv.free = FALSE, auto.fix.first = TRUE (unless std.lv = TRUE), auto.fix.single = TRUE, auto.var = TRUE, auto.cov.lv.x = TRUE, auto.efa = TRUE, auto.th = TRUE, auto.delta = TRUE, and auto.cov.y = TRUE
You can find out what this means via ?lavOptions:
auto.cov.lv.x: If TRUE, the covariances of exogenous latent variables are included in the model and set free.
By default, all exogenous latent variables (i.e., all of your latent factors) are correlated in your model already.
I'm also not sure how you are identifying the 2-item factor, so I'm surprised this does not throw a warnings, unless you're ignoring it.

Logistic stepwise regression with a fixed number of predictors

For a course I'm attending, I have to perform a logistic stepwise regression to reduce the number of predictors of a feature to a fixed number and estimate the accuracy of the resulting model.
I've been trying with regsubsets() from the leaps package, but I can't get its accuracy.
Now I'm trying with caret, because I can set its metric to "Accuracy", but I can't fix the number of predictors when I use method = "glmStepAIC" in the train() function, because it has no tune parameters.
step.model <- train(Outcome ~ .,
data = myDataset,
method = "glmStepAIC",
metric = "Accuracy",
trControl = trainControl(method = "cv", number = 10),
trace = FALSE)
I've found this question, but the answer and the links don't seem to work for me. stepwise regression using caret in R
If not with caret, what would be the best way to achieve the reduced model with fixed number of predictors?
You can specify the number of variables to keep in stepwise selection using the glmulti package. In this example columns a through g are related to the outcome, but columns A through E are not. In glmulti, confsetsize is the number of models to select and set minsize equal to maxsize for the number of variables to keep.
library(MASS)
library(dplyr)
set.seed(100)
dat=data.frame(a=rnorm(10000))
for (i in 2:12) {
dat[,i]=rnorm(10000)
}
names(dat)=c("a", letters[2:7], LETTERS[1:5])
Yy=rep(0, 10000)
for (i in 1:7) {
Yy=Yy+i*dat[,i]
}
Yy=1/(1+exp(-Yy))
outcome=c()
for (i in 1:10000) {
outcome[i]=sample(c(1,0), 1, prob=c(Yy[i], 1-Yy[i]))
}
dat=mutate(dat, outcome=factor(outcome))
library(glmulti)
mod=glmulti(outcome ~ .,
data=dat,
level=1,
method="g",
crit="aic",
confsetsize=5,
plotty=F, report=T,
fitfunction="glm",
family="binomial",
minsize=7,
maxsize=7,
conseq=3)
output
mod#objects[[1]]
Call: fitfunc(formula = as.formula(x), family = "binomial", data = data)
Coefficients:
(Intercept) a b c d e f g
-0.01386 1.11590 1.99116 3.00459 4.00436 4.86382 5.94198 6.89312
Degrees of Freedom: 9999 Total (i.e. Null); 9992 Residual
Null Deviance: 13860
Residual Deviance: 2183 AIC: 2199

R: Problems of wrapping polynomial regression in a function

When I was doing polynomial regression, I tried to put fits with different degree of polynomial into a list, so I wrapped the glm into a function:
library(MASS)
myglm <- function(dop) {
# dop: degree of polynomial
glm(nox ~ poly(dis, degree = dop), data = Boston)
}
However, I guess there might be some problem related to lazy evaluation. The degree of the model is parameter dop rather than a specific number.
r$> myglm(2)
Call: glm(formula = nox ~ poly(dis, degree = dop), data = Boston)
Coefficients:
(Intercept) poly(dis, degree = dop)1 poly(dis, degree = dop)2
0.5547 -2.0031 0.8563
Degrees of Freedom: 505 Total (i.e. Null); 503 Residual
Null Deviance: 6.781
Residual Deviance: 2.035 AIC: -1347
When I do cross-validation using this model, an error occurs:
>>> cv.glm(Boston, myglm(2))
Error in poly(dis, degree = dop) : object 'dop' not found
So how can I solve this problem ?
Quosures, quasiquotation, and tidy evaluation are useful here:
library(MASS)
library(boot)
library(rlang)
myglm <- function(dop) {
eval_tidy(quo(glm(nox ~ poly(dis, degree = !! dop), data = Boston)))
}
cv.glm(Boston, myglm(2))

Weighted logistic regression in R

Given sample data of proportions of successes plus sample sizes and independent variable(s), I am attempting logistic regression in R.
The following code does what I want and seems to give sensible results, but does not look like a sensible approach; in effect it doubles the size of the data set
datf <- data.frame(prop = c(0.125, 0, 0.667, 1, 0.9),
cases = c(8, 1, 3, 3, 10),
x = c(11, 12, 15, 16, 18))
datf2 <- rbind(datf,datf)
datf2$success <- rep(c(1, 0), each=nrow(datf))
datf2$cases <- round(datf2$cases*ifelse(datf2$success,datf2$prop,1-datf2$prop))
fit2 <- glm(success ~ x, weight=cases, data=datf2, family="binomial")
datf$proppredicted <- 1 / (1 + exp(-predict(fit2, datf)))
plot(datf$x, datf$proppredicted, type="l", col="red", ylim=c(0,1))
points(datf$x, datf$prop, cex=sqrt(datf$cases))
producing a chart like
which looks reasonably sensible.
But I am not happy about the use of datf2 as a way of separating the successes and failures by duplicating the data. Is something like this necessary?
As a lesser question, is there a cleaner way of calculating the predicted proportions?
No need to construct artificial data like that; glm can fit your model from the dataset as given.
> glm(prop ~ x, family=binomial, data=datf, weights=cases)
Call: glm(formula = prop ~ x, family = binomial, data = datf, weights = cases)
Coefficients:
(Intercept) x
-9.3533 0.6714
Degrees of Freedom: 4 Total (i.e. Null); 3 Residual
Null Deviance: 17.3
Residual Deviance: 2.043 AIC: 11.43
You will get a warning about "non-integer #successes", but that is because glm is being silly. Compare to the model on your constructed dataset:
> fit2
Call: glm(formula = success ~ x, family = "binomial", data = datf2,
weights = cases)
Coefficients:
(Intercept) x
-9.3532 0.6713
Degrees of Freedom: 7 Total (i.e. Null); 6 Residual
Null Deviance: 33.65
Residual Deviance: 18.39 AIC: 22.39
The regression coefficients (and therefore predicted values) are basically equal. However your residual deviance and AIC are suspect because you've created artificial data points.

Logistic Unit Fixed Effect Model in R

I'm trying to estimate a logistic unit fixed effects model for panel data using R. My dependent variable is binary and measured daily over two years for 13 locations.
The goal of this model is to predict the value of y for a particular day and location based on x.
zero <- seq(from=0, to=1, by=1)
ids = dplyr::data_frame(location=seq(from=1, to=13, by=1))
dates = dplyr::data_frame(date = seq(as.Date("2015-01-01"), as.Date("2016-12-31"), by="days"))
data = merge(dates, ids)
data$y <- sample(zero, size=9503, replace=TRUE)
data$x <- sample(zero, size=9503, replace=TRUE)
While surveying the available packages to do so, I've read a number of ways to (apparently) do this, but I'm not confident I've understood the differences between packages and approaches.
From what I have read so far, glm(), survival::clogit() and pglm::pglm() can be used to do this, but I'm wondering if there are substantial differences between the packages and what those might be.
Here are the calls I've used:
fixed <- glm(y ~ x + factor(location), data=data)
fixed <- clogit(y ~ x + strata(location), data=data)
One of the reasons for this insecurity is the error I get when using pglm (also see this question) that pglm can't use the "within" model:
fixed <- pglm(y ~ x, data=data, index=c("location", "date"), model="within", family=binomial("logit")).
What distinguishes the "within" model of pglm from the approaches in glm() and clogit() and which of the three would be the correct one to take here when trying to predict y for a given date and unit?
I don't see that you have defined a proper hypothesis to test within the context of what you are calling "panel data", but as far as getting glm to give estimates for logistic coefficients within strata it can be accomplished by adding family="binomial" and stratifying by your "unit" variable:
> fixed <- glm(y ~ x + strata(unit), data=data, family="binomial")
> fixed
Call: glm(formula = y ~ x + strata(unit), family = "binomial", data = data)
Coefficients:
(Intercept) x strata(unit)unit=2 strata(unit)unit=3
0.10287 -0.05910 -0.08302 -0.03020
strata(unit)unit=4 strata(unit)unit=5 strata(unit)unit=6 strata(unit)unit=7
-0.06876 -0.05042 -0.10200 -0.09871
strata(unit)unit=8 strata(unit)unit=9 strata(unit)unit=10 strata(unit)unit=11
-0.09702 0.02742 -0.13246 -0.04816
strata(unit)unit=12 strata(unit)unit=13
-0.11449 -0.16986
Degrees of Freedom: 9502 Total (i.e. Null); 9489 Residual
Null Deviance: 13170
Residual Deviance: 13170 AIC: 13190
That will not take into account any date-ordering, which is what I would have expected to be the interest. But as I said above, there doesn't yet appear to be a hypothesis that is premised on any sequential ordering.
This would create a fixed effects model that included a spline relationship of date to probability of y-event. I chose to center the date rather than leaving it as a very large integer:
library(splines)
fixed <- glm(y ~ x + ns(scale(date),3) + factor(unit), data=data, family="binomial")
fixed
#----------------------
Call: glm(formula = y ~ x + ns(scale(date), 3) + factor(unit), family = "binomial",
data = data)
Coefficients:
(Intercept) x ns(scale(date), 3)1 ns(scale(date), 3)2
0.13389 -0.05904 0.04431 -0.10727
ns(scale(date), 3)3 factor(unit)2 factor(unit)3 factor(unit)4
-0.03224 -0.08302 -0.03020 -0.06877
factor(unit)5 factor(unit)6 factor(unit)7 factor(unit)8
-0.05042 -0.10201 -0.09872 -0.09702
factor(unit)9 factor(unit)10 factor(unit)11 factor(unit)12
0.02742 -0.13246 -0.04816 -0.11450
factor(unit)13
-0.16987
Degrees of Freedom: 9502 Total (i.e. Null); 9486 Residual
Null Deviance: 13170
Residual Deviance: 13160 AIC: 13200

Resources