I have a logit model with 4 independent variables:
logit <- glm(y ~ x1 + x2 + x3 + x4, family = binomial(), data = df)
All variables in the model are dichotomous (0,1).
I want to get the predicted probabilities for when x1=1 vs x1=0, while holding all other variables in the model constant, using the following code:
mean(predict(logit,transform(df,x1=1),type='response'))
mean(predict(logit,transform(df,x1=0),type='response'))
Is this the correct way to do this? I'm asking because I tried a few different logit models and compared the difference between the two values produced by the code above with the number the margins command produces:
summary(margins(logit, variables="treatment"))
For most models I tried, the difference between the two values equals the number produced by the margins command above, but for one model specification the numbers were slightly different.
Finally, when I use weights in my logit model, do I have to do anything differently when I calculate the predicted probabilities than when I don't include any weights?
Your approach seems reasonable. The only thing I’ll note is that there are many packages out there which will do this automatically for you, and can estimate standard errors. One example is the marginaleffects package. (Disclaimer: I am the author.)
library(marginaleffects)
mod <- glm(am ~ vs + hp, data = mtcars, family = binomial)
# average predicted outcome for observed units with vs=1 or vs=0
predictions(
mod,
by = "vs")
#> type vs predicted std.error statistic p.value conf.low conf.high
#> 1 response 0 0.3333333 0.1082292 3.079883 0.0020708185 0.1212080 0.5454587
#> 2 response 1 0.5000000 0.1328955 3.762356 0.0001683204 0.2395297 0.7604703
# are the average predicted outcomes equal?
predictions(
mod,
by = "vs",
hypothesis = "pairwise")
#> type term predicted std.error statistic p.value conf.low conf.high
#> 1 response 0 - 1 -0.1666667 0.1713903 -0.9724392 0.3308321 -0.5025855 0.1692522
FYI, there is a wts argument to handle weights in the predictions() function.
Edit:
predictions(
mod,
newdata = datagridcf(vs = 0:1),
by = "vs")
predictions(
mod,
variables = list(vs = 0:1),
by = "vs")
I have a fitted lm model
log_log_model = lm(log(price) ~ log(carat), data = diamonds)`
I want to predict price using this model, but I'm not sure if I should be entering log(carat) or carat value as predictor into the predict() function?
Choice 1
exp(predict(log_log_model, data.frame(carat = log(3)),
interval = 'predict', level = 0.99))
Choice 2
exp(predict(log_log_model, data.frame(carat = 3),
interval = 'predict', level = 0.99))
Which one is correct?
Choice 2 is correct.
To give you some extra bit of confidence, let's inspect what the design matrix looks like when we make prediction.
## for diamonds dataset
library(ggplo2)
## log-log linear model
fit <- lm(log(price) ~ log(carat), data = diamonds)
## for prediction
newdat <- data.frame(data.frame(carat = 3))
## evaluate the design matrix for prediction
Xp <- model.matrix(delete.response(terms(fit)), data = newdat)
# (Intercept) log(carat)
#1 1 1.098612
See it? carat = 3 is automatically evaluated to log(carat) = log(3).
I'm trying to subset a series of models dredged from a global model that has both linear & non-linear terms. There are no interactions e.g.
Glblm <- Y ~ X1 + X2 + X3 + I(X3^2) + X4 + X5 + X6 + I(X6^2) + X7 + I(X7^2)
I want to specify that X3^2 should never appear without X3, but X3 could appear alone without X3^2 (and the same for X6 & X7).
I have tried the following as I understood from the documentation:
ssm <-dredge (Glblm, subset=(X3| !I(X3^2)) && (X6| !I(X6^2)) && (X7| !I(X7^2)))
I also tried making a subset first as I read https://stackoverflow.com/questions/55252019/dredge-in-mumin-r-keeps-models-with-higher-order-terms-without-their-respectiv
e.g.
hbfsubset <- expression( dc(X3, `I(X3^2)`) & dc(`X6`, `I(X6^2)`)& dc(`X7`, `I(X7^2)`))
ssm <-dredge (Glblm, subset=hbfsubset)
neither has produced a subset of models, instead the full list of models is returned when inspecting 'ssm' using:
model.sel(ssm)
Any help would be greatly appreciated.
A reproducible example from you is needed to pinpoint the issue, specifying what type of model you are fitting.
In simple linear models (lm, those examples provided in MuMIn handbook), the name of fitted terms is exactly the same as what you typed in the global model, but this may not be the case in more complex models (e.g. glmmTMB).
Here is an example:
library(MuMIn)
library(glmmTMB)
# a simple linear model, using Cement data from MuMIn
m1 <- lm(y ~ X1 + I(X1^2) + X2 + I(X2^2), data = Cement, na.action = "na.fail")
# dredge without a subset
d1 <- dredge(m1)
# 16 models produced
# dredge with a subset
d1_sub <- dredge(m1, subset = dc(`X1`, `I(X1^2)`) & dc(`X2`, `I(X2^2)`))
# 9 models produced, works totally fine
# a glmmTMB linear model
m2 <- glmmTMB(y ~ X1 + I(X1^2) + X2 + I(X2^2), data = Cement, na.action = "na.fail")
# dredge without a subset
d2 <- dredge(m2)
# 16 models produced
# dredge with a subset
d2_sub <- dredge(m2, subset = dc(`X1`, `I(X1^2)`) & dc(`X2`, `I(X2^2)`))
# 16 models produced, subset didn't work and no warning or error produced
# this is because the term names of a glmmTMB object in dredge() is not the same as the typed global model anymore:
names(d2_sub)
# [1] "cond((Int))" "disp((Int))" "cond(X1)" "cond(I(X1^2))" "cond(X2)" "cond(I(X2^2))" "df" "logLik" "AICc"
# [10] "delta" "weight"
# e.g., now the X1 in the typed global model is actually called cond(X1)
# what will work for glmmTMB:
d2_sub <- dredge(m2, subset = dc(`cond(X1)`, `cond(I(X1^2))`) & dc(`cond(X2)`, `cond(I(X2^2))`))
# 9 models produced
I want to estimate a structural equation model using lavaan in R with a categorical mediator. A wrinkle is that three of the exogenous variables are linearly dependent. However, this shouldn't be a problem since I'm using the categorical mediator to achieve identification a la Judea Pearl's front-door criterion. That is, mathematically each particular equation is identified (see the R code below).
With lavaan in R I can obtain estimates when the mediator is numeric, but not when it is categorical. With a categorical mediator I obtain the following error:
Error in lav_samplestats_step1(Y = Data, ov.names = ov.names, ov.types = ov.types,
: lavaan ERROR: linear regression failed for y; X may not be of full rank in group 1
Any advice on how to obtain estimates with a categorical mediator using lavaan?
Code:
# simulating the dataset
set.seed(1234) # seed for replication
x1 <- rep(seq(1:4), 100) # variable 1
x2 <- rep(1:4, each=100) # variable 2
x3 <- x2 - x1 + 4 # linear dependence
m <- sample(0:1, size = 400, replace = TRUE) # mediator
df <- data.frame(cbind(x1,x2,x3,m)) # dataframe
df$y <- 6.5 + x1*(0.5) + x2*(0.2) + m*(-0.4) + x3*(-1) + rnorm(400, 0, 1) # outcome
# structural equation model using pearl's front-door criterion
sem.formula <- 'y ~ 1 + x1 + x2 + m
m ~ 1 + x3'
# continuous mediator: works!
fit <- lavaan::sem(sem.formula, data=df, estimator="WLSMV",
se="none", control=list(iter.max=500))
# categorical mediator: doesn't work
fit <- lavaan::sem(sem.formula, data=df, estimator="WLSMV",
se="none", control=list(iter.max=500),
ordered = "m")
I want to run a multinomial logit in R and have used two libraries, nnet and mlogit, which produce different results and report different types of statistics. My questions are:
What is the source of discrepency between the coefficients and standard errors reported by nnet and those reported by mlogit?
I would like to report my results to a Latex file using stargazer. When doing so, there is a problematic tradeoff:
If I use the results from mlogit then I get the statistics I wish, such as psuedo R squared, however, the output is in long format (see example below).
If I use the results from nnet then the format is as expected, but it reports statistics that I am not interested in such as AIC, but does not include, for example, psuedo R squared.
I would like to have the statistics reported by mlogit in the formatting of nnet when I use stargazer.
Here is a reproducible example, with three choice alternatives:
library(mlogit)
df = data.frame(c(0,1,1,2,0,1,0), c(1,6,7,4,2,2,1), c(683,276,756,487,776,100,982))
colnames(df) <- c('y', 'col1', 'col2')
mydata = df
mldata <- mlogit.data(mydata, choice="y", shape="wide")
mlogit.model1 <- mlogit(y ~ 1| col1+col2, data=mldata)
The tex output when compiled is of what I refer to as "long format" which I deem undesired:
Now, using nnet:
library(nnet)
mlogit.model2 = multinom(y ~ 1 + col1+col2, data=mydata)
stargazer(mlogit.model2)
Gives the tex output:
which is of the "wide" format which I desire. Note the different coefficient and standard errors.
To my knowledge, there are three R packages that allow the estimation of the multinomial logistic regression model: mlogit, nnet and globaltest (from Bioconductor). I do not consider here the mnlogit package, a faster and more efficient implementation of mlogit.
All the above packages use different algorithms that, for small samples, give different results. These differencies vanishes for moderate sample sizes (try with n <- 100).
Consider the following data generating process taken from the James Keirstead's blog:
n <- 40
set.seed(4321)
df1 <- data.frame(x1=runif(n,0,100), x2=runif(n,0,100))
df1 <- transform(df1, y=1+ifelse(100 - x1 - x2 + rnorm(n,sd=10) < 0, 0,
ifelse(100 - 2*x2 + rnorm(n,sd=10) < 0, 1, 2)))
str(df1)
'data.frame': 40 obs. of 3 variables:
$ x1: num 33.48 90.91 41.15 4.38 76.35 ...
$ x2: num 68.6 42.6 49.9 36.1 49.6 ...
$ y : num 1 1 3 3 1 1 1 1 3 3 ...
table(df1$y)
1 2 3
19 8 13
The model parameters estimated by the three packages are respectively:
library(mlogit)
df2 <- mlogit.data(df1, choice="y", shape="wide")
mlogit.mod <- mlogit(y ~ 1 | x1+x2, data=df2)
(mlogit.cf <- coef(mlogit.mod))
2:(intercept) 3:(intercept) 2:x1 3:x1 2:x2 3:x2
42.7874653 80.9453734 -0.5158189 -0.6412020 -0.3972774 -1.0666809
#######
library(nnet)
nnet.mod <- multinom(y ~ x1 + x2, df1)
(nnet.cf <- coef(nnet.mod))
(Intercept) x1 x2
2 41.51697 -0.5005992 -0.3854199
3 77.57715 -0.6144179 -1.0213375
#######
library(globaltest)
glbtest.mod <- globaltest::mlogit(y ~ x1+x2, data=df1)
(cf <- glbtest.mod#coefficients)
1 2 3
(Intercept) -41.2442934 1.5431814 39.7011119
x1 0.3856738 -0.1301452 -0.2555285
x2 0.4879862 0.0907088 -0.5786950
The mlogit command of globaltest fits the model without using a reference outcome category, hence the usual parameters can be calculated as follows:
(glbtest.cf <- rbind(cf[,2]-cf[,1],cf[,3]-cf[,1]))
(Intercept) x1 x2
[1,] 42.78747 -0.5158190 -0.3972774
[2,] 80.94541 -0.6412023 -1.0666813
Concerning the estimation of the parameters in the three packages, the method used in mlogit::mlogit is explained in detail here.
In nnet::multinom the model is a neural network with no hidden layers, no bias nodes and a softmax output layer; in our case there are 3 input units and 3 output units:
nnet:::summary.nnet(nnet.mod)
a 3-0-3 network with 12 weights
options were - skip-layer connections softmax modelling
b->o1 i1->o1 i2->o1 i3->o1
0.00 0.00 0.00 0.00
b->o2 i1->o2 i2->o2 i3->o2
0.00 41.52 -0.50 -0.39
b->o3 i1->o3 i2->o3 i3->o3
0.00 77.58 -0.61 -1.02
Maximum conditional likelihood is the method used in multinom for model fitting.
The parameters of multinomial logit models are estimated in globaltest::mlogit using maximum likelihood and working with an equivalent log-linear model and the Poisson likelihood. The method is described here.
For models estimated by multinom the McFadden's pseudo R-squared can be easily calculated as follows:
nnet.mod.loglik <- nnet:::logLik.multinom(nnet.mod)
nnet.mod0 <- multinom(y ~ 1, df1)
nnet.mod0.loglik <- nnet:::logLik.multinom(nnet.mod0)
(nnet.mod.mfr2 <- as.numeric(1 - nnet.mod.loglik/nnet.mod0.loglik))
[1] 0.8483931
At this point, using stargazer, I generate a report for the model estimated by mlogit::mlogit which is as similar as possible to the report of multinom. The basic idea is to substitute the estimated coefficients and probabilities in the object created by multinom with the corresponding estimates of mlogit.
# Substitution of coefficients
nnet.mod2 <- nnet.mod
cf <- matrix(nnet.mod2$wts, nrow=4)
cf[2:nrow(cf), 2:ncol(cf)] <- t(matrix(mlogit.cf,nrow=2))
# Substitution of probabilities
nnet.mod2$wts <- c(cf)
nnet.mod2$fitted.values <- mlogit.mod$probabilities
Here is the result:
library(stargazer)
stargazer(nnet.mod2, type="text")
==============================================
Dependent variable:
----------------------------
2 3
(1) (2)
----------------------------------------------
x1 -0.516** -0.641**
(0.212) (0.305)
x2 -0.397** -1.067**
(0.176) (0.519)
Constant 42.787** 80.945**
(18.282) (38.161)
----------------------------------------------
Akaike Inf. Crit. 24.623 24.623
==============================================
Note: *p<0.1; **p<0.05; ***p<0.01
Now I am working on the last issue: how to visualize loglik, pseudo R2 and other information in the above stargazer output.
If you are using stargazer you can use omit to remove unwanted rows or references. Here is a quick example, hopefully, it will point you int he right direction.
nb. My assumption is you are using Rstudio and rmarkdown with knitr.
```{r, echo=FALSE}
library(mlogit)
df = data.frame(c(0,1,1,2,0,1,0), c(1,6,7,4,2,2,1), c(683,276,756,487,776,100,982))
colnames(df) <- c('y', 'col1', 'col2')
mydata = df
mldata <- mlogit.data(mydata, choice = "y", shape="wide")
mlogit.model1 <- mlogit(y ~ 1| col1+col2, data=mldata)
mlogit.col1 <- mlogit(y ~ 1 | col1, data = mldata)
mlogit.col2 <- mlogit(y ~ 1 | col2, data = mldata)
```
# MLOGIT
```{r echo = FALSE, message = TRUE, error = TRUE, warning = FALSE, results = 'asis'}
library(stargazer)
stargazer(mlogit.model1, type = "html")
stargazer(mlogit.col1,
mlogit.col2,
type = "html",
omit=c("1:col1","2:col1","1:col2","2:col2"))
```
Result:
Note that the second image omits 1:col1, 2:col2, 1:col2 and 2:col2