coefficients in vglm(), orm(), polr() and clm() for ologit - r

I am doing ologit with different packages, they are VGAM, rms, MASS and ordinal, using data set wine from package ordinal.
First is vglm():
library(VGAM)
vglmfit <- vglm(rating ~ temp * contact, data = wine,
family=cumulative(parallel=TRUE, reverse=TRUE))
The coefficients are:
(Intercept):1 (Intercept):2 (Intercept):3 (Intercept):4
1.4112568 -1.1435551 -3.3770742 -4.9419773
tempwarm contactyes tempwarm:contactyes
2.3212033 1.3474598 0.3595241
Second is orm():
library(rms)
ormfit <- orm(rating ~ temp * contact, data = wine)
Coef:
Coef S.E. Wald Z Pr(>|Z|)
y>=2 1.4113 0.5454 2.59 0.0097
y>=3 -1.1436 0.5097 -2.24 0.0248
y>=4 -3.3771 0.6382 -5.29 <0.0001
y>=5 -4.9420 0.7509 -6.58 <0.0001
temp=warm 2.3212 0.7009 3.31 0.0009
contact=yes 1.3475 0.6604 2.04 0.0413
temp=warm * contact=yes 0.3595 0.9238 0.39 0.6971
Third, polr:
library(MASS)
polrfit <- polr(rating ~ temp * contact, method="logistic", data = wine)
coef:
Coefficients:
tempwarm contactyes tempwarm:contactyes
2.3211214 1.3474055 0.3596357
Intercepts:
1|2 2|3 3|4 4|5
-1.411278 1.143507 3.377005 4.941901
Last, clm():
library(ordinal)
clmfit <- clm(rating ~ temp * contact, link="logit", data = wine)
coef:
Coefficients:
tempwarm contactyes tempwarm:contactyes
2.3212 1.3475 0.3595
Threshold coefficients:
1|2 2|3 3|4 4|5
-1.411 1.144 3.377 4.942
Besides, when reverse=FALSE in vglm(),
library(VGAM)
vglmfit <- vglm(rating ~ temp * contact, data = wine,
family=cumulative(parallel=TRUE, reverse=FALSE))
Coefficients:
(Intercept):1 (Intercept):2 (Intercept):3 (Intercept):4
-1.4112568 1.1435551 3.3770742 4.9419773
tempwarm contactyes tempwarm:contactyes
-2.3212033 -1.3474598 -0.3595241
You may notice that the coefficients in vglm() while reverse=TRUE and those in orm() are the same, and the ones in polr() and clm() are the same. So there are two set of coefficients, the only difference is the sign of intercepts.
And while I set reverse=FALSE, it does reverse the intercepts, but at the same time the parameters of variables, which I don't want.
What's the problem of that? How could I get exactly the same result? or how could I explain it?

This is all just a matter of parametrizations. One classical way to introduce the ordered logistic regression model is to assume that there is a latent continuous response
y* = x'b + e
where e has a standard logistic distribution. Then, it is assumed that not y* itself is observed by only a discretized category y = j if y* falls between cut-offs a_j-1 and a_j. This then leads to the model equation:
logit(P(y <= j)) = a_j - x'b
Other motivations lead to similar equations but with P(y >= j) and/or a_j + x'b. This just leads to switches in the signs of the a and/or b coefficients that you observe in the different implementations. The corresponding models and predictions are equivalent, of course. Which interpretation you find easier is mostly a matter of taste.

Related

R: Different result from glm and mle2 package in R

So I want to find the estimate parameter using GLM and compare it with mle2 package.
Here's my code for GLM
d <- read.delim("http://dnett.github.io/S510/Disease.txt")
d$disease=factor(d$disease)
d$ses=factor(d$ses)
d$sector=factor(d$sector)
str(d)
glm2 <- glm(disease~ses+sector, family=binomial(link=logit), data=d)
summary(glm2)
And my code for mle2()
y<-as.numeric(as.character(d$disease))
x1<-as.numeric(as.character(d$age))
x2<-as.numeric(as.character(d$sector))
x3<-as.numeric(as.character(d$ses))
library(bbmle)
nlldbin=function(A,B,C,D){
eta<-A+B*(x3==2)+C*(x3==3)+D*(x2==2)
p<-1/(1+exp(-eta))
joint.pdf= (p^y)*((1-p)^(1-y))
-sum(joint.pdf, log=TRUE ,na.rm=TRUE)
}
st <- list(A=0.0001,B=0.0001,C=0.0001,D=0.0001)
est_mle2<-mle2(start=st,nlldbin,hessian=TRUE)
summary(est_mle2)
But the result is quiet different. Please help me to fix this, thank you!
> summary(est_mle2)
Maximum likelihood estimation
Call:
mle2(minuslogl = nlldbin, start = st, hessian.opts = TRUE)
Coefficients:
Estimate Std. Error z value Pr(z)
A -20.4999 5775.1484 -0.0035 0.9972
B -5.2499 120578.9515 0.0000 1.0000
C -7.9999 722637.2670 0.0000 1.0000
D -2.2499 39746.6639 -0.0001 1.0000
> summary(glm2)
Call:
glm(formula = disease ~ ses + sector, family = binomial(link = logit),
data = d)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -1.52001 0.33514 -4.535 5.75e-06 ***
ses2 -0.08525 0.41744 -0.204 0.838177
ses3 0.16086 0.39261 0.410 0.682019
sector2 1.28098 0.34140 3.752 0.000175 ***
I'm not sure your definition of eta is correct. I would use the model matrix.
X <- model.matrix(~ ses + sector, data = d)
nlldbin <- function(A,B,C,D){
eta <- X %*% c(A, B, C, D)
p <- 1/(1+exp(-eta))
logpdf <- y*log(p) + (1-y)*log(1-p)
-sum(logpdf)
}
This line
-sum(joint.pdf, log=TRUE ,na.rm=TRUE)
is wrong. sum doesn't have a special log argument; what you're doing is adding the value TRUE (which gets converted to 1) to the pdf.
What you want is
-sum(log(joint.pdf), na.rm=TRUE)
but this is also not very good for numerical reasons, as the pdf is likely to underflow. A better way of writing it would be
logpdf <- y*log(p) + (1-y)*log(1-p)
-sum(logpdf, na.rm=TRUE)

plot coefficients of a model in R

I am fitting training data with glm() and want to plot the coefficients. however, I had no clue how to give a right plot as follows:
set.seed(1)
trn_index = createDataPartition(y = development$EQUAL_PAY, p = 0.80, list = FALSE)
trn_pay = development[trn_index, ]
tst_pay = development[-trn_index, ]
trn_pay_f <- trn_pay %>%
mutate(EQUAL_PAY = relevel(factor(EQUAL_PAY),ref = "YES"))
pay_lgr = train(EQUAL_PAY ~ .- EQUAL_WORK - COUNTRY, method = "glm", family = binomial(link = "logit"), data = trn_pay_f,trControl = trainControl(method = 'cv', number = 10))
summary(pay_lgr)
##Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -2.560e+00 2.552e+00 -1.003 0.3158
GDP_PER_CAP -5.253e-05 3.348e-05 -1.569 0.1167
CO2_PER_CAP 1.695e-01 7.882e-02 2.151 0.0315 *
PERC_ACCESS_ELECTRICITY -7.833e-03 1.249e-02 -0.627 0.5304
ATMS_PER_1E5 -2.473e-03 8.012e-03 -0.309 0.7576
PERC_INTERNET_USERS -2.451e-02 2.047e-02 -1.198 0.2310
SCIENTIFIC_ARTICLES_PER_YR 2.698e-05 1.519e-05 1.776 0.0757 .
PERC_FEMALE_SECONDARY_EDU 1.126e-01 5.934e-02 1.897 0.0578 .
PERC_FEMALE_LABOR_FORCE -6.559e-03 1.477e-02 -0.444 0.6569
PERC_FEMALE_PARLIAMENT -4.786e-02 2.191e-02 -2.184 0.0289 *
## extract all parameters in a dataframe
pay_lgrFrame <- data.frame(COEFFICIENT = rownames(summary(pay_lgr)$coef),
p_value = summary(pay_lgr)$coef[,4],
z_value = summary(pay_lgr)$coef[,3],
SE = summary(pay_lgr)$coef[,2],
Estimate = summary(pay_lgr)$coef[,1])
## and I was stuck in making a plot as the image I posted the link above.
Pulling in your summary table (you can get this directly as ss <- coef(summary(pay_lgr)), but I don't have your data set):
ss <- read.delim(header=TRUE,check.names=FALSE,text="
Estimate Std. Error z value Pr(>|z|)
(Intercept) -2.560e+00 2.552e+00 -1.003 0.3158
GDP_PER_CAP -5.253e-05 3.348e-05 -1.569 0.1167
CO2_PER_CAP 1.695e-01 7.882e-02 2.151 0.0315
PERC_ACCESS_ELECTRICITY -7.833e-03 1.249e-02 -0.627 0.5304
ATMS_PER_1E5 -2.473e-03 8.012e-03 -0.309 0.7576
PERC_INTERNET_USERS -2.451e-02 2.047e-02 -1.198 0.2310
SCIENTIFIC_ARTICLES_PER_YR 2.698e-05 1.519e-05 1.776 0.0757
PERC_FEMALE_SECONDARY_EDU 1.126e-01 5.934e-02 1.897 0.0578
PERC_FEMALE_LABOR_FORCE -6.559e-03 1.477e-02 -0.444 0.6569
PERC_FEMALE_PARLIAMENT -4.786e-02 2.191e-02 -2.184 0.0289")
Convert row names to a column called term:
ss2 <- tibble::rownames_to_column(ss,"term")
Draw the barplot:
library(ggplot2)
ggplot(ss2, aes(term,Estimate))+
geom_bar(stat="identity")+
coord_flip()
ggsave("bar.png")
As others have commented, there are probably better (both easier and preferable in terms of visual communication) ways to plot the coefficients. The dotwhisker::dwplot() function does several convenient things:
automatically extracts coefficients and plots them
automatically scales continuous predictors by 2*std dev, to enable comparison between coeficients (use by_2sd=FALSE if you don't want this)
automatically leaves out the intercept, which is on a different scale from the other parameters and is rarely of inferential interest
library(dotwhisker)
dwplot(lm(Murder/Population ~ ., data=as.data.frame(state.x77)))

Robust standard errors for negative binomial regression in R do not match those from Stata

I am replicating a negative binomial regression model in R. When calculating robust standard errors, the output does not match Stata output of standard errors.
The original Stata code is
nbreg displaced eei lcostofwar cfughh roadskm lpopdensity ltkilled, robust nolog
I have attempted both manual calculation and vcovHC from sandwich. However, neither produces the same results.
My regression model is as follows:
mod1 <- glm.nb(displaced ~ eei + costofwar_log + cfughh + roadskm + popdensity_log + tkilled_log, data = mod1_df)
With vcovHC I have tried every option from HC0 to HC5.
Attempt 1:
cov_m1 <- vcovHC(mod1, type = "HC0", sandwich = T)
se <- sqrt(diag(cov_m1))
Attempt 2:
mod1_rob <- coeftest(mod1, vcovHC = vcov(mod1, type = "HC0"))
The most successful has been HC0 and vcov = sandwich but no SEs are correct.
Any suggestions?
EDIT
My output is as follows (using HC0):
Estimate Std. Error z value Pr(>|z|)
(Intercept) 1.3281183 1.5441312 0.8601 0.389730
eei -0.0435529 0.0183359 -2.3753 0.017536 *
costofwar_log 0.2984376 0.1350518 2.2098 0.027119 *
cfughh -0.0380690 0.0130254 -2.9227 0.003470 **
roadskm 0.0020812 0.0010864 1.9156 0.055421 .
popdensity_log -0.4661079 0.1748682 -2.6655 0.007688 **
tkilled_log 1.0949084 0.2159161 5.0710 3.958e-07 ***
The Stata output I am attempting to replicate is:
Estimate Std. Error
(Intercept) 1.328 1.272
eei -0.044 0.015
costofwar_log 0.298 0.123
cfughh -0.038 0.018
roadskm 0.002 0.0001
popdensity_log -0.466 0.208
tkilled_log 1.095 0.209
The dataset is found here and the recoded variables are:
mod1_df <- table %>%
select(displaced, eei_01, costofwar, cfughh, roadskm, popdensity,
tkilled)
mod1_df$popdensity_log <- log(mod1_df$popdensity + 1)
mod1_df$tkilled_log <- log(mod1_df$tkilled + 1)
mod1_df$costofwar_log <- log(mod1_df$costofwar + 1)
mod1_df$eei <- mod1_df$eei_01*100
Stata uses the observed Hessian for its computations, glm.nb() uses the expected Hessian. Therefore, the default bread() employed by the sandwich() function is different, leading to different results. There are other R packages that employ the observed hessian for its variance-covariance estimate (e.g., gamlss) but these do not supply an estfun() method for the sandwich package.
Hence, below I simply set up a dedicated bread_obs() function that extracts the ML estimates from a negbin object, sets up the negative log-likelihood, computes the observed Hessian numerically via numDeriv::hessian() and computes the "bread" from it (omitting the estimate for log(theta)):
bread_obs <- function(object, method = "BFGS", maxit = 5000, reltol = 1e-12, ...) {
## data and estimated parameters
Y <- model.response(model.frame(object))
X <- model.matrix(object)
par <- c(coef(object), "log(theta)" = log(object$theta))
## dimensions
n <- NROW(X)
k <- length(par)
## nb log-likelihood
nll <- function(par) suppressWarnings(-sum(dnbinom(Y,
mu = as.vector(exp(X %*% head(par, -1))),
size = exp(tail(par, 1)), log = TRUE)))
## covariance based on observed Hessian
rval <- numDeriv::hessian(nll, par)
rval <- solve(rval) * n
rval[-k, -k]
}
With that function I can compare the sandwich() output (based on the expected Hessian) with the output using the bread_obs() (based on the observed Hessian).
s_exp <- sandwich(mod1)
s_obs <- sandwich(mod1, vcov = bread_obs)
cbind("Coef" = coef(mod1), "SE (Exp)" = sqrt(diag(s_exp)), "SE (Obs)" = sqrt(diag(s_obs)))
## Coef SE (Exp) SE (Obs)
## (Intercept) 1.328 1.259 1.259
## eei -0.044 0.017 0.015
## costofwar_log 0.298 0.160 0.121
## cfughh -0.038 0.015 0.018
## roadskm 0.002 0.001 0.001
## popdensity_log -0.466 0.135 0.207
## tkilled_log 1.095 0.179 0.208
This still has slight differences compared to Stata but these are likely numerical differences from the optimization etc.
If you create a new dedicated bread() method for negbin objects
bread.negbin <- bread_obs
then the method dispatch will use this if you do sandwich(mod1).
In R you need to manually provide a degree of freedom correction, so try this which I borrowed from this source:
dfa <- (G/(G - 1)) * (N - 1)/pm1$df.residual
# display with cluster VCE and df-adjustment
firm_c_vcov <- dfa * vcovHC(pm1, type = "HC0", cluster = "group", adjust = T)
coeftest(pm1, vcov = firm_c_vcov)
Here G is the number of Panels in your data set, N is the number of observations and pm1 is your model estimated. Obviously, you could drop the clustering.

R - Plm and lm - Fixed effects

I have a balanced panel data set, df, that essentially consists in three variables, A, B and Y, that vary over time for a bunch of uniquely identified regions. I would like to run a regression that includes both regional (region in the equation below) and time (year) fixed effects. If I'm not mistaken, I can achieve this in different ways:
lm(Y ~ A + B + factor(region) + factor(year), data = df)
or
library(plm)
plm(Y ~ A + B,
data = df, index = c('region', 'year'), model = 'within',
effect = 'twoways')
In the second equation I specify indices (region and year), the model type ('within', FE), and the nature of FE ('twoways', meaning that I'm including both region and time FE).
Despite I seem to be doing things correctly, I get extremely different results. The problem disappears when I do not consider time fixed effects - and use the argument effect = 'individual'.
What's the deal here? Am I missing something? Are there any other R packages that allow to run the same analysis?
Perhaps posting an example of your data would help answer the question. I am getting the same coefficients for some made up data. You can also use felm from the package lfe to do the same thing:
N <- 10000
df <- data.frame(a = rnorm(N), b = rnorm(N),
region = rep(1:100, each = 100), year = rep(1:100, 100))
df$y <- 2 * df$a - 1.5 * df$b + rnorm(N)
model.a <- lm(y ~ a + b + factor(year) + factor(region), data = df)
summary(model.a)
# (Intercept) -0.0522691 0.1422052 -0.368 0.7132
# a 1.9982165 0.0101501 196.866 <2e-16 ***
# b -1.4787359 0.0101666 -145.450 <2e-16 ***
library(plm)
pdf <- pdata.frame(df, index = c("region", "year"))
model.b <- plm(y ~ a + b, data = pdf, model = "within", effect = "twoways")
summary(model.b)
# Coefficients :
# Estimate Std. Error t-value Pr(>|t|)
# a 1.998217 0.010150 196.87 < 2.2e-16 ***
# b -1.478736 0.010167 -145.45 < 2.2e-16 ***
library(lfe)
model.c <- felm(y ~ a + b | factor(region) + factor(year), data = df)
summary(model.c)
# Coefficients:
# Estimate Std. Error t value Pr(>|t|)
# a 1.99822 0.01015 196.9 <2e-16 ***
# b -1.47874 0.01017 -145.4 <2e-16 ***
This does not seem to be a data issue.
I'm doing computer exercises in R from Wooldridge (2012) Introductory Econometrics. Specifically Chapter 14 CE.1 (data is the rental file at: https://www.cengage.com/cgi-wadsworth/course_products_wp.pl?fid=M20b&product_isbn_issn=9781111531041)
I computed the model in differences (in Python)
model_diff = smf.ols(formula='diff_lrent ~ diff_lpop + diff_lavginc + diff_pctstu', data=rental).fit()
OLS Regression Results
==============================================================================
Dep. Variable: diff_lrent R-squared: 0.322
Model: OLS Adj. R-squared: 0.288
Method: Least Squares F-statistic: 9.510
Date: Sun, 05 Nov 2017 Prob (F-statistic): 3.14e-05
Time: 00:46:55 Log-Likelihood: 65.272
No. Observations: 64 AIC: -122.5
Df Residuals: 60 BIC: -113.9
Df Model: 3
Covariance Type: nonrobust
================================================================================
coef std err t P>|t| [0.025 0.975]
--------------------------------------------------------------------------------
Intercept 0.3855 0.037 10.469 0.000 0.312 0.459
diff_lpop 0.0722 0.088 0.818 0.417 -0.104 0.249
diff_lavginc 0.3100 0.066 4.663 0.000 0.177 0.443
diff_pctstu 0.0112 0.004 2.711 0.009 0.003 0.019
==============================================================================
Omnibus: 2.653 Durbin-Watson: 1.655
Prob(Omnibus): 0.265 Jarque-Bera (JB): 2.335
Skew: 0.467 Prob(JB): 0.311
Kurtosis: 2.934 Cond. No. 23.0
==============================================================================
Now, the PLM package in R gives the same results for the first-difference models:
library(plm) modelfd <- plm(lrent~lpop + lavginc + pctstu,
data=data,model = "fd")
No problem so far. However, the fixed effect reports different estimates.
modelfx <- plm(lrent~lpop + lavginc + pctstu, data=data, model =
"within", effect="time") summary(modelfx)
The FE results should not be any different. In fact, the Computer Exercise question is:
(iv) Estimate the model by fixed effects to verify that you get identical estimates and standard errors to those in part (iii).
My best guest is that I am miss understanding something on the R package.

Mean variance of a difference of BLUEs or BLUPs in `lme4`

Given below is the code for analysis of a resolvable alpha design (alpha lattice design) using the R package asreml.
# load the data
library(agridat)
data(john.alpha)
dat <- john.alpha
# load asreml
library(asreml)
# model1 - random `gen`
#----------------------
# fitting the model
model1 <- asreml(yield ~ 1 + rep, data=dat, random=~ gen + rep:block)
# variance due to `gen`
sg2 <- summary(model1 )$varcomp[1,'component']
# mean variance of a difference of two BLUPs
vblup <- predict(model1 , classify="gen")$avsed ^ 2
# model2 - fixed `gen`
#----------------------
model2 <- asreml(yield ~ 1 + gen + rep, data=dat, random = ~ rep:block)
# mean variance of a difference of two adjusted treatment means (BLUE)
vblue <- predict(model2 , classify="gen")$avsed ^ 2
# H^2 = .803
sg2 / (sg2 + vblue/2)
# H^2c = .809
1-(vblup / 2 / sg2)
I am trying to replicate the above using the R package lme4.
# model1 - random `gen`
#----------------------
# fitting the model
model1 <- lmer(yield ~ 1 + (1|gen) + rep + (1|rep:block), dat)
# variance due to `gen`
varcomp <- VarCorr(model1)
varcomp <- data.frame(print(varcomp, comp = "Variance"))
sg2 <- varcomp[varcomp$grp == "gen",]$vcov
# model2 - fixed `gen`
#----------------------
model2 <- lmer(yield ~ 1 + gen + rep + (1|rep:block), dat)
How to compute the vblup and vblue (mean variance of difference) in lme4 equivalent to predict()$avsed ^ 2 of asreml ?
I'm not that familiar with this variance partitioning stuff, but I'll take a shot.
library(lme4)
model1 <- lmer(yield ~ 1 + rep + (1|gen) + (1|rep:block), john.alpha)
model2 <- update(model1, . ~ . + gen - (1|gen))
## variance due to `gen`
sg2 <- c(VarCorr(model1)[["gen"]]) ## 0.142902
Get conditional variances of BLUPs:
rr1 <- ranef(model1,condVar=TRUE)
vv1 <- attr(rr$gen,"postVar")
str(vv1)
## num [1, 1, 1:24] 0.0289 0.0289 0.0289 0.0289 0.0289 ...
This is a 1x1x24 array (effectively just a vector of variances; we could collapse using c() if we needed to). They're not all the same, but they're pretty close ... I don't know whether they should all be identical (and this is a roundoff issue)
(uv <- unique(vv1))
## [1] 0.02887451 0.02885887 0.02885887
The relative variation is approximately 5.4e-4 ...
If these were all the same then the mean variance of a difference of any two would be just twice the variance (Var(x-y) = Var(x)+Var(y); by construction the BLUPs are all independent). I'm going to go ahead and use this.
vblup <- 2*mean(vv1)
For the model with gen fitted as a fixed effect, let's extract the variances of the parameters relating to genotypes (which are differences in the expected value from the first level):
vv2 <- diag(vcov(model2))[-(1:3)]
summary(vv2)
##
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.06631 0.06678 0.07189 0.07013 0.07246 0.07286
I'm going to take the means of these values (not double the values, since these are already the variances of differences)
vblue <- mean(vv2)
sg2/(sg2+vblue/2) ## 0.8029779
1-(vblup/2/sg2) ## 0.7979965
The H^2 estimate looks right on, but the H^2c estimate is a little different (0.797 vs. 0.809, a 1.5% relative difference); I don't know if that is big enough to be of concern or not.

Resources