Simple logistic regression example.
set.seed(1)
df <- data.frame(out=c(0,1,0,1,0,1,0,1,0),
y=rep(c('A', 'B', 'C'), 3))
result <-glm(out~factor(y), family = 'binomial', data=df)
summary(result)
#Call:
#glm(formula = out ~ factor(y), family = "binomial", data = df)
#Deviance Residuals:
# Min 1Q Median 3Q Max
#-1.4823 -0.9005 -0.9005 0.9005 1.4823
#Coefficients:
# Estimate Std. Error z value Pr(>|z|)
#(Intercept) -6.931e-01 1.225e+00 -0.566 0.571
#factor(y)B 1.386e+00 1.732e+00 0.800 0.423
#factor(y)C 3.950e-16 1.732e+00 0.000 1.000
#(Dispersion parameter for binomial family taken to be 1)
# Null deviance: 12.365 on 8 degrees of freedom
#Residual deviance: 11.457 on 6 degrees of freedom
#AIC: 17.457
#Number of Fisher Scoring iterations: 4
My reference category is now A; results for B and C relative to A are given. I would also like to get the results when B and C are the reference. One can change the reference manually by using levels = in factor(); but this would require fitting 3 models. Is it possible to do this in one go? Or what would be a more efficient approach?
If you want to do all pairwise comparisons, you should usually also do a correction for alpha-error inflation due to multiple testing. You can easily do a Tukey test with package multcomp.
set.seed(1)
df <- data.frame(out=c(0,1,0,1,0,1,0,1,0),
y=rep(c('A', 'B', 'C'), 3))
#y is already a factor, if not, coerce before the model fit
result <-glm(out~y, family = 'binomial', data=df)
summary(result)
library(multcomp)
comps <- glht(result, linfct = mcp(y = "Tukey"))
summary(comps)
#Simultaneous Tests for General Linear Hypotheses
#
#Multiple Comparisons of Means: Tukey Contrasts
#
#
#Fit: glm(formula = out ~ y, family = "binomial", data = df)
#
#Linear Hypotheses:
# Estimate Std. Error z value Pr(>|z|)
#B - A == 0 1.386e+00 1.732e+00 0.8 0.703
#C - A == 0 1.923e-16 1.732e+00 0.0 1.000
#C - B == 0 -1.386e+00 1.732e+00 -0.8 0.703
#(Adjusted p values reported -- single-step method)
#letter notation often used in graphs and tables
cld(comps)
# A B C
#"a" "a" "a"
Related
In glm() it is possible to model bernoulli [0,1] outcomes with a logistic regression using the following sort of syntax.
glm(bin ~ x, df, family = "binomial")
However you can also perform aggregated binomial regression, where each observation represents a count of target events from a certain fixed number of bernoulli trials. For example see the following data:
set.seed(1)
n <- 50
cov <- 10
x <- c(rep(0,n/2), rep(1, n/2))
p <- 0.4 + 0.2*x
y <- rbinom(n, cov, p)
With these sort of data you use slightly different syntax in glm()
mod <- glm(cbind(y, cov-y) ~ x, family="binomial")
mod
# output
# Call: glm(formula = cbind(y, cov - y) ~ x, family = "binomial")
#
# Coefficients:
# (Intercept) x
# -0.3064 0.6786
#
# Degrees of Freedom: 49 Total (i.e. Null); 48 Residual
# Null Deviance: 53.72
# Residual Deviance: 39.54 AIC: 178
I was wondering is it possible to model this type of aggregated binomial data in the glmnet package? If so, what is the syntax?
Yes you can do it as the following
set.seed(1)
n <- 50
cov <- 10
x <- c(rep(0,n/2), rep(1, n/2))
x = cbind(x, xx = c(rep(0.5,20), rep(0.7, 20), rep(1,10)))
p <- 0.4 + 0.2*x
y <- rbinom(n, cov, p)
I added another covariate here called xx as glmnet accepts minimum of two covariates
In glm as you have it in your post
mod <- glm(cbind(y, cov-y) ~ x, family="binomial")
mod
# output
# Call: glm(formula = cbind(y, cov - y) ~ x, family = "binomial")
# Coefficients:
# (Intercept) xx xxx
# 0.04366 0.86126 -0.64862
# Degrees of Freedom: 49 Total (i.e. Null); 47 Residual
# Null Deviance: 53.72
# Residual Deviance: 38.82 AIC: 179.3
In glmnet, without regularization (lambda=0) to reproduce similar results as in glm
library(glmnet)
fit = glmnet(x, cbind(cov-y,y), family="binomial", lambda=0)
coef(fit)
# output
# 3 x 1 sparse Matrix of class "dgCMatrix"
# s0
# (Intercept) 0.04352689
# x 0.86111234
# xx -0.64831806
Does anyone know if it is possible to use lmFit or lm in R to calculate a linear model with categorical variables while including all possible comparisons between the categories? For example in the test data created here:
set.seed(25)
f <- gl(n = 3, k = 20, labels = c("control", "low", "high"))
mat <- model.matrix(~f, data = data.frame(f = f))
beta <- c(12, 3, 6) #these are the simulated regression coefficient
y <- rnorm(n = 60, mean = mat %*% beta, sd = 2)
m <- lm(y ~ f)
I get the summary:
summary(m)
Call:
lm(formula = y ~ f)
Residuals:
Min 1Q Median 3Q Max
-4.3505 -1.6114 0.1608 1.1615 5.2010
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 11.4976 0.4629 24.840 < 2e-16 ***
flow 3.0370 0.6546 4.639 2.09e-05 ***
fhigh 6.1630 0.6546 9.415 3.27e-13 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 2.07 on 57 degrees of freedom
Multiple R-squared: 0.6086, Adjusted R-squared: 0.5949
F-statistic: 44.32 on 2 and 57 DF, p-value: 2.446e-12
which is because the contrasts term ("contr.treatment") compares "high" to "control" and "low" to "control".
Is it possible to get also the comparison between "high" and "low"?
If you use aov instead of lm, you can use the TukeyHSD function from the stats package:
fit <- aov(y ~ f)
TukeyHSD(fit)
# Tukey multiple comparisons of means
# 95% family-wise confidence level
# Fit: aov(formula = y ~ f)
# $f
# diff lwr upr p adj
# low-control 3.036957 1.461707 4.612207 6.15e-05
# high-control 6.163009 4.587759 7.738259 0.00e+00
# high-low 3.126052 1.550802 4.701302 3.81e-05
If you want to use an lm object, you can use the TukeyHSD function from the mosaic package:
library(mosaic)
TukeyHSD(m)
Or, as #ben-bolker suggests,
library(emmeans)
e1 <- emmeans(m, specs = "f")
pairs(e1)
# contrast estimate SE df t.ratio p.value
# control - low -3.036957 0.6546036 57 -4.639 0.0001
# control - high -6.163009 0.6546036 57 -9.415 <.0001
# low - high -3.126052 0.6546036 57 -4.775 <.0001
# P value adjustment: tukey method for comparing a family of 3 estimates
With lmFit:
library(limma)
design <- model.matrix(~0 + f)
colnames(design) <- levels(f)
fit <- lmFit(y, design)
contrast.matrix <- makeContrasts(control-low, control-high, low-high,
levels = design)
fit2 <- contrasts.fit(fit, contrast.matrix)
fit2 <- eBayes(fit2)
round(t(rbind(fit2$coefficients, fit2$t, fit2$p.value)), 5)
# [,1] [,2] [,3]
# control - low -3.03696 -4.63938 2e-05
# control - high -6.16301 -9.41487 0e+00
# low - high -3.12605 -4.77549 1e-05
Also see Multiple t-test comparisons for more information.
I have run the Hosmer Lemeshow statistic in R, but I have obtained an p-value of 1. This seems strange to me. I know that a high p-valvalue means that we do not reject the null hypothesis that observed and expected are the same, but is it possible i have an error somewhere?
How do i interpret such p-value?
Below is the code i have used to run the test. I also attach how my model looks like. Response variable is a count variable, while all regressors are continous. I have run a negative binomial model, due to detected overdispersion in my initial poisson model.
> hosmerlem <- function(y, yhat, g=10)
+ {cutyhat <- cut(yhat, breaks = quantile(yhat, probs=seq(0,1, 1/g)), include.lowest=TRUE)
+ obs <- xtabs(cbind(1 - y, y) ~ cutyhat)
+ expect <- xtabs(cbind(1 - yhat, yhat) ~ cutyhat)
+ chisq <- sum((obs - expect)^2/expect)
+ P <- 1 - pchisq(chisq, g - 2)
+ return(list(chisq=chisq,p.value=P))}
> hosmerlem(y=TOT.N, yhat=fitted(final.model))
$chisq
[1] -2.529054
$p.value
[1] 1
> final.model <-glm.nb(TOT.N ~ D.PARK + OPEN.L + L.WAT.C + sqrt(L.P.ROAD))
> summary(final.model)
Call:
glm.nb(formula = TOT.N ~ D.PARK + OPEN.L + L.WAT.C + sqrt(L.P.ROAD),
init.theta = 4.979895131, link = log)
Deviance Residuals:
Min 1Q Median 3Q Max
-3.08218 -0.70494 -0.09268 0.55575 1.67860
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 4.032e+00 3.363e-01 11.989 < 2e-16 ***
D.PARK -1.154e-04 1.061e-05 -10.878 < 2e-16 ***
OPEN.L -1.085e-02 3.122e-03 -3.475 0.00051 ***
L.WAT.C 1.597e-01 7.852e-02 2.034 0.04195 *
sqrt(L.P.ROAD) 4.924e-01 3.101e-01 1.588 0.11231
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for Negative Binomial(4.9799) family taken to be 1)
Null deviance: 197.574 on 51 degrees of freedom
Residual deviance: 51.329 on 47 degrees of freedom
AIC: 383.54
Number of Fisher Scoring iterations: 1
Theta: 4.98
Std. Err.: 1.22
2 x log-likelihood: -371.542
As correctly pointed out by #BenBolker, Hosmer-Lemeshow is a test for logistic regression, not for a negative binomial generalized linear model.
If we consider to apply the test to a logistic regression,
the inputs of the function hosmerlem (a copy of the hoslem.test function in the package ResourceSelection) should be:
- y = a numeric vector of observations, binary (0/1)
- yhat = expected values (probabilities)
Here is an illustrative example that shows how to get the correct inputs:
set.seed(123)
n <- 500
x <- rnorm(n)
y <- rbinom(n, 1, plogis(0.1 + 0.5*x))
logmod <- glm(y ~ x, family=binomial)
# Important: use the type="response" option
yhat <- predict(logmod, type="response")
hosmerlem(y, yhat)
########
$chisq
[1] 4.522719
$p.value
[1] 0.8071559
The same result is given by the function hoslem.test:
library(ResourceSelection)
hoslem.test(y, yhat)
########
Hosmer and Lemeshow goodness of fit (GOF) test
data: y, yhat
X-squared = 4.5227, df = 8, p-value = 0.8072
As already mentioned, HL-test is not appropriate for the specified model. It is also important to know that a large p-value doesn't necessarily mean a good fit. It could also be that there isn't enough evidence to prove it's a poor fit.
Meanwhile, the gofcat package implementation of the HL-test provides for passing model objects directly to the function without necessarily supplying the observed and predicted values. For the simulated data one has:
library(gofcat)
set.seed(123)
n <- 500
x <- rnorm(n)
y <- rbinom(n, 1, plogis(0.1 + 0.5*x))
logmod <- glm(y ~ x, family=binomial)
hosmerlem(logmod, group = 10)
Hosmer-Lemeshow Test:
Chi-sq df pr(>chi)
binary(Hosmerlem) 4.5227 8 0.8072
H0: No lack of fit dictated
rho: 100%
I would like to remove all of the as.factor elements from the output of an ordinary least squares lm() model in R. The last line doesn't work, but for example:
frame <- data.frame(y = rnorm(100), x= rnorm(100), block = sample(c("A", "B", "C", "D"), 100, replace = TRUE))
mod <- lm(y ~ x + as.factor(block), data = frame)
summary(mod)
summary(mod)$coefficients[3:5,] <- NULL
Is there a way to remove all of these elements so that the saved `lm' object no longer has them? Thanks.
One option is to use felm function in lfe package.
As stated in the package:
The package is intended for linear models with multiple group fixed effects, i.e. with 2 or more factors with a large number of levels. It performs similar functions as lm, but it uses a special method for projecting out multiple group fixed effects from the normal equations, hence it is faster.
set.seed(123)
frame <- data.frame(y = rnorm(100), x= rnorm(100), block = sample(c("A", "B", "C", "D"), 100, replace = TRUE))
id<-as.factor(frame$block)
mod <- lm(y ~ x + id, data = frame) #lm
summary(mod)
Call:
lm(formula = y ~ x + id, data = frame)
Residuals:
Min 1Q Median 3Q Max
-2.53394 -0.68372 0.04072 0.67805 2.00777
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.18115 0.17201 1.053 0.2950
x -0.08310 0.09604 -0.865 0.3891
idB 0.04834 0.24645 0.196 0.8449
idC -0.51265 0.25052 -2.046 0.0435 *
idD 0.04905 0.26073 0.188 0.8512
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.9002 on 95 degrees of freedom
Multiple R-squared: 0.06677, Adjusted R-squared: 0.02747
F-statistic: 1.699 on 4 and 95 DF, p-value: 0.1566
library(lfe)
est <- felm(y ~ x| id)
summary(est)
Call:
felm(formula = y ~ x | id, data = frame)
Residuals:
Min 1Q Median 3Q Max
-2.53394 -0.68372 0.04072 0.67805 2.00777
Coefficients:
Estimate Std. Error t value Pr(>|t|)
x -0.08310 0.09604 -0.865 0.389
Residual standard error: 0.9002 on 95 degrees of freedom
Multiple R-squared(full model): 0.06677 Adjusted R-squared: 0.02747
Multiple R-squared(proj model): 0.00782 Adjusted R-squared: -0.03396
F-statistic(full model):1.699 on 4 and 95 DF, p-value: 0.1566
F-statistic(proj model): 0.7487 on 1 and 95 DF, p-value: 0.3891
P.S. A similar program for Stata is reghdfe.
Suppose we have two variables that we wish to build a model from:
set.seed(10239)
x <- rnorm(seq(1,100,1))
y <- rnorm(seq(1,100,1))
model <- lm(x~y)
class(model)
# [1] "lm"
summary(model)
#
# Call:
# lm(formula = x ~ y)
#
# Residuals:
# Min 1Q Median 3Q Max
# -3.08676 -0.63022 -0.01115 0.75280 2.35169
#
# Coefficients:
# Estimate Std. Error t value Pr(>|t|)
# (Intercept) -0.07188 0.11375 -0.632 0.529
# y 0.06999 0.12076 0.580 0.564
#
# Residual standard error: 1.117 on 98 degrees of freedom
# Multiple R-squared: 0.003416, Adjusted R-squared: -0.006754
# F-statistic: 0.3359 on 1 and 98 DF, p-value: 0.5635
How do you plot the F-distribution of the model object?
If you check the structure of the summary of your model str(summary(model)), you'll notice that the parameters for the F-distribution of interest can be found by calling summary(model)$fstatistic. The first element in the list is the F-statistic and the following two element are the numerator degrees of freedom and the denominator degrees of freedom, in that order. So to plot the F-distribution, try something like the following
df <- summary(model)$fstatistic
curve(df(x, df1 = df[2], df2 = df[3]), from = 0, to = 100)
Alternatively, you can also get the parameters for the F-distribution of interest from the model itself. The numerator degrees of freedom is one less than the number of coefficients in the model and the denominator degrees of freedom is the total number of observations less one more than the number of coefficients in the model.
df1 <- length(model$coefficients) - 1
df2 <- length(model$residuals) - df1 - 1
curve(df(x, df1 = df1, df2 = df2), from = 0, to = 100)
I prefer the following way to show the p-value of the F distribution
fstat <- summary(model)$fstatistic
library(HH)
old.omd <- par(omd=c(.05,.88, .05,1))
F.setup(df1=fstat['numdf'], df2=fstat['dendf'])
F.curve(df1=fstat['numdf'], df2=fstat['dendf'], col='blue')
F.observed(fstat['value'], df1=fstat['numdf'], df2=fstat['dendf'])
par(old.omd)