Set intercept to zero when using predict.glm - r

How do I remove the intercept from the prediction when using predict.glm? I'm not talking about the model itself, just in the prediction.
For example, I want to get the difference and standard error between x=1 and x=3
I tried putting newdata=list(x=2), intercept = NULL when using predict.glm and it doesn't work
So for example:
m <- glm(speed ~ dist, data=cars, family=gaussian(link="identity"))
prediction <- predict.glm(m, newdata=list(dist=c(2)), type="response", se.fit=T, intercept=NULL)

I'm not sure if this is somehow implemented in predict, but you could the following trick1.
Add a manual intercept column (i.e. a vector of 1s) to the data and use it in the model while adding 0 to RHS of formula (to remove the "automatic" intercept).
cars$intercept <- 1L
m <- glm(speed ~ 0 + intercept + dist, family=gaussian, data=cars)
This gives us an intercept column in the model.frame, internally used by predict,
model.frame(m)
# speed intercept dist
# 1 4 1 2
# 2 4 1 10
# 3 7 1 4
# 4 7 1 22
# ...
which allows us to set it to an arbitrary value such as zero.
predict.glm(m, newdata=list(dist=2, intercept=0), type="response", se.fit=TRUE)
# $fit
# 1
# 0.3311351
#
# $se.fit
# [1] 0.03498896
#
# $residual.scale
# [1] 3.155753

Related

How to only print (adjusted) R-squared of regression model?

I am a beginner with R. I have a data set on air pollution. The columns are site, measured concentration and 80 variables (v1-v80) that might influence the concentration.
I want to make a model with forward stepwise regression based on R-squared/adj with my own code (so I do not want to use something like step() or regsubset()). The dependent variable is concentration and the variables v1-v80 as independent variables. I wrote the following code for the first step (data set is simplified):
site concentration v1 v2 v3
1 1 -0.84085548 1.7114409 -0.2857736 -1.0803926
2 2 1.38435934 -0.6029080 0.1381082 -0.1575344
3 3 -1.25549186 -0.4721664 1.2276303 -1.0717600
for (j in names(df)){
model <- lm(concentration ~ df[[j]], data = df)
print(j)
print(summary(model))
}
This works well, but I am only interested in R-squared and adjusted R-squared. I tried to only have (adjusted) R-squared printed with:
for (j in names(df)){
model <- lm(concentration ~ df[[j]], data = df)
print(j)
print(summary(model$r.squared))
print(summary(model$adj.r.squared))
}
But then I get as output (this is only a part):
[1] "v1"
Length Class Mode
0 NULL NULL
Length Class Mode
0 NULL NULL
[1] "v2"
Length Class Mode
0 NULL NULL
Length Class Mode
0 NULL NULL
Etcetera.
How can I get as output only the name of the relevant variable and (adjusted) R-squared for every model that is produced in the for-loop?
Thanks!
library(broom)
glance(model)[c(1,2)]
Input = ("site concentration v1 v2 v3
1 1 -0.84085548 1.7114409 -0.2857736 -1.0803926
2 2 1.38435934 -0.6029080 0.1381082 -0.1575344
3 3 -1.25549186 -0.4721664 1.2276303 -1.0717600")
df = read.table(textConnection(Input),header=TRUE)
for (j in names(df)){
model <- lm(concentration ~ df[[j]], data = df)
print(j)
print(glance(model)[c(1,2)])
}
[1] "site"
r.squared adj.r.squared
1 0.02132635 -0.9573473
[1] "concentration"
r.squared adj.r.squared
1 1 1
[1] "v1"
r.squared adj.r.squared
1 0.1717716 -0.6564568
[1] "v2"
r.squared adj.r.squared
1 0.1482473 -0.7035055
[1] "v3"
r.squared adj.r.squared
1 0.9762587 0.9525174
Warning message:
In stats::summary.lm(x) :
essentially perfect fit: summary may be unreliable
Using base R
summary(model)$adj.r.squared
summary(model)$r.squared

Regression in R using poly() function

The function poly() in R is used in order to produce orthogonal vectors and can be helpful to interpret coefficient significance. However, I don't see the point of using it for prediction. To my view, the two following model (model_1 and model_2) should produce the same predictions.
q=1:11
v=c(3,5,7,9.2,14,20,26,34,50,59,80)
model_1=lm(v~poly(q,2))
model_2=lm(v~1+q+q^2)
predict(model_1)
predict(model_2)
But it doesn't. Why?
Because they are not the same model. Your second one has one unique covariate, while the first has two.
> model_2
Call:
lm(formula = v ~ 1 + q + q^2)
Coefficients:
(Intercept) q
-15.251 7.196
You should use the I() function to modify one parameter inside your formula in order the regression to consider it as a covariate:
model_2=lm(v~1+q+I(q^2))
> model_2
Call:
lm(formula = v ~ 1 + q + I(q^2))
Coefficients:
(Intercept) q I(q^2)
7.5612 -3.3323 0.8774
will give the same prediction
> predict(model_1)
1 2 3 4 5 6 7 8 9 10 11
5.106294 4.406154 5.460793 8.270210 12.834406 19.153380 27.227133 37.055664 48.638974 61.977063 77.069930
> predict(model_2)
1 2 3 4 5 6 7 8 9 10 11
5.106294 4.406154 5.460793 8.270210 12.834406 19.153380 27.227133 37.055664 48.638974 61.977063 77.069930

How can I force dropping intercept or equivalent in this linear model?

Consider the following table :
DB <- data.frame(
Y =rnorm(6),
X1=c(T, T, F, T, F, F),
X2=c(T, F, T, F, T, T)
)
Y X1 X2
1 1.8376852 TRUE TRUE
2 -2.1173739 TRUE FALSE
3 1.3054450 FALSE TRUE
4 -0.3476706 TRUE FALSE
5 1.3219099 FALSE TRUE
6 0.6781750 FALSE TRUE
I'd like to explain my quantitative variable Y by two binary variables (TRUE or FALSE) without intercept.
The argument of this choice is that, in my study, we can't observe X1=FALSE and X2=FALSE at the same time, so it doesn't make sense to have a mean, other than 0, for this level.
With intercept
m1 <- lm(Y~X1+X2, data=DB)
summary(m1)
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -1.9684 1.0590 -1.859 0.1600
X1TRUE 0.7358 0.9032 0.815 0.4749
X2TRUE 3.0702 0.9579 3.205 0.0491 *
Without intercept
m0 <- lm(Y~0+X1+X2, data=DB)
summary(m0)
Coefficients:
Estimate Std. Error t value Pr(>|t|)
X1FALSE -1.9684 1.0590 -1.859 0.1600
X1TRUE -1.2325 0.5531 -2.229 0.1122
X2TRUE 3.0702 0.9579 3.205 0.0491 *
I can't explain why two coefficients are estimated for the variable X1. It seems to be equivalent to the intercept coefficient in the model with intercept.
Same results
When we display the estimation for all the combinations of variables, the two models are the same.
DisplayLevel <- function(m){
R <- outer(
unique(DB$X1),
unique(DB$X2),
function(a, b) predict(m,data.frame(X1=a, X2=b))
)
colnames(R) <- paste0('X2:', unique(DB$X2))
rownames(R) <- paste0('X1:', unique(DB$X1))
return(R)
}
DisplayLevel(m1)
X2:TRUE X2:FALSE
X1:TRUE 1.837685 -1.232522
X1:FALSE 1.101843 -1.968364
DisplayLevel(m0)
X2:TRUE X2:FALSE
X1:TRUE 1.837685 -1.232522
X1:FALSE 1.101843 -1.968364
So the two models are equivalent.
Question
My question is : can we just estimate one coefficient for the first effect ? Can we force R to assign a 0 value to the combinations X1=FALSE and X2=FALSE ?
Yes, we can, by
DB <- as.data.frame(data.matrix(DB))
## or you can do:
## DB$X1 <- as.integer(DB$X1)
## DB$X2 <- as.integer(DB$X2)
# Y X1 X2
# 1 -0.5059575 1 1
# 2 1.3430388 1 0
# 3 -0.2145794 0 1
# 4 -0.1795565 1 0
# 5 -0.1001907 0 1
# 6 0.7126663 0 1
## a linear model without intercept
m0 <- lm(Y ~ 0 + X1 + X2, data = DB)
DisplayLevel(m0)
# X2:1 X2:0
# X1:1 0.15967744 0.2489237
# X1:0 -0.08924625 0.0000000
I have explicitly coerced your TRUE/FALSE binary into numeric 1/0, so that no contrast is handled by lm().
The data appeared in my answer are different to yours, because you did not use set.seed(?) before rnorm() for reproducibility. But this is not a issue here.

Order of predictions from merTools predictInterval()

I'm encountering an issue with predictInterval() from merTools. The predictions seem to be out of order when compared to the data and midpoint predictions using the standard predict() method for lme4. I can't reproduce the problem with simulated data, so the best I can do is show the lmerMod object and some of my data.
> # display input data to the model
> head(inputData)
id y x z
1 calibration19 1.336 0.531 001
2 calibration20 1.336 0.433 001
3 calibration22 0.042 0.432 001
4 calibration23 0.042 0.423 001
5 calibration16 3.300 0.491 001
6 calibration17 3.300 0.465 001
> sapply(inputData, class)
id y x z
"factor" "numeric" "numeric" "factor"
>
> # fit mixed effects regression with random intercept on z
> lmeFit = lmer(y ~ x + (1 | z), inputData)
>
> # display lmerMod object
> lmeFit
Linear mixed model fit by REML ['lmerMod']
Formula: y ~ x + (1 | z)
Data: inputData
REML criterion at convergence: 444.245
Random effects:
Groups Name Std.Dev.
z (Intercept) 0.3097
Residual 0.9682
Number of obs: 157, groups: z, 17
Fixed Effects:
(Intercept) x
-0.4291 5.5638
>
> # display new data to predict in
> head(predData)
id x z
1 29999900108 0.343 001
2 29999900207 0.315 001
3 29999900306 0.336 001
4 29999900405 0.408 001
5 29999900504 0.369 001
6 29999900603 0.282 001
> sapply(predData, class)
id x z
"factor" "numeric" "factor"
>
> # estimate fitted values using predict()
> set.seed(1)
> preds_mid = predict(lmeFit, newdata=predData)
>
> # estimate fitted values using predictInterval()
> set.seed(1)
> preds_interval = predictInterval(lmeFit, newdata=predData, n.sims=1000) # wrong order
>
> # estimate fitted values just for the first observation to confirm that it should be similar to preds_mid
> set.seed(1)
> preds_interval_first_row = predictInterval(lmeFit, newdata=predData[1,], n.sims=1000)
>
> # display results
> head(preds_mid) # correct prediction
1 2 3 4 5 6
1.256860 1.101074 1.217913 1.618505 1.401518 0.917470
> head(preds_interval) # incorrect order
fit upr lwr
1 1.512410 2.694813 0.133571198
2 1.273143 2.521899 0.009878347
3 1.398273 2.785358 0.232501376
4 1.878165 3.188086 0.625161201
5 1.605049 2.813737 0.379167003
6 1.147415 2.417980 -0.108547846
> preds_interval_first_row # correct prediction
fit upr lwr
1 1.244366 2.537451 -0.04911808
> preds_interval[round(preds_interval$fit,3)==round(preds_interval_first_row$fit,3),] # the correct prediction ends up as observation 1033
fit upr lwr
1033 1.244261 2.457012 -0.0001299777
>
To put this into words, the first observation of my data frame predData should have a fitted value around 1.25 according to the predict() method, but it has a value around 1.5 using the predictInterval() method. This does not seem to be simply due to differences in the prediction approaches, because if I restrict the newdata argument to the first row of predData, the resulting fitted value is around 1.25, as expected.
The fact that I can't reproduce the problem with simulated data leads me to believe it has to do with an attribute of my input or prediction data. I've tried reclassifying the factor variable as character, enforcing the order of the rows prior to fitting the model, between fitting the model and predicting, but found no success.
Is this a known issue? What can I do to avoid it?
I have attempted to make a minimal reproducible example of this issue, but have been unsuccessful.
library(merTools)
d <- data.frame(x = rnorm(1000), z = sample(1:25L, 1000, replace=TRUE),
id = sample(LETTERS, 1000, replace = TRUE))
d$z <- as.factor(d$z)
d$id <- factor(d$id)
d$y <- simulate(~x+(1|z),family = gaussian,
newdata=d,
newparams=list(beta=c(2, -1.1), theta=c(.25),
sigma = c(.23)), seed =463)[[1]]
lmeFit <- lmer(y ~ x + (1|z), data = d)
predData <- data.frame(x = rnorm(25), z = sample(1:25L, 25, replace=TRUE),
id = sample(LETTERS, 25, replace = TRUE))
predData$z <- as.factor(predData$z)
predData$id <- factor(predData$id)
predict(lmeFit, predData)
predictInterval(lmeFit, predData)
predictInterval(lmeFit, predData[1, ])
But, playing around with this code I was not able to recreate the error observed above. Can you post a synthetic example or see if you can create a synthetic example?
Or can you test the issue first coercing the factors to characters and seeing if you see the same re-ordering issue?

How can I run logistic regression loop that will run across all Independent variables , pairs and trios

I would like to run the dependent variable of a logistic regression (in my data set it's : dat$admit) with all available variables, pairs and trios(3 Independent vars), each regression with a different Independent variables vs dependent variable. The outcome that I would like to get back is a list of each regression summary in a row: coeff,p-value ,AUC,CI 95%. Using the data set submitted below there should be 7 regressions:
dat$admit vs dat$female
dat$admit vs dat$apcalc
dat$admit vs dat$num
dat$admit vs dat$female + dat$apcalc
dat$admit vs dat$female + dat$num
dat$admit vs dat$apcalc + dat$num
dat$admit vs dat$female + dat$apcalc + dat$num
Here is a sample data set (where dat$admit is the logistic regression dependent variable) :
dat <- read.table(text = " female apcalc admit num
0 0 0 7
0 0 1 1
0 1 0 3
0 1 1 7
1 0 0 5
1 0 1 1
1 1 0 0
1 1 1 6",header = TRUE)
Per #marek comment, the output should be like this (for female alone and from female & apcalc ):
# Intercept Estimate P-Value (Intercept) P-Value (Estimate) AUC
# female 0.000000e+00 0.000000e+00 1 1 0.5
female+apcalc 0.000000e+00 0.000000e+00 1 1 0.5
There is a good code that #David Arenburg wrote that produces the stats but with no models creations of pairs and trios so I would like to know how can add the models creations.
Here is David Arenburg's code?
library(caTools)
ResFunc <- function(x) {
temp <- glm(reformulate(x,response="admit"), data=dat,family=binomial)
c(summary(temp)$coefficients[,1],
summary(temp)$coefficients[,4],
colAUC(predict(temp, type = "response"), dat$admit))
}
temp <- as.data.frame(t(sapply(setdiff(names(dat),"admit"), ResFunc)))
colnames(temp) <- c("Intercept", "Estimate", "P-Value (Intercept)", "P-Value (Estimate)", "AUC")
temp
# Intercept Estimate P-Value (Intercept) P-Value (Estimate) AUC
# female 0.000000e+00 0.000000e+00 1 1 0.5
# apcalc 0.000000e+00 0.000000e+00 1 1 0.5
# num 5.177403e-16 -1.171295e-16 1 1 0.5
Any idea how to create this list? Thanks, Ron
Simple solution is to make the list of models by hand:
results <- list(
"female" = glm(admit~female , family=binomial, dat)
,"apcalc" = glm(admit~apcalc , family=binomial, dat)
,"num" = glm(admit~num , family=binomial, dat)
,"female + apcalc" = glm(admit~female + apcalc, family=binomial, dat)
,"female + num" = glm(admit~female + num , family=binomial, dat)
,"apcalc + num" = glm(admit~apcalc + num , family=binomial, dat)
,"all" = glm(admit~female + apcalc + num, family=binomial, dat)
)
Then you could check models by lapplying over the list of models:
lapply(results, summary)
Or more advanced (coefficient statistics):
require(plyr)
ldply(results, function(m) {
name_rows(as.data.frame(summary(m)$coefficients))
})
In similar way you could extract every information you want. Just write function to extract statistics you want, which takes glm model as argument:
get_everything_i_want <- function(model) {
#... do what i want ...
# eg:
list(AIC = AIC(model))
}
and then apply to each model:
lapply(results, get_everything_i_want)
# $female
# $female$AIC
# [1] 15.0904
# $apcalc
# $apcalc$AIC
# [1] 15.0904
# $num
# $num$AIC
# [1] 15.0904
# $`female + apcalc`
# $`female + apcalc`$AIC
# [1] 17.0904
# $`female + num`
# $`female + num`$AIC
# [1] 17.0904
# $`apcalc + num`
# $`apcalc + num`$AIC
# [1] 17.0904
# $all
# $all$AIC
# [1] 19.0904

Resources