Using specified regressors in glm() in R - r

Question related to R, glm() function:
I have a dataset obtained as:
mydata <- read.csv("data.csv", header = TRUE)
which contains the variable 'y' (y is binary 0 or 1) and 60 regressors. Three of these regressors are 'avg','age' and 'income' (all three are numerical).
I want to use glm function for logistic regression, as below:
model <-glm(y~., data = mydata, family = binomial)
Can you tell me how I may proceed if I don't want to use the three specified variables (avg, age and income) in the glm() function, and use only the remaining 57 variables?

You can simply exclude those three variables from mydata before running glm().
Here I create some sample data:
set.seed(1)
mydata<-replicate(10,rnorm(100,300,50))
mydata<-data.frame(dv=sample(c(0,1),100,replace = TRUE),mydata)
> head(mydata)
dv X1 X2 X3 X4 X5 X6 X7 X8 X9 X10
1 1 268.6773 268.9817 320.4701 344.6837 353.7220 303.8652 282.9467 264.6216 245.6546 222.9299
2 1 309.1822 302.1058 384.4437 247.6351 394.7827 285.1566 375.1212 398.5786 208.6958 309.7161
3 1 258.2186 254.4539 379.3294 398.5669 269.8501 240.8379 326.4154 295.5001 349.7641 313.2211
4 0 379.7640 307.9014 283.4546 280.8184 280.4566 300.5646 327.1096 299.2991 299.4069 244.0632
5 0 316.4754 267.2708 185.7382 382.7073 279.1889 349.5801 293.1663 243.8272 270.0186 332.5476
6 0 258.9766 388.3644 424.8831 375.6106 281.2171 379.6984 243.1633 232.7935 291.1026 248.3550
If I run your specified model on the data as it is then I use all the variables on the right hand side:
model<-glm(data=mydata, dv~.,family=binomial(link = 'logit'))
> summary(model)
Call:
glm(formula = dv ~ ., family = binomial(link = "logit"), data = mydata)
Deviance Residuals:
Min 1Q Median 3Q Max
-1.8891 -1.0853 -0.5163 1.0237 1.8303
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -2.4330825 4.1437180 -0.587 0.5571
X1 -0.0020482 0.0049025 -0.418 0.6761
X2 -0.0059021 0.0046298 -1.275 0.2024
X3 0.0123246 0.0047991 2.568 0.0102 *
X4 0.0024804 0.0046856 0.529 0.5966
X5 0.0025348 0.0039545 0.641 0.5215
X6 -0.0005905 0.0047417 -0.125 0.9009
X7 -0.0001758 0.0040737 -0.043 0.9656
X8 0.0042362 0.0041170 1.029 0.3035
X9 -0.0007664 0.0042471 -0.180 0.8568
X10 -0.0042089 0.0043094 -0.977 0.3287
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 138.59 on 99 degrees of freedom
Residual deviance: 125.11 on 89 degrees of freedom
AIC: 147.11
Number of Fisher Scoring iterations: 4
Now I exclude X1 and X2 from mydata and run the model again:
mydata2<-mydata[,-match(c('X1','X2'), colnames(mydata))]
model2<-glm(data=mydata2, dv~.,family=binomial(link = 'logit'))
> summary(model2)
Call:
glm(formula = dv ~ ., family = binomial(link = "logit"), data = mydata2)
Deviance Residuals:
Min 1Q Median 3Q Max
-1.8983 -1.0724 -0.4521 1.1132 1.7792
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -4.8725545 3.6357314 -1.340 0.18019
X3 0.0124982 0.0047930 2.608 0.00912 **
X4 0.0031911 0.0045971 0.694 0.48758
X5 0.0015992 0.0038101 0.420 0.67467
X6 -0.0003295 0.0046554 -0.071 0.94357
X7 0.0003372 0.0039961 0.084 0.93275
X8 0.0038889 0.0040737 0.955 0.33977
X9 -0.0010014 0.0042078 -0.238 0.81189
X10 -0.0041691 0.0042232 -0.987 0.32356
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 138.59 on 99 degrees of freedom
Residual deviance: 126.93 on 91 degrees of freedom
AIC: 144.93
Number of Fisher Scoring iterations: 4

The . ("everything") on the right side of the formula can be modified by subtracting terms:
model <- glm(y~ . - avg - age - income, data = mydata,
family = binomial)

Related

How to get a Rquared and F-ratio for a mixed effects model in R?

Data fits a mixed effects model with nested random effects. How do I get the r-square and F-ratio for this model?
set.seed(111)
df <- data.frame(level = rep(c("A","B"), times = 8),
time = rep(c("1","2","3","4"), each = 4),
x1 = rnorm(16,3,1),
x2 = rnorm(16,3,1))
mod <- lmer(x1 ~ x2 + I(x2^2) + (1|time/level), df)
summary(mod)
Linear mixed model fit by REML. t-tests use Satterthwaite's method ['lmerModLmerTest'] Formula: x1 ~ x2 + I(x2^2) + (1 | time/level) Data: df
REML criterion at convergence: 47.9
Scaled residuals:
Min 1Q Median 3Q Max
-1.72702 -0.41979 0.00653 0.43709 2.36393
Random effects: Groups Name Variance Std.Dev. level:time (Intercept) 0.00 0.00 time (Intercept) 0.00 0.00 Residual 1.02 1.01 Number of obs: 16, groups: level:time, 8; time, 4
Fixed effects:
Estimate Std. Error df t value Pr(>|t|) (Intercept) 3.58299 0.81911 13.00000 4.374 0.000753 *** x2
-0.59777 0.54562 13.00000 -1.096 0.293147 I(x2^2) 0.07686 0.09356 13.00000 0.822 0.426136
--- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Correlation of Fixed Effects:
(Intr) x2 x2 -0.868 I(x2^2) 0.660 -0.928 optimizer (nloptwrap) convergence code: 0 (OK) boundary (singular) fit: see ?isSingular
For the R-squared you can use the r.squaredLR function from the MuMIn package:
library(MuMIn)
r.squaredLR(mod)
Output:
[1] 0.1017782
attr(,"adj.r.squared")
[1] 0.1086741
For the F-ratio, maybe you want this:
anova(mod)
Output:
Analysis of Variance Table
npar Sum Sq Mean Sq F value
x2 1 0.81407 0.81407 0.7981
I(x2^2) 1 0.68850 0.68850 0.6750

R polynomal regression or group values and test between groups + outcome interpreatation

I am trying to model the relation between a scar acquisition rate of a wild population of animals, and I have calculated yearly rates before.
If you see below the plot, it seems to me that rates rise through the middle of the period and than fall again. I have tried to fit a polynomial LM with the code
model1 <- lm(Rate~poly(year, 2, raw = TRUE),data=yearlyratesub)
summary(model1)
model1
I have plotted using:
g <-ggplot(yearlyratesub, aes(year, Rate)) + geom_point(shape=1) + geom_smooth(method = lm, formula = y ~ poly(x, 2, raw = TRUE))
g
The model output was:
Call:
lm(formula = Rate ~ poly(year, 2, raw = TRUE), data = yearlyratesub)
Residuals:
Min 1Q Median 3Q Max
-0.126332 -0.037683 -0.002602 0.053222 0.083503
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -8.796e+03 3.566e+03 -2.467 0.0297 *
poly(year, 2, raw = TRUE)1 8.747e+00 3.545e+00 2.467 0.0297 *
poly(year, 2, raw = TRUE)2 -2.174e-03 8.813e-04 -2.467 0.0297 *
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.0666 on 12 degrees of freedom
Multiple R-squared: 0.3369, Adjusted R-squared: 0.2264
F-statistic: 3.048 on 2 and 12 DF, p-value: 0.08503
How can I enterpret that now? The overall model p value is not significant but the intercept and single slopes are?
Should I rather try another fit than x² or even group the values and test between groups e.g. with an ANOVA? I know the LM has low fit but I guess it's because I have little values and maybe x² might be not it...?
Would be happy about input regarding model and outcome interpretation..
Grouping
Since the data was not provided (next time please provide a complete reproducible question including all inputs) we used the data in the Note at the end. We see that that the model is highly significant if we group the points using the indicated breakpoints.
g <- factor(findInterval(yearlyratesub$year, c(2007.5, 2014.5))+1); g
## [1] 1 1 1 1 2 2 2 2 2 2 2 3 3 3 3
## Levels: 1 2 3
fm <- lm(rate ~ g, yearlyratesub)
summary(fm)
giving
Call:
lm(formula = rate ~ g, data = yearlyratesub)
Residuals:
Min 1Q Median 3Q Max
-0.064618 -0.018491 0.006091 0.029684 0.046831
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.110854 0.019694 5.629 0.000111 ***
g2 0.127783 0.024687 5.176 0.000231 ***
g3 -0.006714 0.027851 -0.241 0.813574
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.03939 on 12 degrees of freedom
Multiple R-squared: 0.7755, Adjusted R-squared: 0.738
F-statistic: 20.72 on 2 and 12 DF, p-value: 0.0001281
We could consider combining the outer two groups.
g2 <- factor(g == 2)
fm2 <- lm(rate ~ g2, yearlyratesub)
summary(fm2)
giving:
Call:
lm(formula = rate ~ g2, data = yearlyratesub)
Residuals:
Min 1Q Median 3Q Max
-0.064618 -0.016813 0.007096 0.031363 0.046831
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.10750 0.01341 8.015 2.19e-06 ***
g2TRUE 0.13114 0.01963 6.680 1.52e-05 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.03793 on 13 degrees of freedom
Multiple R-squared: 0.7744, Adjusted R-squared: 0.757
F-statistic: 44.62 on 1 and 13 DF, p-value: 1.517e-05
Sinusoid
Looking at the graph it seems that the points are turning up at the left and right edges suggesting we use a sinusoidal fit. a + b * cos(c * year)
fm3 <- nls(rate ~ cbind(a = 1, b = cos(c * year)),
yearlyratesub, start = list(c = 0.5), algorithm = "plinear")
summary(fm3)
giving
Formula: rate ~ cbind(a = 1, b = cos(c * year))
Parameters:
Estimate Std. Error t value Pr(>|t|)
c 0.4999618 0.0001449 3449.654 < 2e-16 ***
.lin.a 0.1787200 0.0150659 11.863 5.5e-08 ***
.lin.b 0.0753754 0.0205818 3.662 0.00325 **
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.05688 on 12 degrees of freedom
Number of iterations to convergence: 2
Achieved convergence tolerance: 5.241e-08
Comparison
Plotting the fits and looking at their residual sum of squares and AIC we have
plot(yearlyratesub)
# fm0 from Note at end, fm and fm2 are grouping models, fm3 is sinusoidal
L <- list(fm0 = fm0, fm = fm, fm2 = fm2, fm3 = fm3)
for(i in seq_along(L)) {
lines(fitted(L[[i]]) ~ year, yearlyratesub, col = i, lwd = 2)
}
legend("topright", names(L), col = seq_along(L), lwd = 2)
giving the following where lower residual sum of squares and AIC (which takes into account the number of paramters) are better. We see that fm fits the most closely based on residual sum of squares but with fm2 fitting almost as well; however, when taking the number of parameters into account by using AIC fm2 has the lowest and so is most favored by that criterion.
cbind(RSS = sapply(L, deviance), AIC = sapply(L, AIC))
## RSS AIC
## fm0 0.05488031 -33.59161
## fm 0.01861659 -49.80813
## fm2 0.01870674 -51.73567
## fm3 0.04024237 -38.24512
Note
yearlyratesub <-
structure(list(year = c(2004, 2005, 2006, 2007, 2008, 2009, 2010,
2011, 2012, 2013, 2014, 2015, 2017, 2018, 2019), rate = c(0.14099813521287,
0.0949946651016247, 0.0904788394070601, 0.11694517831575, 0.26786193592875,
0.256346628540479, 0.222029818828298, 0.180116679856725, 0.285467976459104,
0.174019208113095, 0.28461698734932, 0.0574827955982996, 0.103378448084776,
0.114593695172686, 0.141105952837639)), row.names = c(NA, -15L
), class = "data.frame")
fm0 <- lm(rate ~ poly(year, 2, raw = TRUE), yearlyratesub)
summary(fm0)
giving
Call:
lm(formula = rate ~ poly(year, 2, raw = TRUE), data = yearlyratesub)
Residuals:
Min 1Q Median 3Q Max
-0.128335 -0.038289 -0.002715 0.054090 0.084792
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -8.930e+03 3.621e+03 -2.466 0.0297 *
poly(year, 2, raw = TRUE)1 8.880e+00 3.600e+00 2.467 0.0297 *
poly(year, 2, raw = TRUE)2 -2.207e-03 8.949e-04 -2.467 0.0297 *
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.06763 on 12 degrees of freedom
Multiple R-squared: 0.3381, Adjusted R-squared: 0.2278
F-statistic: 3.065 on 2 and 12 DF, p-value: 0.0841

linear model having 4 predictors

I am trying to fit a linear model having 4 predictors. The problem I am facing is my code doesn't estimate the one parameter. Every time when I put the one variable at last of my lm formula it doesn't estimate it. My code is:
AllData <- read.csv("AllBandReflectance.csv",header = T)
Swir2ref <- AllData$band7
x1 <- AllData$X1
x2 <- AllData$X2
y1 <- AllData$Y1
y2 <- AllData$Y2
linear.model <- lm( Swir2ref ~ x1 + y1 +x2 +y2 , data = AllData )
summary(linear.model)
Call:
lm(formula = Swir2ref ~ x1 + y1 + x2 + y2, data = AllData)
Residuals:
Min 1Q Median 3Q Max
-0.027277 -0.008793 -0.000689 0.010085 0.035097
Coefficients: (1 not defined because of singularities)
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.595593 0.002006 296.964 <2e-16 ***
x1 0.002175 0.003462 0.628 0.532
y1 0.001498 0.003638 0.412 0.682
x2 0.022671 0.018786 1.207 0.232
y2 NA NA NA NA
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.01437 on 67 degrees of freedom
Multiple R-squared: 0.02876, Adjusted R-squared: -0.01473
F-statistic: 0.6613 on 3 and 67 DF, p-value: 0.5787

Difference between lm(y~x1/x2) and aov(y~x1+Error(x2))

I have trouble understanding the difference between these two notations.
According to R intro y~x1/x2 represents that x2 in nested within x1. If x1 is a factor and x2 a continuous variable, is lm( y~x1/x2) a correct representation of nested ANCOVA?
What is confusing is that some online help topics suggest using aov(y~x1+Error(x2)) to represent a nested anova. Yet those two codes have completely different results.
For example:
x2 = rnorm(1000,2)
x1 = rep( c("A","B"), each=500)
y = x2*3+rnorm(1000)
Under this scenario I would expect x2 to be significant and x1 to be non significant.
summary(aov(y~x1+Error(x2)))
Error: x2
Df Sum Sq Mean Sq
x1 1 9262 9262
Error: Within
Df Sum Sq Mean Sq F value Pr(>F)
x1 1 0.0 0.0003 0 0.985
Residuals 997 967.9 0.9708
aov() works as expected. However, lm()....
summary(lm( y~x1/x2))
Call:
lm(formula = y ~ x1/x2)
Residuals:
Min 1Q Median 3Q Max
-3.4468 -0.6352 0.0092 0.6526 2.8294
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.08727 0.09566 0.912 0.3618
x1B -0.24501 0.13715 -1.786 0.0743 .
x1A:x2 2.94012 0.04362 67.401 <2e-16 ***
x1B:x2 3.06272 0.04326 70.806 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.9838 on 996 degrees of freedom
Multiple R-squared: 0.9058, Adjusted R-squared: 0.9055
F-statistic: 3191 on 3 and 996 DF, p-value: < 2.2e-16
x1 is marginally significant, and in many iterations it is highly significant? How can these results be so different?
What am I missing? Those two formulas are not suppose to represent the same thing? Or am I misunderstanding something on the underlying statistics?

How to pull out Dispersion parameter in R

Call:
glm(formula = Y1 ~ 0 + x1 + x2 + x3 + x4 + x5, family = quasibinomial(link = cauchit))
Deviance Residuals:
Min 1Q Median 3Q Max
-2.5415 0.2132 0.3988 0.6614 1.8426
Coefficients:
Estimate Std. Error t value Pr(>|t|)
x1 -0.7280 0.3509 -2.075 0.03884 *
x2 -0.9108 0.3491 -2.609 0.00951 **
x3 0.2377 0.1592 1.494 0.13629
x4 -0.2106 0.1573 -1.339 0.18151
x5 3.6982 0.8658 4.271 2.57e-05 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for quasibinomial family taken to be 0.8782731)
Null deviance: 443.61 on 320 degrees of freedom
Residual deviance: 270.17 on 315 degrees of freedom
AIC: NA
Number of Fisher Scoring iterations: 12
Here is the output from glm in R.
Do you know a way to pull out Dispersion parameter which is 0.8782731 in this case, instead of just copy and paste. Thanks.
You can extract it from the output of summary:
data(iris)
mod <- glm((Petal.Length > 5) ~ Sepal.Width, data=iris)
summary(mod)
#
# Call:
# glm(formula = (Petal.Length > 5) ~ Sepal.Width, data = iris)
#
# Deviance Residuals:
# Min 1Q Median 3Q Max
# -0.3176 -0.2856 -0.2714 0.7073 0.7464
#
# Coefficients:
# Estimate Std. Error t value Pr(>|t|)
# (Intercept) 0.38887 0.26220 1.483 0.140
# Sepal.Width -0.03561 0.08491 -0.419 0.676
#
# (Dispersion parameter for gaussian family taken to be 0.2040818)
#
# Null deviance: 30.240 on 149 degrees of freedom
# Residual deviance: 30.204 on 148 degrees of freedom
# AIC: 191.28
#
# Number of Fisher Scoring iterations: 2
summary(mod)$dispersion
# [1] 0.2040818
The str function in R is often helpful to solve these sorts of questions. For instance, I looked at str(summary(mod)) to answer the question.

Resources