I am trying to model the relation between a scar acquisition rate of a wild population of animals, and I have calculated yearly rates before.
If you see below the plot, it seems to me that rates rise through the middle of the period and than fall again. I have tried to fit a polynomial LM with the code
model1 <- lm(Rate~poly(year, 2, raw = TRUE),data=yearlyratesub)
summary(model1)
model1
I have plotted using:
g <-ggplot(yearlyratesub, aes(year, Rate)) + geom_point(shape=1) + geom_smooth(method = lm, formula = y ~ poly(x, 2, raw = TRUE))
g
The model output was:
Call:
lm(formula = Rate ~ poly(year, 2, raw = TRUE), data = yearlyratesub)
Residuals:
Min 1Q Median 3Q Max
-0.126332 -0.037683 -0.002602 0.053222 0.083503
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -8.796e+03 3.566e+03 -2.467 0.0297 *
poly(year, 2, raw = TRUE)1 8.747e+00 3.545e+00 2.467 0.0297 *
poly(year, 2, raw = TRUE)2 -2.174e-03 8.813e-04 -2.467 0.0297 *
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.0666 on 12 degrees of freedom
Multiple R-squared: 0.3369, Adjusted R-squared: 0.2264
F-statistic: 3.048 on 2 and 12 DF, p-value: 0.08503
How can I enterpret that now? The overall model p value is not significant but the intercept and single slopes are?
Should I rather try another fit than x² or even group the values and test between groups e.g. with an ANOVA? I know the LM has low fit but I guess it's because I have little values and maybe x² might be not it...?
Would be happy about input regarding model and outcome interpretation..
Grouping
Since the data was not provided (next time please provide a complete reproducible question including all inputs) we used the data in the Note at the end. We see that that the model is highly significant if we group the points using the indicated breakpoints.
g <- factor(findInterval(yearlyratesub$year, c(2007.5, 2014.5))+1); g
## [1] 1 1 1 1 2 2 2 2 2 2 2 3 3 3 3
## Levels: 1 2 3
fm <- lm(rate ~ g, yearlyratesub)
summary(fm)
giving
Call:
lm(formula = rate ~ g, data = yearlyratesub)
Residuals:
Min 1Q Median 3Q Max
-0.064618 -0.018491 0.006091 0.029684 0.046831
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.110854 0.019694 5.629 0.000111 ***
g2 0.127783 0.024687 5.176 0.000231 ***
g3 -0.006714 0.027851 -0.241 0.813574
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.03939 on 12 degrees of freedom
Multiple R-squared: 0.7755, Adjusted R-squared: 0.738
F-statistic: 20.72 on 2 and 12 DF, p-value: 0.0001281
We could consider combining the outer two groups.
g2 <- factor(g == 2)
fm2 <- lm(rate ~ g2, yearlyratesub)
summary(fm2)
giving:
Call:
lm(formula = rate ~ g2, data = yearlyratesub)
Residuals:
Min 1Q Median 3Q Max
-0.064618 -0.016813 0.007096 0.031363 0.046831
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.10750 0.01341 8.015 2.19e-06 ***
g2TRUE 0.13114 0.01963 6.680 1.52e-05 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.03793 on 13 degrees of freedom
Multiple R-squared: 0.7744, Adjusted R-squared: 0.757
F-statistic: 44.62 on 1 and 13 DF, p-value: 1.517e-05
Sinusoid
Looking at the graph it seems that the points are turning up at the left and right edges suggesting we use a sinusoidal fit. a + b * cos(c * year)
fm3 <- nls(rate ~ cbind(a = 1, b = cos(c * year)),
yearlyratesub, start = list(c = 0.5), algorithm = "plinear")
summary(fm3)
giving
Formula: rate ~ cbind(a = 1, b = cos(c * year))
Parameters:
Estimate Std. Error t value Pr(>|t|)
c 0.4999618 0.0001449 3449.654 < 2e-16 ***
.lin.a 0.1787200 0.0150659 11.863 5.5e-08 ***
.lin.b 0.0753754 0.0205818 3.662 0.00325 **
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.05688 on 12 degrees of freedom
Number of iterations to convergence: 2
Achieved convergence tolerance: 5.241e-08
Comparison
Plotting the fits and looking at their residual sum of squares and AIC we have
plot(yearlyratesub)
# fm0 from Note at end, fm and fm2 are grouping models, fm3 is sinusoidal
L <- list(fm0 = fm0, fm = fm, fm2 = fm2, fm3 = fm3)
for(i in seq_along(L)) {
lines(fitted(L[[i]]) ~ year, yearlyratesub, col = i, lwd = 2)
}
legend("topright", names(L), col = seq_along(L), lwd = 2)
giving the following where lower residual sum of squares and AIC (which takes into account the number of paramters) are better. We see that fm fits the most closely based on residual sum of squares but with fm2 fitting almost as well; however, when taking the number of parameters into account by using AIC fm2 has the lowest and so is most favored by that criterion.
cbind(RSS = sapply(L, deviance), AIC = sapply(L, AIC))
## RSS AIC
## fm0 0.05488031 -33.59161
## fm 0.01861659 -49.80813
## fm2 0.01870674 -51.73567
## fm3 0.04024237 -38.24512
Note
yearlyratesub <-
structure(list(year = c(2004, 2005, 2006, 2007, 2008, 2009, 2010,
2011, 2012, 2013, 2014, 2015, 2017, 2018, 2019), rate = c(0.14099813521287,
0.0949946651016247, 0.0904788394070601, 0.11694517831575, 0.26786193592875,
0.256346628540479, 0.222029818828298, 0.180116679856725, 0.285467976459104,
0.174019208113095, 0.28461698734932, 0.0574827955982996, 0.103378448084776,
0.114593695172686, 0.141105952837639)), row.names = c(NA, -15L
), class = "data.frame")
fm0 <- lm(rate ~ poly(year, 2, raw = TRUE), yearlyratesub)
summary(fm0)
giving
Call:
lm(formula = rate ~ poly(year, 2, raw = TRUE), data = yearlyratesub)
Residuals:
Min 1Q Median 3Q Max
-0.128335 -0.038289 -0.002715 0.054090 0.084792
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -8.930e+03 3.621e+03 -2.466 0.0297 *
poly(year, 2, raw = TRUE)1 8.880e+00 3.600e+00 2.467 0.0297 *
poly(year, 2, raw = TRUE)2 -2.207e-03 8.949e-04 -2.467 0.0297 *
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.06763 on 12 degrees of freedom
Multiple R-squared: 0.3381, Adjusted R-squared: 0.2278
F-statistic: 3.065 on 2 and 12 DF, p-value: 0.0841
I want to fit this linear trend function to my data:
Yt=a+bt+Xt
This is based on time series data.
I believe writing lm(y ~ time) will return the equivalent of Yt=a+Xt but I'm confused how to include bt into this linear trend function in R.
You can simply include it as an explanatory variable
library(data.table)
d <- data.table(id = 1)
d <- d[, .(year=1:200), by=id]
d[, x1 := runif(200)]
# add an erros
d[, e := rnorm(200, 23, 7)]
# add the dependent variable
d[, y := 3.5*x1 + 0.5*year + e ]
m <- lm(y ~ x1 + year, d)
summary(m)
Call:
lm(formula = y ~ x1 + year, data = d)
Residuals:
Min 1Q Median 3Q Max
-19.2008 -4.4356 0.3986 5.2283 16.6819
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 20.064776 1.519766 13.203 <2e-16 ***
x1 3.114048 1.914318 1.627 0.105
year 0.523195 0.009187 56.947 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 7.469 on 197 degrees of freedom
Multiple R-squared: 0.943, Adjusted R-squared: 0.9424
F-statistic: 1628 on 2 and 197 DF, p-value: < 2.2e-16
I run the following code in R. And I do not get coefficient for intercept. How can I get the coefficient for intercept?
#create covariates
x <- rnorm(4000)
x2 <- rnorm(length(x))
#create individual and firm
id <- factor(sample(500,length(x),replace=TRUE))
firm <- factor(sample(300,length(x),replace=TRUE))
#effects
id.eff <- rlnorm(nlevels(id))
firm.eff <- rexp(nlevels(firm))
#left hand side
y <- 50000 + x + 0.25*x2 + id.eff[id] + firm.eff[firm] + rnorm(length(x))
#estimate and print result
est <- felm(y ~ x+x2 | id + firm)
summary(est)
Call: felm(formula = y ~ x + x2 | id + firm)
which gives me
Residuals: Min 1Q Median 3Q Max -3.3129 -0.6147 -0.0009 0.6131 3.2878
Coefficients: Estimate Std. Error t value Pr(>|t|)
x 1.00276 0.01834 54.66 <2e-16 ***
x2 0.26190 0.01802 14.54 <2e-16 ***
Signif. codes: 0 ‘’ 0.001 ‘’ 0.01 ‘’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 1.02 on 3199 degrees of freedom Multiple R-squared(full model): 0.8778 Adjusted R-squared: 0.8472 Multiple R-squared(proj model): 0.4988 Adjusted R-squared: 0.3735 F-statistic(full model):28.72 on 800 and 3199 DF, p-value: < 2.2e-16 F-statistic(proj model): 1592 on 2 and 3199 DF, p-value: < 2.2e-16
Question related to R, glm() function:
I have a dataset obtained as:
mydata <- read.csv("data.csv", header = TRUE)
which contains the variable 'y' (y is binary 0 or 1) and 60 regressors. Three of these regressors are 'avg','age' and 'income' (all three are numerical).
I want to use glm function for logistic regression, as below:
model <-glm(y~., data = mydata, family = binomial)
Can you tell me how I may proceed if I don't want to use the three specified variables (avg, age and income) in the glm() function, and use only the remaining 57 variables?
You can simply exclude those three variables from mydata before running glm().
Here I create some sample data:
set.seed(1)
mydata<-replicate(10,rnorm(100,300,50))
mydata<-data.frame(dv=sample(c(0,1),100,replace = TRUE),mydata)
> head(mydata)
dv X1 X2 X3 X4 X5 X6 X7 X8 X9 X10
1 1 268.6773 268.9817 320.4701 344.6837 353.7220 303.8652 282.9467 264.6216 245.6546 222.9299
2 1 309.1822 302.1058 384.4437 247.6351 394.7827 285.1566 375.1212 398.5786 208.6958 309.7161
3 1 258.2186 254.4539 379.3294 398.5669 269.8501 240.8379 326.4154 295.5001 349.7641 313.2211
4 0 379.7640 307.9014 283.4546 280.8184 280.4566 300.5646 327.1096 299.2991 299.4069 244.0632
5 0 316.4754 267.2708 185.7382 382.7073 279.1889 349.5801 293.1663 243.8272 270.0186 332.5476
6 0 258.9766 388.3644 424.8831 375.6106 281.2171 379.6984 243.1633 232.7935 291.1026 248.3550
If I run your specified model on the data as it is then I use all the variables on the right hand side:
model<-glm(data=mydata, dv~.,family=binomial(link = 'logit'))
> summary(model)
Call:
glm(formula = dv ~ ., family = binomial(link = "logit"), data = mydata)
Deviance Residuals:
Min 1Q Median 3Q Max
-1.8891 -1.0853 -0.5163 1.0237 1.8303
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -2.4330825 4.1437180 -0.587 0.5571
X1 -0.0020482 0.0049025 -0.418 0.6761
X2 -0.0059021 0.0046298 -1.275 0.2024
X3 0.0123246 0.0047991 2.568 0.0102 *
X4 0.0024804 0.0046856 0.529 0.5966
X5 0.0025348 0.0039545 0.641 0.5215
X6 -0.0005905 0.0047417 -0.125 0.9009
X7 -0.0001758 0.0040737 -0.043 0.9656
X8 0.0042362 0.0041170 1.029 0.3035
X9 -0.0007664 0.0042471 -0.180 0.8568
X10 -0.0042089 0.0043094 -0.977 0.3287
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 138.59 on 99 degrees of freedom
Residual deviance: 125.11 on 89 degrees of freedom
AIC: 147.11
Number of Fisher Scoring iterations: 4
Now I exclude X1 and X2 from mydata and run the model again:
mydata2<-mydata[,-match(c('X1','X2'), colnames(mydata))]
model2<-glm(data=mydata2, dv~.,family=binomial(link = 'logit'))
> summary(model2)
Call:
glm(formula = dv ~ ., family = binomial(link = "logit"), data = mydata2)
Deviance Residuals:
Min 1Q Median 3Q Max
-1.8983 -1.0724 -0.4521 1.1132 1.7792
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -4.8725545 3.6357314 -1.340 0.18019
X3 0.0124982 0.0047930 2.608 0.00912 **
X4 0.0031911 0.0045971 0.694 0.48758
X5 0.0015992 0.0038101 0.420 0.67467
X6 -0.0003295 0.0046554 -0.071 0.94357
X7 0.0003372 0.0039961 0.084 0.93275
X8 0.0038889 0.0040737 0.955 0.33977
X9 -0.0010014 0.0042078 -0.238 0.81189
X10 -0.0041691 0.0042232 -0.987 0.32356
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 138.59 on 99 degrees of freedom
Residual deviance: 126.93 on 91 degrees of freedom
AIC: 144.93
Number of Fisher Scoring iterations: 4
The . ("everything") on the right side of the formula can be modified by subtracting terms:
model <- glm(y~ . - avg - age - income, data = mydata,
family = binomial)
Call:
glm(formula = Y1 ~ 0 + x1 + x2 + x3 + x4 + x5, family = quasibinomial(link = cauchit))
Deviance Residuals:
Min 1Q Median 3Q Max
-2.5415 0.2132 0.3988 0.6614 1.8426
Coefficients:
Estimate Std. Error t value Pr(>|t|)
x1 -0.7280 0.3509 -2.075 0.03884 *
x2 -0.9108 0.3491 -2.609 0.00951 **
x3 0.2377 0.1592 1.494 0.13629
x4 -0.2106 0.1573 -1.339 0.18151
x5 3.6982 0.8658 4.271 2.57e-05 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for quasibinomial family taken to be 0.8782731)
Null deviance: 443.61 on 320 degrees of freedom
Residual deviance: 270.17 on 315 degrees of freedom
AIC: NA
Number of Fisher Scoring iterations: 12
Here is the output from glm in R.
Do you know a way to pull out Dispersion parameter which is 0.8782731 in this case, instead of just copy and paste. Thanks.
You can extract it from the output of summary:
data(iris)
mod <- glm((Petal.Length > 5) ~ Sepal.Width, data=iris)
summary(mod)
#
# Call:
# glm(formula = (Petal.Length > 5) ~ Sepal.Width, data = iris)
#
# Deviance Residuals:
# Min 1Q Median 3Q Max
# -0.3176 -0.2856 -0.2714 0.7073 0.7464
#
# Coefficients:
# Estimate Std. Error t value Pr(>|t|)
# (Intercept) 0.38887 0.26220 1.483 0.140
# Sepal.Width -0.03561 0.08491 -0.419 0.676
#
# (Dispersion parameter for gaussian family taken to be 0.2040818)
#
# Null deviance: 30.240 on 149 degrees of freedom
# Residual deviance: 30.204 on 148 degrees of freedom
# AIC: 191.28
#
# Number of Fisher Scoring iterations: 2
summary(mod)$dispersion
# [1] 0.2040818
The str function in R is often helpful to solve these sorts of questions. For instance, I looked at str(summary(mod)) to answer the question.