Fitting a linear regression model in R

Fitting a linear regression model in R - r

I have a question regarding linear regression analysis in R:
I have several independent variables (about 20-30) and one dependent variable. To reach the best model, I try "all" relevant combinations of independent variables in order to maximize my adjusted R^2. However, this is a lot of work. So my question is: Is there a way to automatically fit a regression model in R, i.e. an automatic selection of these independent variables stored in a data frame, which yield the best description of the variation in the dependent variable?
Thank you for your help!

You can use step function, however analysis done with this approach may hit some bumps on the road if whoever is checking your work is against data dredging. Here is an example from step.
> summary(lm1 <- lm(Fertility ~ ., data = swiss))
Call:
lm(formula = Fertility ~ ., data = swiss)
Residuals:
Min 1Q Median 3Q Max
-15.2743 -5.2617 0.5032 4.1198 15.3213
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 66.91518 10.70604 6.250 1.91e-07 ***
Agriculture -0.17211 0.07030 -2.448 0.01873 *
Examination -0.25801 0.25388 -1.016 0.31546
Education -0.87094 0.18303 -4.758 2.43e-05 ***
Catholic 0.10412 0.03526 2.953 0.00519 **
Infant.Mortality 1.07705 0.38172 2.822 0.00734 **
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 7.165 on 41 degrees of freedom
Multiple R-squared: 0.7067, Adjusted R-squared: 0.671
F-statistic: 19.76 on 5 and 41 DF, p-value: 5.594e-10
> slm1 <- step(lm1)
Start: AIC=190.69
Fertility ~ Agriculture + Examination + Education + Catholic +
Infant.Mortality
Df Sum of Sq RSS AIC
- Examination 1 53.03 2158.1 189.86
<none> 2105.0 190.69
- Agriculture 1 307.72 2412.8 195.10
- Infant.Mortality 1 408.75 2513.8 197.03
- Catholic 1 447.71 2552.8 197.75
- Education 1 1162.56 3267.6 209.36
Step: AIC=189.86
Fertility ~ Agriculture + Education + Catholic + Infant.Mortality
Df Sum of Sq RSS AIC
<none> 2158.1 189.86
- Agriculture 1 264.18 2422.2 193.29
- Infant.Mortality 1 409.81 2567.9 196.03
- Catholic 1 956.57 3114.6 205.10
- Education 1 2249.97 4408.0 221.43
> summary(slm1)
Call:
lm(formula = Fertility ~ Agriculture + Education + Catholic +
Infant.Mortality, data = swiss)
Residuals:
Min 1Q Median 3Q Max
-14.6765 -6.0522 0.7514 3.1664 16.1422
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 62.10131 9.60489 6.466 8.49e-08 ***
Agriculture -0.15462 0.06819 -2.267 0.02857 *
Education -0.98026 0.14814 -6.617 5.14e-08 ***
Catholic 0.12467 0.02889 4.315 9.50e-05 ***
Infant.Mortality 1.07844 0.38187 2.824 0.00722 **
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 7.168 on 42 degrees of freedom
Multiple R-squared: 0.6993, Adjusted R-squared: 0.6707
F-statistic: 24.42 on 4 and 42 DF, p-value: 1.717e-10

Related

Test if intercepts in ancova model are significantly different in R

I ran a model explaining the weight of some plant as a function of time and trying to incorporate the treatment effect.
mod <- lm(weight ~time + treatment)
The model looks like this:
with model summary being:
Call:
lm(formula = weight ~ time + treatment, data = df)
Residuals:
Min 1Q Median 3Q Max
-21.952 -7.674 0.770 6.851 21.514
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -37.5790 3.2897 -11.423 < 2e-16 ***
time 4.7478 0.2541 18.688 < 2e-16 ***
treatmentB 8.2000 2.4545 3.341 0.00113 **
treatmentC 5.4633 2.4545 2.226 0.02797 *
treatmentD 20.3533 2.4545 8.292 2.36e-13 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 9.506 on 115 degrees of freedom
Multiple R-squared: 0.7862, Adjusted R-squared: 0.7788
F-statistic: 105.7 on 4 and 115 DF, p-value: < 2.2e-16
ANOVA table
Analysis of Variance Table
Response: weight
Df Sum Sq Mean Sq F value Pr(>F)
time 1 31558.1 31558.1 349.227 < 2.2e-16 ***
treatment 3 6661.9 2220.6 24.574 2.328e-12 ***
Residuals 115 10392.0 90.4
I want to test the H0 that intercept1=intercept2=intercept3=intercept4. Is this done by simply interpreting the t-value and p-value for the intercept ( I guess not because this is the baseline (treatment A in this case))? I'm a bit puzzled by this as not much attention is paid on difference in intercept on most sources i looked up.

Model outcome = mortality (count), exposure = climate (continuous), Rstudio

I have ran Poisson distribution model with quasi Poisson error in Rstudio
glm(formula = MI ~ corr_data$Temperature + corr_data$Humidity +
corr_data$Sun + corr_data$Rain, family = quasipoisson(),
data = corr_data)
Deviance Residuals:
Min 1Q Median 3Q Max
-3.5323 -1.1149 -0.1346 0.8591 3.2585
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 3.9494713 1.2068332 3.273 0.00144 **
corr_data$Temperature -0.0281248 0.0144238 -1.950 0.05381 .
corr_data$Humidity -0.0099800 0.0144047 -0.693 0.48992
corr_data$Sun -0.0767811 0.0414464 -1.853 0.06670 .
corr_data$Rain -0.0003076 0.0004211 -0.731 0.46662
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for quasipoisson family taken to be 1.873611)
Null deviance: 249.16 on 111 degrees of freedom
Residual deviance: 206.36 on 107 degrees of freedom
(24 observations deleted due to missingness)
I have read that the dispersion parameter should be ideally close to 1
I have some zero values in my cumulative rainfall measures.
How best to I go about finding the appropriate model?
I next tried negative binomial
Call:
glm.nb(formula = Incidence ~ Humidity + Sun + Rain, data = corr_data,
init.theta = 22.16822882, link = log)
Deviance Residuals:
Min 1Q Median 3Q Max
-2.53274 -0.85380 -0.08705 0.73230 2.48643
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 1.3626266 1.0970701 1.242 0.2142
Humidity 0.0111537 0.0124768 0.894 0.3713
Sun -0.0295395 0.0345469 -0.855 0.3925
Rain -0.0006170 0.0003007 -2.052 0.0402 *
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for Negative Binomial(22.1682) family taken to be 1)
Null deviance: 120.09 on 111 degrees of freedom
Residual deviance: 113.57 on 108 degrees of freedom
(24 observations deleted due to missingness)
AIC: 578.3
Number of Fisher Scoring iterations: 1
Theta: 22.2
Std. Err.: 11.8
2 x log-likelihood: -568.299
Any advice would be very much appreciated. I am new to R and to modelling!

Display category labels in regression output in R

Using the this R linear modelling tutorial I'm finding the format of the model output is annoyingly different to that provided in the text and I can't for the life of me work out why. For example, here is the code:
pitch = c(233,204,242,130,112,142)
sex = c(rep("female",3),rep("male",3))
my.df = data.frame(sex,pitch)
xmdl = lm(pitch ~ sex, my.df)
summary(xmdl)
Here is the output I get:
Call:
lm(formula = pitch ~ sex, data = my.df)
Residuals:
1 2 3 4 5 6
6.667 -22.333 15.667 2.000 -16.000 14.000
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 177.167 7.201 24.601 1.62e-05 ***
sex1 49.167 7.201 6.827 0.00241 **
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 17.64 on 4 degrees of freedom
Multiple R-squared: 0.921, Adjusted R-squared: 0.9012
F-statistic: 46.61 on 1 and 4 DF, p-value: 0.002407
In the tutorial the line for Coefficients has "sexmale" instead of "sex1". What setting do I need to activate to achieve this?

How does lm() know which predictors are categorical?

Normally, me and you(assuming you're not a bot) are easily able to identify whether a predictor is categorical or quantitative. Like, for example, gender is obviously categorical. Your last vote can be classified categorically.
Basically, we can identify categorical predictors easily. But what happens when we input some data in R, and it's lm function makes dummy variables for a predictor? How does it do that?
Somewhat related Question on StackOverflow.

Search R factor function. Here is a small demo, first model uses number of cylinder as a numerical valuable. Second model uses it as a categorical variable.
> summary(lm(mpg~cyl,mtcars))
Call:
lm(formula = mpg ~ cyl, data = mtcars)
Residuals:
Min 1Q Median 3Q Max
-4.9814 -2.1185 0.2217 1.0717 7.5186
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 37.8846 2.0738 18.27 < 2e-16 ***
cyl -2.8758 0.3224 -8.92 6.11e-10 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 3.206 on 30 degrees of freedom
Multiple R-squared: 0.7262, Adjusted R-squared: 0.7171
F-statistic: 79.56 on 1 and 30 DF, p-value: 6.113e-10
> summary(lm(mpg~factor(cyl),mtcars))
Call:
lm(formula = mpg ~ factor(cyl), data = mtcars)
Residuals:
Min 1Q Median 3Q Max
-5.2636 -1.8357 0.0286 1.3893 7.2364
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 26.6636 0.9718 27.437 < 2e-16 ***
factor(cyl)6 -6.9208 1.5583 -4.441 0.000119 ***
factor(cyl)8 -11.5636 1.2986 -8.905 8.57e-10 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 3.223 on 29 degrees of freedom
Multiple R-squared: 0.7325, Adjusted R-squared: 0.714
F-statistic: 39.7 on 2 and 29 DF, p-value: 4.979e-09

Hxd1011 adressed the more difficult case, when a categorical variable is stored as a number and therefore R understands by default that it is a numerical value - and if this is not the desired behaviour we must use factor function.
Your example with predictor ShelveLoc in dataset Carseats is easier because it's a text (character) variable, and therefore it can only be a categorical variable.
> head(Carseats$ShelveLoc)
[1] Bad Good Medium Medium Bad Bad
Levels: Bad Good Medium

R decides that thing from the features type. You can check that by using the str(dataset).If the feature is factor type then it will create dummies for that feature.

Polynomial model to data in R [duplicate]

This question already has answers here:
Fitting polynomial model to data in R
(5 answers)
Closed 5 years ago.
Year <- c(1000,1500,1600,1700,1750,1800,1850,1900,1950,1955,1960,1965,1970,1975,1980,1985,1990,1995,2000,2005,2010,2015)
Africa <- c(70,86,114,106,106,107,111,133,229,254,285,322,366,416,478,550,632,720,814,920,1044,1186)
How can I find the population for the years: 1925, 1963, 1978, 1988, 1998 using Polynomial Linear Regression.

Here is a starting point for the solution of your problem.
Year <- c(1000,1500,1600,1700,1750,1800,1850,1900,1950,1955,1960,1965,
1970,1975,1980,1985,1990,1995,2000,2005,2010,2015)
Africa <- c(70,86,114,106,106,107,111,133,229,254,285,322,366,416,478,550,
632,720,814,920,1044,1186)
df <- data.frame(Year, Africa)
# Polynomial linear regression of order 5
model1 <- lm(Africa ~ poly(Year,5), data=df)
summary(model1)
###########
Call:
lm(formula = Africa ~ poly(Year, 5), data = df)
Residuals:
Min 1Q Median 3Q Max
-59.639 -27.119 -12.397 9.149 97.398
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 411.32 10.12 40.643 < 2e-16 ***
poly(Year, 5)1 881.26 47.47 18.565 3.01e-12 ***
poly(Year, 5)2 768.50 47.47 16.190 2.42e-11 ***
poly(Year, 5)3 709.43 47.47 14.945 8.07e-11 ***
poly(Year, 5)4 628.45 47.47 13.239 4.89e-10 ***
poly(Year, 5)5 359.04 47.47 7.564 1.14e-06 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 47.47 on 16 degrees of freedom
Multiple R-squared: 0.9852, Adjusted R-squared: 0.9805
F-statistic: 212.5 on 5 and 16 DF, p-value: 4.859e-14
#############
pred <- predict(model1)
plot(Year, Africa, type="o", xlab="Year", ylab="Africa")
lines(Year, pred, lwd=2, col="red")
The model estimated above shows a bad fit for Years < 1900. It is therefore preferable to estimate a model only for data after 1900.
# Polynomial linear regression of order 2
df2 <- subset(df,Year>1900)
model2 <- lm(Africa ~ poly(Year,2), data=df2)
summary(model2)
###########
Call:
lm(formula = Africa ~ poly(Year, 2), data = df2)
Residuals:
Min 1Q Median 3Q Max
-9.267 -2.489 -0.011 3.334 12.482
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 586.857 1.677 349.93 < 2e-16 ***
poly(Year, 2)1 1086.646 6.275 173.17 < 2e-16 ***
poly(Year, 2)2 245.687 6.275 39.15 3.65e-13 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 6.275 on 11 degrees of freedom
Multiple R-squared: 0.9997, Adjusted R-squared: 0.9996
F-statistic: 1.576e+04 on 2 and 11 DF, p-value: < 2.2e-16
###########
df2$pred <- predict(model2)
plot(df2$Year, df2$Africa, type="o", xlab="Year", ylab="Africa")
lines(df2$Year, df2$pred, lwd=2, col="red")
The fit of this second model is clearly better:
At last, we get model prediction for the years 1925, 1963, 1978, 1988, 1998.
df3 <- data.frame(Year=c(1925, 1963, 1978, 1988, 1998))
df3$pred <- predict(model2, newdata=df3)
df3
Year pred
1 1925 286.4863
2 1963 301.1507
3 1978 451.7210
4 1988 597.6301
5 1998 779.9623

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Fitting a linear regression model in R - r

Related

Test if intercepts in ancova model are significantly different in R

Model outcome = mortality (count), exposure = climate (continuous), Rstudio

Display category labels in regression output in R

How does lm() know which predictors are categorical?

Polynomial model to data in R [duplicate]

Categories

Resources