Polynomial model to data in R [duplicate] - r

This question already has answers here:
Fitting polynomial model to data in R
(5 answers)
Closed 5 years ago.
Year <- c(1000,1500,1600,1700,1750,1800,1850,1900,1950,1955,1960,1965,1970,1975,1980,1985,1990,1995,2000,2005,2010,2015)
Africa <- c(70,86,114,106,106,107,111,133,229,254,285,322,366,416,478,550,632,720,814,920,1044,1186)
How can I find the population for the years: 1925, 1963, 1978, 1988, 1998 using Polynomial Linear Regression.

Here is a starting point for the solution of your problem.
Year <- c(1000,1500,1600,1700,1750,1800,1850,1900,1950,1955,1960,1965,
1970,1975,1980,1985,1990,1995,2000,2005,2010,2015)
Africa <- c(70,86,114,106,106,107,111,133,229,254,285,322,366,416,478,550,
632,720,814,920,1044,1186)
df <- data.frame(Year, Africa)
# Polynomial linear regression of order 5
model1 <- lm(Africa ~ poly(Year,5), data=df)
summary(model1)
###########
Call:
lm(formula = Africa ~ poly(Year, 5), data = df)
Residuals:
Min 1Q Median 3Q Max
-59.639 -27.119 -12.397 9.149 97.398
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 411.32 10.12 40.643 < 2e-16 ***
poly(Year, 5)1 881.26 47.47 18.565 3.01e-12 ***
poly(Year, 5)2 768.50 47.47 16.190 2.42e-11 ***
poly(Year, 5)3 709.43 47.47 14.945 8.07e-11 ***
poly(Year, 5)4 628.45 47.47 13.239 4.89e-10 ***
poly(Year, 5)5 359.04 47.47 7.564 1.14e-06 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 47.47 on 16 degrees of freedom
Multiple R-squared: 0.9852, Adjusted R-squared: 0.9805
F-statistic: 212.5 on 5 and 16 DF, p-value: 4.859e-14
#############
pred <- predict(model1)
plot(Year, Africa, type="o", xlab="Year", ylab="Africa")
lines(Year, pred, lwd=2, col="red")
The model estimated above shows a bad fit for Years < 1900. It is therefore preferable to estimate a model only for data after 1900.
# Polynomial linear regression of order 2
df2 <- subset(df,Year>1900)
model2 <- lm(Africa ~ poly(Year,2), data=df2)
summary(model2)
###########
Call:
lm(formula = Africa ~ poly(Year, 2), data = df2)
Residuals:
Min 1Q Median 3Q Max
-9.267 -2.489 -0.011 3.334 12.482
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 586.857 1.677 349.93 < 2e-16 ***
poly(Year, 2)1 1086.646 6.275 173.17 < 2e-16 ***
poly(Year, 2)2 245.687 6.275 39.15 3.65e-13 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 6.275 on 11 degrees of freedom
Multiple R-squared: 0.9997, Adjusted R-squared: 0.9996
F-statistic: 1.576e+04 on 2 and 11 DF, p-value: < 2.2e-16
###########
df2$pred <- predict(model2)
plot(df2$Year, df2$Africa, type="o", xlab="Year", ylab="Africa")
lines(df2$Year, df2$pred, lwd=2, col="red")
The fit of this second model is clearly better:
At last, we get model prediction for the years 1925, 1963, 1978, 1988, 1998.
df3 <- data.frame(Year=c(1925, 1963, 1978, 1988, 1998))
df3$pred <- predict(model2, newdata=df3)
df3
Year pred
1 1925 286.4863
2 1963 301.1507
3 1978 451.7210
4 1988 597.6301
5 1998 779.9623

Related

R polynomal regression or group values and test between groups + outcome interpreatation

I am trying to model the relation between a scar acquisition rate of a wild population of animals, and I have calculated yearly rates before.
If you see below the plot, it seems to me that rates rise through the middle of the period and than fall again. I have tried to fit a polynomial LM with the code
model1 <- lm(Rate~poly(year, 2, raw = TRUE),data=yearlyratesub)
summary(model1)
model1
I have plotted using:
g <-ggplot(yearlyratesub, aes(year, Rate)) + geom_point(shape=1) + geom_smooth(method = lm, formula = y ~ poly(x, 2, raw = TRUE))
g
The model output was:
Call:
lm(formula = Rate ~ poly(year, 2, raw = TRUE), data = yearlyratesub)
Residuals:
Min 1Q Median 3Q Max
-0.126332 -0.037683 -0.002602 0.053222 0.083503
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -8.796e+03 3.566e+03 -2.467 0.0297 *
poly(year, 2, raw = TRUE)1 8.747e+00 3.545e+00 2.467 0.0297 *
poly(year, 2, raw = TRUE)2 -2.174e-03 8.813e-04 -2.467 0.0297 *
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.0666 on 12 degrees of freedom
Multiple R-squared: 0.3369, Adjusted R-squared: 0.2264
F-statistic: 3.048 on 2 and 12 DF, p-value: 0.08503
How can I enterpret that now? The overall model p value is not significant but the intercept and single slopes are?
Should I rather try another fit than x² or even group the values and test between groups e.g. with an ANOVA? I know the LM has low fit but I guess it's because I have little values and maybe x² might be not it...?
Would be happy about input regarding model and outcome interpretation..
Grouping
Since the data was not provided (next time please provide a complete reproducible question including all inputs) we used the data in the Note at the end. We see that that the model is highly significant if we group the points using the indicated breakpoints.
g <- factor(findInterval(yearlyratesub$year, c(2007.5, 2014.5))+1); g
## [1] 1 1 1 1 2 2 2 2 2 2 2 3 3 3 3
## Levels: 1 2 3
fm <- lm(rate ~ g, yearlyratesub)
summary(fm)
giving
Call:
lm(formula = rate ~ g, data = yearlyratesub)
Residuals:
Min 1Q Median 3Q Max
-0.064618 -0.018491 0.006091 0.029684 0.046831
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.110854 0.019694 5.629 0.000111 ***
g2 0.127783 0.024687 5.176 0.000231 ***
g3 -0.006714 0.027851 -0.241 0.813574
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.03939 on 12 degrees of freedom
Multiple R-squared: 0.7755, Adjusted R-squared: 0.738
F-statistic: 20.72 on 2 and 12 DF, p-value: 0.0001281
We could consider combining the outer two groups.
g2 <- factor(g == 2)
fm2 <- lm(rate ~ g2, yearlyratesub)
summary(fm2)
giving:
Call:
lm(formula = rate ~ g2, data = yearlyratesub)
Residuals:
Min 1Q Median 3Q Max
-0.064618 -0.016813 0.007096 0.031363 0.046831
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.10750 0.01341 8.015 2.19e-06 ***
g2TRUE 0.13114 0.01963 6.680 1.52e-05 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.03793 on 13 degrees of freedom
Multiple R-squared: 0.7744, Adjusted R-squared: 0.757
F-statistic: 44.62 on 1 and 13 DF, p-value: 1.517e-05
Sinusoid
Looking at the graph it seems that the points are turning up at the left and right edges suggesting we use a sinusoidal fit. a + b * cos(c * year)
fm3 <- nls(rate ~ cbind(a = 1, b = cos(c * year)),
yearlyratesub, start = list(c = 0.5), algorithm = "plinear")
summary(fm3)
giving
Formula: rate ~ cbind(a = 1, b = cos(c * year))
Parameters:
Estimate Std. Error t value Pr(>|t|)
c 0.4999618 0.0001449 3449.654 < 2e-16 ***
.lin.a 0.1787200 0.0150659 11.863 5.5e-08 ***
.lin.b 0.0753754 0.0205818 3.662 0.00325 **
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.05688 on 12 degrees of freedom
Number of iterations to convergence: 2
Achieved convergence tolerance: 5.241e-08
Comparison
Plotting the fits and looking at their residual sum of squares and AIC we have
plot(yearlyratesub)
# fm0 from Note at end, fm and fm2 are grouping models, fm3 is sinusoidal
L <- list(fm0 = fm0, fm = fm, fm2 = fm2, fm3 = fm3)
for(i in seq_along(L)) {
lines(fitted(L[[i]]) ~ year, yearlyratesub, col = i, lwd = 2)
}
legend("topright", names(L), col = seq_along(L), lwd = 2)
giving the following where lower residual sum of squares and AIC (which takes into account the number of paramters) are better. We see that fm fits the most closely based on residual sum of squares but with fm2 fitting almost as well; however, when taking the number of parameters into account by using AIC fm2 has the lowest and so is most favored by that criterion.
cbind(RSS = sapply(L, deviance), AIC = sapply(L, AIC))
## RSS AIC
## fm0 0.05488031 -33.59161
## fm 0.01861659 -49.80813
## fm2 0.01870674 -51.73567
## fm3 0.04024237 -38.24512
Note
yearlyratesub <-
structure(list(year = c(2004, 2005, 2006, 2007, 2008, 2009, 2010,
2011, 2012, 2013, 2014, 2015, 2017, 2018, 2019), rate = c(0.14099813521287,
0.0949946651016247, 0.0904788394070601, 0.11694517831575, 0.26786193592875,
0.256346628540479, 0.222029818828298, 0.180116679856725, 0.285467976459104,
0.174019208113095, 0.28461698734932, 0.0574827955982996, 0.103378448084776,
0.114593695172686, 0.141105952837639)), row.names = c(NA, -15L
), class = "data.frame")
fm0 <- lm(rate ~ poly(year, 2, raw = TRUE), yearlyratesub)
summary(fm0)
giving
Call:
lm(formula = rate ~ poly(year, 2, raw = TRUE), data = yearlyratesub)
Residuals:
Min 1Q Median 3Q Max
-0.128335 -0.038289 -0.002715 0.054090 0.084792
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -8.930e+03 3.621e+03 -2.466 0.0297 *
poly(year, 2, raw = TRUE)1 8.880e+00 3.600e+00 2.467 0.0297 *
poly(year, 2, raw = TRUE)2 -2.207e-03 8.949e-04 -2.467 0.0297 *
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.06763 on 12 degrees of freedom
Multiple R-squared: 0.3381, Adjusted R-squared: 0.2278
F-statistic: 3.065 on 2 and 12 DF, p-value: 0.0841

PCA with new Factors in R

My objective to fit a linear model with the same response, but predictors replaced by factors/scores. I am trying to find out which principal components to include in such a linear model if I want to achieve an R^2 of at least 0.9*r.squared from my original model.
Which predictors should I choose?
model1 <- lm(Resp~.,data=test_dat)
> summary(model1)
Call:
lm(formula = Resp ~ ., data = test_dat)
Residuals:
Min 1Q Median 3Q Max
-0.35934 -0.07729 0.00330 0.08204 0.38709
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -3.18858 0.06926 -46.039 <2e-16 ***
Pred1 4.32083 0.03767 114.708 <2e-16 ***
Pred2 2.42110 0.04740 51.077 <2e-16 ***
Pred3 -1.00507 0.04435 -22.664 <2e-16 ***
Pred4 -3.19480 0.09147 -34.927 <2e-16 ***
Pred5 2.77779 0.05368 51.748 <2e-16 ***
Pred6 1.22923 0.05427 22.648 <2e-16 ***
Pred7 -1.21338 0.04562 -26.595 <2e-16 ***
Pred8 0.02485 0.05937 0.419 0.676
Pred9 -0.67831 0.05308 -12.778 <2e-16 ***
Pred10 1.69947 0.02628 64.672 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.1193 on 489 degrees of freedom
Multiple R-squared: 0.997, Adjusted R-squared: 0.997
F-statistic: 1.645e+04 on 10 and 489 DF, p-value: < 2.2e-16
My new model should have an R^2 >= 0.897
(threshold<-0.9*r.sqrd)
[1] 0.8973323
metrics.swiss <- calc.relimp(model1, type = c("lmg", "first", "last","betasq", "Pratt"))
metrics.swiss
metrics.swiss#lmg.rank
>Pred1 Pred2 Pred3 Pred4 Pred5 Pred6 Pred7 Pred8 Pred9 Pred10
2 8 3 6 1 10 5 4 7 9
sum(metrics.swiss#lmg)
orderComponents<-c(5,1,3,8,7,4,9,2,10,6)
PCAFactors<-Project.Data.PCA$scores
Rotated<-as.data.frame(cbind(Resp=test_dat$Resp,PCAFactors))
swissRotatedReordered<-Rotated[,c(1,orderComponents+1)]
(nestedRSquared<-sapply(2:11,function(z)
summary(lm(Resp~.,data=swissRotatedReordered[,1:z]))$r.squared))
[1] 0.001041492 0.622569992 0.689046489 0.690319839 0.715051745 0.732286987
[7] 0.742441421 0.991291253 0.995263470 0.997035905
You run a linear model on the new model with your scores.
"lmg" will allow you to see which factors made the most contribution and those are the factors you should keep. In my case it was the top 3 factors
predictors <- test_dat[-1]
Project.Data.PCA <- princomp(predictors)
summary(Project.Data.PCA)
PCAFactors<-Project.Data.PCA$scores
Rotated<-as.data.frame(cbind(Resp=test_dat$Resp,PCAFactors))
linModPCA<-lm(Resp~.,data=Rotated)
metrics.swiss <- calc.relimp(linModPCA, type = c("lmg", "first", "last","betasq",
"pratt"))
metrics.swiss

How to get coefficient for intercept in felm

I run the following code in R. And I do not get coefficient for intercept. How can I get the coefficient for intercept?
#create covariates
x <- rnorm(4000)
x2 <- rnorm(length(x))
#create individual and firm
id <- factor(sample(500,length(x),replace=TRUE))
firm <- factor(sample(300,length(x),replace=TRUE))
#effects
id.eff <- rlnorm(nlevels(id))
firm.eff <- rexp(nlevels(firm))
#left hand side
y <- 50000 + x + 0.25*x2 + id.eff[id] + firm.eff[firm] + rnorm(length(x))
#estimate and print result
est <- felm(y ~ x+x2 | id + firm)
summary(est)
Call: felm(formula = y ~ x + x2 | id + firm)
which gives me
Residuals: Min 1Q Median 3Q Max -3.3129 -0.6147 -0.0009 0.6131 3.2878
Coefficients: Estimate Std. Error t value Pr(>|t|)
x 1.00276 0.01834 54.66 <2e-16 ***
x2 0.26190 0.01802 14.54 <2e-16 ***
Signif. codes: 0 ‘’ 0.001 ‘’ 0.01 ‘’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 1.02 on 3199 degrees of freedom Multiple R-squared(full model): 0.8778 Adjusted R-squared: 0.8472 Multiple R-squared(proj model): 0.4988 Adjusted R-squared: 0.3735 F-statistic(full model):28.72 on 800 and 3199 DF, p-value: < 2.2e-16 F-statistic(proj model): 1592 on 2 and 3199 DF, p-value: < 2.2e-16

R- How to save the console data into a row/matrix or data frame for future use?

I'm performing the multiple regression to find the best model to predict the prices. See as following for the output in the R console.
I'd like to store the first column (Estimates) into a row/matrix or data frame for future use such as using R shiny to deploy on the web.
*(Price = 698.8+0.116*voltage-70.72*VendorCHICONY
-36.6*VendorDELTA-66.8*VendorLITEON-14.86*H)*
Can somebody kindly advise?? Thanks in advance.
Call:
lm(formula = Price ~ Voltage + Vendor + H, data = PSU2)
Residuals:
Min 1Q Median 3Q Max
-10.9950 -0.6251 0.0000 3.0134 11.0360
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 698.821309 276.240098 2.530 0.0280 *
Voltage 0.116958 0.005126 22.818 1.29e-10 ***
VendorCHICONY -70.721088 9.308563 -7.597 1.06e-05 ***
VendorDELTA -36.639685 5.866688 -6.245 6.30e-05 ***
VendorLITEON -66.796531 6.120925 -10.913 3.07e-07 ***
H -14.869478 6.897259 -2.156 0.0541 .
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 7.307 on 11 degrees of freedom
Multiple R-squared: 0.9861, Adjusted R-squared: 0.9799
F-statistic: 156.6 on 5 and 11 DF, p-value: 7.766e-10
Use coef on your lm output.
e.g.
m <- lm(Sepal.Length ~ Sepal.Width + Species, iris)
summary(m)
# Call:
# lm(formula = Sepal.Length ~ Sepal.Width + Species, data = iris)
# Residuals:
# Min 1Q Median 3Q Max
# -1.30711 -0.25713 -0.05325 0.19542 1.41253
#
# Coefficients:
# Estimate Std. Error t value Pr(>|t|)
# (Intercept) 2.2514 0.3698 6.089 9.57e-09 ***
# Sepal.Width 0.8036 0.1063 7.557 4.19e-12 ***
# Speciesversicolor 1.4587 0.1121 13.012 < 2e-16 ***
# Speciesvirginica 1.9468 0.1000 19.465 < 2e-16 ***
# ---
# Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
#
# Residual standard error: 0.438 on 146 degrees of freedom
# Multiple R-squared: 0.7259, Adjusted R-squared: 0.7203
# F-statistic: 128.9 on 3 and 146 DF, p-value: < 2.2e-16
coef(m)
# (Intercept) Sepal.Width Speciesversicolor Speciesvirginica
# 2.2513932 0.8035609 1.4587431 1.9468166
See also names(m) which shows you some things you can extract, e.g. m$residuals (or equivalently, resid(m)).
And also methods(class='lm') will show you some other functions that work on a lm.
> methods(class='lm')
[1] add1 alias anova case.names coerce confint cooks.distance deviance dfbeta dfbetas drop1 dummy.coef effects extractAIC family
[16] formula hatvalues influence initialize kappa labels logLik model.frame model.matrix nobs plot predict print proj qr
[31] residuals rstandard rstudent show simulate slotsFromS3 summary variable.names vcov
(oddly, 'coef' is not in there? ah well)
Besides, I'd like to know if there is command to show the "residual percentage"
=(actual value-fitted value)/actual value"; currently the "residuals()" command can
only show the below info but I need the percentage instead.
residuals(fit3ab)
1 2 3 4 5 6
-5.625491e-01 -5.625491e-01 7.676578e-15 -8.293815e+00 -5.646900e+00 3.443652e+00

Fitting a linear regression model in R

I have a question regarding linear regression analysis in R:
I have several independent variables (about 20-30) and one dependent variable. To reach the best model, I try "all" relevant combinations of independent variables in order to maximize my adjusted R^2. However, this is a lot of work. So my question is: Is there a way to automatically fit a regression model in R, i.e. an automatic selection of these independent variables stored in a data frame, which yield the best description of the variation in the dependent variable?
Thank you for your help!
You can use step function, however analysis done with this approach may hit some bumps on the road if whoever is checking your work is against data dredging. Here is an example from step.
> summary(lm1 <- lm(Fertility ~ ., data = swiss))
Call:
lm(formula = Fertility ~ ., data = swiss)
Residuals:
Min 1Q Median 3Q Max
-15.2743 -5.2617 0.5032 4.1198 15.3213
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 66.91518 10.70604 6.250 1.91e-07 ***
Agriculture -0.17211 0.07030 -2.448 0.01873 *
Examination -0.25801 0.25388 -1.016 0.31546
Education -0.87094 0.18303 -4.758 2.43e-05 ***
Catholic 0.10412 0.03526 2.953 0.00519 **
Infant.Mortality 1.07705 0.38172 2.822 0.00734 **
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 7.165 on 41 degrees of freedom
Multiple R-squared: 0.7067, Adjusted R-squared: 0.671
F-statistic: 19.76 on 5 and 41 DF, p-value: 5.594e-10
> slm1 <- step(lm1)
Start: AIC=190.69
Fertility ~ Agriculture + Examination + Education + Catholic +
Infant.Mortality
Df Sum of Sq RSS AIC
- Examination 1 53.03 2158.1 189.86
<none> 2105.0 190.69
- Agriculture 1 307.72 2412.8 195.10
- Infant.Mortality 1 408.75 2513.8 197.03
- Catholic 1 447.71 2552.8 197.75
- Education 1 1162.56 3267.6 209.36
Step: AIC=189.86
Fertility ~ Agriculture + Education + Catholic + Infant.Mortality
Df Sum of Sq RSS AIC
<none> 2158.1 189.86
- Agriculture 1 264.18 2422.2 193.29
- Infant.Mortality 1 409.81 2567.9 196.03
- Catholic 1 956.57 3114.6 205.10
- Education 1 2249.97 4408.0 221.43
> summary(slm1)
Call:
lm(formula = Fertility ~ Agriculture + Education + Catholic +
Infant.Mortality, data = swiss)
Residuals:
Min 1Q Median 3Q Max
-14.6765 -6.0522 0.7514 3.1664 16.1422
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 62.10131 9.60489 6.466 8.49e-08 ***
Agriculture -0.15462 0.06819 -2.267 0.02857 *
Education -0.98026 0.14814 -6.617 5.14e-08 ***
Catholic 0.12467 0.02889 4.315 9.50e-05 ***
Infant.Mortality 1.07844 0.38187 2.824 0.00722 **
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 7.168 on 42 degrees of freedom
Multiple R-squared: 0.6993, Adjusted R-squared: 0.6707
F-statistic: 24.42 on 4 and 42 DF, p-value: 1.717e-10

Resources