for Boston dataset perform polynomial regression with degree 5,4,3 and 2 I want to use loop but get error :
Error in [.data.frame(data, 0, cols, drop = FALSE) :
undefined columns selected
library(caret)
train_control <- trainControl(method = "cv", number=10)
#set.seed(5)
cv <-rep(NA,4)
n=c(5,4,3,2)
for (i in n) {
cv[i]=train(nox ~ poly(dis,degree=i ), data = Boston, trncontrol = train_control, method = "lm")
}
outside the loop train(nox ~ poly(dis,degree=i ), data = Boston, trncontrol = train_control, method = "lm")
works well
Since you are using poly(..., raw = FALSE) that means you are getting orthogonal contrasts. Hence no need of for-loop, use the maximum degree since the coefficients and standard errors will not change for each coefficient.
Check quick example below using lm and iris dataset:
summary(lm(Sepal.Length~poly(Sepal.Width, 2), iris))
Call:
lm(formula = Sepal.Length ~ poly(Sepal.Width, 2), data = iris)
Residuals:
Min 1Q Median 3Q Max
-1.63153 -0.62177 -0.08282 0.50531 2.33336
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 5.84333 0.06692 87.316 <2e-16 ***
poly(Sepal.Width, 2)1 -1.18838 0.81962 -1.450 0.1492
poly(Sepal.Width, 2)2 -1.41578 0.81962 -1.727 0.0862 .
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.8196 on 147 degrees of freedom
Multiple R-squared: 0.03344, Adjusted R-squared: 0.02029
F-statistic: 2.543 on 2 and 147 DF, p-value: 0.08209
> summary(lm(Sepal.Length~poly(Sepal.Width, 3), iris))
Call:
lm(formula = Sepal.Length ~ poly(Sepal.Width, 3), data = iris)
Residuals:
Min 1Q Median 3Q Max
-1.6876 -0.5001 -0.0876 0.5493 2.4600
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 5.84333 0.06588 88.696 <2e-16 ***
poly(Sepal.Width, 3)1 -1.18838 0.80687 -1.473 0.1430
poly(Sepal.Width, 3)2 -1.41578 0.80687 -1.755 0.0814 .
poly(Sepal.Width, 3)3 1.92349 0.80687 2.384 0.0184 *
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.8069 on 146 degrees of freedom
Multiple R-squared: 0.06965, Adjusted R-squared: 0.05054
F-statistic: 3.644 on 3 and 146 DF, p-value: 0.01425
Take a look at the summary table. Everything is the same. Only the poly(Sepal.Width,3)3 was added when a degree of 3 was used. Meaning if we used a degree of 3, we could easily tell what degree 2 will look like. Hence no need of for loop.
Note that you could use different variables in poly: eg poly(cbind(Sepal.Width, Petal.Length, Petal.Width), 4) and still be able to easily recover poly(Sepal.Width, 2).
Related
I am conducting model selection on my dataset using the package MASS and the function stepAIC. This is the current code I am using:
mod <- lm(Distance~DiffAge + DiffR + DiffSize + DiffRep + DiffSeason +
Diff.Bkp + Diff.Fzp + Diff.AO + Diff.Aow +
Diff.Lag.NAOw + Diff.Lag.NAO + Diff.Lag.AO + Diff.Lag.Aow, data=data,
na.action="na.exclude")
library(MASS)
step.model<-stepAIC(mod, direction = "both",
trace = FALSE)
summary(step.model)
this gives me the following output:
Call:
lm(formula = Distance ~ Diff.Lag.NAOw + Diff.Lag.AO + DiffSeason,
data = data, na.action = "na.exclude")
Residuals:
Min 1Q Median 3Q Max
-146.984 -48.397 -9.533 42.169 194.950
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 77.944 20.247 3.850 0.000184 ***
Diff.Lag.NAOw 11.868 6.261 1.896 0.060209 .
Diff.Lag.AO 24.696 17.475 1.413 0.159947
DiffSeasonEW-LW 41.891 18.607 2.251 0.026014 *
DiffSeasonLW-LW 22.863 20.791 1.100 0.273465
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 67.2 on 132 degrees of freedom
Multiple R-squared: 0.06031, Adjusted R-squared: 0.03183
F-statistic: 2.118 on 4 and 132 DF, p-value: 0.08209
If I am reading this right, the output only shows me the top model (Let me know if this is incorrect!). I would like to see the other, lower-ranked models as well, with their accompanying AIC scores.
Any suggestions on how I can achieve this? Should I modify my code in any way?
data("hprice2")
reg1 <- lm(price ~ rooms + crime + nox, hprice2)
summary(reg1)
Call:
lm(formula = price ~ rooms + crime + nox, data = hprice2)
Residuals:
Min 1Q Median 3Q Max
-18311 -3218 -772 2418 39164
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -19371.47 3250.94 -5.959 4.79e-09 ***
rooms 7933.18 407.87 19.450 < 2e-16 ***
crime -199.70 35.05 -5.697 2.08e-08 ***
nox -1306.06 266.14 -4.907 1.25e-06 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 6103 on 502 degrees of freedom
Multiple R-squared: 0.5634, Adjusted R-squared: 0.5608
F-statistic: 215.9 on 3 and 502 DF, p-value: < 2.2e-16
Question 1.
Run two alternative (two-sided) t-tests for: H0: B1 = 8000
predict(reg1, data.frame(rooms=8000, crime = -199.70, nox = -1306.06), interval = .99)
Report your t-statistic and whether you reject or fail to reject the null at 90, 95, and/or 99 percent confidence levels.
I suppose by beta1 you mean rooms in this case. Your t.test in the summary is tested against beta0 = 0, you can see from wiki:
so using the example of nox:
tstat = (-1306.06 - 0)/266.14
[2] -4.907417
And p.value is
2*pt(-abs(tstat),502)
[2] 1.251945e-06
the null hypothesis in your case will be 8000 and you test rooms = 8000:
tstat = (7933.18 - 8000)/407.87
2*pt(-abs(tstat),502)
You can also use linearHypothesis from cars to do the above:
library(car)
linearHypothesis(reg1, c("rooms = 8000"))
I'm trying to understand what is the role of I() base function in R when using a linear polynomial model or the function poly. When I calculate the model using
q + q^2
q + I(q^2)
poly(q, 2)
I have different answers.
Here is an example:
set.seed(20)
q <- seq(from=0, to=20, by=0.1)
y <- 500 + .1 * (q-5)^2
noise <- rnorm(length(q), mean=10, sd=80)
noisy.y <- y + noise
model3 <- lm(noisy.y ~ poly(q,2))
model1 <- lm(noisy.y ~ q + I(q^2))
model2 <- lm(noisy.y ~ q + q^2)
I(q^2)==I(q)^2
I(q^2)==q^2
summary(model1)
summary(model2)
summary(model3)
Here is the output:
> summary(model1)
Call:
lm(formula = noisy.y ~ q + I(q^2))
Residuals:
Min 1Q Median 3Q Max
-211.592 -50.609 4.742 61.983 165.792
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 489.3723 16.5982 29.483 <2e-16 ***
q 5.0560 3.8344 1.319 0.189
I(q^2) -0.1530 0.1856 -0.824 0.411
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 79.22 on 198 degrees of freedom
Multiple R-squared: 0.02451, Adjusted R-squared: 0.01466
F-statistic: 2.488 on 2 and 198 DF, p-value: 0.08568
> summary(model2)
Call:
lm(formula = noisy.y ~ q + q^2)
Residuals:
Min 1Q Median 3Q Max
-219.96 -54.42 3.30 61.06 170.79
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 499.5209 11.1252 44.900 <2e-16 ***
q 1.9961 0.9623 2.074 0.0393 *
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 79.16 on 199 degrees of freedom
Multiple R-squared: 0.02117, Adjusted R-squared: 0.01625
F-statistic: 4.303 on 1 and 199 DF, p-value: 0.03933
> summary(model3)
Call:
lm(formula = noisy.y ~ poly(q, 2))
Residuals:
Min 1Q Median 3Q Max
-211.592 -50.609 4.742 61.983 165.792
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 519.482 5.588 92.966 <2e-16 ***
poly(q, 2)1 164.202 79.222 2.073 0.0395 *
poly(q, 2)2 -65.314 79.222 -0.824 0.4107
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 79.22 on 198 degrees of freedom
Multiple R-squared: 0.02451, Adjusted R-squared: 0.01466
F-statistic: 2.488 on 2 and 198 DF, p-value: 0.08568
Why is the I() necessary when doing a polynomial model in R.
Also, is this normal that the poly function doesn't give the same result as the q + I(q^2)?
The formula syntax in R is described in the ?formula help page. The ^ symbol has not been given the usual meaning of multiplicative exponentiation. Rather, it's used for interactions between all terms at the base of the exponent. For example
y ~ (a+b)^2
is the same as
y ~ a + b + a:b
But if you do
y ~ a + b^2
y ~ a + b # same as above, no way to "interact" b with itself.
That caret would just include the b term because it can't include the interaction with itself. So ^ and * inside formulas has nothing to do with multiplication just like the + doesn't really mean addition for variables in the usual sense.
If you want the "usual" definition for ^2 you need to put it the as is function. Otherwise it's not fitting a squared term at all.
And the poly() function by default returns orthogonal polynomials as described on the help page. This helps to reduce co-linearity in the covariates. But if you don't want the orthogonal versions and just want the "raw" polynomial terms, then just pass raw=TRUE to your poly call. For example
lm(noisy.y ~ poly(q,2, raw=TRUE))
will return the same estimates as model1
I'm a newbie in R and I have this fitted model:
> mqo_reg_g <- lm(G ~ factor(year), data = data)
> summary(mqo_reg_g)
Call:
lm(formula = G ~ factor(year), data = data)
Residuals:
Min 1Q Median 3Q Max
-0.11134 -0.06793 -0.04239 0.01324 0.85213
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.111339 0.005253 21.197 < 2e-16 ***
factor(year)2002 -0.015388 0.007428 -2.071 0.038418 *
factor(year)2006 -0.016980 0.007428 -2.286 0.022343 *
factor(year)2010 -0.024432 0.007496 -3.259 0.001131 **
factor(year)2014 -0.025750 0.007436 -3.463 0.000543 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.119 on 2540 degrees of freedom
Multiple R-squared: 0.005952, Adjusted R-squared: 0.004387
F-statistic: 3.802 on 4 and 2540 DF, p-value: 0.004361
I want to test the difference between the coefficients of factor(year)2002 and Intercept; factor(year)2006 and factor(year)2002; and so on.
In STATA I know people use the function "test" that performs a Wald tests about the parameters of the fitted model. But I could find how to do in R.
How can I do it?
Thanks!
I want to use the partial least squares regression to find the most representative variables to predict my data.
Here is my code:
library(pls)
potion<-read.table("potion-insomnie.txt",header=T)
potionTrain <- potion[1:182,]
potionTest <- potion[183:192,]
potion1 <- plsr(Sommeil ~ Aubepine + Bave + Poudre + Pavot, data = potionTrain, validation = "LOO")
The summary(lm(potion1)) give me this answer:
Call:
lm(formula = potion1)
Residuals:
Min 1Q Median 3Q Max
-14.9475 -5.3961 0.0056 5.2321 20.5847
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 37.63931 1.67955 22.410 < 2e-16 ***
Aubepine -0.28226 0.05195 -5.434 1.81e-07 ***
Bave -1.79894 0.26849 -6.700 2.68e-10 ***
Poudre 0.35420 0.72849 0.486 0.627
Pavot -0.47678 0.52027 -0.916 0.361
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 7.845 on 177 degrees of freedom
Multiple R-squared: 0.293, Adjusted R-squared: 0.277
F-statistic: 18.34 on 4 and 177 DF, p-value: 1.271e-12
I deduced that only the variables Aubepine et Bave are representative. So I redid the model just with this two variables:
potion1 <- plsr(Sommeil ~ Aubepine + Bave, data = potionTrain, validation = "LOO")
And I plot:
plot(potion1, ncomp = 2, asp = 1, line = TRUE)
Here is the plot of predicted vs measured values:
The problem is that I see the linear regression on the plot, but I can not know its equation and R². Is it possible ?
Is the first part is the same as a multiple regression linear (ANOVA)?
pacman::p_load(pls)
data(mtcars)
potion <- mtcars
potionTrain <- potion[1:28,]
potionTest <- potion[29:32,]
potion1 <- plsr(mpg ~ cyl + disp + hp + drat, data = potionTrain, validation = "LOO")
coef(potion1) # coefficeints
scores(potion1) # scores
## R^2:
R2(potion1, estimate = "train")
## cross-validated R^2:
R2(potion1)
## Both:
R2(potion1, estimate = "all")