Polynomial Regression nonsense Predictions - r

Suppose I want to fit a linear regression model with degree two (orthogonal) polynomial and then predict the response. Here are the codes for the first model (m1)
x=1:100
y=-2+3*x-5*x^2+rnorm(100)
m1=lm(y~poly(x,2))
prd.1=predict(m1,newdata=data.frame(x=105:110))
Now let's try the same model but instead of using $poly(x,2)$, I will use its columns like:
m2=lm(y~poly(x,2)[,1]+poly(x,2)[,2])
prd.2=predict(m2,newdata=data.frame(x=105:110))
Let's look at the summaries of m1 and m2.
> summary(m1)
Call:
lm(formula = y ~ poly(x, 2))
Residuals:
Min 1Q Median 3Q Max
-2.50347 -0.48752 -0.07085 0.53624 2.96516
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -1.677e+04 9.912e-02 -169168 <2e-16 ***
poly(x, 2)1 -1.449e+05 9.912e-01 -146195 <2e-16 ***
poly(x, 2)2 -3.726e+04 9.912e-01 -37588 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.9912 on 97 degrees of freedom
Multiple R-squared: 1, Adjusted R-squared: 1
F-statistic: 1.139e+10 on 2 and 97 DF, p-value: < 2.2e-16
> summary(m2)
Call:
lm(formula = y ~ poly(x, 2)[, 1] + poly(x, 2)[, 2])
Residuals:
Min 1Q Median 3Q Max
-2.50347 -0.48752 -0.07085 0.53624 2.96516
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -1.677e+04 9.912e-02 -169168 <2e-16 ***
poly(x, 2)[, 1] -1.449e+05 9.912e-01 -146195 <2e-16 ***
poly(x, 2)[, 2] -3.726e+04 9.912e-01 -37588 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.9912 on 97 degrees of freedom
Multiple R-squared: 1, Adjusted R-squared: 1
F-statistic: 1.139e+10 on 2 and 97 DF, p-value: < 2.2e-16
So m1 and m2 are basically the same. Now let's look at the predictions prd.1 and prd.2
> prd.1
1 2 3 4 5 6
-54811.60 -55863.58 -56925.56 -57997.54 -59079.52 -60171.50
> prd.2
1 2 3 4 5 6
49505.92 39256.72 16812.28 -17827.42 -64662.35 -123692.53
Q1: Why prd.2 is significantly different from prd.1?
Q2: How can I obtain prd.1 using the model m2?

m1 is the right way to do this. m2 is entering a whole world of pain...
To do predictions from m2, the model needs to know it was fitted to an orthogonal set of basis functions, so that it uses the same basis functions for the extrapolated new data values. Compare: poly(1:10,2)[,2] with poly(1:12,2)[,2] - the first ten values are not the same. If you fit the model explicitly with poly(x,2) then predict understands all that and does the right thing.
What you have to do is make sure your predicted locations are transformed using the same set of basis functions as used to create the model in the first place. You can use predict.poly for this (note I call my explanatory variables x1 and x2 so that its easy to match the names up):
px = poly(x,2)
x1 = px[,1]
x2 = px[,2]
m3 = lm(y~x1+x2)
newx = 90:110
pnew = predict(px,newx) # px is the previous poly object, so this calls predict.poly
prd.3 = predict(m3, newdata=data.frame(x1=pnew[,1],x2=pnew[,2]))

Related

Extracting output from linear regression in r

I want to extract a selected output from the lm function. This is the code i have,
fastfood <- openintro::fastfood
L1 = lm(formula = calories~sat_fat +fiber+sugar, fastfood)
summary(L1)
This is the output
Call:
lm(formula = calories ~ sat_fat + fiber + sugar, data = fastfood)
Residuals:
Min 1Q Median 3Q Max
-680.18 -88.97 -24.65 57.46 1501.07
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 113.334 15.760 7.191 2.36e-12 ***
sat_fat 30.839 1.180 26.132 < 2e-16 ***
fiber 24.396 2.444 9.983 < 2e-16 ***
sugar 8.890 1.120 7.938 1.37e-14 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 160.8 on 499 degrees of freedom
(12 observations deleted due to missingness)
Multiple R-squared: 0.6726, Adjusted R-squared: 0.6707
F-statistic: 341.7 on 3 and 499 DF, p-value: < 2.2e-16
I need to extract only the follwing from above output ? How do i get to this?
sat_fat 30.839.
Most commonly coef() is used to return the coefficients e.g.
coef(L1)
coef(L1)['sat_fat']
You may also want to look at tidy in the broom package which returns a nice summary as a dataframe, with coefficients in the estimate column.

Swap x and y variables in lm() function in R

Trying to get a summary output of the linear model created with the lm() function in R but no matter which way I set it up I get my desired y value as my x input. My desired output is the summary of the model where Winnings is the y output and averagedist is the input. This is my current output:
Call:
lm(formula = Winnings ~ averagedist, data = combineddata)
Residuals:
Min 1Q Median 3Q Max
-20.4978 -5.2992 -0.3824 6.0887 23.4764
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 2.882e+02 7.577e-01 380.281 < 2e-16 ***
Winnings 1.293e-06 2.023e-07 6.391 8.97e-10 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 8.343 on 232 degrees of freedom
Multiple R-squared: 0.1497, Adjusted R-squared: 0.146
F-statistic: 40.84 on 1 and 232 DF, p-value: 8.967e-10
Have tried flipping the order and defining the variables using y = winnings, x = averagedist but always get the same output.
Used summary(lm(Winnings ~ averagedist,combineddata)) as an alternative way to set it up and that seemed to do the trick, as opposed the two step method:
str<-lm(Winnings ~ averagedist,combineddata)
summary(str)

Summary Extract Correlation Coefficient

I am using lm() on a large data set in R. Using summary() one can get lot of details about linear regression between these two parameters.
The part I am confused with is which one is the correct parameter in the Coefficients: section of summary, to use as correlation coefficient?
Sample Data
c1 <- c(1:10)
c2 <- c(10:19)
output <- summary(lm(c1 ~ c2))
Summary
Call:
lm(formula = c1 ~ c2)
Residuals:
Min 1Q Median 3Q Max
-2.280e-15 -8.925e-16 -2.144e-16 4.221e-16 4.051e-15
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -9.000e+00 2.902e-15 -3.101e+15 <2e-16 ***
c2 1.000e+00 1.963e-16 5.093e+15 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 1.783e-15 on 8 degrees of freedom
Multiple R-squared: 1, Adjusted R-squared: 1
F-statistic: 2.594e+31 on 1 and 8 DF, p-value: < 2.2e-16
Is this the correlation coefficient I should use?
output$coefficients[2,1]
1
Please suggest, thanks.
The full variance covariance matrix of the coefficient estimates is:
fm <- lm(c1 ~ c2)
vcov(fm)
and in particular sqrt(diag(vcov(fm))) equals coef(summary(fm))[, 2]
The corresponding correlation matrix is:
cov2cor(vcov(fm))
The correlation between the coefficient estimates is:
cov2cor(vcov(fm))[1, 2]

linear regression r comparing multiple observations vs single observation

Based upon answers of my question, I am supposed to get same values of intercept and the regression coefficient for below 2 models. But they are not the same. What is going on?
is something wrong with my code? Or is the original answer wrong?
#linear regression average qty per price point vs all quantities
x1=rnorm(30,20,1);y1=rep(3,30)
x2=rnorm(30,17,1.5);y2=rep(4,30)
x3=rnorm(30,12,2);y3=rep(4.5,30)
x4=rnorm(30,6,3);y4=rep(5.5,30)
x=c(x1,x2,x3,x4)
y=c(y1,y2,y3,y4)
plot(y,x)
cor(y,x)
fit=lm(x~y)
attributes(fit)
summary(fit)
xdum=c(20,17,12,6)
ydum=c(3,4,4.5,5.5)
plot(ydum,xdum)
cor(ydum,xdum)
fit1=lm(xdum~ydum)
attributes(fit1)
summary(fit1)
> summary(fit)
Call:
lm(formula = x ~ y)
Residuals:
Min 1Q Median 3Q Max
-8.3572 -1.6069 -0.1007 2.0222 6.4904
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 40.0952 1.1570 34.65 <2e-16 ***
y -6.1932 0.2663 -23.25 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 2.63 on 118 degrees of freedom
Multiple R-squared: 0.8209, Adjusted R-squared: 0.8194
F-statistic: 540.8 on 1 and 118 DF, p-value: < 2.2e-16
> summary(fit1)
Call:
lm(formula = xdum ~ ydum)
Residuals:
1 2 3 4
-0.9615 1.8077 -0.3077 -0.5385
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 38.2692 3.6456 10.497 0.00895 **
ydum -5.7692 0.8391 -6.875 0.02051 *
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 1.513 on 2 degrees of freedom
Multiple R-squared: 0.9594, Adjusted R-squared: 0.9391
F-statistic: 47.27 on 1 and 2 DF, p-value: 0.02051
You are not calculating xdum and ydum in a comparable fashion because rnorm will only approximate the mean value you specify, particularly when you are sampling only 30 cases. This is easily fixed however:
coef(fit)
#(Intercept) y
# 39.618472 -6.128739
xdum <- c(mean(x1),mean(x2),mean(x3),mean(x4))
ydum <- c(mean(y1),mean(y2),mean(y3),mean(y4))
coef(lm(xdum~ydum))
#(Intercept) ydum
# 39.618472 -6.128739
In theory they should be the same if (and only if) the mean of the former model is equal to the point in the latter model.
This is not the case in your models, so the results are slightly different. For example the mean of x1:
x1=rnorm(30,20,1)
mean(x1)
20.08353
where the point version is 20.
There are similar tiny differences from your other rnorm samples:
> mean(x2)
[1] 17.0451
> mean(x3)
[1] 11.72307
> mean(x4)
[1] 5.913274
Not that this really matters, but just FYI the standard nomenclature is that Y is the dependent variable and X is the independent variable, which you reversed. Makes no difference of course, but just so you know.

Select regression coefs by name

After running a regression, how can I select the variable name and corresponding parameter estimate?
For example, after running the following regression, I obtain:
set.seed(1)
n=1000
x=rnorm(n,0,1)
y=.6*x+rnorm(n,0,sqrt(1-.6)^2)
(reg1=summary(lm(y~x)))
Call:
lm(formula = y ~ x)
Residuals:
Min 1Q Median 3Q Max
-1.2994 -0.2688 -0.0055 0.3022 1.4577
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.006475 0.013162 -0.492 0.623
x 0.602573 0.012723 47.359 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.4162 on 998 degrees of freedom
Multiple R-squared: 0.6921, Adjusted R-squared: 0.6918
F-statistic: 2243 on 1 and 998 DF, p-value: < 2.2e-16
I would like to be able to select the coefficient by the variable names (e.g., (Intercept) -0.006475)
I have tried the following but nothing works...
attr(reg1$coefficients,"terms")
names(reg1$coefficients)
Note: This works reg1$coefficients[1,1] but I want to be able to call it by the name rather than row / column.
The package broom tidies a lot of regression models very nicely.
require(broom)
set.seed(1)
n=1000
x=rnorm(n,0,1)
y=.6*x+rnorm(n,0,sqrt(1-.6)^2)
model = lm(y~x)
tt <- tidy(model, conf.int=TRUE)
subset(tt,term=="x")
## term estimate std.error statistic p.value conf.low conf.high
## 2 x 0.602573 0.01272349 47.35908 1.687125e-257 0.5776051 0.6275409
with(tt,tt[term=="(Intercept)","estimate"])
## [1] -0.006474794
So, your code doesn't run the way you have it. I changed it a bit:
set.seed(1)
n=1000
x=rnorm(n,0,1)
y=.6*x+rnorm(n,0,sqrt(1-.6)^2)
model = lm(y~x)
Now, I can call coef(model)["x"] or coef(model)["(Intercept)"] and get the values.
> coef(model)["x"]
x
0.602573
> coef(model)["(Intercept)"]
(Intercept)
-0.006474794

Resources