Extracting output from linear regression in r - r

I want to extract a selected output from the lm function. This is the code i have,
fastfood <- openintro::fastfood
L1 = lm(formula = calories~sat_fat +fiber+sugar, fastfood)
summary(L1)
This is the output
Call:
lm(formula = calories ~ sat_fat + fiber + sugar, data = fastfood)
Residuals:
Min 1Q Median 3Q Max
-680.18 -88.97 -24.65 57.46 1501.07
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 113.334 15.760 7.191 2.36e-12 ***
sat_fat 30.839 1.180 26.132 < 2e-16 ***
fiber 24.396 2.444 9.983 < 2e-16 ***
sugar 8.890 1.120 7.938 1.37e-14 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 160.8 on 499 degrees of freedom
(12 observations deleted due to missingness)
Multiple R-squared: 0.6726, Adjusted R-squared: 0.6707
F-statistic: 341.7 on 3 and 499 DF, p-value: < 2.2e-16
I need to extract only the follwing from above output ? How do i get to this?
sat_fat 30.839.

Most commonly coef() is used to return the coefficients e.g.
coef(L1)
coef(L1)['sat_fat']
You may also want to look at tidy in the broom package which returns a nice summary as a dataframe, with coefficients in the estimate column.

Related

Function to determine if f statistic is significant

Is there a function in R to calculate the critical value of F-statistic and compare it to the F-statistic to determine if it is significant or not? I have to calculate thousands of linear models and at the end create a dataframe with the r squared values, p-values, f-statistic, coefficients etc. for each linear model.
> summary(mod)
Call:
lm(formula = log2umi ~ Age + Sex, data = df)
Residuals:
Min 1Q Median 3Q Max
-0.01173 -0.01173 -0.01173 -0.01152 0.98848
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.0115203 0.0018178 6.337 2.47e-10 ***
Age -0.0002679 0.0006053 -0.443 0.658
SexM 0.0002059 0.0024710 0.083 0.934
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.1071 on 7579 degrees of freedom
Multiple R-squared: 2.644e-05, Adjusted R-squared: -0.0002374
F-statistic: 0.1002 on 2 and 7579 DF, p-value: 0.9047
I am aware of this question: How do I get R to spit out the critical value for F-statistic based on ANOVA?
But is there one function on its own that will compare the two values and spit out True or False?
EDIT:
I wrote this, but just out of curiosity if anyone knows a better way please let me know.
f_sig is a named vector that I will later add to the dataframe
model <- lm(log2umi~Age + Sex, df)
f_crit <- qf(1-0.05, summary(model)$fstatistic[2], summary(model)$fstatistic[3] )
f <- summary(mod)$fstatistic[1]
if (f > f_crit) {
f_sig[gen] = 0 #True
} else {
f_sig[gen] = 1 #False
}

Swap x and y variables in lm() function in R

Trying to get a summary output of the linear model created with the lm() function in R but no matter which way I set it up I get my desired y value as my x input. My desired output is the summary of the model where Winnings is the y output and averagedist is the input. This is my current output:
Call:
lm(formula = Winnings ~ averagedist, data = combineddata)
Residuals:
Min 1Q Median 3Q Max
-20.4978 -5.2992 -0.3824 6.0887 23.4764
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 2.882e+02 7.577e-01 380.281 < 2e-16 ***
Winnings 1.293e-06 2.023e-07 6.391 8.97e-10 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 8.343 on 232 degrees of freedom
Multiple R-squared: 0.1497, Adjusted R-squared: 0.146
F-statistic: 40.84 on 1 and 232 DF, p-value: 8.967e-10
Have tried flipping the order and defining the variables using y = winnings, x = averagedist but always get the same output.
Used summary(lm(Winnings ~ averagedist,combineddata)) as an alternative way to set it up and that seemed to do the trick, as opposed the two step method:
str<-lm(Winnings ~ averagedist,combineddata)
summary(str)

linear regression r comparing multiple observations vs single observation

Based upon answers of my question, I am supposed to get same values of intercept and the regression coefficient for below 2 models. But they are not the same. What is going on?
is something wrong with my code? Or is the original answer wrong?
#linear regression average qty per price point vs all quantities
x1=rnorm(30,20,1);y1=rep(3,30)
x2=rnorm(30,17,1.5);y2=rep(4,30)
x3=rnorm(30,12,2);y3=rep(4.5,30)
x4=rnorm(30,6,3);y4=rep(5.5,30)
x=c(x1,x2,x3,x4)
y=c(y1,y2,y3,y4)
plot(y,x)
cor(y,x)
fit=lm(x~y)
attributes(fit)
summary(fit)
xdum=c(20,17,12,6)
ydum=c(3,4,4.5,5.5)
plot(ydum,xdum)
cor(ydum,xdum)
fit1=lm(xdum~ydum)
attributes(fit1)
summary(fit1)
> summary(fit)
Call:
lm(formula = x ~ y)
Residuals:
Min 1Q Median 3Q Max
-8.3572 -1.6069 -0.1007 2.0222 6.4904
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 40.0952 1.1570 34.65 <2e-16 ***
y -6.1932 0.2663 -23.25 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 2.63 on 118 degrees of freedom
Multiple R-squared: 0.8209, Adjusted R-squared: 0.8194
F-statistic: 540.8 on 1 and 118 DF, p-value: < 2.2e-16
> summary(fit1)
Call:
lm(formula = xdum ~ ydum)
Residuals:
1 2 3 4
-0.9615 1.8077 -0.3077 -0.5385
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 38.2692 3.6456 10.497 0.00895 **
ydum -5.7692 0.8391 -6.875 0.02051 *
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 1.513 on 2 degrees of freedom
Multiple R-squared: 0.9594, Adjusted R-squared: 0.9391
F-statistic: 47.27 on 1 and 2 DF, p-value: 0.02051
You are not calculating xdum and ydum in a comparable fashion because rnorm will only approximate the mean value you specify, particularly when you are sampling only 30 cases. This is easily fixed however:
coef(fit)
#(Intercept) y
# 39.618472 -6.128739
xdum <- c(mean(x1),mean(x2),mean(x3),mean(x4))
ydum <- c(mean(y1),mean(y2),mean(y3),mean(y4))
coef(lm(xdum~ydum))
#(Intercept) ydum
# 39.618472 -6.128739
In theory they should be the same if (and only if) the mean of the former model is equal to the point in the latter model.
This is not the case in your models, so the results are slightly different. For example the mean of x1:
x1=rnorm(30,20,1)
mean(x1)
20.08353
where the point version is 20.
There are similar tiny differences from your other rnorm samples:
> mean(x2)
[1] 17.0451
> mean(x3)
[1] 11.72307
> mean(x4)
[1] 5.913274
Not that this really matters, but just FYI the standard nomenclature is that Y is the dependent variable and X is the independent variable, which you reversed. Makes no difference of course, but just so you know.

Equation for Standard Error

All,
I am trying to figure out the formula that computes the std.error for the factors for the below regression and how to compute that using the mean and sd functions. (std.error=2.015). Please help.
Thanks.
Rik
k=5;n=4;
s1=4;s2=8;mu1=75
factor1=as.factor(rep(1:k,n))
sim1=rep(rnorm(k,mu1,s2),n)
sim2=rep(rnorm(k*n,0,s1))
sim=sim1+sim2
options(contrasts=c("contr.sum","contr.poly"))
lm1=lm(sim~factor1)
> summary(lm1)
Call:
lm(formula = sim ~ factor1)
Residuals:
Min 1Q Median 3Q Max
-8.2234 -2.3561 0.7269 2.9855 7.9084
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 79.513 1.007 78.923 < 2e-16 ***
factor11 6.216 2.015 3.085 0.007545 **
factor12 -1.051 2.015 -0.522 0.609399
factor13 9.101 2.015 4.517 0.000409 ***
factor14 -4.543 2.015 -2.255 0.039534 *
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 4.506 on 15 degrees of freedom
Multiple R-squared: 0.7575, Adjusted R-squared: 0.6928
F-statistic: 11.71 on 4 and 15 DF, p-value: 0.0001624
Try any of these to get the std error:
sqrt(sum(resid(lm1)^2)/(length(factor1) - nlevels(factor1)))
sqrt(deviance(lm1)/(length(factor1) - nlevels(factor1)))
summary.lm(lm1)$sigma
library(broom); glance(lm1)$sigma
If you want the std error of the coefficients then if se is any of the above then:
sqrt(diag(vcov(lm1)))
se * sqrt(diag(solve(crossprod(model.matrix(lm1)))))
se * sqrt(diag(summary.lm(lm1)$cov))
coef(summary(lm1))[, 2]
library(broom); tidy(lm1)$std.error
Note that (1) since the question did not use set.seed to set the random number generator to a known state the data is not reproducible and (2) As mentioned in the comments summary.lm source code will give the details of how it does it which may not be precisely the same as we showed here but would be equivalent except for numerical error.

Polynomial Regression nonsense Predictions

Suppose I want to fit a linear regression model with degree two (orthogonal) polynomial and then predict the response. Here are the codes for the first model (m1)
x=1:100
y=-2+3*x-5*x^2+rnorm(100)
m1=lm(y~poly(x,2))
prd.1=predict(m1,newdata=data.frame(x=105:110))
Now let's try the same model but instead of using $poly(x,2)$, I will use its columns like:
m2=lm(y~poly(x,2)[,1]+poly(x,2)[,2])
prd.2=predict(m2,newdata=data.frame(x=105:110))
Let's look at the summaries of m1 and m2.
> summary(m1)
Call:
lm(formula = y ~ poly(x, 2))
Residuals:
Min 1Q Median 3Q Max
-2.50347 -0.48752 -0.07085 0.53624 2.96516
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -1.677e+04 9.912e-02 -169168 <2e-16 ***
poly(x, 2)1 -1.449e+05 9.912e-01 -146195 <2e-16 ***
poly(x, 2)2 -3.726e+04 9.912e-01 -37588 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.9912 on 97 degrees of freedom
Multiple R-squared: 1, Adjusted R-squared: 1
F-statistic: 1.139e+10 on 2 and 97 DF, p-value: < 2.2e-16
> summary(m2)
Call:
lm(formula = y ~ poly(x, 2)[, 1] + poly(x, 2)[, 2])
Residuals:
Min 1Q Median 3Q Max
-2.50347 -0.48752 -0.07085 0.53624 2.96516
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -1.677e+04 9.912e-02 -169168 <2e-16 ***
poly(x, 2)[, 1] -1.449e+05 9.912e-01 -146195 <2e-16 ***
poly(x, 2)[, 2] -3.726e+04 9.912e-01 -37588 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.9912 on 97 degrees of freedom
Multiple R-squared: 1, Adjusted R-squared: 1
F-statistic: 1.139e+10 on 2 and 97 DF, p-value: < 2.2e-16
So m1 and m2 are basically the same. Now let's look at the predictions prd.1 and prd.2
> prd.1
1 2 3 4 5 6
-54811.60 -55863.58 -56925.56 -57997.54 -59079.52 -60171.50
> prd.2
1 2 3 4 5 6
49505.92 39256.72 16812.28 -17827.42 -64662.35 -123692.53
Q1: Why prd.2 is significantly different from prd.1?
Q2: How can I obtain prd.1 using the model m2?
m1 is the right way to do this. m2 is entering a whole world of pain...
To do predictions from m2, the model needs to know it was fitted to an orthogonal set of basis functions, so that it uses the same basis functions for the extrapolated new data values. Compare: poly(1:10,2)[,2] with poly(1:12,2)[,2] - the first ten values are not the same. If you fit the model explicitly with poly(x,2) then predict understands all that and does the right thing.
What you have to do is make sure your predicted locations are transformed using the same set of basis functions as used to create the model in the first place. You can use predict.poly for this (note I call my explanatory variables x1 and x2 so that its easy to match the names up):
px = poly(x,2)
x1 = px[,1]
x2 = px[,2]
m3 = lm(y~x1+x2)
newx = 90:110
pnew = predict(px,newx) # px is the previous poly object, so this calls predict.poly
prd.3 = predict(m3, newdata=data.frame(x1=pnew[,1],x2=pnew[,2]))

Resources