Swap x and y variables in lm() function in R - r

Trying to get a summary output of the linear model created with the lm() function in R but no matter which way I set it up I get my desired y value as my x input. My desired output is the summary of the model where Winnings is the y output and averagedist is the input. This is my current output:
Call:
lm(formula = Winnings ~ averagedist, data = combineddata)
Residuals:
Min 1Q Median 3Q Max
-20.4978 -5.2992 -0.3824 6.0887 23.4764
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 2.882e+02 7.577e-01 380.281 < 2e-16 ***
Winnings 1.293e-06 2.023e-07 6.391 8.97e-10 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 8.343 on 232 degrees of freedom
Multiple R-squared: 0.1497, Adjusted R-squared: 0.146
F-statistic: 40.84 on 1 and 232 DF, p-value: 8.967e-10
Have tried flipping the order and defining the variables using y = winnings, x = averagedist but always get the same output.

Used summary(lm(Winnings ~ averagedist,combineddata)) as an alternative way to set it up and that seemed to do the trick, as opposed the two step method:
str<-lm(Winnings ~ averagedist,combineddata)
summary(str)

Related

Extracting output from linear regression in r

I want to extract a selected output from the lm function. This is the code i have,
fastfood <- openintro::fastfood
L1 = lm(formula = calories~sat_fat +fiber+sugar, fastfood)
summary(L1)
This is the output
Call:
lm(formula = calories ~ sat_fat + fiber + sugar, data = fastfood)
Residuals:
Min 1Q Median 3Q Max
-680.18 -88.97 -24.65 57.46 1501.07
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 113.334 15.760 7.191 2.36e-12 ***
sat_fat 30.839 1.180 26.132 < 2e-16 ***
fiber 24.396 2.444 9.983 < 2e-16 ***
sugar 8.890 1.120 7.938 1.37e-14 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 160.8 on 499 degrees of freedom
(12 observations deleted due to missingness)
Multiple R-squared: 0.6726, Adjusted R-squared: 0.6707
F-statistic: 341.7 on 3 and 499 DF, p-value: < 2.2e-16
I need to extract only the follwing from above output ? How do i get to this?
sat_fat 30.839.
Most commonly coef() is used to return the coefficients e.g.
coef(L1)
coef(L1)['sat_fat']
You may also want to look at tidy in the broom package which returns a nice summary as a dataframe, with coefficients in the estimate column.

Function to determine if f statistic is significant

Is there a function in R to calculate the critical value of F-statistic and compare it to the F-statistic to determine if it is significant or not? I have to calculate thousands of linear models and at the end create a dataframe with the r squared values, p-values, f-statistic, coefficients etc. for each linear model.
> summary(mod)
Call:
lm(formula = log2umi ~ Age + Sex, data = df)
Residuals:
Min 1Q Median 3Q Max
-0.01173 -0.01173 -0.01173 -0.01152 0.98848
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.0115203 0.0018178 6.337 2.47e-10 ***
Age -0.0002679 0.0006053 -0.443 0.658
SexM 0.0002059 0.0024710 0.083 0.934
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.1071 on 7579 degrees of freedom
Multiple R-squared: 2.644e-05, Adjusted R-squared: -0.0002374
F-statistic: 0.1002 on 2 and 7579 DF, p-value: 0.9047
I am aware of this question: How do I get R to spit out the critical value for F-statistic based on ANOVA?
But is there one function on its own that will compare the two values and spit out True or False?
EDIT:
I wrote this, but just out of curiosity if anyone knows a better way please let me know.
f_sig is a named vector that I will later add to the dataframe
model <- lm(log2umi~Age + Sex, df)
f_crit <- qf(1-0.05, summary(model)$fstatistic[2], summary(model)$fstatistic[3] )
f <- summary(mod)$fstatistic[1]
if (f > f_crit) {
f_sig[gen] = 0 #True
} else {
f_sig[gen] = 1 #False
}

linear regression r comparing multiple observations vs single observation

Based upon answers of my question, I am supposed to get same values of intercept and the regression coefficient for below 2 models. But they are not the same. What is going on?
is something wrong with my code? Or is the original answer wrong?
#linear regression average qty per price point vs all quantities
x1=rnorm(30,20,1);y1=rep(3,30)
x2=rnorm(30,17,1.5);y2=rep(4,30)
x3=rnorm(30,12,2);y3=rep(4.5,30)
x4=rnorm(30,6,3);y4=rep(5.5,30)
x=c(x1,x2,x3,x4)
y=c(y1,y2,y3,y4)
plot(y,x)
cor(y,x)
fit=lm(x~y)
attributes(fit)
summary(fit)
xdum=c(20,17,12,6)
ydum=c(3,4,4.5,5.5)
plot(ydum,xdum)
cor(ydum,xdum)
fit1=lm(xdum~ydum)
attributes(fit1)
summary(fit1)
> summary(fit)
Call:
lm(formula = x ~ y)
Residuals:
Min 1Q Median 3Q Max
-8.3572 -1.6069 -0.1007 2.0222 6.4904
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 40.0952 1.1570 34.65 <2e-16 ***
y -6.1932 0.2663 -23.25 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 2.63 on 118 degrees of freedom
Multiple R-squared: 0.8209, Adjusted R-squared: 0.8194
F-statistic: 540.8 on 1 and 118 DF, p-value: < 2.2e-16
> summary(fit1)
Call:
lm(formula = xdum ~ ydum)
Residuals:
1 2 3 4
-0.9615 1.8077 -0.3077 -0.5385
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 38.2692 3.6456 10.497 0.00895 **
ydum -5.7692 0.8391 -6.875 0.02051 *
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 1.513 on 2 degrees of freedom
Multiple R-squared: 0.9594, Adjusted R-squared: 0.9391
F-statistic: 47.27 on 1 and 2 DF, p-value: 0.02051
You are not calculating xdum and ydum in a comparable fashion because rnorm will only approximate the mean value you specify, particularly when you are sampling only 30 cases. This is easily fixed however:
coef(fit)
#(Intercept) y
# 39.618472 -6.128739
xdum <- c(mean(x1),mean(x2),mean(x3),mean(x4))
ydum <- c(mean(y1),mean(y2),mean(y3),mean(y4))
coef(lm(xdum~ydum))
#(Intercept) ydum
# 39.618472 -6.128739
In theory they should be the same if (and only if) the mean of the former model is equal to the point in the latter model.
This is not the case in your models, so the results are slightly different. For example the mean of x1:
x1=rnorm(30,20,1)
mean(x1)
20.08353
where the point version is 20.
There are similar tiny differences from your other rnorm samples:
> mean(x2)
[1] 17.0451
> mean(x3)
[1] 11.72307
> mean(x4)
[1] 5.913274
Not that this really matters, but just FYI the standard nomenclature is that Y is the dependent variable and X is the independent variable, which you reversed. Makes no difference of course, but just so you know.

R: Translate the results from lm() to an equation

I'm using R and I want to translate the results from lm() to an equation.
My model is:
Residuals:
Min 1Q Median 3Q Max
-0.048110 -0.023948 -0.000376 0.024511 0.044190
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 3.17691 0.00909 349.50 < 2e-16 ***
poly(QPB2_REF1, 2)1 0.64947 0.03015 21.54 2.66e-14 ***
poly(QPB2_REF1, 2)2 0.10824 0.03015 3.59 0.00209 **
B2DBSA_REF1DONSON -0.20959 0.01286 -16.30 3.17e-12 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.03015 on 18 degrees of freedom
Multiple R-squared: 0.9763, Adjusted R-squared: 0.9724
F-statistic: 247.6 on 3 and 18 DF, p-value: 8.098e-15
Do you have any idea?
I tried to have something like
f <- function(x) {3.17691 + 0.64947*x +0.10824*x^2 -0.20959*1 + 0.03015^2}
but when I tried to set a x, the f(x) value is incorrect.
Your output indicates that the model includes use of the poly function which be default orthogonalizes the polynomials (includes centering the x's and other things). In your formula there is no orthogonalization done and that is the likely difference. You can refit the model using raw=TRUE in the call to poly to get the raw coefficients that can be multiplied by $x$ and $x^2$.
You may also be interested in the Function function in the rms package which automates creating functions from fitted models.
Edit
Here is an example:
library(rms)
xx <- 1:25
yy <- 5 - 1.5*xx + 0.1*xx^2 + rnorm(25)
plot(xx,yy)
fit <- ols(yy ~ pol(xx,2))
mypred <- Function(fit)
curve(mypred, add=TRUE)
mypred( c(1,25, 3, 3.5))
You need to use the rms functions for fitting (ols and pol for this example instead of lm and poly).
If you want to calculate y-hat based on the model, you can just use predict!
Example:
set.seed(123)
my_dat <- data.frame(x=1:10, e=rnorm(10))
my_dat$y <- with(my_dat, x*2 + e)
my_lm <- lm(y~x, data=my_dat)
summary(my_lm)
Result:
Call:
lm(formula = y ~ x, data = my_dat)
Residuals:
Min 1Q Median 3Q Max
-1.1348 -0.5624 -0.1393 0.3854 1.6814
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.5255 0.6673 0.787 0.454
x 1.9180 0.1075 17.835 1e-07 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.9768 on 8 degrees of freedom
Multiple R-squared: 0.9755, Adjusted R-squared: 0.9724
F-statistic: 318.1 on 1 and 8 DF, p-value: 1e-07
Now, instead of making a function like 0.5255 + x * 1.9180 manually, I just call predict for my_lm:
predict(my_lm, data.frame(x=11:20))
Same result as this (not counting minor errors from rounding the slope/intercept estimates):
0.5255 + (11:20) * 1.9180
If you are looking for actually visualizing or writing out a complex equation (e.g. something that has restricted cubic spline transformations), I recommend using the rms package, fitting your model, and using the latex function to see it in latex
my_lm <- ols(y~x, data=my_dat)
latex(my_lm)
Note you will need to render the latex code so as to see your equation. There are websites and, if you are using a Mac, Mac Tex software, that will render it for you.

Strange abline behavior when inverting X and Y

I'm trying to do a regression line with 2 variables, WMC and BUG
When BUG is the X axis, the regression line seems perfect.
However, when BUG is the Y axis and WMC the X axis, the line behaves strangely, it doesn't seem to fit the plot at all. What am I doing wrong?
reg1 <- lm (WMC ~ BUG)
plot(BUG,WMC)
abline(reg1)
reg1 <- lm (BUG ~ WMC)
plot(WMC,BUG)
abline(reg1)
Yeah, I'm a stats noob.
Ignoring the rationale for your models at the moment: there is nothing wrong with how the lines are plotted. The reason why the lines seem to fit differently to the plot is because you are estimating two different models.
So let's start with your first model where you regress wmc on bug
m1<-lm(wmc~bug,df);summary(m1)
Call:
lm(formula = wmc ~ bug, data = df)
Residuals:
Min 1Q Median 3Q Max
-2.17555 -0.55069 0.00892 0.46091 2.23740
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.93699 0.03057 63.37 <2e-16***
bug 0.84808 0.05926 14.31 <2e-16***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.7508 on 743 degrees of freedom
Multiple R-squared: 0.2161, Adjusted R-squared: 0.215
F-statistic: 204.8 on 1 and 743 DF, p-value: < 2.2e-16
This model tells us that when regressing wmc on bug that a unit increase in bug corresponds with a 0.85 increase in wmc (I don't know what the original units of your measurement are). This is reflected in the plot if you look at the intercept value and the slope of the line over one unit increase of bug:
Now in the second model you do the opposite. You regress bug on wmc.
m2<-lm(bug~wmc,df);summary(m2)
Call:
lm(formula = bug ~ wmc, data = df)
Residuals:
Min 1Q Median 3Q Max
-0.74635 -0.26947 -0.09287 0.14058 1.67470
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.31717 0.04077 -7.779 2.44e-14***
wmc 0.25477 0.01780 14.310 < 2e-16***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.4115 on 743 degrees of freedom
Multiple R-squared: 0.2161, Adjusted R-squared: 0.215
F-statistic: 204.8 on 1 and 743 DF, p-value: < 2.2e-16
So in this case a one unit increase in wmc corresponds with a 0.25 increase in bug. Which is also reflected in the plot, notice again the value of the intercept and slope of the line with regard to a one unit increase in wmc in this case.

Resources