R: Translate the results from lm() to an equation - r

I'm using R and I want to translate the results from lm() to an equation.
My model is:
Residuals:
Min 1Q Median 3Q Max
-0.048110 -0.023948 -0.000376 0.024511 0.044190
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 3.17691 0.00909 349.50 < 2e-16 ***
poly(QPB2_REF1, 2)1 0.64947 0.03015 21.54 2.66e-14 ***
poly(QPB2_REF1, 2)2 0.10824 0.03015 3.59 0.00209 **
B2DBSA_REF1DONSON -0.20959 0.01286 -16.30 3.17e-12 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.03015 on 18 degrees of freedom
Multiple R-squared: 0.9763, Adjusted R-squared: 0.9724
F-statistic: 247.6 on 3 and 18 DF, p-value: 8.098e-15
Do you have any idea?
I tried to have something like
f <- function(x) {3.17691 + 0.64947*x +0.10824*x^2 -0.20959*1 + 0.03015^2}
but when I tried to set a x, the f(x) value is incorrect.

Your output indicates that the model includes use of the poly function which be default orthogonalizes the polynomials (includes centering the x's and other things). In your formula there is no orthogonalization done and that is the likely difference. You can refit the model using raw=TRUE in the call to poly to get the raw coefficients that can be multiplied by $x$ and $x^2$.
You may also be interested in the Function function in the rms package which automates creating functions from fitted models.
Edit
Here is an example:
library(rms)
xx <- 1:25
yy <- 5 - 1.5*xx + 0.1*xx^2 + rnorm(25)
plot(xx,yy)
fit <- ols(yy ~ pol(xx,2))
mypred <- Function(fit)
curve(mypred, add=TRUE)
mypred( c(1,25, 3, 3.5))
You need to use the rms functions for fitting (ols and pol for this example instead of lm and poly).

If you want to calculate y-hat based on the model, you can just use predict!
Example:
set.seed(123)
my_dat <- data.frame(x=1:10, e=rnorm(10))
my_dat$y <- with(my_dat, x*2 + e)
my_lm <- lm(y~x, data=my_dat)
summary(my_lm)
Result:
Call:
lm(formula = y ~ x, data = my_dat)
Residuals:
Min 1Q Median 3Q Max
-1.1348 -0.5624 -0.1393 0.3854 1.6814
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.5255 0.6673 0.787 0.454
x 1.9180 0.1075 17.835 1e-07 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.9768 on 8 degrees of freedom
Multiple R-squared: 0.9755, Adjusted R-squared: 0.9724
F-statistic: 318.1 on 1 and 8 DF, p-value: 1e-07
Now, instead of making a function like 0.5255 + x * 1.9180 manually, I just call predict for my_lm:
predict(my_lm, data.frame(x=11:20))
Same result as this (not counting minor errors from rounding the slope/intercept estimates):
0.5255 + (11:20) * 1.9180

If you are looking for actually visualizing or writing out a complex equation (e.g. something that has restricted cubic spline transformations), I recommend using the rms package, fitting your model, and using the latex function to see it in latex
my_lm <- ols(y~x, data=my_dat)
latex(my_lm)
Note you will need to render the latex code so as to see your equation. There are websites and, if you are using a Mac, Mac Tex software, that will render it for you.

Related

Creating a Line of Best Fit in R

Hi, I was wondering if anyone could help me on how to complete function in R as I am totally new to this computer programming. So apologies if this seems like a silly question to this audience.
So I am currently trying to add a line of best fit onto my scatter graph but I'm not quite understanging how to do this. I've tried many functions like "abline" and "lm" but I'm not sure if I am using the right ideas or whether I am putting incorrect numbers into the functions.
I have cleared the workspace and just left my graph and previous workings so that it looks neater.
thanks in advance for the help.
reproducible data example:
set.seed(2348907)
x <- rnorm(100)
y <- 2*(x+rnorm(100))
then this makes a linear model for the intercept and slope:
lmodel <- lm(y ~ x)
which now contains the intercept (Intercept term) and the slope (as the coefficient of the x variable in the model). Here is a summary of the model:
summary(lmodel)
Call:
lm(formula = y ~ x)
Residuals:
Min 1Q Median 3Q Max
-3.8412 -1.0767 -0.1808 1.2216 4.1540
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.0454 0.1851 0.245 0.807
x 2.1087 0.1814 11.627 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 1.848 on 98 degrees of freedom
Multiple R-squared: 0.5797, Adjusted R-squared: 0.5754
F-statistic: 135.2 on 1 and 98 DF, p-value: < 2.2e-16
then make the plot using the coef() function to pull out the intercept and slope from the linear model:
plot(x,y) # plots the points
abline(a=coef(lmodel)[1], b=coef(lmodel)[2]) # plots the line, a=intercept, b=slope
personally, I prefer ggplot2 for such things, which would be like this:
library(ggplot2)
ggplot() + geom_point(aes(x,y)) +
geom_smooth(aes(x,y), method="lm", se=F)

R: Same values for confidence and prediction intervals using predict()

When I try to make prediction and confidence intervals, around a a linear regression model, with two continuous variables and two categorical variables (that may act as dummy variables), results for the two intervals are exactly the same. I'm using the predict() function.
I've already tried with other datasets, that have continuous and discrete variables, but not categorical or dichotomous variables, and intervals are different. I tried removing some variables from the regression model, and the intervals are still the same. On the other hand, I've already compared my data.frame with ones that are exemplified on the R documentation, and I think the problem isn't there.
#linear regression model: modeloReducido
summary(modeloReducido)
> Call: lm(formula = V ~ T * W + P * G, data = Datos)
>
> Residuals:
> Min 1Q Median 3Q Max
> -7.5579 -1.6222 0.3286 1.6175 10.4773
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.937674 3.710133 0.253 0.800922
T -12.864441 2.955519 -4.353 2.91e-05 ***
W 0.013926 0.001432 9.722 < 2e-16 ***
P 12.142109 1.431102 8.484 8.14e-14 ***
GBaja 15.953421 4.513963 3.534 0.000588 ***
GMedia 0.597568 4.546935 0.131 0.895669
T:W 0.014283 0.001994 7.162 7.82e-11 ***
P:GBaja -3.249681 2.194803 -1.481 0.141418
P:GMedia -5.093860 2.147673 -2.372 0.019348 *
> --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
>
> Residual standard error: 3.237 on 116 degrees of freedom Multiple
> R-squared: 0.9354, Adjusted R-squared: 0.931 F-statistic: 210 on
> 8 and 116 DF, p-value: < 2.2e-16
#Prediction Interval
newdata1.2 <- data.frame(T=1,W=1040,P=10000,G="Media")
#EP
opt1.PI <- predict.lm(modeloReducido, newdata1.2,
interval="prediction", level=.95)
#Confidence interval
newdata1.1 <- data.frame(T=1,W=1040,P=10000,G="Media")
#EP
opt1.CI <- predict(modeloReducido, newdata1.1,
interval="confidence", level=.95)
opt1.CI
#fit lwr upr
#1 70500.51 38260.24 102740.8
opt1.PI
# fit lwr upr
# 1 70500.51 38260.24 102740.8
opt1.PI and opt1.CI should be different.
The Excel file that I was given out is in the following link:
https://www.filehosting.org/file/details/830581/Datos%20Tarea%204.xlsx

Function to determine if f statistic is significant

Is there a function in R to calculate the critical value of F-statistic and compare it to the F-statistic to determine if it is significant or not? I have to calculate thousands of linear models and at the end create a dataframe with the r squared values, p-values, f-statistic, coefficients etc. for each linear model.
> summary(mod)
Call:
lm(formula = log2umi ~ Age + Sex, data = df)
Residuals:
Min 1Q Median 3Q Max
-0.01173 -0.01173 -0.01173 -0.01152 0.98848
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.0115203 0.0018178 6.337 2.47e-10 ***
Age -0.0002679 0.0006053 -0.443 0.658
SexM 0.0002059 0.0024710 0.083 0.934
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.1071 on 7579 degrees of freedom
Multiple R-squared: 2.644e-05, Adjusted R-squared: -0.0002374
F-statistic: 0.1002 on 2 and 7579 DF, p-value: 0.9047
I am aware of this question: How do I get R to spit out the critical value for F-statistic based on ANOVA?
But is there one function on its own that will compare the two values and spit out True or False?
EDIT:
I wrote this, but just out of curiosity if anyone knows a better way please let me know.
f_sig is a named vector that I will later add to the dataframe
model <- lm(log2umi~Age + Sex, df)
f_crit <- qf(1-0.05, summary(model)$fstatistic[2], summary(model)$fstatistic[3] )
f <- summary(mod)$fstatistic[1]
if (f > f_crit) {
f_sig[gen] = 0 #True
} else {
f_sig[gen] = 1 #False
}

plm vs lm - different results?

I tried several times to use lm and plm to do a regression. And I get different results.
First, I used lm as follows:
fixed.Region1 <- lm(CapNormChange ~ Policychanges + factor(Region),
data=Panel)
Further I used plm in the following way:
fixed.Region2 <- plm(CapNormChange ~ Policychanges+ factor(Region),
data=Panel, index=c("Region", "Year"), model="within", effect="individual")
I think there is something wrong with plm because I don't see an intercept in the results (see below).
Furthermore, I am not entirely sure if + factor (Region) is necessary, however, if it is not there, I don't see the coefficients (and significance) for the dummy.
So, my question is:
I am using the plm function wrong? (or what is wrong about it)
If not, how can it be that the results are different?
If somebody could give me a hint, I would really appreciate.
Results from LM:
Call:
lm(formula = CapNormChange ~ Policychanges + factor(Region),
data = Panel)
Residuals:
Min 1Q Median 3Q Max
-31.141 -4.856 -0.642 1.262 192.803
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 17.3488 4.9134 3.531 0.000558 ***
Policychanges 0.6412 0.1215 5.277 4.77e-07 ***
factor(Region)Asia -19.3377 6.7804 -2.852 0.004989 **
factor(Region)C America + Carib 0.1147 6.8049 0.017 0.986578
factor(Region)Eurasia -17.6476 6.8294 -2.584 0.010767 *
factor(Region)Europe -20.7759 8.8993 -2.335 0.020959 *
factor(Region)Middle East -17.3348 6.8285 -2.539 0.012200 *
factor(Region)N America -17.5932 6.8064 -2.585 0.010745 *
factor(Region)Oceania -14.0440 6.8417 -2.053 0.041925 *
factor(Region)S America -14.3580 6.7781 -2.118 0.035878 *
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 19.72 on 143 degrees of freedom
Multiple R-squared: 0.3455, Adjusted R-squared: 0.3043
F-statistic: 8.386 on 9 and 143 DF, p-value: 5.444e-10`
Results from PLM:
Call:
plm(formula = CapNormChange ~ Policychanges, data = Panel, effect = "individual",
model = "within", index = c("Region", "Year"))
Balanced Panel: n = 9, T = 17, N = 153
Residuals:
Min. 1st Qu. Median 3rd Qu. Max.
-31.14147 -4.85551 -0.64177 1.26236 192.80277
Coefficients:
Estimate Std. Error t-value Pr(>|t|)
Policychanges 0.64118 0.12150 5.277 4.769e-07 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Total Sum of Squares: 66459
Residual Sum of Squares: 55627
R-Squared: 0.16299
Adj. R-Squared: 0.11031
F-statistic: 27.8465 on 1 and 143 DF, p-value: 4.7687e-07`
You would need to leave out + factor(Region) in your formula for the within model with plm to get what you want.
Within models do not have an intercept, but some software packages (esp. Stata and Gretl) report one. You can estimate it with plm by running within_intercept on you estimated model. The help page has the details about this somewhat artificial intercept.
If you want the individual effects and their significance, use summary(fixef(<your_plm_model>)). Use pFtest to check if the within specification seems worthwhile.
The R squareds diverge between the lm model and the plm model. This is due to the lm model (if used like this with the dummies, it is usually called the LSDV model (least squares dummy variables)) gives what is sometimes called the overall R squared while plm will give you the R squared of the demeaned regression, sometimes called the within R squared. Stata's documentation has some details about this: https://www.stata.com/manuals/xtxtreg.pdf

Swap x and y variables in lm() function in R

Trying to get a summary output of the linear model created with the lm() function in R but no matter which way I set it up I get my desired y value as my x input. My desired output is the summary of the model where Winnings is the y output and averagedist is the input. This is my current output:
Call:
lm(formula = Winnings ~ averagedist, data = combineddata)
Residuals:
Min 1Q Median 3Q Max
-20.4978 -5.2992 -0.3824 6.0887 23.4764
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 2.882e+02 7.577e-01 380.281 < 2e-16 ***
Winnings 1.293e-06 2.023e-07 6.391 8.97e-10 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 8.343 on 232 degrees of freedom
Multiple R-squared: 0.1497, Adjusted R-squared: 0.146
F-statistic: 40.84 on 1 and 232 DF, p-value: 8.967e-10
Have tried flipping the order and defining the variables using y = winnings, x = averagedist but always get the same output.
Used summary(lm(Winnings ~ averagedist,combineddata)) as an alternative way to set it up and that seemed to do the trick, as opposed the two step method:
str<-lm(Winnings ~ averagedist,combineddata)
summary(str)

Resources