Creating a Line of Best Fit in R - r

Hi, I was wondering if anyone could help me on how to complete function in R as I am totally new to this computer programming. So apologies if this seems like a silly question to this audience.
So I am currently trying to add a line of best fit onto my scatter graph but I'm not quite understanging how to do this. I've tried many functions like "abline" and "lm" but I'm not sure if I am using the right ideas or whether I am putting incorrect numbers into the functions.
I have cleared the workspace and just left my graph and previous workings so that it looks neater.
thanks in advance for the help.

reproducible data example:
set.seed(2348907)
x <- rnorm(100)
y <- 2*(x+rnorm(100))
then this makes a linear model for the intercept and slope:
lmodel <- lm(y ~ x)
which now contains the intercept (Intercept term) and the slope (as the coefficient of the x variable in the model). Here is a summary of the model:
summary(lmodel)
Call:
lm(formula = y ~ x)
Residuals:
Min 1Q Median 3Q Max
-3.8412 -1.0767 -0.1808 1.2216 4.1540
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.0454 0.1851 0.245 0.807
x 2.1087 0.1814 11.627 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 1.848 on 98 degrees of freedom
Multiple R-squared: 0.5797, Adjusted R-squared: 0.5754
F-statistic: 135.2 on 1 and 98 DF, p-value: < 2.2e-16
then make the plot using the coef() function to pull out the intercept and slope from the linear model:
plot(x,y) # plots the points
abline(a=coef(lmodel)[1], b=coef(lmodel)[2]) # plots the line, a=intercept, b=slope
personally, I prefer ggplot2 for such things, which would be like this:
library(ggplot2)
ggplot() + geom_point(aes(x,y)) +
geom_smooth(aes(x,y), method="lm", se=F)

Related

Why is this regression plot only plotting 2 of the 4 regression coefficients? [duplicate]

This question already has answers here:
plot regression line in R
(3 answers)
Closed 5 years ago.
I have the following set of data: https://archive.ics.uci.edu/ml/datasets/abalone
I am trying to plot a regression for the whole weight against the diameter.
A scatter plot of the data is clearly not a linear function. (I am unable to attach it for some reason.)
Consider a quadratic regression model. I set it up like so:
abalone <- read.csv("abalone.data")
diameter <- abalone$Diameter
diameter2 <- diameter^2
whole <- abalone$Whole.weight
quadraticModel <- lm( whole ~ diameter + diameter2)
This is fine and gives me the following when calling quadraticModel:
Call:
lm(formula = whole ~ diameter + diameter2)
Coefficients:
(Intercept) diameter diameter2
0.3477 -3.3555 10.4968
However, when I plot:
abline(quadraticModel)
I get the following warning:
Warning message:
In abline(quadraticModel) :
only using the first two of 3 regression coefficients
which means that I am getting a straight line plot which isn't what I am aiming for. Can someone please explain to me why this is happening and possible ways around it? I am also having the same issue with cubic plots etc. (They always just plot the first two coefficients.)
You can not use abline to plot polynomial regression fitted. Try this:
x<-sort(diameter)
y<-quadraticModel$fitted.values[order(diameter)]
lines(x, y)
I don't think you're producing a quadratic fit, rather a linear fit using diameter and the squared diameter. Try this instead:
library(stats)
df <- read.csv("abalone.data")
var_names <-
c(
"Sex",
"Length",
"Diameter",
"Height",
"Whole_weight",
"Shucked_weight",
"Viscera_weight",
"Shell_weight",
"Rings"
)
colnames(df) <- var_names
fit <- lm(df$Whole_weight ~ poly(df$Diameter, 2))
summary(fit)
diameter <- df$Diameter
predicted_weight <- predict(fit, data.frame(x = diameter))
plot(diameter, predicted_weight)
> summary(fit)
Call:
lm(formula = df$Whole_weight ~ poly(df$Diameter, 2))
Residuals:
Min 1Q Median 3Q Max
-0.66800 -0.06579 -0.00611 0.04590 0.97396
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.828818 0.002054 403.44 <2e-16 ***
poly(df$Diameter, 2)1 29.326043 0.132759 220.90 <2e-16 ***
poly(df$Diameter, 2)2 8.401508 0.132759 63.28 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.1328 on 4173 degrees of freedom
Multiple R-squared: 0.9268, Adjusted R-squared: 0.9267
F-statistic: 2.64e+04 on 2 and 4173 DF, p-value: < 2.2e-16

R - Regression Analysis for Logarthmic

I perform regression analysis and try to find the best fit model for the dataset diamonds.csv in ggplot2. I use price(response variable) vs carat and I perform linear regression, quadratic, and cubic regression. The line is not the best fit. I realize the logarithmic from excel has the best fitting line. However, I couldn't figure out how to code in R to find the logarithmic fitting line. Anyone can help?
Comparing Price vs Carat
model<-lm(price~carat, data = diamonds)
Model 2 uses the polynomial to compare
model2<-lm(price~carat + I(carat^2), data = diamonds)
use cubic in model3
model3 <- lm(price~carat + I(carat^2) + I(carat^3), data = diamonds)
How can I code the log in R to get same result as excel?
y = 0.4299ln(x) - 2.5495
R² = 0.8468
Thanks!
The result you report from excel y = 0.4299ln(x) - 2.5495 does not contain any polynomial or cubic terms. What are you trying to do? price is very skewed and as with say 'income' it is common practice to take the log from that. This also provides the R2 you are referring to, but very different coefficients for the intercept and carat parameter.
m1 <- lm(log(price) ~ carat, data = diamonds)
summary(m1)
Call:
lm(formula = log(price) ~ carat, data = diamonds)
Residuals:
Min 1Q Median 3Q Max
-6.2844 -0.2449 0.0335 0.2578 1.5642
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 6.215021 0.003348 1856 <2e-16 ***
carat 1.969757 0.003608 546 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.3972 on 53938 degrees of freedom
Multiple R-squared: 0.8468, Adjusted R-squared: 0.8468
F-statistic: 2.981e+05 on 1 and 53938 DF, p-value: < 2.2e-16

R: Translate the results from lm() to an equation

I'm using R and I want to translate the results from lm() to an equation.
My model is:
Residuals:
Min 1Q Median 3Q Max
-0.048110 -0.023948 -0.000376 0.024511 0.044190
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 3.17691 0.00909 349.50 < 2e-16 ***
poly(QPB2_REF1, 2)1 0.64947 0.03015 21.54 2.66e-14 ***
poly(QPB2_REF1, 2)2 0.10824 0.03015 3.59 0.00209 **
B2DBSA_REF1DONSON -0.20959 0.01286 -16.30 3.17e-12 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.03015 on 18 degrees of freedom
Multiple R-squared: 0.9763, Adjusted R-squared: 0.9724
F-statistic: 247.6 on 3 and 18 DF, p-value: 8.098e-15
Do you have any idea?
I tried to have something like
f <- function(x) {3.17691 + 0.64947*x +0.10824*x^2 -0.20959*1 + 0.03015^2}
but when I tried to set a x, the f(x) value is incorrect.
Your output indicates that the model includes use of the poly function which be default orthogonalizes the polynomials (includes centering the x's and other things). In your formula there is no orthogonalization done and that is the likely difference. You can refit the model using raw=TRUE in the call to poly to get the raw coefficients that can be multiplied by $x$ and $x^2$.
You may also be interested in the Function function in the rms package which automates creating functions from fitted models.
Edit
Here is an example:
library(rms)
xx <- 1:25
yy <- 5 - 1.5*xx + 0.1*xx^2 + rnorm(25)
plot(xx,yy)
fit <- ols(yy ~ pol(xx,2))
mypred <- Function(fit)
curve(mypred, add=TRUE)
mypred( c(1,25, 3, 3.5))
You need to use the rms functions for fitting (ols and pol for this example instead of lm and poly).
If you want to calculate y-hat based on the model, you can just use predict!
Example:
set.seed(123)
my_dat <- data.frame(x=1:10, e=rnorm(10))
my_dat$y <- with(my_dat, x*2 + e)
my_lm <- lm(y~x, data=my_dat)
summary(my_lm)
Result:
Call:
lm(formula = y ~ x, data = my_dat)
Residuals:
Min 1Q Median 3Q Max
-1.1348 -0.5624 -0.1393 0.3854 1.6814
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.5255 0.6673 0.787 0.454
x 1.9180 0.1075 17.835 1e-07 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.9768 on 8 degrees of freedom
Multiple R-squared: 0.9755, Adjusted R-squared: 0.9724
F-statistic: 318.1 on 1 and 8 DF, p-value: 1e-07
Now, instead of making a function like 0.5255 + x * 1.9180 manually, I just call predict for my_lm:
predict(my_lm, data.frame(x=11:20))
Same result as this (not counting minor errors from rounding the slope/intercept estimates):
0.5255 + (11:20) * 1.9180
If you are looking for actually visualizing or writing out a complex equation (e.g. something that has restricted cubic spline transformations), I recommend using the rms package, fitting your model, and using the latex function to see it in latex
my_lm <- ols(y~x, data=my_dat)
latex(my_lm)
Note you will need to render the latex code so as to see your equation. There are websites and, if you are using a Mac, Mac Tex software, that will render it for you.

Linear multivariate regression in R

I want to model that a factory takes an input of, say, x tonnes of raw material, which is then processed. In the first step waste materials are removed, and a product P1 is created. For the "rest" of the material, it is processed once again and another product P2 is created.
The problem is that I want to know how much raw material it takes to produce, say, 1 tonne of product P1 and how much raw material it takes to produce 1 tonne of P2.
I know the amount of raw materials, the amount of finished product P1 and P2 but nothing more.
In my mind, this can be modelled through multivariate regression, using P1 and P2 as dependent variables and the total raw material as the independent variable and find the factors <1 for each finished product. Does this seem right?
Also, how can this be achieved using R? From googling, I've found how to conduct multivariable regression, but not multivariate regression in R.
EDIT:
Trying to use:
datas <- read.table("datass.csv",header = TRUE, sep=",")
rawMat <- matrix(datas[,1])
P1 <- matrix(datas[,2])
P2 <- matrix(datas[,3])
fit <- lm(formula = P1 ~ rawMat)
fit
fit2 <-lm(formula = P2 ~ rawMat)
fit2
gave me results which is certainly not aligned with reality. Fit2, for instance returned 0,1381 which should have a value around 0,8. How can I factor in Y1 as well? Fit2 for instance more or less gave me the average P2/RawMat, but the RawMat is the same raw material used to produce both Products, so I would like to have something like 0,8 as the factor for P1, and around the same for the factor of P2.
The R output was only:
Coefficients:
(Intercept) rawMat
-65.6702 0.1381
for fit2. Why doesn't it include "rawMat1", "rawMat2" as in J.R.'s solution?
EDIT2: datass.csv contains 3 columns - the first with the rawMaterial required to produce both Products P1 and P2, the second column represents the tonnes of P1 produces and the last column the same for P2
multivariate multiple regression can be done by lm(). This is very well documented, but here follows a little example:
rawMat <- matrix(rnorm(200), ncol=2)
noise <- matrix(rnorm(200, 0, 0.2), ncol=2)
B <- matrix( 1:4, ncol=2)
P <- t( B %*% t(rawMat)) + noise
fit <- lm(P ~ rawMat)
summary( fit )
with summary output:
Response Y1 :
Call:
lm(formula = Y1 ~ rawMat)
Residuals:
Min 1Q Median 3Q Max
-0.50710 -0.14475 -0.02501 0.11955 0.51882
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.007812 0.019801 -0.395 0.694
rawMat1 1.002428 0.020141 49.770 <2e-16 ***
rawMat2 3.032761 0.020293 149.445 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.1978 on 97 degrees of freedom
Multiple R-squared: 0.9964, Adjusted R-squared: 0.9963
F-statistic: 1.335e+04 on 2 and 97 DF, p-value: < 2.2e-16
Response Y2 :
Call:
lm(formula = Y2 ~ rawMat)
Residuals:
Min 1Q Median 3Q Max
-0.60435 -0.11004 0.02105 0.11929 0.42539
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.02287 0.01930 1.185 0.239
rawMat1 2.05474 0.01964 104.638 <2e-16 ***
rawMat2 4.00162 0.01978 202.256 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.1929 on 97 degrees of freedom
Multiple R-squared: 0.9983, Adjusted R-squared: 0.9983
F-statistic: 2.852e+04 on 2 and 97 DF, p-value: < 2.2e-16
EDIT!:
In your case with a data.frame named datas you could do something like:
datas <- data.frame( y1 = P[,1], y2=P[,2], x1 = rawMat[,1], x2 = rawMat[,2])
fit <- lm( as.matrix(datas[ ,1:2]) ~ as.matrix(datas[,3:4]) )
or instead:
fit <- with(datas, lm( cbind(y1,y2) ~ x1+x2 ))

How to plot quadratic regression in R?

The following code generates a qudaratic regression in R.
lm.out3 = lm(listOfDataFrames1$avgTime ~ listOfDataFrames1$betaexit + I(listOfDataFrames1$betaexit^2) + I(listOfDataFrames1$betaexit^3))
summary(lm.out3)
Call:
lm(formula = listOfDataFrames1$avgTime ~ listOfDataFrames1$betaexit +
I(listOfDataFrames1$betaexit^2) + I(listOfDataFrames1$betaexit^3))
Residuals:
Min 1Q Median 3Q Max
-14.168 -2.923 -1.435 2.459 28.429
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 199.41 11.13 17.913 < 2e-16 ***
listOfDataFrames1$betaexit -3982.03 449.49 -8.859 1.14e-12 ***
I(listOfDataFrames1$betaexit^2) 32630.86 5370.27 6.076 7.87e-08 ***
I(listOfDataFrames1$betaexit^3) -93042.90 19521.05 -4.766 1.15e-05 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 7.254 on 63 degrees of freedom
Multiple R-squared: 0.9302, Adjusted R-squared: 0.9269
F-statistic: 279.8 on 3 and 63 DF, p-value: < 2.2e-16
But how to do I plot the curve on the graph am confused.
To get graph:
plot(listOfDataFrames1$avgTime~listOfDataFrames1$betaexit)
But curve?
Is there any to do it without manually copying the values?
Like mso suggested though it works.
This should work.
# not tested
lm.out3 = lm(avgTime ~ poly(betaexit,3,raw=TRUE),listofDataFrames3)
plot(avgTime~betaexit,listofDataDFrames3)
curve(predict(lm.out3,newdata=data.frame(betaexit=x)),add=T)
Since you didn't provide any data, here is a working example using the built-in mtcars dataset.
fit <- lm(mpg~poly(wt,3,raw=TRUE),mtcars)
plot(mpg~wt,mtcars)
curve(predict(fit,newdata=data.frame(wt=x)),add=T)
Some notes:
(1) It is a really bad idea to reference external data structures in the formula=... argument to lm(...). Instead, reference columns of a data frame referenced in the data=... argumennt, as above and as #mso points out.
(2) You can specify the formula as #mso suggests, or you can use the poly(...) function with raw=TRUE.
(3) The curve(...) function takes an expression as its first argument, This expression has to have a variable x, which will be populated automatically by values from the x-axis of the graph. So in this example, the expression is:
predict(fit,newdata=data.frame(wt=x))
which uses predict(...) on the model with a dataframe having wt (the predictor variable) given by x.
Try with ggplot:
library(ggplot)
ggplot(listOfDataFrames1, aes(x=betaexit, y=avgTime)) + geom_point()+stat_smooth(se=F)
Using mtcars data:
ggplot(mtcars, aes(x=wt, y=mpg)) + geom_point()+stat_smooth(se=F, method='lm', formula=y~poly(x,3))
Try:
with(listOfDataFrames1, plot(betaexit, avgTime))
with(listOfDataFrames1, lines(betaexit, 199-3982*betaexit+32630*betaexit^2-93042*betaexit^3))

Resources