R - Regression Analysis for Logarthmic - r

I perform regression analysis and try to find the best fit model for the dataset diamonds.csv in ggplot2. I use price(response variable) vs carat and I perform linear regression, quadratic, and cubic regression. The line is not the best fit. I realize the logarithmic from excel has the best fitting line. However, I couldn't figure out how to code in R to find the logarithmic fitting line. Anyone can help?
Comparing Price vs Carat
model<-lm(price~carat, data = diamonds)
Model 2 uses the polynomial to compare
model2<-lm(price~carat + I(carat^2), data = diamonds)
use cubic in model3
model3 <- lm(price~carat + I(carat^2) + I(carat^3), data = diamonds)
How can I code the log in R to get same result as excel?
y = 0.4299ln(x) - 2.5495
R² = 0.8468
Thanks!

The result you report from excel y = 0.4299ln(x) - 2.5495 does not contain any polynomial or cubic terms. What are you trying to do? price is very skewed and as with say 'income' it is common practice to take the log from that. This also provides the R2 you are referring to, but very different coefficients for the intercept and carat parameter.
m1 <- lm(log(price) ~ carat, data = diamonds)
summary(m1)
Call:
lm(formula = log(price) ~ carat, data = diamonds)
Residuals:
Min 1Q Median 3Q Max
-6.2844 -0.2449 0.0335 0.2578 1.5642
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 6.215021 0.003348 1856 <2e-16 ***
carat 1.969757 0.003608 546 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.3972 on 53938 degrees of freedom
Multiple R-squared: 0.8468, Adjusted R-squared: 0.8468
F-statistic: 2.981e+05 on 1 and 53938 DF, p-value: < 2.2e-16

Related

Creating a Line of Best Fit in R

Hi, I was wondering if anyone could help me on how to complete function in R as I am totally new to this computer programming. So apologies if this seems like a silly question to this audience.
So I am currently trying to add a line of best fit onto my scatter graph but I'm not quite understanging how to do this. I've tried many functions like "abline" and "lm" but I'm not sure if I am using the right ideas or whether I am putting incorrect numbers into the functions.
I have cleared the workspace and just left my graph and previous workings so that it looks neater.
thanks in advance for the help.
reproducible data example:
set.seed(2348907)
x <- rnorm(100)
y <- 2*(x+rnorm(100))
then this makes a linear model for the intercept and slope:
lmodel <- lm(y ~ x)
which now contains the intercept (Intercept term) and the slope (as the coefficient of the x variable in the model). Here is a summary of the model:
summary(lmodel)
Call:
lm(formula = y ~ x)
Residuals:
Min 1Q Median 3Q Max
-3.8412 -1.0767 -0.1808 1.2216 4.1540
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.0454 0.1851 0.245 0.807
x 2.1087 0.1814 11.627 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 1.848 on 98 degrees of freedom
Multiple R-squared: 0.5797, Adjusted R-squared: 0.5754
F-statistic: 135.2 on 1 and 98 DF, p-value: < 2.2e-16
then make the plot using the coef() function to pull out the intercept and slope from the linear model:
plot(x,y) # plots the points
abline(a=coef(lmodel)[1], b=coef(lmodel)[2]) # plots the line, a=intercept, b=slope
personally, I prefer ggplot2 for such things, which would be like this:
library(ggplot2)
ggplot() + geom_point(aes(x,y)) +
geom_smooth(aes(x,y), method="lm", se=F)

How to fit a known linear equation to my data in R?

I used a linear model to obtain the best fit to my data, lm() function.
From literature I know that the optimal fit would be a linear regression with the slope = 1 and the intercept = 0. I would like to see how good this equation (y=x) fits my data? How do I proceed in order to find an R^2 as well as a p-value?
This is my data
(y = modelled, x = measured)
measured<-c(67.39369,28.73695,60.18499,49.32405,166.39318,222.29022,271.83573,241.72247, 368.46304,220.27018,169.92343,56.49579,38.18381,49.33753,130.91752,161.63536,294.14740,363.91029,358.32905,239.84112,129.65078,32.76462,30.13952,52.83656,67.35427,132.23034,366.87857,247.40125,273.19316,278.27902,123.24256,45.98363,83.50199,240.99459,266.95707,308.69814,228.34256,220.51319,83.97942,58.32171,57.93815,94.64370,264.78007,274.25863,245.72940,155.41777,77.45236,70.44223,104.22838,294.01645,312.42321,122.80831,41.65770,242.22661,300.07147,291.59902,230.54478,89.42498,55.81760,55.60525,111.64263,305.76432,264.27192,233.28214,192.75603,75.60803,63.75376)
modelled<-c(42.58318,71.64667,111.08853,67.06974,156.47303,240.41188,238.25893,196.42247,404.28974,138.73164,116.73998,55.21672,82.71556,64.27752,145.84891,133.67465,295.01014,335.25432,253.01847,166.69241,68.84971,26.03600,45.04720,75.56405,109.55975,202.57084,288.52887,140.58476,152.20510,153.99427,75.70720,92.56287,144.93923,335.90871,NA,264.25732,141.93407,122.80440,83.23812,42.18676,107.97732,123.96824,270.52620,388.93979,308.35117,100.79047,127.70644,91.23133,162.53323,NA ,276.46554,100.79440,81.10756,272.17680,387.28700,208.29715,152.91548,62.54459,31.98732,74.26625,115.50051,324.91248,210.14204,168.29598,157.30373,45.76027,76.07370)
Now I would like to see how good the equation y=x fits the data presented above (R^2 and p-value)?
I am very grateful if somebody can help me with this (basic) problem, as I found no answers to my question on stackoverflow?
Best regards Cyril
Let's be clear what you are asking here. You have an existing model, which is "the modelled values are the expected value of the measured values", or in other words, measured = modelled + e, where e are the normally distributed residuals.
You say that the "optimal fit" should be a straight line with intercept 0 and slope 1, which is another way of saying the same thing.
The thing is, this "optimal fit" is not the optimal fit for your actual data, as we can easily see by doing:
summary(lm(measured ~ modelled))
#>
#> Call:
#> lm(formula = measured ~ modelled)
#>
#> Residuals:
#> Min 1Q Median 3Q Max
#> -103.328 -39.130 -4.881 40.428 114.829
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) 23.09461 13.11026 1.762 0.083 .
#> modelled 0.91143 0.07052 12.924 <2e-16 ***
#> ---
#> Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
#>
#> Residual standard error: 55.13 on 63 degrees of freedom
#> Multiple R-squared: 0.7261, Adjusted R-squared: 0.7218
#> F-statistic: 167 on 1 and 63 DF, p-value: < 2.2e-16
This shows us the line that would produce the optimal fit to your data in terms of reducing the sum of the squared residuals.
But I guess what you are asking is "How well do my data fit the model measured = modelled + e ?"
Trying to coerce lm into giving you a fixed intercept and slope probably isn't the best way to answer this question. Remember, the p value for the slope only tells you whether the actual slope is significantly different from 0. The above model already confirms that. If you want to know the r-squared of measured = modelled + e, you just need to know the proportion of the variance of measured that is explained by modelled. In other words:
1 - var(measured - modelled) / var(measured)
#> [1] 0.7192672
This is pretty close to the r squared from the lm call.
I think you have sufficient evidence to say that your data is consistent with the model measured = modelled, in that the slope in the lm model includes the value 1 within its 95% confidence interval, and the intercept contains the value 0 within its 95% confidence interval.
As mentioned in the comments, you can use the lm() function, but this actually estimates the slope and intercept for you, whereas what you want is something different.
If slope = 1 and the intercept = 0, essentially you have a fit and your modelled is already the predicted value. You need the r-square from this fit. R squared is defined as:
R2 = MSS/TSS = (TSS − RSS)/TSS
See this link for definition of RSS and TSS.
We can only work with observations that are complete (non NA). So we calculate each of them:
TSS = nonNA = !is.na(modelled) & !is.na(measured)
# residuals from your prediction
RSS = sum((modelled[nonNA] - measured[nonNA])^2,na.rm=T)
# total residuals from data
TSS = sum((measured[nonNA] - mean(measured[nonNA]))^2,na.rm=T)
1 - RSS/TSS
[1] 0.7116585
If measured and modelled are supposed to represent the actual and fitted values of an undisclosed model, as discussed in the comments below another answer, then if fm is the lm object for that undisclosed model then
summary(fm)
will show the R^2 and p value of that model.
The R squared value can actually be calculated using only measured and modelled but the formula is different if there is or is not an intercept in the undisclosed model. The signs are that there is no intercept since if there were an intercept sum(modelled - measured, an.rm = TRUE) should be 0 but in fact it is far from it.
In any case R^2 and the p value are shown in the output of the summary(fm) where fm is the undisclosed linear model so there is no point in restricting the discussion to measured and modelled if you have the lm object of the undisclosed model.
For example, if the undisclosed model is the following then using the builtin CO2 data frame:
fm <- lm(uptake ~ Type + conc, CO2)
summary(fm)
we have the this output where the last two lines show R squared and p value.
Call:
lm(formula = uptake ~ Type + conc, data = CO2)
Residuals:
Min 1Q Median 3Q Max
-18.2145 -4.2549 0.5479 5.3048 12.9968
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 25.830052 1.579918 16.349 < 2e-16 ***
TypeMississippi -12.659524 1.544261 -8.198 3.06e-12 ***
conc 0.017731 0.002625 6.755 2.00e-09 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 7.077 on 81 degrees of freedom
Multiple R-squared: 0.5821, Adjusted R-squared: 0.5718
F-statistic: 56.42 on 2 and 81 DF, p-value: 4.498e-16

plm vs lm - different results?

I tried several times to use lm and plm to do a regression. And I get different results.
First, I used lm as follows:
fixed.Region1 <- lm(CapNormChange ~ Policychanges + factor(Region),
data=Panel)
Further I used plm in the following way:
fixed.Region2 <- plm(CapNormChange ~ Policychanges+ factor(Region),
data=Panel, index=c("Region", "Year"), model="within", effect="individual")
I think there is something wrong with plm because I don't see an intercept in the results (see below).
Furthermore, I am not entirely sure if + factor (Region) is necessary, however, if it is not there, I don't see the coefficients (and significance) for the dummy.
So, my question is:
I am using the plm function wrong? (or what is wrong about it)
If not, how can it be that the results are different?
If somebody could give me a hint, I would really appreciate.
Results from LM:
Call:
lm(formula = CapNormChange ~ Policychanges + factor(Region),
data = Panel)
Residuals:
Min 1Q Median 3Q Max
-31.141 -4.856 -0.642 1.262 192.803
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 17.3488 4.9134 3.531 0.000558 ***
Policychanges 0.6412 0.1215 5.277 4.77e-07 ***
factor(Region)Asia -19.3377 6.7804 -2.852 0.004989 **
factor(Region)C America + Carib 0.1147 6.8049 0.017 0.986578
factor(Region)Eurasia -17.6476 6.8294 -2.584 0.010767 *
factor(Region)Europe -20.7759 8.8993 -2.335 0.020959 *
factor(Region)Middle East -17.3348 6.8285 -2.539 0.012200 *
factor(Region)N America -17.5932 6.8064 -2.585 0.010745 *
factor(Region)Oceania -14.0440 6.8417 -2.053 0.041925 *
factor(Region)S America -14.3580 6.7781 -2.118 0.035878 *
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 19.72 on 143 degrees of freedom
Multiple R-squared: 0.3455, Adjusted R-squared: 0.3043
F-statistic: 8.386 on 9 and 143 DF, p-value: 5.444e-10`
Results from PLM:
Call:
plm(formula = CapNormChange ~ Policychanges, data = Panel, effect = "individual",
model = "within", index = c("Region", "Year"))
Balanced Panel: n = 9, T = 17, N = 153
Residuals:
Min. 1st Qu. Median 3rd Qu. Max.
-31.14147 -4.85551 -0.64177 1.26236 192.80277
Coefficients:
Estimate Std. Error t-value Pr(>|t|)
Policychanges 0.64118 0.12150 5.277 4.769e-07 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Total Sum of Squares: 66459
Residual Sum of Squares: 55627
R-Squared: 0.16299
Adj. R-Squared: 0.11031
F-statistic: 27.8465 on 1 and 143 DF, p-value: 4.7687e-07`
You would need to leave out + factor(Region) in your formula for the within model with plm to get what you want.
Within models do not have an intercept, but some software packages (esp. Stata and Gretl) report one. You can estimate it with plm by running within_intercept on you estimated model. The help page has the details about this somewhat artificial intercept.
If you want the individual effects and their significance, use summary(fixef(<your_plm_model>)). Use pFtest to check if the within specification seems worthwhile.
The R squareds diverge between the lm model and the plm model. This is due to the lm model (if used like this with the dummies, it is usually called the LSDV model (least squares dummy variables)) gives what is sometimes called the overall R squared while plm will give you the R squared of the demeaned regression, sometimes called the within R squared. Stata's documentation has some details about this: https://www.stata.com/manuals/xtxtreg.pdf

Why is this regression plot only plotting 2 of the 4 regression coefficients? [duplicate]

This question already has answers here:
plot regression line in R
(3 answers)
Closed 5 years ago.
I have the following set of data: https://archive.ics.uci.edu/ml/datasets/abalone
I am trying to plot a regression for the whole weight against the diameter.
A scatter plot of the data is clearly not a linear function. (I am unable to attach it for some reason.)
Consider a quadratic regression model. I set it up like so:
abalone <- read.csv("abalone.data")
diameter <- abalone$Diameter
diameter2 <- diameter^2
whole <- abalone$Whole.weight
quadraticModel <- lm( whole ~ diameter + diameter2)
This is fine and gives me the following when calling quadraticModel:
Call:
lm(formula = whole ~ diameter + diameter2)
Coefficients:
(Intercept) diameter diameter2
0.3477 -3.3555 10.4968
However, when I plot:
abline(quadraticModel)
I get the following warning:
Warning message:
In abline(quadraticModel) :
only using the first two of 3 regression coefficients
which means that I am getting a straight line plot which isn't what I am aiming for. Can someone please explain to me why this is happening and possible ways around it? I am also having the same issue with cubic plots etc. (They always just plot the first two coefficients.)
You can not use abline to plot polynomial regression fitted. Try this:
x<-sort(diameter)
y<-quadraticModel$fitted.values[order(diameter)]
lines(x, y)
I don't think you're producing a quadratic fit, rather a linear fit using diameter and the squared diameter. Try this instead:
library(stats)
df <- read.csv("abalone.data")
var_names <-
c(
"Sex",
"Length",
"Diameter",
"Height",
"Whole_weight",
"Shucked_weight",
"Viscera_weight",
"Shell_weight",
"Rings"
)
colnames(df) <- var_names
fit <- lm(df$Whole_weight ~ poly(df$Diameter, 2))
summary(fit)
diameter <- df$Diameter
predicted_weight <- predict(fit, data.frame(x = diameter))
plot(diameter, predicted_weight)
> summary(fit)
Call:
lm(formula = df$Whole_weight ~ poly(df$Diameter, 2))
Residuals:
Min 1Q Median 3Q Max
-0.66800 -0.06579 -0.00611 0.04590 0.97396
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.828818 0.002054 403.44 <2e-16 ***
poly(df$Diameter, 2)1 29.326043 0.132759 220.90 <2e-16 ***
poly(df$Diameter, 2)2 8.401508 0.132759 63.28 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.1328 on 4173 degrees of freedom
Multiple R-squared: 0.9268, Adjusted R-squared: 0.9267
F-statistic: 2.64e+04 on 2 and 4173 DF, p-value: < 2.2e-16

What is the function of I()?

I found the following code on the internet
mod1 <- lm(mpg ~ weight + I(weight^2) + foreign, auto)
What is the function I()? It seems that the result of weight^2 is same as I(weight^2).
The function of I() is to isolate terms in formulae from the usual formula parsing & syntax. There are other uses of I() in data frames where it helps create objects that have or inherit from class "AsIs" which allows the embedding of objects without the usual conversion that takes place.
In the formula case, as that is what you specifically ask about, ^ is a special formula operator which indicates crossing of the terms to the nth degree where n is given following the operator like so: ^n. As such ^ does not have it's usual arithmetic interpretation in a formula. (Likewise, the -, +, / and * operators also have special formula meanings and as a result I() is required to use them to isolate them from the formula parsing tools.)
In the specific example you give (which I'll illustrate using an in-built data set trees), if you forget to use I() round a the quadratic term, R will, in this case, ignore that term completely as Volume (weight in your example) is already in the model and you are asking for multi-way interactions of the variable with itself, which is not a quadratic term.
First without I():
> lm(Height ~ Volume + Volume^2, data = trees)
Call:
lm(formula = Height ~ Volume + Volume^2, data = trees)
Coefficients:
(Intercept) Volume
69.0034 0.2319
Notice how only the Volume term is in the formula? The correct specification for a quadratic model (actually it may not be, see below) is
> lm(Height ~ Volume + I(Volume^2), data = trees)
Call:
lm(formula = Height ~ Volume + I(Volume^2), data = trees)
Coefficients:
(Intercept) Volume I(Volume^2)
65.33587 0.47540 -0.00314
I said it may not be correct; this is due to correlation between Volume and Volume^2. An identical but more stable fit can be achieved by the use of orthogonal polynomials, whichpoly()` can produce for you. So the more stable speficiation would be:
> lm(Height ~ poly(Volume, 2), data = trees)
Call:
lm(formula = Height ~ poly(Volume, 2), data = trees)
Coefficients:
(Intercept) poly(Volume, 2)1 poly(Volume, 2)2
76.000 20.879 -5.278
Note that the fit is identical to the earlier model though with different coefficient estimates as the input data are different (orthogonal polynomials vs raw polynomials). You can see this by their summary() output if you don't believe me:
> summary(lm(Height ~ poly(Volume, 2), data = trees))
Call:
lm(formula = Height ~ poly(Volume, 2), data = trees)
Residuals:
Min 1Q Median 3Q Max
-11.2266 -3.6728 -0.0745 2.4073 9.9954
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 76.0000 0.9322 81.531 < 2e-16 ***
poly(Volume, 2)1 20.8788 5.1900 4.023 0.000395 ***
poly(Volume, 2)2 -5.2780 5.1900 -1.017 0.317880
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 5.19 on 28 degrees of freedom
Multiple R-squared: 0.3808, Adjusted R-squared: 0.3365
F-statistic: 8.609 on 2 and 28 DF, p-value: 0.001219
> summary(lm(Height ~ Volume + I(Volume^2), data = trees))
Call:
lm(formula = Height ~ Volume + I(Volume^2), data = trees)
Residuals:
Min 1Q Median 3Q Max
-11.2266 -3.6728 -0.0745 2.4073 9.9954
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 65.335867 4.110886 15.893 1.52e-15 ***
Volume 0.475398 0.246279 1.930 0.0638 .
I(Volume^2) -0.003140 0.003087 -1.017 0.3179
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 5.19 on 28 degrees of freedom
Multiple R-squared: 0.3808, Adjusted R-squared: 0.3365
F-statistic: 8.609 on 2 and 28 DF, p-value: 0.001219
Note the difference in the t-tests for the linear and quadratic terms in the models. This is where the orthogonality of the input polynomial terms helps.
To really see what ^ does in a formula (in case the terminology in ?formula is not something you are familiar with, consider this model:
> lm(Height ~ (Girth + Volume)^2, data = trees)
Call:
lm(formula = Height ~ (Girth + Volume)^2, data = trees)
Coefficients:
(Intercept) Girth Volume Girth:Volume
75.40148 -2.29632 1.86095 -0.05608
As there are two terms in (...)^2, the formula parsing code converts that into the main effects of the two variables plus their 2nd-order interaction. That model could be more succinctly written as Height ~ Girth * Volume, but ^ helps when you want higher-order interactions or interactions among a larger number of variables.

Resources