I am starting to use R and have a bit of a problem.
I have a dataset containing 20 points where leaf temperature and respiration is measured called ADC_dark.
I expect an exponential relationship where an increase in leaf temperature results in increased respiration
Then I plotted an exponential curve through this graph:
ADC_dark %>%
ggplot(aes(x=Tleaf, y=abs_A))+
geom_point()+
stat_smooth(method='lm', formula = log(y)~x)+
labs(title="Respiration and leaf temperature", x="Tleaf", y="abs_A")
This is not looking very good. The formula matching this line is y = -2.70206 * e^(0.11743*x)
Call:
lm(formula = log(ADC_dark$abs_A) ~ ADC_dark$Tleaf)
Residuals:
Min 1Q Median 3Q Max
-2.0185 -0.1059 0.1148 0.2698 0.6825
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -2.70206 0.51255 -5.272 5.18e-05 ***
ADC_dark$Tleaf 0.11743 0.02161 5.435 3.66e-05 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.5468 on 18 degrees of freedom
Multiple R-squared: 0.6213, Adjusted R-squared: 0.6003
F-statistic: 29.54 on 1 and 18 DF, p-value: 3.659e-05
When I use the same data in excel I get this:
As you can see the intercept between these suggested exponential relationships differs.
Just looking at the pictures I would say that excel is doing a better job.
How can I 'train' R to make a better fitted curve through my data, or am I misinterpreting something?
The problem is that when you fit inside ggplot2 start smooth using log(y) ~ x it occured that scales of your data points and fitted line are different. Basically you plot y and log(y) at the same y scale and since y > log(y) for any positive y your fitted plot shifted lower than your data point.
You have several options like to tweak axises and scales, or just use glm generalized linear model with log link instead of lm. In this case the scales would be presevered, no additional tweaking.
library(ggplot2)
set.seed(123)
ADC_dark <- data.frame(Tleaf = 1:20,
abs_A = exp(0.11*x - 2.7 + rnorm(1:20) / 10))
ADC_dark %>%
ggplot(aes(x = Tleaf, y = abs_A))+
geom_point()+
geom_smooth(method = "glm", type = "response", formula = y ~ x, method.args = list(family = gaussian(link = "log")))+
labs(title = "Respiration and leaf temperature", x = "Tleaf", y = "abs_A")
Output:
Related
I'm fixing a linear regression with lm() like
model<-lm(y~x+a, data=dat)
where a is a blocking variable with multiple factor levels.
summary(model)
Call:
lm(formula = y ~ x, data = dat)
Residuals:
Min 1Q Median 3Q Max
-1.45006 -0.20737 0.04593 0.26337 0.91628
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -7.704042 1.088024 -7.081 1.08e-10 ***
x 0.248889 0.036436 6.831 3.81e-10 ***
a1 0.002695 0.150530 0.018 0.98575
a2 0.491749 0.152378 3.227 0.00162 **
a3 0.349772 0.145024 2.412 0.01740 *
a4 -0.009058 0.138717 -0.065 0.94805
a5 0.428085 0.128041 3.343 0.00111 **
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.4505 on 119 degrees of freedom
Multiple R-squared: 0.4228, Adjusted R-squared: 0.3937
F-statistic: 14.53 on 6 and 119 DF, p-value: 2.19e-12
I'm trying to display the same equation and R2 I would get with summary(model) when plotting the raw data and the regression line using ggplot, but because I'm not actually providing a, it's not taking into the fitting of stat_poly_eq()
ggplot(data=dat, aes(x, y)) +
geom_point() +
geom_abline(slope=coef(model)[2], intercept=coef(model)[1], color='red') +
stat_poly_eq(data=plankton.dat, formula = y ~ x,
aes(label = paste(..eq.label.., ..rr.label.., sep = "~~~")),
parse = TRUE, size=3, colour= "red")
Naturally, because lm() and stat_poly_eq() fit the model differently, the resulting parameter estimates and R2 are different.
Is it possible to include the blocking variable in stat_poly_eq and if so, how?
Having factor a six levels, you have fitted six parallel lines, so it does not make much sense to show only one line and one equation. If factor a describes blocks, then using lme() to fit a mixed effects model is possible, and it will give you only one estimate for the line. You have to consider the contrasts used by default in R, and that the first level of a or a0 is the "reference", so the line plotted in your example is for block level a0 and is not valid for the dataset as a whole.
stat_poly_eq() supports only lm(). stat_poly_eq() works in the same way as stat_smooth(method = "lm") as it is intended to be used together with it. If you are fitting the model outside of ggplot then you will need to build a suitable label manually using plotmath syntax, and add it in an annotation layer using annotate(geom = "text", x = 27, y = 1, label = "<your string>", parse = TRUE). To create the string that I show with the placeholder <your string>, you can extract the coefficient estimates in a similar way as you do in geom_abline() in your plot example, and use paste() or sprintf() to assemble the equation. You can also use coef() with a model fitted with lme().
Other statistics in package 'ggpmisc' let you fit a model with lme() but you would anyway need to assemble the label manually. If you will be producing many plots, you may find it worthwhile cheking the User Guide of package 'ggpmisc' for the details.
Say I have the data frame (image below) and I want to split in to two new categories based on region so one would be BC and the other NZ, how do I achieve this? (in R)
data
Here is an example with the mtcars data where we use the transmission variable am to plot different groups in a scatterplot with the ggplot2 package.
We will create a scatterplot with the displacement variable on the x axis and miles per gallon on the y axis. Since cars with larger engine displacement typically consume more gasoline than those with smaller displacement, we expect to see a negative relationship (mpg is higher with low values of displacement) in the chart.
First, we convert am to a factor variable so the legend prints two categories instead of a continuum between 0 and 1. Then we use ggplot() and geom_point() to set the point color based on the value of am.
library(ggplot2)
mtcars$am <- factor(mtcars$am,labels = c("automatic","manual"))
ggplot(mtcars,aes(disp,mpg,group = am)) +
geom_point(aes(color = am))
...and the output:
Separating charts by group with facets
We can use ggplot2 directly to generate separate charts by a grouping variable. In ggplot2 this is known as a facetted chart. We use facet_wrap() to split the data by values of am as follows.
ggplot(mtcars,aes(disp,mpg,group = am)) +
geom_point() +
facet_wrap(mtcars$am,ncol = 2)
...and the output:
Adding regression line and confidence intervals
Given the comments in the original question, we an add a regression line to the plot by using the geom_smooth() function, which defaults to lowess smoothing.
ggplot(mtcars,aes(disp,mpg,group = am)) +
geom_point() +
facet_wrap(mtcars$am,ncol = 2) +
geom_smooth(span = 1)
...and the output:
To use a simple regression instead of lowess smoothing, we use the method = argument in geom_smooth(), and set it to lm.
ggplot(mtcars,aes(disp,mpg,group = am)) +
geom_point() +
facet_wrap(mtcars$am,ncol = 2) +
geom_smooth(method = "lm")
...and the output:
Generate regression models by group
Here we split the data frame by values of am, and use lapply() to generate regression models for each group.
carsList <- split(mtcars,mtcars$am)
lapply(carsList,function(x){
summary(lm(mpg ~ disp,data = x))
})
...and the output:
$automatic
Call:
lm(formula = mpg ~ disp, data = x)
Residuals:
Min 1Q Median 3Q Max
-2.7341 -1.6546 -0.8855 1.6032 5.0764
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 25.157064 1.592922 15.79 1.36e-11 ***
disp -0.027584 0.005146 -5.36 5.19e-05 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 2.405 on 17 degrees of freedom
Multiple R-squared: 0.6283, Adjusted R-squared: 0.6064
F-statistic: 28.73 on 1 and 17 DF, p-value: 5.194e-05
$manual
Call:
lm(formula = mpg ~ disp, data = x)
Residuals:
Min 1Q Median 3Q Max
-4.6056 -2.4200 -0.0956 3.1484 5.2315
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 32.86614 1.95033 16.852 3.33e-09 ***
disp -0.05904 0.01174 -5.031 0.000383 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 3.545 on 11 degrees of freedom
Multiple R-squared: 0.6971, Adjusted R-squared: 0.6695
F-statistic: 25.31 on 1 and 11 DF, p-value: 0.0003834
NOTE: since this is an example illustrating the code necessary to generate a regression analysis with a split variable, we won't go into the details about whether the data here conforms to modeling assumptions for Ordinary Least Squares regression.
Modeling the groups in one regression model
As I noted in the comments, we can account for the differences between automatic and manual transmissions in one regression model if we specify the am effect as well as an interaction effect between am and disp.
summary(lm(mpg ~ disp + am + am * disp,data=mtcars))
We can demonstrate that this model generates the same predictions as the split model for manual transmissions by generating predictions from each model as follows.
data <- data.frame(am = c(1,1,0),
disp = c(157,248,300))
data$am <- factor(data$am,labels = c("automatic","manual"))
mod1 <- lm(mpg ~ disp + am + am * disp,data=mtcars)
predict(mod1,data)
mod2 <- lm(mpg ~ disp,data = mtcars[mtcars$am == "manual",])
predict(mod2,data[data$am == "manual",])
...and the output:
> data <- data.frame(am = c(1,1,0),
+ disp = c(157,248,300))
> data$am <- factor(data$am,labels = c("automatic","manual"))
> mod1 <- lm(mpg ~ disp + am + am * disp,data=mtcars)
> predict(mod1,data)
1 2 3
23.59711 18.22461 16.88199
> mod2 <- lm(mpg ~ disp,data = mtcars[mtcars$am == "manual",])
> predict(mod2,data[data$am == "manual",])
1 2
23.59711 18.22461
We subset the data prior to predict() for the split model in order to generate predictions only for observations that had manual transmissions. Since the predictions match, we prove that building separate models by transmission type is no different than a fully specified model that includes both the categorical am effect and an interaction effect for am * disp.
This question already has answers here:
plot regression line in R
(3 answers)
Closed 5 years ago.
I have the following set of data: https://archive.ics.uci.edu/ml/datasets/abalone
I am trying to plot a regression for the whole weight against the diameter.
A scatter plot of the data is clearly not a linear function. (I am unable to attach it for some reason.)
Consider a quadratic regression model. I set it up like so:
abalone <- read.csv("abalone.data")
diameter <- abalone$Diameter
diameter2 <- diameter^2
whole <- abalone$Whole.weight
quadraticModel <- lm( whole ~ diameter + diameter2)
This is fine and gives me the following when calling quadraticModel:
Call:
lm(formula = whole ~ diameter + diameter2)
Coefficients:
(Intercept) diameter diameter2
0.3477 -3.3555 10.4968
However, when I plot:
abline(quadraticModel)
I get the following warning:
Warning message:
In abline(quadraticModel) :
only using the first two of 3 regression coefficients
which means that I am getting a straight line plot which isn't what I am aiming for. Can someone please explain to me why this is happening and possible ways around it? I am also having the same issue with cubic plots etc. (They always just plot the first two coefficients.)
You can not use abline to plot polynomial regression fitted. Try this:
x<-sort(diameter)
y<-quadraticModel$fitted.values[order(diameter)]
lines(x, y)
I don't think you're producing a quadratic fit, rather a linear fit using diameter and the squared diameter. Try this instead:
library(stats)
df <- read.csv("abalone.data")
var_names <-
c(
"Sex",
"Length",
"Diameter",
"Height",
"Whole_weight",
"Shucked_weight",
"Viscera_weight",
"Shell_weight",
"Rings"
)
colnames(df) <- var_names
fit <- lm(df$Whole_weight ~ poly(df$Diameter, 2))
summary(fit)
diameter <- df$Diameter
predicted_weight <- predict(fit, data.frame(x = diameter))
plot(diameter, predicted_weight)
> summary(fit)
Call:
lm(formula = df$Whole_weight ~ poly(df$Diameter, 2))
Residuals:
Min 1Q Median 3Q Max
-0.66800 -0.06579 -0.00611 0.04590 0.97396
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.828818 0.002054 403.44 <2e-16 ***
poly(df$Diameter, 2)1 29.326043 0.132759 220.90 <2e-16 ***
poly(df$Diameter, 2)2 8.401508 0.132759 63.28 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.1328 on 4173 degrees of freedom
Multiple R-squared: 0.9268, Adjusted R-squared: 0.9267
F-statistic: 2.64e+04 on 2 and 4173 DF, p-value: < 2.2e-16
I perform regression analysis and try to find the best fit model for the dataset diamonds.csv in ggplot2. I use price(response variable) vs carat and I perform linear regression, quadratic, and cubic regression. The line is not the best fit. I realize the logarithmic from excel has the best fitting line. However, I couldn't figure out how to code in R to find the logarithmic fitting line. Anyone can help?
Comparing Price vs Carat
model<-lm(price~carat, data = diamonds)
Model 2 uses the polynomial to compare
model2<-lm(price~carat + I(carat^2), data = diamonds)
use cubic in model3
model3 <- lm(price~carat + I(carat^2) + I(carat^3), data = diamonds)
How can I code the log in R to get same result as excel?
y = 0.4299ln(x) - 2.5495
R² = 0.8468
Thanks!
The result you report from excel y = 0.4299ln(x) - 2.5495 does not contain any polynomial or cubic terms. What are you trying to do? price is very skewed and as with say 'income' it is common practice to take the log from that. This also provides the R2 you are referring to, but very different coefficients for the intercept and carat parameter.
m1 <- lm(log(price) ~ carat, data = diamonds)
summary(m1)
Call:
lm(formula = log(price) ~ carat, data = diamonds)
Residuals:
Min 1Q Median 3Q Max
-6.2844 -0.2449 0.0335 0.2578 1.5642
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 6.215021 0.003348 1856 <2e-16 ***
carat 1.969757 0.003608 546 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.3972 on 53938 degrees of freedom
Multiple R-squared: 0.8468, Adjusted R-squared: 0.8468
F-statistic: 2.981e+05 on 1 and 53938 DF, p-value: < 2.2e-16
How can you get R's glm() to match polynomial data? I've tried several iterations of 'family=AAA(link="BBB")' but I can't seem to get trivial predictions to match.
For example, please help with R's glm to match polynomial data
x=seq(-6,6,2)
y=x*x
parabola=data.frame(x,y)
plot(parabola)
model=glm(y~x,dat=parabola)
test=data.frame(x=seq(-5,5,2))
test$y=predict(model,test)
plot(test)
The plot(parabola) looks as expected, but I can find the incantation of glm() that will make plot(test) look parabolic.
I think you need to step back and start to think about a model and how you represent this in R. In your example, y is a quadratic function of x, so you need to include x and x^2 in the model formula, i.e. as predictors you need to estimate the effect of x and x^2 on the response given the data to hand.
If y is Gaussian, conditional upon the model, then you can do this with lm() and either
y ~ x + I(x^2)
or
y ~ poly(x, 2)
In the first, we wrap the quadratic term in I() as the ^ operator has a special meaning (not its mathematical one) in an R model formula. The latter version gives orthogonal polynomials and hence the x and x^2 terms won't be correlated which can help with fitting, however in some cases interpreting the coefficients is trickier with poly().
Putting it all together we have (note that I add some random error to y so as to not predict it perfectly as the example I use is more common in reality):
x <- seq(-6 ,6 ,2)
y <- x^2 + rnorm(length(x), sd = 2)
parabola <- data.frame(x = x, y = y)
mod <- lm(y ~ poly(x, 2), data = parabola)
plot(parabola)
lines(fitted(mod) ~ x, data = parabola, col = "red")
The plot produced is:
An additional issue is whether y is Gaussian? If y can't be negative (i.e. a count), and/or is discrete, modelling using lm() is going to be wrong. That's where glm() might come in, by which you might fit a curve without needing x^2 (although if the data really are a parabola, then x on its own isn't going to fit the response), as there is an explicit transformation of the data from the linear predictor on to the scale of the response.
It is better to think about the properties of the data and the sort of model you want to fit and then build up the degree of polynomial within that modelling framework, rather than jumping in a trying various incantations to simply curve fit the data.
The match is now perfect. A slightly more interesting parabola:
x=seq(-16,16,2)
y= 4*x*x + 10*x + 6
parabola=data.frame(x,y)
plot(parabola)
model=lm(y~poly(x,2),dat=parabola)
summary(model)
test=data.frame(x=seq(-15,15,2))
test$y=predict(model,test)
points(test,pch=3)
An amateur (like me) might expect the coefficients of the model to be (4,10,6) to match 4*x*x + 10*x + 6
Call:
lm(formula = y ~ poly(x, 2), data = parabola)
Residuals:
Min 1Q Median 3Q Max
-3.646e-13 -8.748e-14 -3.691e-14 4.929e-14 6.387e-13
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 3.900e+02 5.192e-14 7.511e+15 <2e-16 ***
poly(x, 2)1 4.040e+02 2.141e-13 1.887e+15 <2e-16 ***
poly(x, 2)2 1.409e+03 2.141e-13 6.581e+15 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 2.141e-13 on 14 degrees of freedom
Multiple R-squared: 1, Adjusted R-squared: 1
F-statistic: 2.343e+31 on 2 and 14 DF, p-value: < 2.2e-16
Why would the coefficients be (390,404,1409)?