Adding a blocking factor in stat_poly_eq() - r

I'm fixing a linear regression with lm() like
model<-lm(y~x+a, data=dat)
where a is a blocking variable with multiple factor levels.
summary(model)
Call:
lm(formula = y ~ x, data = dat)
Residuals:
Min 1Q Median 3Q Max
-1.45006 -0.20737 0.04593 0.26337 0.91628
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -7.704042 1.088024 -7.081 1.08e-10 ***
x 0.248889 0.036436 6.831 3.81e-10 ***
a1 0.002695 0.150530 0.018 0.98575
a2 0.491749 0.152378 3.227 0.00162 **
a3 0.349772 0.145024 2.412 0.01740 *
a4 -0.009058 0.138717 -0.065 0.94805
a5 0.428085 0.128041 3.343 0.00111 **
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.4505 on 119 degrees of freedom
Multiple R-squared: 0.4228, Adjusted R-squared: 0.3937
F-statistic: 14.53 on 6 and 119 DF, p-value: 2.19e-12
I'm trying to display the same equation and R2 I would get with summary(model) when plotting the raw data and the regression line using ggplot, but because I'm not actually providing a, it's not taking into the fitting of stat_poly_eq()
ggplot(data=dat, aes(x, y)) +
geom_point() +
geom_abline(slope=coef(model)[2], intercept=coef(model)[1], color='red') +
stat_poly_eq(data=plankton.dat, formula = y ~ x,
aes(label = paste(..eq.label.., ..rr.label.., sep = "~~~")),
parse = TRUE, size=3, colour= "red")
Naturally, because lm() and stat_poly_eq() fit the model differently, the resulting parameter estimates and R2 are different.
Is it possible to include the blocking variable in stat_poly_eq and if so, how?

Having factor a six levels, you have fitted six parallel lines, so it does not make much sense to show only one line and one equation. If factor a describes blocks, then using lme() to fit a mixed effects model is possible, and it will give you only one estimate for the line. You have to consider the contrasts used by default in R, and that the first level of a or a0 is the "reference", so the line plotted in your example is for block level a0 and is not valid for the dataset as a whole.
stat_poly_eq() supports only lm(). stat_poly_eq() works in the same way as stat_smooth(method = "lm") as it is intended to be used together with it. If you are fitting the model outside of ggplot then you will need to build a suitable label manually using plotmath syntax, and add it in an annotation layer using annotate(geom = "text", x = 27, y = 1, label = "<your string>", parse = TRUE). To create the string that I show with the placeholder <your string>, you can extract the coefficient estimates in a similar way as you do in geom_abline() in your plot example, and use paste() or sprintf() to assemble the equation. You can also use coef() with a model fitted with lme().
Other statistics in package 'ggpmisc' let you fit a model with lme() but you would anyway need to assemble the label manually. If you will be producing many plots, you may find it worthwhile cheking the User Guide of package 'ggpmisc' for the details.

Related

Fitting an exponential curve through scatterplot

I am starting to use R and have a bit of a problem.
I have a dataset containing 20 points where leaf temperature and respiration is measured called ADC_dark.
I expect an exponential relationship where an increase in leaf temperature results in increased respiration
Then I plotted an exponential curve through this graph:
ADC_dark %>%
ggplot(aes(x=Tleaf, y=abs_A))+
geom_point()+
stat_smooth(method='lm', formula = log(y)~x)+
labs(title="Respiration and leaf temperature", x="Tleaf", y="abs_A")
This is not looking very good. The formula matching this line is y = -2.70206 * e^(0.11743*x)
Call:
lm(formula = log(ADC_dark$abs_A) ~ ADC_dark$Tleaf)
Residuals:
Min 1Q Median 3Q Max
-2.0185 -0.1059 0.1148 0.2698 0.6825
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -2.70206 0.51255 -5.272 5.18e-05 ***
ADC_dark$Tleaf 0.11743 0.02161 5.435 3.66e-05 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.5468 on 18 degrees of freedom
Multiple R-squared: 0.6213, Adjusted R-squared: 0.6003
F-statistic: 29.54 on 1 and 18 DF, p-value: 3.659e-05
When I use the same data in excel I get this:
As you can see the intercept between these suggested exponential relationships differs.
Just looking at the pictures I would say that excel is doing a better job.
How can I 'train' R to make a better fitted curve through my data, or am I misinterpreting something?
The problem is that when you fit inside ggplot2 start smooth using log(y) ~ x it occured that scales of your data points and fitted line are different. Basically you plot y and log(y) at the same y scale and since y > log(y) for any positive y your fitted plot shifted lower than your data point.
You have several options like to tweak axises and scales, or just use glm generalized linear model with log link instead of lm. In this case the scales would be presevered, no additional tweaking.
library(ggplot2)
set.seed(123)
ADC_dark <- data.frame(Tleaf = 1:20,
abs_A = exp(0.11*x - 2.7 + rnorm(1:20) / 10))
ADC_dark %>%
ggplot(aes(x = Tleaf, y = abs_A))+
geom_point()+
geom_smooth(method = "glm", type = "response", formula = y ~ x, method.args = list(family = gaussian(link = "log")))+
labs(title = "Respiration and leaf temperature", x = "Tleaf", y = "abs_A")
Output:

In R, the output of my linear model shows a positive correlation but my ggplot graph indicates a negative correlation?

I'm trying to identify the impact of how Sycamore_biomass affects the day which a bird lays its first_egg. My model output indicates a weak positive relationship - i.e. as sycamore biomass increases, the day of the first egg being laid should increase (i.e. should be later) (note I'm including confounding factors in this model):
Call:
lm(formula = First_egg ~ Sycamore_biomass + Distance_to_road +
Distance_to_light + Anthropogenic_cover + Canopy_cover, data = egglay_date)
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 39.61055 16.21391 2.443 0.0347 *
Sycamore_biomass 0.15123 0.53977 0.280 0.7851
Distance_to_road 0.01773 0.46323 0.038 0.9702
Distance_to_light -0.02626 0.44225 -0.059 0.9538
Anthropogenic_cover -0.13879 0.28306 -0.490 0.6345
Canopy_cover -0.30219 0.20057 -1.507 0.1628
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 12.99 on 10 degrees of freedom
Multiple R-squared: 0.2363, Adjusted R-squared: -0.1455
F-statistic: 0.6189 on 5 and 10 DF, p-value: 0.6891
However, when I plot this using ggplot, the regression line indicates a negative relationship? Can anyone help me out with what is happening here?
ggplot(egglay_date, aes(x=Sycamore_biomass, y=First_egg)) +
geom_point(shape=19, alpha=1/4) +
geom_smooth(method=lm)
GG PLOT of Sycamore biomass and First egg date
I suppose this is because you look at the raw data you fed into the model, not the model predictions. In the plot, you don't "isolate" a single predictor. You look at the result of all predictors doing something to the response variable. I suppose the effect of this predictor is "overshadowed" by the effects of the other predictors.
To take a look at the effect of solely one predictor, you need to predict new values from the model while fixing all other predictors. You can try something along the lines of:
preds <- predict(yourmodel, newdata = data.frame(
"Sycamore_biomass" = 0:25,
"Distance_to_road" = mean(egglay_date$Distance_to_road),
"Distance_to_light" = mean(egglay_date$Distance_to_light),
"Anthropogenic_cover" = mean(egglay_date$Anthropogenic_cover),
"Canopy_cover" = mean(egglay_date$Canopy_cover)))
new_data <- data.frame(
"Sycamore_biomass" = 0:25,
"First_egg" = preds)
ggplot(new_data, aes(x=Sycamore_biomass, y=First_egg)) +
geom_point(shape=19, alpha=1/4) +
geom_smooth(method=lm)
This should give you the predictions of your model when only considering the effect of the one predictor.
The answer to your question is quite simple (but I understand why it may seems complex at first).
First off, your model indicates a positive relationship because you have included all your other variables. Keep in mind, your best fit line through your data here is when you take all your data points, and fit a line to make the sum of residuals = 0. Note: this is not the same as sum of residuals squared.
Since you didn't provide your data (please do on future posts, or at least, something to work with), I will illustrate my point with the data(mtcars) built into R
data("mtcars")
df <- mtcars
This dataset has many variables, to see them all, just type names(df)
Lets just work with three of them to see if miles per gallon (mpg) is explained by:
1) cyl : # of cylinders
2) hp : horse power
3) drat : rear axle ratio
attach(df)
model <-lm(mpg~cyl+hp+drat)
summary(model)
Let's say, I just want to plot the relationship between cylinders and mpg (for you, it would be sycamour biomass and bird lay). Here, from our model summary, we see that our relationship is negative (negative estimate, aka, coefficient), and that the intercept is at 22.5.
So I do what you just did and just plot mpg~cly (without considering my other variables)
plot(mpg~cyl, pch=15, col="blue",cex=2, cex.axis=2, ylab="MPG", xlab="Number of Cylinders", cex.lab=1.5)
abline(lm(mpg~cyl),lwd=2,col="red")
First off, we see that the y intercept is not 22.5, but rather above 25.
If I were to do the math from first model, if I had 4 cylinders, I should predict:
22.51406 + (4 * -1.3606) = 17.07
So lets see if our prediction is correct on our graph
Definitely not.
So lets run a new model (which you need to do), where we model just mpg~cly
reduced_model <- lm(mpg~cyl)
summary(reduced_model)
See how the intercept and coefficent (estimates) changed? Yours will too when you run a reduced model. Lets see if the plots now make sense following the same steps as above with predicting 4 cylinders
37.8846 + (4 * -2.8758 ) # 26.38
plot(mpg~cyl, pch=15, col="blue",cex=2, cex.axis=2, ylab="MPG", xlab="Number of Cylinders", cex.lab=1.5)
abline(lm(mpg~cyl),lwd=2,col="red")
abline(h=26.38,v=4,lwd=2, col="green")
Looks like everything checks out.
Summary: You need to run a simple model with just your two variables of interest if you want to correctly understand your plot

Estimate SE for all factor levels with zero-inflated model

I have a fairly complicated ZINB model. I have tried to replicate the basic structure of what I'm trying to do:
MyDat<-cbind.data.frame(fac1 = rep(c("A","B","C","D"),10),
fac2=c(rep("X",20),rep("Y",20)),
offset=c(runif(20, 50,60),runif(20,150,165)),
fac3=rep(c(rep("a1",4),rep("a2",4),rep("a3",4),rep("a4",4),rep("a5",4)),2),
Y=c(0,0,0,1,0,0,11,10,0,0,0,5,0,0,0,35,60,0,0,0,0,2,0,0,16,0,0,0,0,0,3,88,0,0,0,0,0,0,27,0))
f<-formula(Y~fac1+ offset(log(offset))|fac3+ fac2)
ZINB <-zeroinfl(f, dist = "negbin",link = "logit", data = MyDat)
summary(ZINB)
The primary goal of this model is to look at the effect of fac1 across the four levels. The other variables are more just artifacts of the sampling process.
Here is the output:
Call:
zeroinfl(formula = f, data = MyDat, dist = "negbin", link = "logit")
Pearson residuals:
Min 1Q Median 3Q Max
-0.418748 -0.338875 -0.265109 -0.001566 2.682920
Count model coefficients (negbin with log link):
Estimate Std. Error z value Pr(>|z|)
(Intercept) -1.7192 0.9220 -1.865 0.062239 .
fac1B -4.4161 1.4700 -3.004 0.002663 **
fac1C -1.2008 1.2896 -0.931 0.351778
fac1D 0.1928 1.3003 0.148 0.882157
Log(theta) -1.7349 0.4558 -3.806 0.000141 ***
Zero-inflation model coefficients (binomial with logit link):
Estimate Std. Error z value Pr(>|z|)
(Intercept) -11.5899 210.8434 -0.055 0.956
fac3a2 -0.4775 2.4608 -0.194 0.846
fac3a3 -11.2284 427.5200 -0.026 0.979
fac3a4 10.7771 210.8056 0.051 0.959
fac3a5 -0.3135 2.3358 -0.134 0.893
fac2Y 11.8292 210.8298 0.056 0.955
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Theta = 0.1764
Number of iterations in BFGS optimization: 76
Log-likelihood: -63.82 on 11 Df
I have consulted papers and stats books and forums, but I'm still not sure how to present this information. What I really want is a bar plot showing the effects on the Y-axis and the 4 levels on the X.
If I understand correctly, level A of fac1 is currently set at 0, and is my reference level (please correct me if I'm wrong here). So, I can make a plot of the 4 levels (including level A as zero). This doesn't seem ideal. I would really like to have 95%CIs for all levels.
I can also use the predict function, however predict.zeroinfl does not give error estimates, and I'm unsure how to interpret the effect of the offset.
Similar papers have just put a boxplot of the original data next to a boxplot of the predictions and compared. I feel like I should be able to do better.
Below is the code and plot to create the predicted values:
MyDat$phat<-predict(ZINB, type="response")
MyDat$phat_os<-MyDat$phat/MyDat$offset
plot(phat~fac1, MyDat)
Predictions plot
Is bootstrapping the way to go? I have tried this and run into all kinds of trouble for something I'm not sure is necessary.
Thank you in advance, and please go easy on me if I'm making a silly oversight/assumption. I'm still learning, but these stats feel a bit out of my reach.
For starters, you can plot the model coefficients with their confidence intervals. The arm package has the coefplot function, but it doesn't have a method for zeroinfl models, so I've created a simple coefficient plot below using ggplot2. The predict method for zeroinfl models doesn't provide confidence intervals for predictions, but this answer to a question on CrossValidated shows how to construct bootstrapped confidence intervals for zeroinfl models.
Regarding the levels of fac1: A is the reference level, so the coefficients for the other levels are relative to fac1 = "A".
library(pscl)
library(ggplot2)
MyDat<-cbind.data.frame(fac1 = rep(c("A","B","C","D"),10),
fac2=c(rep("X",20),rep("Y",20)),
offset=c(runif(20, 50,60),runif(20,150,165)),
fac3=rep(c(rep("a1",4),rep("a2",4),rep("a3",4),rep("a4",4),rep("a5",4)),2),
Y=c(0,0,0,1,0,0,11,10,0,0,0,5,0,0,0,35,60,0,0,0,0,2,0,0,16,0,0,0,0,0,3,88,0,0,0,0,0,0,27,0))
f<-formula(Y ~ fac1 + offset(log(offset))|fac3 + fac2)
ZINB <-zeroinfl(f, dist = "negbin",link = "logit", data = MyDat)
# Extract coefficients and standard errors from model summary
coefs = as.data.frame(summary(ZINB)$coefficients$count[,1:2])
names(coefs)[2] = "se"
coefs$vars = rownames(coefs)
# Coefficient plot
ggplot(coefs, aes(vars, Estimate)) +
geom_hline(yintercept=0, lty=2, lwd=1, colour="grey50") +
geom_errorbar(aes(ymin=Estimate - 1.96*se, ymax=Estimate + 1.96*se),
lwd=1, colour="red", width=0) +
geom_errorbar(aes(ymin=Estimate - se, ymax=Estimate + se),
lwd=2.5, colour="blue", width=0) +
geom_point(size=4, pch=21, fill="yellow") +
theme_bw()
And here's what the plot looks like.

R: Finding the coefficients of an expression which produce the largest R-squared value?

Let's say I've got a data inputted into a data frame like so:
df = data.frame(x = c(1,2,3,4,5,10,15,25,50),
y = c(.57,.75,.82,0.87,.89,.95,.97,.98,.99))
df
and I wish to fit the expression:
y = ((x/a)^b)/(1+(x/a)^b)
where a and b are unknown parameters.
I have plotted the points and drawn a fitted line by guessing the values of a and b:
library(ggplot2)
graph <- ggplot(df, aes(x=x, y=y))
graph <- graph + geom_point()
a = 0.50
b = 1.00
guesstimate <- function(x){((x/a)^b)/(1+(x/a)^b)}
graph <- graph + stat_function(fun = guesstimate)
graph
However, I'd like to find the values of a and b which creates an expression that produces the highest R^2 square value; i.e. the best possible mathematical fit for the data possible.
Question:
Short of guessing through the values of a and b manually and checking with the naked eye which fit is best, is there a way to get R to find the 'best' a and b values along with providing the R-squared value which confirms to me that the chosen a and b values are indeed the best possible fit?
You can use the nls (non-linear least squares) function:
m1 = nls(y ~ (x/a)^b/(1+(x/a)^b), list(a=1, b=1), data=df)
summary(m1)
Formula: y ~ (x/a)^b/(1 + (x/a)^b)
Parameters:
Estimate Std. Error t value Pr(>|t|)
a 0.779291 0.009444 82.51 1.01e-11 ***
b 1.145174 0.012733 89.94 5.53e-12 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.003086 on 7 degrees of freedom
Number of iterations to convergence: 4
Achieved convergence tolerance: 5.949e-08
ggplot(df, aes(x,y)) +
geom_point() +
geom_line(data=data.frame(x=seq(1,50,0.1), y=predict(m1, newdata=list(x=seq(1,50,0.1)))),
aes(x,y), colour="red")
nls does not provide an r-squared value, because, as discussed in this thread on R-help, r-squared is not necessarily meaningful for a non-linear model. nls does, however, find the parameter values that minimize the residual sum-of-squares, so in that sense these parameters provide the best fit for the given data and model. That doesn't mean that there isn't another model specification that gives a better fit, though in this case the model fit is virtually perfect.
Even if not obvious, a linear model can be applied here, just using basic algebra. Indeed, starting from 1/y = (1+(x/a)^b)/(x/a)^b and a little manipulation, you can arrive to:
log(1/y - 1) = -b*log(x) + b*log(a)
which is basically a linear model in the variables Y = log(1/y - 1) and X = log(x). From here, you can use lm:
df2<-data.frame(Y = log(1/df$y - 1), X = log(df$x))
coeffs<-lm(Y ~ X, data=df2)$coefficients
a <- exp(-model[1]/model[2])
# 0.7491387
b <- -model[2]
#1.116111
which are similar to those obtained with nls.

Polynomial data and R's glm()

How can you get R's glm() to match polynomial data? I've tried several iterations of 'family=AAA(link="BBB")' but I can't seem to get trivial predictions to match.
For example, please help with R's glm to match polynomial data
x=seq(-6,6,2)
y=x*x
parabola=data.frame(x,y)
plot(parabola)
model=glm(y~x,dat=parabola)
test=data.frame(x=seq(-5,5,2))
test$y=predict(model,test)
plot(test)
The plot(parabola) looks as expected, but I can find the incantation of glm() that will make plot(test) look parabolic.
I think you need to step back and start to think about a model and how you represent this in R. In your example, y is a quadratic function of x, so you need to include x and x^2 in the model formula, i.e. as predictors you need to estimate the effect of x and x^2 on the response given the data to hand.
If y is Gaussian, conditional upon the model, then you can do this with lm() and either
y ~ x + I(x^2)
or
y ~ poly(x, 2)
In the first, we wrap the quadratic term in I() as the ^ operator has a special meaning (not its mathematical one) in an R model formula. The latter version gives orthogonal polynomials and hence the x and x^2 terms won't be correlated which can help with fitting, however in some cases interpreting the coefficients is trickier with poly().
Putting it all together we have (note that I add some random error to y so as to not predict it perfectly as the example I use is more common in reality):
x <- seq(-6 ,6 ,2)
y <- x^2 + rnorm(length(x), sd = 2)
parabola <- data.frame(x = x, y = y)
mod <- lm(y ~ poly(x, 2), data = parabola)
plot(parabola)
lines(fitted(mod) ~ x, data = parabola, col = "red")
The plot produced is:
An additional issue is whether y is Gaussian? If y can't be negative (i.e. a count), and/or is discrete, modelling using lm() is going to be wrong. That's where glm() might come in, by which you might fit a curve without needing x^2 (although if the data really are a parabola, then x on its own isn't going to fit the response), as there is an explicit transformation of the data from the linear predictor on to the scale of the response.
It is better to think about the properties of the data and the sort of model you want to fit and then build up the degree of polynomial within that modelling framework, rather than jumping in a trying various incantations to simply curve fit the data.
The match is now perfect. A slightly more interesting parabola:
x=seq(-16,16,2)
y= 4*x*x + 10*x + 6
parabola=data.frame(x,y)
plot(parabola)
model=lm(y~poly(x,2),dat=parabola)
summary(model)
test=data.frame(x=seq(-15,15,2))
test$y=predict(model,test)
points(test,pch=3)
An amateur (like me) might expect the coefficients of the model to be (4,10,6) to match 4*x*x + 10*x + 6
Call:
lm(formula = y ~ poly(x, 2), data = parabola)
Residuals:
Min 1Q Median 3Q Max
-3.646e-13 -8.748e-14 -3.691e-14 4.929e-14 6.387e-13
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 3.900e+02 5.192e-14 7.511e+15 <2e-16 ***
poly(x, 2)1 4.040e+02 2.141e-13 1.887e+15 <2e-16 ***
poly(x, 2)2 1.409e+03 2.141e-13 6.581e+15 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 2.141e-13 on 14 degrees of freedom
Multiple R-squared: 1, Adjusted R-squared: 1
F-statistic: 2.343e+31 on 2 and 14 DF, p-value: < 2.2e-16
Why would the coefficients be (390,404,1409)?

Resources