How to easily show the equation behind ggplot's geom_smooth - r

Is there any simple command to show the geom_smooth equation of a non-linear relationship? Something as simple as "show. equation". The equation has to be somewhere, I just want to call the equation used by default.
ggplot(dataset, aes(x=variablex, y=variabley)) +
geom_point()+
geom_smooth()+
theme_bw()

If you look at the documentation for geom_smooth and stat_smooth you can see that it uses stats::loess for small data sets (1,000 observations) and mgcv::gam otherwise:
For method = NULL the smoothing method is chosen based on the size of
the largest group (across all panels). stats::loess() is used for less
than 1,000 observations; otherwise mgcv::gam() is used with formula = y ~ s(x, bs = "cs") with method = "REML". Somewhat anecdotally, loess
gives a better appearance, but is 𝑂(𝑁2) in memory, so does not work
for larger datasets.
So if you want to use the model implied by the geom_smooth fit, you could just call the underlying method (e.g. stats::loess(variabley ~ variablex, data = dataset)) and then use the predict method to calculate values for new data.

Related

Scenario development with GAM models

I'm working with a mgcv::gam model in R to generate predictions in which the relationship between time (year) and the outcome variable (out) varies. For example, in one scenario, I'd like to force time to affect the outcome variable in a linear manner, in another a marginally decreasing manner, and in another, I'd like to specify specific slopes of the time-outcome interaction. I'm unsure how to force the prediction to treat the interaction between time and the outcome variable in a specific manner:
res <- gam(out ~ s(time) + s(GEOID, bs='re'), data = df, method = "REML")
pred <- predict(gam, newdata = ndf, type="response", se=T)
There isn't an interaction betweentime and out; here time has a potentially non-linear effect on out.
Are we talking about trying to force certain shapes for the function of time? If so, you will need to estimate different models; use time if you want a linear effect:
res_lin <- gam(out ~ time + s(GEOID, bs='re'), data = df, method = "REML")
and look at shape constrained p splines to enforce montonicity or concave/convex relationships.
The scam package has these sorts of constraints and uses mgcv with GCV smoothness selection to fit the shape constrained models.
As for specifying a specific slope for the linear effect of time, I think you'll need to include time as an offset in the model. So say the slope you want is 0.5 I think you need to do + offset(I(0.5*time)) because an offset has by definition a coefficient of 1. I would double check this though as I might have messed up my thinking here.

Scatterplot for multiple regression results in R

I am trying to find a way to get a scatterplot in R of actual values vs. regressed values. Example:
fit = lm(y ~ a + x + z)
I get the results y ~ 2*a + 3*x - 7*z + 4
Now how do I make a scatterplot plotting y against 2*a + 3*x - 7*z + 4? As well as creating a trendline.
(And, by the way, I tried the plot() function. It didn't seem to have what I need)
Look at plot(fit), or the help for lm, which you can access using ?lm.
From your question, it sounds like you want to plot your actual values against the fitted values. There is a plot method for lm which does this out of the box.
You could always build it yourself, say in ggplot2, by accessing the fitted values. Check out your object using str(fit) for all of the data that captured during the regression.

Loess regression goes to negative values

I'm currently trying to fit a loess regression to my dataset (latitudinal distribution of biomasses). I used the following code:
ggplot(data=test)+
geom_point(aes(y=log10(value+1), x=lat, colour=variable), alpha=0.5)+
stat_smooth(aes(y=log10(value+1), x=lat, colour=variable, fill=variable), size=1, alpha=0.1)+
scale_y_continuous("Depth-integrated biomass (mgC.m-2)")+
scale_x_continuous("Latitude", limits=c(-70, 80), breaks=seq(-70, 80, 10))+
coord_flip()+
theme_bw()+
theme(legend.background = element_rect(colour = "black"))
The problem is that the regression goes below 0 while I have no values below 0...
Is there a way to force the regression not to cross 0 ?
I try changing the "span" value, it's better but some part of the loess curve still goes negative. Xlim=c(0, X) was not good since it cut the curves..
Thanks.
The loess methods assume an unbounded distribution, so can easily go below 0 if you have data near 0. One option in to work on the log scale (fit the model to the log of the y-values, then exponentiate the predicted values for plotting, etc.)
Why would you set xlim if you want to restrict the y-values? Either way, though, xlim and ylim are only used to filter the underlying dataset and so that won't solve your problem. An alternative way to avoid 0 values would be to use a different model: a linear regression shouldn't interpolate negative values if all observed values are positive. Or, maybe something like a logistic regression would be appropriate for your data?
Adding these types of fits to the data is actually pretty easy, just add method = glm and and family = binomial, for example, inside stat_smooth.

how to plot estimates through model in R

I'm trying to use R to do some modelling, I've started to use BodyWeight library, since I've seen some examples online. Just to understand and get used to the commands.
I've come to my final model, with estimates and I was wondering how to plot these estimates, but I haven't seen anything online..
Is there a way to plot the values of the estimates with a line, and dots for the values of each observation?
Where can I find information about how to do this, do I have to extract the values myself or it is possible to say plot the estimates of these model?
I'm only starting with R. Any help is welcome.
Thank you
There is no function that just plots the output of a model, since there are usually many different possible ways of plotting the output.
Take a look at the predict function for whatever model type you are using (for example, linear regressions using lm have a predict.lm function).
Then choose a plotting system (you will likely want different panels for different levels of diet, so use either ggplot2 or lattice). Then see if you can describe more clearly in words how you want the plot to look. Then update your question if you get stuck.
Now we've identified which dataset you are using, here's a possible plot:
#Run your model
model <- lme(weight ~ Time + Diet, BodyWeight, ~ 1 | Rat)
summary(model)
#Predict the values
#predict.lme is a pain because you have to specify which rat
#you are interested in, but we don't want that
#manually predicting things instead
times <- seq.int(0, 65, 0.1)
mcf <- model$coefficients$fixed
predicted <-
mcf["(Intercept)"] +
rep.int(mcf["Time"] * times, nlevels(BodyWeight$Diet)) +
rep(c(0, mcf["Diet2"], mcf["Diet3"]), each = length(times))
prediction_data <- data.frame(
weight = predicted,
Time = rep.int(times, nlevels(BodyWeight$Diet)),
Diet = rep(levels(BodyWeight$Diet), each = length(times))
)
#Draw the plot (using ggplot2)
(p <- ggplot(BodyWeight, aes(Time, weight, colour = Diet)) +
geom_point() +
geom_line(data = prediction_data)
)

How can I superimpose modified loess lines on a ggplot2 qplot?

Background
Right now, I'm creating a multiple-predictor linear model and generating diagnostic plots to assess regression assumptions. (It's for a multiple regression analysis stats class that I'm loving at the moment :-)
My textbook (Cohen, Cohen, West, and Aiken 2003) recommends plotting each predictor against the residuals to make sure that:
The residuals don't systematically covary with the predictor
The residuals are homoscedastic with respect to each predictor in the model
On point (2), my textbook has this to say:
Some statistical packages allow the analyst to plot lowess fit lines at the mean of the residuals (0-line), 1 standard deviation above the mean, and 1 standard deviation below the mean of the residuals....In the present case {their example}, the two lines {mean + 1sd and mean - 1sd} remain roughly parallel to the lowess {0} line, consistent with the interpretation that the variance of the residuals does not change as a function of X. (p. 131)
How can I modify loess lines?
I know how to generate a scatterplot with a "0-line,":
# First, I'll make a simple linear model and get its diagnostic stats
library(ggplot2)
data(cars)
mod <- fortify(lm(speed ~ dist, data = cars))
attach(mod)
str(mod)
# Now I want to make sure the residuals are homoscedastic
qplot (x = dist, y = .resid, data = mod) +
geom_smooth(se = FALSE) # "se = FALSE" Removes the standard error bands
But does anyone know how I can use ggplot2 and qplot to generate plots where the 0-line, "mean + 1sd" AND "mean - 1sd" lines would be superimposed? Is that a weird/complex question to be asking?
Apology
Folks, I want to apologize for my ignorance. Hadley is absolutely right, and the answer was right in front of me all along. As I suspected, my question was born of statistical, rather than programmatic ignorance.
We get the 68% Confidence Interval for Free
geom_smooth() defaults to loess smoothing, and it superimposes the +1sd and -1sd lines as part of the deal. That's what Hadley meant when he said "Isn't that just a 68% confidence interval?" I just completely forgot that's what the 68% interval is, and kept searching for something that I already knew how to do. It didn't help that I'd actually turned the confidence intervals off in my code by specifying geom_smooth(se = FALSE).
What my Sample Code Should Have Looked Like
# First, I'll make a simple linear model and get its diagnostic stats.
library(ggplot2)
data(cars)
mod <- fortify(lm(speed ~ dist, data = cars))
attach(mod)
str(mod)
# Now I want to make sure the residuals are homoscedastic.
# By default, geom_smooth is loess and includes the 68% standard error bands.
qplot (x = dist, y = .resid, data = mod) +
geom_abline(slope = 0, intercept = 0) +
geom_smooth()
What I've Learned
Hadley implemented a really beautiful and simple way to get what I'd wanted all along. But because I was focused on loess lines, I lost sight of the fact that the 68% confidence interval was bounded by the very lines I needed. Sorry for the trouble, everyone.
Could you calculate the +/- standard deviation values from the data and add a fitted curve of them to the plot?
Have a look at my question "modify lm or loess function.."
I am not sure I followed your question very well, but maybe a:
+ stat_smooth(method=yourfunction)
will work, provided that you define your function as described here.

Resources