So basically, I generated 2 random variables X and Y 1000 times and created a data frame Data=data.frame(x,y) in order to perform a smoothing by spline function. Now I want to recreate exactly that but for B= 1000 times and plot the smoothing functions (B=1,...,1000) to compare its variability
simulation= function(d){
X=runif(1000,0,10)
Y=rpois(1000,lambda=2*X+0.2*X*sin(X))
Data=matrix(data=c(X,Y),ncol=2)
smoothing_sim=lm(Y~ns(x=X,df=d),data=Data)
new_x2=seq(min(X),max(X),length.out=100)
adjusted_sim=predict(object=smoothing_sim,newdata=data.frame(X=new_x2))
return(data.frame(new_x2,smoothing_sim))
}
simulation2=replicate(n=1000,simulation)
I'm not sure wether my method is good or not. And I'm also not sure how to plot the functions following the simulation. Anyone care to comment? Thanks !
If you use ggplot, you can make the smooths right in geom_smooth. As ggplot demands long form, using list columns and tidyr::unnest is a useful substitute for replicate, though there are lots of ways to accomplish the data generation step.
library(tidyverse)
set.seed(47)
# A nice theme with a white background to help make low-opacity objects visible
theme_set(hrbrthemes::theme_ipsum_tw())
df <- tibble(replication = seq(100), # scaled down a little
x = map(replication, ~runif(1000, 0, 10)),
y = map(x, ~rpois(1000, lambda = 2*.x + 0.2*.x*sin(.x)))) %>%
unnest()
# base plot with aesthetics and points
point_plot <- ggplot(df, aes(x, y, group = replication)) +
geom_point(alpha = 0.01, stroke = 0)
point_plot +
geom_smooth(method = lm, formula = y ~ splines::ns(x), size = .1, se = FALSE)
Controlling the line's alpha can be really helpful for this sort of plot, but the alpha parameter in geom_smooth controls the opacity of the standard error ribbon. To set the alpha of the line, use geom_line with stat_smooth:
point_plot +
stat_smooth(geom = 'line', method = lm, formula = y ~ splines::ns(x),
color = 'blue', alpha = 0.03)
Currently, the smooth isn't doing much more than OLS here. To make it more flexible, set the degrees of freedom:
point_plot +
stat_smooth(geom = 'line', method = lm, formula = y ~ splines::ns(x, df = 5),
color = 'blue', alpha = 0.03)
Given the response is Poisson, it may be worth scaling up to Poisson regression with glm. The largest impact here is that when x is small, y doesn't dip all the way to 0:
point_plot +
stat_smooth(geom = 'line', method = glm, method.args = list(family = 'poisson'),
formula = y ~ splines::ns(x, df = 5), color = 'blue', alpha = 0.03)
Adjust further as you like.
Related
Is there a way to set the x-axis limits when plotting the predicted fits for GAM models? More specifically, I'm fitting a smoother for each level of a factor using 'by = ', however, each factor level has a different range of values. Plotting the variable in ggplot results in an x-axis that automatically accommodates the different ranges of 'x'; however, after fitting a GAM (mgcv::gam()) the default behavior of plot.gam() appears to be predicting values across a shared x-axis limit.
The dummy data below has some continuous variable for 'x', but in my real data, 'x' is Time (year), and 'group' is sampling location. Because I did not collect data from each site across the same time range, I feel it is inappropriate to show a model fit in these empty years.
library(tidyverse)
library(mgcv)
library(gratia)
theme_set(theme_classic())
## simulate data with a grouping variable of three levels:
d = data.frame(group = rep(c('A','B','C'), each = 100),
x = c(seq(0,1,length=100),
seq(.2,1,length=100),
seq(0,.5,length=100))) %>%
mutate(y = sin(2*pi*x) + rnorm(100, sd=0.3),
group = as.factor(group))
## Look at data
ggplot(d, aes(x = x, y = y, colour = group))+
facet_wrap(~group)+
geom_point()+
geom_smooth()
Here is the raw data with loess smoother in ggplot:
## fit simple GAM with smoother for X
m1 = mgcv::gam(y ~ s(x, by = group), data = d)
## base R plot
par(mfrow = c(2,2), bty = 'l', las = 1, mai = c(.6,.6,.2,.1), mgp = c(2,.5,0))
plot(m1)
## Gavin's neat plotter
gratia::draw(m1)
Here is the predicted GAM fit that spans the same range (0,1) for all three groups:
Can I limit the prediction/plot to actual values of 'x'?
If you install the current development version (>= 0.6.0.9111) from GitHub, {gratia} will now do what you want, sort of. I added some functionality to smooth_estimates() that I had planned to add eventually but your post kicked it the top of the ToDo list and motivated me to add it now.
You can use smooth_estimates() to evaluate the smooths at the observed (or any user-supplied) data only and then a bit of ggplot() recreates most of the plot.
remotes::install_github("gavinsimpson/gratia")
library('mgcv')
library('gratia')
library('dplyr')
library('ggplot2')
d <- data.frame(group = rep(c('A','B','C'), each = 100),
x = c(seq(0,1,length=100),
seq(.2,1,length=100),
seq(0,.5,length=100))) %>%
mutate(y = sin(2*pi*x) + rnorm(100, sd=0.3),
group = as.factor(group))
m <- gam(y ~ group + s(x, by = group), data = d, method = 'REML')
sm <- smooth_estimates(m, data = d) %>%
add_confint()
ggplot(sm, aes(x = x, y = est, colour = group)) +
geom_ribbon(aes(ymin = lower_ci, ymax = upper_ci, colour = NULL, fill = group),
alpha = 0.2) +
geom_line() +
facet_wrap(~ group)
Edit: following interactions in the responses below, I believe there may be some issues with the plot() or plot.gam() functions when dealing with gam outputs. See responses below.
I am running a non parametric regression model <- gam(y ~ x, bs = "cs", data = data).
My data looks like what follows, where x is in logs. I have 273 observations
y x
[1,] 0.010234756 10.87952
[2,] 0.009165001 10.98407
[3,] 0.001330975 11.26850
[4,] 0.008000957 10.97803
[5,] 0.008579472 10.94924
[6,] 0.009746714 11.01823
I would like to plot the output of the model, basically the fitted curve. When I do
# graph
plot(model)
or
ggplot(data = data, mapping = aes(x = x y = y)) +
geom_point(size = 0.5, alpha = 0.5) +
geom_smooth(method="gam", formula= y~s(x, bs = "cs") )
I get the desired output graphs (apologies for the original labels):
[
However, the two plotted curves are not exactly the same and I did not manage to find the parameters to tweak to remove the differences. Therefore I would like to plot the curve manually.
Here it's my current attempt.
model <- gam(y~ s(x), bs = "cs", data = data)
names(model)
# summary(model)
model_fit <- as.data.frame(cbind(model$y, model$fitted.values,
model$linear.predictors, data$x,
model$residuals))
names(model_fit) <- c("y", "y_fit", "linear_pred", "x", "res")
### here the plotting
ggplot(model_fit) +
geom_point(aes(x = x, y = y_fit), size = 0.5, alpha = 0.5) +
geom_line(aes(x = x, y = y_fit))
However I get the following warning
geom_path: Each group consists of only one observation. Do you need to adjust the group aesthetic?
and wrong output graph
I do not seem to be able to fix the last graph (it seems the error is in geom_point() ) and add the confidence intervals, nor to find where to tweak the first two to make them exactly the same.
The difference is likely due to you using different fitting algorithms. The default in gam() is (currently) method = "GCV.Cp" even through the recommended option is to use method = "REML". stat_smooth() uses method = "REML". GCV-based smoothness selection is known to undersmooth in some circumstances and this seems to be the case here with the REML solution being a much smoother curve.
If you change to method = "REML" in your gam() call, the differences should disappear.
That said, you really shouldn't be ripping things out of model objects like that - for a set off $residuals is not what you think it is - it's not useful in this context as those are the working residuals for PIRLS algorithm. Use the extractor functions like fitted(), residuals() etc.
The easiest way to plot your own version of that drawn by plot.gam() is to capture the object returned by plot.gam() and then use that object to draw what you need.
Via plot.gam()
df <- data_sim("eg1", seed = 2)
m <- gam(y ~ s(x2), data = df, method = "REML")
p_obj <- plot(m, residuals = TRUE)
p_obj <- p_obj[[1]] # just one smooth so select the first component
sm_df <- as.data.frame(p_obj[c("x", "se", "fit")])
data_df <- as.data.frame(p_obj[c("raw", "p.resid")])
## plot
ggplot(sm_df, aes(x = x, y = fit)) +
geom_rug(data = data_df, mapping = aes(x = raw, y = NULL),
sides = "b") +
geom_point(data = data_df, mapping = aes(x = raw, y = p.resid)) +
geom_ribbon(aes(ymin = fit - se, ymax = fit + se, y = NULL),
alpha = 0.3) +
geom_line() +
labs(x = p_obj$xlab, y = p_obj$ylab)
Which produces
Alternatively, you might look at my {gratia} package or the {mgcViz} package of Matteo Fasiolo as options that will do this all for you.
{gratia} example
For example with {gratia}
library('gratia')
draw(m, residuals = TRUE)
which produces
The solution provided by #Gavin Simpson here partially solves the issue, meaning that to make the two curves equal, one needs to add the method = "REML". The two curves then have the same slope.
However, for some reason, when plotting the output of a gam() model using either plot() or plot.gam(), the curve does not fit properly the original data as it should. The same happens by manually plotting the graph by extracting the elements from the object returned by plot.gam(). I am not sure why this happens. In my case, the fitted curve is shifted downwards, clearly "missing" the data points it is supposed to fit. Below the code and the corresponding output graph, the latter being the same you get in plot() or plot.gam() with the addition of the original data points to the plots.
plot(model_1)
# or plot.gam(model_1)
data.plot = as.data.frame(cbind(b[[1]]$x, b[[1]]$fit, b[[1]]$se))
ggplot(data=data.plot, mapping = aes(x= data.plot$V1, y= data.plot$V2)) +
geom_line(aes(x = V1, y = V2)) +
geom_line(aes(x= V1, y = V2 + V3 ), linetype="dashed") +
geom_line(aes(x= V1, y = V2 - V3 ), linetype ="dashed") +
geom_point(data= df_abs, aes(x= log(prd_l_1999), y=prd_gr), size = 0.5, alpha = 0.5)
Misplaced graphs
To note that the ggplot function makes the plot properly. Therefore, my ignorant guess is that this may be an issue with the plotting method.
Working solution
I am not able to prove that the issue is with the plotting functions, but it turns out that this is the same issue as in this question and the partial solution provided by the OP fixes the plotting while still using the gam() function. Below (his) code adapted to my case and the corresponding output graph. As you can see, the graph is plotted properly and the curve fits the data as it is supposed to do. I'd say this may corroborate my hypothesis even though I cannot prove it as I am not knowledgeable enough.
library(data.table)
model_1 <- gam(prd_gr ~ s(log(prd_l_1999)), bs = "cs", data = df_abs, method = "REML")
preds <- predict(model_1,se.fit=TRUE)
my_data <- data.frame(mu=preds$fit, low =(preds$fit - 1.96 * preds$se.fit), high = (preds$fit + 1.96 * preds$se.fit))
ggplot()+
geom_line(data = my_data, aes(x=log(df_abs$prd_l_1999), y=mu), size=1, col="blue")+
geom_smooth(data=my_data,aes(ymin = low, ymax = high, x=log(df_abs$prd_l_1999), y = mu), stat = "identity", col="green")
I am trying to reproduce someone else's work using a probit model. Unfortunately, I don't have much information about their methods, only their starting data and a plot of their model.
When I plot the data in ggplot and fit a line using geom_smooth(method = "glm", ...), I am able to reproduce the prior work. However, when I try to fit (what I think is) an identical model outside of ggplot using glm(), I get different predictions. I feel like I making some silly mistake, but I can't quite pin it down.
Here is a reproducible example:
library(tidyverse)
set.seed(123)
df <- tibble(x = c(0.006, 0.014, 0.025, 0.05, 0.15, 0.3, 0.5),
y = c(0.4, 0.733, 0.875, 1, 1, 1, 1))
probit_model <- glm(y ~ x, data = df, family = quasibinomial(link = "probit"))
df <- df %>%
add_row(x = 0.001, y = NA) %>% # To underline that these models are different
mutate(y_pred = predict(probit_model, newdata = ., type = "response"))
df %>%
ggplot(aes(x, y)) +
geom_point(size = 4) +
geom_line(aes(y = y_pred), color = "red", lwd = 1) +
geom_smooth(formula = y ~ x, color = "blue",
method = "glm", fullrange = TRUE,
method.args = list(family = quasibinomial(link = "probit"))) +
scale_x_log10(limits = c(0.001, 1))
And here is the plot it produces. Note that the blue line and the red line describe different fits. I believe they should be the same (ignoring the piecewise nature of the red line), given they use the same model and data.
I've read quite a few threads in the process of troubleshooting, and many responses suggest that geom_smooth() is not a replacement for modelling. Broadly, I agree. That said, I am explicitly trying to figure out what geom_smooth() is doing here, and then reproduce it outside of ggplot.
My questions are:
Why are these two models different? How is geom_smooth() calling glm()? How can I call glm() myself to reproduce the model that geom_smooth() is using?
The models are actually the same. You can see this if you set, say, xlim(0, 0.1) and remove scale_x_log10. Then you'll see the fits coincide.
I think the behavior you're seeing is because scale_x_log10 performs the axis transformation before any statistical summaries (such as geom_smooth). So, when you run scale_x_log10, geom_smooth is effectively fitting the model y ~ log10(x), rather than y ~ x. If you use coord_trans(x="log10") instead of scale_x_log10, you'll also see that the models coincide, since coord_trans does the transformation after any statistical summaries.
I am running an analysis in R on the effect of canopy cover (OverheadCover) and the number of carcasses placed on the same location (CarcassNumber) on the proportion of carrion eaten by birds (ProportionBirdsScavenging). The interaction effect OverheadCover * CarcassNumber is significant and I would like visualise this using ggplot like explained here: https://sebastiansauer.github.io/vis_interaction_effects/. I won't be using method = "lm" like in the example, but method = glmmTMB::glmmTMB. I've added the extra arguments formula = and method.args = to make sure R computes the smooth correctly.
This is how it should look, but I prefer the graph to be made with ggplot because then all my graphs will be in the same style.
glmm_interaction <- glmmTMB(ProportionBirdsScavenging ~ OverheadCover * CarcassNumber + (1|Area), data = data_both, beta_family(link = "logit"), weights = pointWeight_scaled)
plot_model(glmm_interaction, type = "int", ci.lvl = 0.682) # conf. int. of 68.3% -> ± standard error
This is the code I'm trying to run, but I can't get it to work. It keeps giving me errors, like object 'pointWeight_scaled' not found. Anyone an idea what I'm doing wrong here?
qplot(x = OverheadCover, y = ProportionBirdsScavenging, color = CarcassNumber, data = data_both) +
geom_smooth(method = glmmTMB::glmmTMB,
formula = ProportionBirdsScavenging ~ OverheadCover * CarcassNumber,
method.args = list(data = data_both, beta_family(link = "logit"), weights = pointWeight_scaled))
I know that it might be easier to just individually run the models and plot them on the same graph. I've done that, and it works. However, my calculated standard errors are larger than the ones in the plot_model(), so I wanted to see how these standard errors look if R does all the work, hence my intention to plot it this way.
This is how it should look, but I prefer the graph to be made with ggplot
The plot returned by plot_model() is a ggplot-object, which you can modify as you like. You could also use the ggeffects-package, which returns the underlying data that can be used to create the plot. There are many examples in the vignettes, both on how to create own plots or how to modify plots returned by plot(), e.g. here or here.
Here is a toy example:
library(ggplot2)
library(ggeffects)
library(lme4)
#> Loading required package: Matrix
set.seed(123)
dat <- data.frame(
outcome = rbinom(n = 500, size = 1, prob = 0.25),
var_binom = as.factor(rbinom(n = 500, size = 1, prob = 0.3)),
var_cont = rnorm(n = 500, mean = 10, sd = 3),
group = sample(letters[1:4], size =500, replace = TRUE)
)
model <- glmer(
outcome ~ var_binom * poly(var_cont, 2) + (1 | group),
data = dat,
family = binomial(link = "logit")
)
predictions <- ggpredict(model, c("var_cont [all]", "var_binom"))
# plot-function from ggeffects
plot(predictions)
# self made ggplot
ggplot(
predictions,
aes(x = x, y = predicted, ymin = conf.low, ymax = conf.high, colour = group, fill = group)
) +
geom_line() +
geom_ribbon(alpha = .1, colour = NA) +
theme_minimal()
Created on 2020-02-06 by the reprex package (v0.3.0)
I use to plot the loess estimation of a bunch of points along with the confidence interval by means of the geom_smooth function.
Now I need to change the method by which the confidence bounds are computed (i.e. I need to change the shape of the blur band). Is there a way to do that in geom_smooth?
Or, how can I emulate it with ggplot2? How can I such a blur band?
If you need a to plot something that isn't one of the options in geom_smooth your best bet is to manually fit the model yourself.
You haven't said what method you need.
But here is an example of fitting the loess with family symmetric and computing the standard errors of that.
d <- data.frame(x = rnorm(100), y = rnorm(100))
# The original plot using the default loess method
p <- ggplot(d, aes(x, y)) + geom_smooth(method = 'loess', se = TRUE)
# Fit loess model with family = 'symmetric'
# Replace the next 2 lines with whatever different method you need
loess_smooth <- loess(d$x ~ d$y, family = 'symmetric')
# Predict the model over the range of data you are interested in.
loess_pred <- predict(loess_smooth,
newdata = seq(min(d$x), max(d$x), length.out = 1000),
se = TRUE)
loess.df <- data.frame(fit = loess_pred$fit,
x = seq(min(d$x), max(d$x), length.out = 1000),
upper = loess_pred$fit + loess_pred$se.fit,
lower = loess_pred$fit - loess_pred$se.fit)
# plot to compare
p +
geom_ribbon(data = loess.df, aes(x = x, y = fit, ymax = upper, ymin = lower), alpha = 0.6) +
geom_line(data = loess.df, aes(x = x, y = fit))