Using the 'iris' dataset (sightly modified as below), I plot the results of an LME.
PLEASE NOTE: I am only using the iris dataset as mock data for the purpose of plotting, so please do not critique the appropriateness of this test. I'm not interested in the statistics, rather the plotting.
Using ggpredict function and plotting the results, the plot extends the predictions beyond the range of the data. Is there a systematic way plot predictions only within the range of each faceted data?
I can plot each facet separately, limit the axis per plot manually, and cowplot them back together, but if there is way to say 'predict only to the max. and min. of the data for that group', this would be great.
Given that these are facets of a single model, perhaps not showing the predictions for different groups is in fact misleading, and I should rather create three different models if I only want predictions within those data subsets?
library(lme4)
library(ggeffects)
library(ggplot2)
data(iris)
glimpse(iris)
df = iris
glimpse(df)
df_ed = df %>% group_by(Species) %>% mutate(Sepal.Length = ifelse(Species == "setosa",Sepal.Length+10,Sepal.Length+0))
df_ed = df_ed %>% group_by(Species) %>% mutate(Sepal.Length = ifelse(Species == "versicolor",Sepal.Length-3,Sepal.Length+0))
glimpse(df_ed)
m_test =
lmer(Sepal.Width ~ Sepal.Length * Species +
(1|Petal.Width),
data = df_ed, REML = T)
summary(m_test)
test_plot = ggpredict(m_test, c("Sepal.Length", "Species"), type = "re") %>% plot(rawdata = T, dot.alpha = 0.6, facet = T, alpha = 0.3)
As per the OP's comment, I think this will provide a solution. In this example, I use data from the sleepstudy dataset that comes with the lme4 package. First, we have to postulate a mixed model, which I generically call fit.
Note that I do not perform any hypothesis test to formally select an appropriate random-effects structure. Of course, this is essential to adequately capture the correlations in the repeated measurements, but falls outside the scope of this post.
library(lme4)
library(splines)
# quantiles of Days
quantile(sleepstudy$Days, c(0.05, 0.95))
# 5% 95%
# 0 9
# mixed model
fit <- lmer(Reaction ~ ns(Days, df = 2, B = c(0, 9)) +
(Days | Subject), data = sleepstudy)
# new data.frame for prediction
ND <- with(sleepstudy, expand.grid(Days = seq(0L, 9L, len = 50)))
Then, we need a fucntion that enables us to obtain predictions from fit for certain values of the covariates. The function effectPlot_lmer() takes the following arguments:
object: a character string indicating the merMod object that was fitted (the mixed model).
ND: a character string indicating the new data.frame, which specifies the values of the covariates for which we want to obtain predictions.
orig_data: a character string specifying the data on which the mixed model was fitted.
# function to obtain predicted reaction times
effectPlot_lmer <- function (object, ND, orig_data) {
form <- formula(object, fixed.only = TRUE)
namesVars <- all.vars(form)
betas <- fixef(object)
V <- vcov(object)
orig_data <- orig_data[complete.cases(orig_data[namesVars]), ]
Terms <- delete.response(terms(form))
mfX <- model.frame(Terms, data = orig_data)
Terms_new <- attr(mfX, "terms")
mfX_new <- model.frame(Terms_new, ND, xlev = .getXlevels(Terms, mfX))
X <- model.matrix(Terms_new, mfX_new)
pred <- c(X %*% betas)
ses <- sqrt(diag(X %*% V %*% t(X)))
ND$pred <- pred
ND$low <- pred - 1.96 * ses
ND$upp <- pred + 1.96 * ses
return(ND)
}
Finally, we can make an effect plot with ggplot.
# effect plot
library(ggplot2)
ggplot(effectPlot_lmer(fit, ND, orig_data = sleepstudy),
aes(x = Days, y = pred)) +
geom_line(size = 1.2, colour = 'blue4') +
geom_ribbon(aes(ymin = low, ymax = upp), colour = NA,
fill = adjustcolor('blue4', 0.2)) +
theme_bw() + ylab('Expected Reaction (ms)')
Related
i have the following data and created a model with the package glmmTMB in R for plant diameters ~ plant density (number of plants) with a random plot effect:
d <- data.frame (diameter = c(17,16,15,13,11, 19,17,15,11,11, 19,15,14,11,8),
plant_density = c(1000,2000,3000,4000,5000, 1000,2000,3000,4000,5000, 1000,2000,3000,4000,5000),
plot = c(1,1,1,1,1, 2,2,2,2,2, 3,3,3,3,3))
glmm.model <- glmmTMB(diameter ~ plant_density + (1|plot),
data = d,
na.action = na.omit,
family="gaussian",
ziformula = ~ 0)
My intention was to create a plot with predicted diameter data for different plant densities with an included random plot effect. So i tried to predict the data:
new.dat <- data.frame(diameter= d$diameter,
plant_density = d$plant_density,
plot= d$plot)
new.dat$prediction <- predict(glmm.model, new.data = new.dat,
type = "response", re.form = NA)
Unfortunately I get an output for every plot but wanted a generalized prediction for the diameter ~ plant density.
My goal is to create a plot like here, but with a regression model from glmmTMB which consider the random effect.
Thanks for ur help!
The ggeffects package makes this type of thing very easy to implement and customize.
For example
library('ggplot2')
library('glmmTMB')
library('ggeffects')
d <- data.frame (diameter = c(17,16,15,13,11, 19,17,15,11,11, 19,15,14,11,8),
plant_density = c(1000,2000,3000,4000,5000, 1000,2000,3000,4000,5000, 1000,2000,3000,4000,5000),
plotx = as.factor( c(1,1,1,1,1, 2,2,2,2,2, 3,3,3,3,3)))
glmm.model <- glmmTMB(diameter ~ plant_density + (1|plotx),
data = d,
family="gaussian")
# basically what your looking for
plot(ggpredict(glmm.model, terms = "plant_density"))
# with additional a change of limits on the y-axis
plot(ggpredict(glmm.model, terms = "plant_density")) +
scale_y_continuous(limits = c(0, 20))
You can really do whatever you'd like with it from there, changing colors, themes, scales, the package has some nice vignettes as well.
I try to make a plot for standard purposes with zero inflated model and zero inflated mixed model using ggplot2 without success. For this, I try:
#Packages
library(pscl)
library(glmmTMB)
library(ggplot2)
library(gridExtra)
# Artificial data set
set.seed(007)
n <- 100 # number of subjects
K <- 8 # number of measurements per subject
t_max <- 5 # maximum follow-up time
DF <- data.frame(id = rep(seq_len(n), each = K),
time = c(replicate(n, c(0, sort(runif(K - 1, 0, t_max))))),
sex = rep(gl(2, n/2, labels = c("male", "female")), each = K))
DF$y <- rnbinom(n * K, size = 2, mu = exp(1.552966))
str(DF)
Using zero inflated poisson model with pscl package
time2<-(DF$time)^2
mZIP <- zeroinfl(y~time+time2+sex|time+sex, data=DF)
summary(mZIP)
If I imagine thal all coefficients are significant
# Y estimated
pred.data1 = data.frame(
time<-DF$time,
time2<-(DF$time)^2,
sex<-DF$sex)
pred.data1$y = predict(mZIP, newdata=pred.data1, type="response")
Now using zero inflated poisson mixed model with glmmTMB package
mZIPmix<- glmmTMB(y~time+time2+sex+(1|id),
data=DF, ziformula=~1,family=poisson)
summary(mZIPmix)
#
# new Y estimated
pred.data2 = data.frame(
time<-DF$time,
time2<-(DF$time)^2,
sex<-DF$sex,
id<-DF$id)
pred.data2$y = predict(mZIPmix, newdata=pred.data2, type="response")
Plot zero inflated poisson model and mixed poisson model
par(mfrow=c(1,2))
plot1<-ggplot(DF, aes(time, y, colour=sex)) +
labs(title="Zero inflated model") +
geom_point() +
geom_line(data=pred.data1) +
stat_smooth(method="glm", family=poisson(link="log"), formula = y~poly(x,2),fullrange=TRUE)
plot2<-ggplot(DF, aes(time, y, colour=sex)) +
labs(title="Zero inflated mixed model") +
geom_point() +
geom_line(data=pred.data2) +
stat_smooth(method="glm", family=poisson(link="log"), formula = y~poly(x,2),fullrange=TRUE)## here a don't find any method to mixed glm
grid.arrange(plot1, plot2, ncol=2)
#-
Doesn't work of sure. Is possible to make this using ggplot2?
Thanks in advance
I'm not sure, but it looks to me that you're looking for marginal effects. You can do this with the ggeffects-package. Here are two examples, using your simulated data, that create a ggplot-object, one with and one w/o raw data.
library(glmmTMB)
library(ggeffects)
mZIPmix<- glmmTMB(y~poly(time,2)+sex+(1|id), data=DF, ziformula=~1,family=poisson)
# compute marginal effects and create a plot.
# the tag "[all]" is useful for polynomial terms, to produce smoother plots
ggpredict(mZIPmix, c("time [all]", "sex")) %>% plot(rawdata = TRUE, jitter = .01)
ggpredict(mZIPmix, c("time [all]", "sex")) %>% plot(rawdata = FALSE)
Created on 2019-05-16 by the reprex package (v0.2.1)
Note that sex only has an "additive" effect. Maybe you want to model an intercation between time and sex?
mZIPmix<- glmmTMB(y~poly(time,2)*sex+(1|id), data=DF, ziformula=~1,family=poisson)
ggpredict(mZIPmix, c("time [all]", "sex")) %>% plot(rawdata = TRUE, jitter = .01)
ggpredict(mZIPmix, c("time [all]", "sex")) %>% plot()
Created on 2019-05-16 by the reprex package (v0.2.1)
After variable selection I usually end up in a model with a numerical covariable (2nd or 3rd degree). What I want to do is to plot using emmeans package preferentially. Is there a way of doing it?
I can do it using predict:
m1 <- lm(mpg ~ poly(disp,2), data = mtcars)
df <- cbind(disp = mtcars$disp, predict.lm(m1, interval = "confidence"))
df <- as.data.frame(df)
ggplot(data = df, aes(x = disp, y = fit)) +
geom_line() +
geom_ribbon(aes(ymin = lwr, ymax = upr, x = disp, y = fit),alpha = 0.2)
I didn't figured out a way of doing it using emmip neither emtrends
For illustration purposes, how could I do it using mixed models via lme?
m1 <- lme(mpg ~ poly(disp,2), random = ~1|factor(am), data = mtcars)
I suspect that your issue is due to the fact that by default, covariates are reduced to their means in emmeans. You can use theat or cov.reduce arguments to specify a larger number of values. See the documentation for ref_grid and vignette(“basics”, “emmeans”), or the index of vignette topics.
Using sjPlot:
plot_model(m1, terms = "disp [all]", type = "pred")
gives the same graphic.
Using emmeans:
em1 <- ref_grid(m1, at = list(disp = seq(min(mtcars$disp), max(mtcars$disp), 1)))
emmip(em1, ~disp, CIs = T)
returns a graphic with a small difference in layout. An alternative is to add the result to an object and plot as the way that I want to:
d1 <- emmip(em1, ~disp, CIs = T, plotit = F)
I've been using ggplot2 to plot the results of bootstrapping various statistical outputs such as correlation coefficients. Most recently, I bootstrapped the slope of a linear regression model. Here's how that looks using the plot() function from the graphics package:
plot(main="Relationship Between Eruption Length at Wait Time at \n
Old Faithful With Bootstrapped Regression Lines",
xlab = "Eruption Length (minutes)",
ylab = "Wait Time (minutes)",
waiting ~ eruptions,
data = faithful,
col = spot_color,
pch = 19)
index <- 1:nrow(faithful)
for (i in 1:10000) {
index_boot <- sample(index, replace = TRUE) #getting a boostrap sample (of indices)
faithful_boot <- faithful[index_boot, ]
# Fitting the linear model to the bootstrapped data:
fit.boot <- lm(waiting ~ eruptions, data = faithful_boot)
abline(fit.boot, lwd = 0.1, col = rgb(0, 0.1, 0.25, alpha = 0.05)) # Add line to plot
}
fit <- lm(waiting ~ eruptions, data=faithful)
abline(fit, lwd = 2.5, col = "blue")
That works, but it depends on a workflow where we first create a plot, then add the lines in a loop. I'd rather create a list of slopes with a function and then plot all of them in ggplot2.
For example, the function might look something like this:
set.seed(777) # included so the following output is reproducible
n_resample <- 10000 # set the number of times to resample the data
# First argument is the data; second is the number of resampled datasets
bootstrap <- function(df, n_resample) {
slope_resample <- matrix(NA, nrow = n_resample) # initialize vector
index <- 1:nrow(df) # create an index for supplied table
for (i in 1:n_resample) {
index_boot <- sample(index, replace = TRUE) # sample row numbers, with replacement
df_boot <- df[index_boot, ] # create a bootstrap sample from original data
a <- lm(waiting ~ eruptions, data=df_boot) # compute linear model
slope_resample[i] <- slope <- a$coefficients[2] # take the slope
}
return(slope_resample) # Return a vector of differences of proportion
}
bootstrapped_slopes <- bootstrap(faithful, 10000)
But how to get geom_line() or geom_smooth() to take the data from bootstrapped_slopes? Any assistance is much appreciated.
EDIT: More direct adaptation from the OP
For plotting, I presume you want both the slopes and the intercepts, so here's a modified bootstrap function:
bootstrap <- function(df, n_resample) {
# Note 2 dimensions here, for slope and intercept
slope_resample <- matrix(NA, 2, nrow = n_resample) # initialize vector
index <- 1:nrow(df) # create an index for supplied table
for (i in 1:n_resample) {
index_boot <- sample(index, replace = TRUE) # sample row numbers, with replacement
df_boot <- df[index_boot, ] # create a bootstrap sample from original data
a <- lm(waiting ~ eruptions, data=df_boot) # compute linear model
slope_resample[i, 1] <- slope <- a$coefficients[1] # take the slope
slope_resample[i, 2] <- intercept <- a$coefficients[2] # take the intercept
}
# Return a data frame with all the slopes and intercepts
return(as.data.frame(slope_resample))
}
Then run it and plot the lines from that data frame:
bootstrapped_slopes <- bootstrap(faithful, 10000)
library(dplyr); library(ggplot2)
ggplot(faithful, aes(eruptions, waiting)) +
geom_abline(data = bootstrapped_slopes %>%
sample_n(1000), # 10k lines look about the same as 1k, just darker and slower
aes(slope = V2, intercept = V1), #, group = id),
alpha = 0.01) +
geom_point(shape = 19, color = "red")
Alternative solution
This could also be done using modelr and broom to simplify some of the bootstrapping. Based on the main help example for modelr::bootstrap, we can do the following:
library(purrr); library(modelr); library(broom); library(dplyr)
set.seed(777)
# Creates bootstrap object with 10k extracts from faithful
boot <- modelr::bootstrap(faithful, 10000)
# Applies the linear regression to each
models <- map(boot$strap, ~ lm(waiting ~ eruptions, data = .))
# Extracts the model results into a tidy format
tidied <- map_df(models, broom::tidy, .id = "id")
# We just need the slope and intercept here
tidied_wide <- tidied %>% select(id, term, estimate) %>% spread(term, estimate)
ggplot(faithful, aes(eruptions, waiting)) +
geom_abline(data = tidied_wide %>%
sample_n(1000), # 10k lines look about the same as 1k, just darker and slower
aes(slope = eruptions, intercept = `(Intercept)`, group = id), alpha = 0.05) +
geom_point(shape = 19, color = "red") # spot_color wasn't provided in OP
I'm running a binomial GLM to predict the probability of a species occurrence, where I am training on one dataset and testing the model on another dataset:
TrainingData<-read.csv("TrainingData.csv")[,-1]
TrainingData[,1]<-as.factor(TrainingData[,1])
TrainingData[,4]<-as.factor(TrainingData[,4])
TestData<-read.csv("TestData.csv")[,-1]
TestData[,1]<-as.factor(TestData[,1])
TestData[,4]<-as.factor(TestData[,4])
mod<-glm(presence~var1+var2+var3, family=binomial, data=TrainingData)
probs=predict(mod, TestData, type="response")
What is the best way (or function) to create response curves to plot the relationship between the probability of presence and each predictor variable?
Thanks!
The marginal probabilities can be calculated from predict.glm with type = "terms",
since each of the terms are calculated with the remaining variables set at their mean values.
This is converted back to a probabilty scale with plogis(term + intercept).
Second, because your data set contains and combination of continuous values and factors
for your predictor variables, separate plots were made for each type and combined
with grid.arrange.
Although this answers your question directly based on the glm model you presented,
I would still recommend examining the spatial autocorrelation of both your predictor
and response variables, as this could have a likely impact on your final model.
library(reshape2)
library(dplyr)
library(tidyr)
library(ggplot2)
library(gridExtra)
TrainingData <- read.csv("~/Downloads/TrainingData.csv", header = TRUE)
TrainingData[['presence']] <- as.factor(TrainingData[['presence']])
TrainingData[['var3']] <- as.factor(TrainingData[['var3']])
TrainingData[['X']] <- NULL # Not used in the model
TestData <- read.csv("~/Downloads/TestData.csv", header = TRUE)
TestData[['presence']] <- as.factor(TestData[['presence']])
TestData[['var3']] <- as.factor(TestData[['var3']])
TestData[['X']] <- NULL
Presence/Absence model
mod <- glm(presence ~ var1 + var2 + var3, family = binomial, data = TrainingData)
Get predicted probabilities for each of the centered variables (i.e remaining variables set to their mean).
mod_terms <- predict(mod, newdata = TestData, type = "terms")
mod_prob <- data.frame(idx = 1:nrow(TestData), plogis(mod_terms +
attr(mod_terms, "constant")))
mod_probg <- mod_prob %>% gather(variable, probability, -idx)
Melt the Test data into long format
TestData['idx'] <- 1:nrow(TestData) # Add index to data
TestData[['X']] <- NULL # Drop the X variable since it was not used in the model
data_long <- melt(TestData, id = c("presence","idx"))
data_long[['value']] <- as.numeric(data_df[['value']])
Merge Testdata with predictions and separate the data containing continuous (var1 and var2) and factors (var3).
# Merge Testdata with predictions
data_df <- merge(data_long, mod_probg, by = c("idx", "variable"))
data_df <- data_df %>% arrange(variable, value)
data_continuous <- data_df %>% filter(., variable != "var3") %>%
transform(value = as.numeric(value)) %>% arrange(variable, value)
data_factor <- data_df %>% filter(., variable == "var3") %>%
transform(value = as.factor(value))%>%
arrange(idx)
ggplot output
g_continuous <- ggplot(data_continuous, aes(x = value, y = probability)) + geom_point()+
facet_wrap(~variable, scales = "free_x")
g_factor <- ggplot(data = data_factor, aes(x = value, y = probability)) + geom_boxplot() +
facet_wrap(~variable)
grid.arrange(g_continuous, g_factor, nrow = 1)