I have predicted values, via:
glm0 <- glm(use ~ as.factor(decision), data = decision_use, family = binomial(link = "logit"))
predicted_glm <- predict(glm0, newdata = decision_use, type = "response", interval = "confidence", se = TRUE)
predict <- predicted_glm$fit
predict <- predict + 1
head(predict)
1 2 3 4 5 6
0.3715847 0.3095335 0.3095335 0.3095335 0.3095335 0.5000000
Now when I plot a box plot using ggplot2,
ggplot(decision_use, aes(x = decision, y = predict)) +
geom_boxplot(aes(fill = factor(decision)), alpha = .2)
I get a box plot with one horizontal line per categorical variable. If you look at the predict data, it's same for each categorical variable, so makes sense.
But I want a box plot with the range. How can I get that? When I use "use" instead of predict, I get boxes stretching from end to end (1 to 0). So I suppose that's not it. Thank you in advance.
To clarify, predicted_glm includes se.fit values. I wonder how to incorporate those.
It doesn't really make sense to do a boxplot here. A boxplot shows the range and spread of a continuous variable within groups. Your dependent variable is binary, so the values are all 0 or 1. Since you are plotting predictions for each group, your plot would have just a single point representing the expected value (i.e. the probability) for each group.
The closest you can come is probably to plot the prediction with 95% confidence bars around it.
You haven't provided any sample data, so I'll make some up here:
set.seed(100)
df <- data.frame(outcome = rbinom(200, 1, c(0.1, 0.9)), var1 = rep(c("A", "B"), 100))
Now we'll create our model and get the prediction for each level of my predictor variable using the newdata parameter of predict. I'm going to specify type = "link" because I want the log odds, and I'm also going to specify se.fit = TRUE so I can get the standard error of these predictions:
mod <- glm(outcome ~ var1, data = df, family = binomial)
prediction <- predict(mod, list(var1 = c("A", "B")), se.fit = TRUE, type = "link")
Now I can work out the 95% confidence intervals for my predictions:
prediction$lower <- prediction$fit - prediction$se.fit * 1.96
prediction$upper <- prediction$fit + prediction$se.fit * 1.96
Finally, I transform the fit and confidence intervals from log odds into probabilities:
prediction <- lapply(prediction, function(logodds) exp(logodds)/(1 + exp(logodds)))
plotdf <- data.frame(Group = c("A", "B"), fit = prediction$fit,
upper = prediction$upper, lower = prediction$lower)
plotdf
#> Group fit upper lower
#> 1 A 0.13 0.2111260 0.07700412
#> 2 B 0.92 0.9594884 0.84811360
Now I am ready to plot. I will use geom_points for the probability estimates and geom_errorbars for the confidence intervals :
library(ggplot2)
ggplot(plotdf, aes(x = Group, y = fit, colour = Group)) +
geom_errorbar(aes(ymin = lower, ymax = upper), size = 2, width = 0.5) +
geom_point(size = 3, colour = "black") +
scale_y_continuous(limits = c(0, 1)) +
labs(title = "Probability estimate with 95% CI", y = "Probability")
Created on 2020-05-11 by the reprex package (v0.3.0)
Related
I want to estimate predicted values for the mean (mu) and standard deviation (sigma) based on a gamlss model. However, it is not clear to me how to extract a standard deviation for given values of x
The data frame I am using looks like this:
#> head(abdom)
# y x
# 59 12.29
# 64 12.29
# 56 12.29
Here is the code to fit a gamlss model:
library(gamlss)
fit = gamlss(y ~ cs(x), sigma.formula = ~ cs(x), data = abdom, family = BCPE)
I want to calculate z-scores based on this model using the following approach: z = (y - mu)/sigma . Therefore, I use this code to calculate mu and sigma for each value of y and calculate the z scores. 95% of the z scores should lie between -2 and 2.
using predict function
mu = predict(fit, newdata = abdom, type = "response", what = "mu")
si = predict(fit, newdata = abdom, type = "response", what = "sigma")
z_score1 = (abdom$y - mu) / si
hist(z_score1)
using centiles.pred function
z_score2 = centiles.pred(fit, xname = "x", xvalues = abdom$x, yval = abdom$y, type = "z-scores")
hist(z_score2)
This leads to the following plots:
for z_score1, most scores are not even close to lie between -2 and 2.
Another way to approach this is by plotting the mean and standard deviation:
# calculating mu +/- 2*sigma
pred_dat = data.frame(x = 10:45)
mu = predict(fit, newdata = pred_dat, type = "response", what = "mu")
si = predict(fit, newdata = pred_dat, type = "response", what = "sigma")
hi = mu + (2 * si)
lo = mu - (2 * si)
pred_dat$mu = mu
pred_dat$hi = hi
pred_dat$lo = lo
# plotting
ggplot(data = pred_dat, aes(x = x)) +
geom_point(data = abdom, aes(x = x, y = y)) +
geom_line(aes(y = mu), colour = "red") +
geom_line(aes(y = hi), colour = "blue") +
geom_line(aes(y = lo), colour = "blue")
yielding the following plot:
Again, 95% of the values should lie between the two blue lines (hi and lo). But the values of the standard deviations are so low that there seems to be only one line.
So my questions are:
first question: what do the values derived from predict represent if not the standard deviation conditional to x?
second question: how can I predict the standard deviation for a given x-value?
The gamlss package provides distribution functions for the BCPE distribution, including qBCPE. If you plug the coefficients from your model into this function at pnorm(1), then you will get the predicted value of y at 1 standard deviation above the predicted mean. Since you can get the predicted mean with predict(fit), then you can easily get the standard deviation. The difficult part is getting the parameters from your model into qBCPE. Here's a reprex:
library(gamlss)
library(ggplot2)
fit <- gamlss(y ~ cs(x), sigma.formula = ~ cs(x), data = abdom, family = BCPE)
Q <- qBCPE(pnorm(1),
mu = predict(fit),
sigma = exp(fit$sigma.coefficients[1] +
fit$sigma.coefficients[2] * cs(abdom$x)),
nu = fit$nu.coefficients,
tau = exp(fit$tau.coefficients))
SD <- c(Q - predict(fit))
Here, SD gives the vector of standard deviations at each value of x:
head(SD)
#> [1] 4.092467 4.092467 4.092467 4.203738 4.425361 4.425361
To show this is correct, let's plot 1.96 standard deviations on either side of the prediction line:
ggplot(data = data.frame(x = abdom$x, y = predict(fit),
upper = predict(fit) + 1.96 * SD,
lower = predict(fit) - 1.96 * SD), aes(x, y)) +
geom_point(data = abdom) +
geom_ribbon(aes(ymin = lower, ymax = upper), alpha = 0.3) +
geom_line(color = "blue", linewidth = 1)
This looks good. Let's confirm that about 5% of observations lie outside 1.96 standard deviations of the mean:
(sum(abdom$y > predict(fit) + 1.96 * SD) +
sum(abdom$y < predict(fit) - 1.96 * SD)) / nrow(abdom)
#> [1] 0.0557377
And let's show that the calculated Z scores follow a standard normal distribution:
Z <- (abdom$y - predict(fit))/SD
hist(Z, breaks = 20, freq = FALSE)
lines(seq(-4, 4, 0.1), dnorm(seq(-4, 4, 0.1)))
This looks pretty good.
Created on 2023-01-09 with reprex v2.0.2
For BCPE the z-scores are not (y-mu)/sigma.
For any gamlss fit, the z-scores are exactly equal to the residuals of the fitted model, i.e. for your model fit
resid(fit)
or
fit$residuals
I am trying to plot 95% confidence intervals on some simulated values but am running into so issues when i am trying to plot the CIs using the geom_ribbon() argument. The trouble I'm having it that my model does not show the CIs when i plot them, like so;
I have included all of my code below if anyone knows where i have gone wrong here;
set.seed(20220520)
#simulating 200 values between 0 and 1 from a uniform distribution
x = runif(200, min = 0, max = 1)
lam = exp(0.3+5*x)
y = rpois(200, lambda = lam)
#before we do this each Yi may contain zeros so we need to add a small constant
y <- y + .1
#combining x and y into a dataframe so we can plot
df = data.frame(x, y)
#fitting a Poisson GLM
model2 <- glm(y ~ x,
data = df,
family = poisson(link='log'))
#make predictions (this may be the same as predictions_mod2)
preds <- predict(model2, type = "response")
#making CI predictions
predictions_mod2 = predict(model2, df, se.fit = TRUE, type = 'response')
#calculate confidence intervals limit
upper_mod2 = predictions_mod2$fit+1.96*predictions_mod2$se.fit
lower_mod2 = predictions_mod2$fit-1.96*predictions_mod2$se.fit
#transform the CI limit to get one at the level of the mean
upper_mod2 = exp(upper_mod2)/(1+exp(upper_mod2))
lower_mod2 = exp(lower_mod2)/(1+exp(lower_mod2))
#combining into a df
predframe = data.frame(lwr=lower_mod2,upr=upper_mod2, x = df$x, y = df$y)
#plot model with 95% confidence intervals using ggplot
ggplot(df, aes(x, y)) +
geom_ribbon(data = predframe, aes(ymin=lwr, ymax=upr), alpha = 0.4) +
geom_point() +
geom_line(aes(x, preds2), col = 'blue')
In a comment to the question, it's asked why not to logit transform the predicted values. The reason why is that the type of prediction asked for is "response". From the documentation, my emphasis.
type
the type of prediction required. The default is on the scale of the linear predictors; the alternative "response" is on the scale of the response variable. Thus for a default binomial model the default predictions are of log-odds (probabilities on logit scale) and type = "response" gives the predicted probabilities. The "terms" option returns a matrix giving the fitted values of each term in the model formula on the linear predictor scale.
There is a good way to answer, to show the code.
library(ggplot2, quietly = TRUE)
set.seed(20220520)
#simulating 200 values between 0 and 1 from a uniform distribution
x = runif(200, min = 0, max = 1)
lam = exp(0.3+5*x)
y = rpois(200, lambda = lam)
#before we do this each Yi may contain zeros so we need to add a small constant
y <- y + 0.1
#combining x and y into a dataframe so we can plot
df = data.frame(x, y)
#fitting a Poisson GLM
suppressWarnings(
model2 <- glm(y ~ x,
data = df,
family = poisson(link='log'))
)
#make predictions (this may be the same as predictions_mod2)
preds <- predict(model2, type = "response")
#making CI predictions
predictions_mod2 = predict(model2, df, se.fit = TRUE, type = 'response')
#calculate confidence intervals limit
upper_mod2 = predictions_mod2$fit+1.96*predictions_mod2$se.fit
lower_mod2 = predictions_mod2$fit-1.96*predictions_mod2$se.fit
#combining into a df
predframe = data.frame(lwr=lower_mod2,upr=upper_mod2, x = df$x, y = df$y)
#plot model with 95% confidence intervals using ggplot
ggplot(df, aes(x, y)) +
geom_ribbon(data = predframe, aes(ymin=lwr, ymax=upr), alpha = 0.4) +
geom_point() +
geom_line(aes(x, preds), col = 'blue')
Created on 2022-05-29 by the reprex package (v2.0.1)
Is there a way to set the x-axis limits when plotting the predicted fits for GAM models? More specifically, I'm fitting a smoother for each level of a factor using 'by = ', however, each factor level has a different range of values. Plotting the variable in ggplot results in an x-axis that automatically accommodates the different ranges of 'x'; however, after fitting a GAM (mgcv::gam()) the default behavior of plot.gam() appears to be predicting values across a shared x-axis limit.
The dummy data below has some continuous variable for 'x', but in my real data, 'x' is Time (year), and 'group' is sampling location. Because I did not collect data from each site across the same time range, I feel it is inappropriate to show a model fit in these empty years.
library(tidyverse)
library(mgcv)
library(gratia)
theme_set(theme_classic())
## simulate data with a grouping variable of three levels:
d = data.frame(group = rep(c('A','B','C'), each = 100),
x = c(seq(0,1,length=100),
seq(.2,1,length=100),
seq(0,.5,length=100))) %>%
mutate(y = sin(2*pi*x) + rnorm(100, sd=0.3),
group = as.factor(group))
## Look at data
ggplot(d, aes(x = x, y = y, colour = group))+
facet_wrap(~group)+
geom_point()+
geom_smooth()
Here is the raw data with loess smoother in ggplot:
## fit simple GAM with smoother for X
m1 = mgcv::gam(y ~ s(x, by = group), data = d)
## base R plot
par(mfrow = c(2,2), bty = 'l', las = 1, mai = c(.6,.6,.2,.1), mgp = c(2,.5,0))
plot(m1)
## Gavin's neat plotter
gratia::draw(m1)
Here is the predicted GAM fit that spans the same range (0,1) for all three groups:
Can I limit the prediction/plot to actual values of 'x'?
If you install the current development version (>= 0.6.0.9111) from GitHub, {gratia} will now do what you want, sort of. I added some functionality to smooth_estimates() that I had planned to add eventually but your post kicked it the top of the ToDo list and motivated me to add it now.
You can use smooth_estimates() to evaluate the smooths at the observed (or any user-supplied) data only and then a bit of ggplot() recreates most of the plot.
remotes::install_github("gavinsimpson/gratia")
library('mgcv')
library('gratia')
library('dplyr')
library('ggplot2')
d <- data.frame(group = rep(c('A','B','C'), each = 100),
x = c(seq(0,1,length=100),
seq(.2,1,length=100),
seq(0,.5,length=100))) %>%
mutate(y = sin(2*pi*x) + rnorm(100, sd=0.3),
group = as.factor(group))
m <- gam(y ~ group + s(x, by = group), data = d, method = 'REML')
sm <- smooth_estimates(m, data = d) %>%
add_confint()
ggplot(sm, aes(x = x, y = est, colour = group)) +
geom_ribbon(aes(ymin = lower_ci, ymax = upper_ci, colour = NULL, fill = group),
alpha = 0.2) +
geom_line() +
facet_wrap(~ group)
I am working with R. Using a tutorial, I was able to create a statistical model and produce visual plots for some of the outputs:
#load libraries
library(survival)
library(dplyr)
library(ranger)
library(data.table)
library(ggplot2)
#use the built in "lung" data set
#remove missing values (dataset is called "a")
a <- na.omit(lung)
#create id variable
a$ID <- seq_along(a[,1])
#create test set with only the first 3 rows
new <- a[1:3,]
#create a training set by removing first three rows
a <- a[-c(1:3),]
#fit survival model (random survival forest)
r_fit <- ranger(Surv(time,status) ~ age + sex + ph.ecog + ph.karno + pat.karno + meal.cal + wt.loss, data = a, mtry = 4, importance = "permutation", splitrule = "extratrees", verbose = TRUE)
#create new intermediate variables required for the survival curves
death_times <- r_fit$unique.death.times
surv_prob <- data.frame(r_fit$survival)
avg_prob <- sapply(surv_prob, mean)
#use survival model to produce estimated survival curves for the first three observations
pred <- predict(r_fit, new, type = 'response')$survival
pred <- data.table(pred)
colnames(pred) <- as.character(r_fit$unique.death.times)
#plot the results for these 3 patients
plot(r_fit$unique.death.times, pred[1,], type = "l", col = "red")
lines(r_fit$unique.death.times, pred[2,], type = "l", col = "green")
lines(r_fit$unique.death.times, pred[3,], type = "l", col = "blue")
Now, I am trying to convert the above plot into ggplot format (and add 95% confidence intervals):
ggplot(r_fit) + geom_line(aes(x = r_fit$unique.death.times, y = pred[1,], group = 1), color = red) + geom_ribbon(aes(ymin = 0.95 * pred[1,], ymax = - 0.95 * pred[1,]), fill = "red") + geom_line(aes(x = r_fit$unique.death.times, y = pred[2,], group = 1), color = blue) + geom_ribbon(aes(ymin = 0.95 * pred[2,], ymax = - 0.95 * pred[2,]), fill = "blue") + geom_line(aes(x = r_fit$unique.death.times, y = pred[3,], group = 1), color = green) + geom_ribbon(aes(ymin = 0.95 * pred[3,], ymax = - 0.95 * pred[3,]), fill = "green") + theme(axis.text.x = element_text(angle = 90)) + ggtitle("sample graph")
But this produces the following error:
Error: `data` must be a data frame, or other object coercible by `fortify()`, not an S3 object with class ranger
Run `rlang::last_error()` to see where the error occurred.
What is the reason for this error? Can someone please show me how to fix this problem?
Thanks
As per the ggplot2 documentation, you need to provide a data.frame() or object that can be converted (coerced) to a data.frame(). In this case, if you want to reproduce the plot above in ggplot2, you will need to manually set up the data frame yourself.
Below is an example of how you could set up the data to display the plot in ggplot2.
Data Frame
First we create a data.frame() with the variables that we want to plot. The easiest way to do this is to just group them all in as separate columns. Note that I have used the as.numeric() function to first coerce the predicted values to a vector, because they were previously a data.table row, and if you don't convert them they are maintained as rows.
ggplot_data <- data.frame(unique.death.times = r_fit$unique.death.times,
pred1 = as.numeric(pred[1,]),
pred2 = as.numeric(pred[2,]),
pred3 = as.numeric(pred[3,]))
head(ggplot_data)
## unique.death.times pred1 pred2 pred3
## 1 5 0.9986676 1.0000000 0.9973369
## 2 11 0.9984678 1.0000000 0.9824642
## 3 12 0.9984678 0.9998182 0.9764154
## 4 13 0.9984678 0.9998182 0.9627118
## 5 15 0.9731656 0.9959416 0.9527424
## 6 26 0.9731656 0.9959416 0.9093876
Pivot the data
This format is still not ideal, because in order to plot the data and colour by the correct column (variable), we need to 'pivot' the data. We need to load the tidyr package for this.
library(tidyr)
ggplot_data <- ggplot_data %>%
pivot_longer(cols = !unique.death.times,
names_to = "category", values_to = "predicted.value")
Plotting
Now the data is in a form that makes it really easy to plot in ggplot2.
plot <- ggplot(ggplot_data, aes(x = unique.death.times, y = predicted.value, colour = category)) +
geom_line()
plot
If you really want to match the look of the base plot, you can add theme_classic():
plot + theme_classic()
Additional notes
Note that this doesn't include 95% confidence intervals, so they would have to be calculated separately. Be aware though, that a 95% confidence interval is not just 95% of the y value at a given x value. There are calculations that will give you the correct values of the confidence interval, including functions built into R.
For a quick view of a trend line with prediction intervals, you can use the geom_smooth() function in ggplot2, but in this case it adds a loess curve by default, and the intervals provided by that function.
plot + theme_classic() + geom_smooth()
I need to graph the predicted probabilities of a logit regression in ggplot2. Essentially, I am trying to graph a glm by each treatment condition within the same graph. However, I am getting quite confused about how to do this seeing that my treat variable (i.e. the x I am interested in) is categorical.This means that when I try to graph the treatment effects using ggplot I just get a bunch of points at 0, 1, and 2 but no lines.
My question is... How could I graph the logit prediction lines in this case? Thanks in advance!
set.seed(96)
df <- data.frame(
vote = sample(0:1, 200, replace = T),
treat = sample(0:3, 200, replace = T))
glm_output <- glm(vote ~ as.factor(treat), data = df, family = binomial(link = "logit"))
predicted_vote <- predict(glm_output, newdata = df, type = "link", interval = "confidence", se = TRUE)
df <- cbind(df, data.frame(predicted_vote))
Since the explanatory variable treat is categorical, it will make more sense if you use boxplot instead like the following:
ggplot(df, aes(x = treat, y = predicted_prob)) +
geom_boxplot(aes(fill = factor(treat)), alpha = .2)
If you want to see the predicted probabilities by glm across different values of some of other explanatory variables you may try this:
ggplot(df, aes(x = treat, y = predicted_prob)) +
geom_boxplot(aes(fill = factor(treat)), alpha = .2) + facet_wrap(~gender)
# create age groups
df$age_group <- cut(df$age, breaks=seq(0,100,20))
ggplot(df, aes(x = treat, y = predicted_prob)) +
geom_boxplot(aes(fill = factor(treat)), alpha = .2) + facet_grid(age_group~gender)