Using modelr::add_predictions for glm - r

I am trying to calculate the logistic regression prediction for a set of data using the tidyverse and modelr packages. Clearly I am doing something wrong in the add_predictions as I am not receiving the "response" of the logistic function as I would if I were using the 'predict' function in stats. This should be simple, but I can't figure it out and multiple searches yielded little.
library(tidyverse)
library(modelr)
options(na.action = na.warn)
library(ISLR)
d <- as_tibble(ISLR::Default)
model <- glm(default ~ balance, data = d, family = binomial)
grid <- d %>% data_grid(balance) %>% add_predictions(model)
ggplot(d, aes(x=balance)) +
geom_point(aes(y = default)) +
geom_line(data = grid, aes(y = pred))

predict.glm's type parameter defaults to "link", which add_predictions does not change by default, nor provide you with any way to change to the almost-certainly desired "response". (A GitHub issue exists; add your nice reprex on it if you like.) That said, it's not hard to just use predict directly within the tidyverse via dplyr::mutate.
Also note that ggplot is coercing default (a factor) to numeric in order to plot the line, which is fine, except that "No" and "Yes" are replaced by 1 and 2, while the probabilities returned by predict will be between 0 and 1. Explicitly coercing to numeric and subtracting one fixes the plot, though an extra scale_y_continuous call is required to fix the labels.
library(tidyverse)
library(modelr)
d <- as_tibble(ISLR::Default)
model <- glm(default ~ balance, data = d, family = binomial)
grid <- d %>% data_grid(balance) %>%
mutate(pred = predict(model, newdata = ., type = 'response'))
ggplot(d, aes(x = balance)) +
geom_point(aes(y = as.numeric(default) - 1)) +
geom_line(data = grid, aes(y = pred)) +
scale_y_continuous('default', breaks = 0:1, labels = levels(d$default))
Also note that if all you want is a plot, geom_smooth can calculate predictions directly for you:
ggplot(d, aes(balance, as.numeric(default) - 1)) +
geom_point() +
geom_smooth(method = 'glm', method.args = list(family = 'binomial')) +
scale_y_continuous('default', breaks = 0:1, labels = levels(d$default))

Related

How can I create a ggplot with a regression line based on the predicted values of a glm?

I am creating a ggplot and I would like to add in a regression line, I have tried using geom_smooth. But the trouble now is that I would like to add this line using the predicted values of my response variable which I retrieved from the predict() function.
I was just wondering if anyone knows how I could do this? I have tried adding:
geom_smooth(predicted.values ~ data$predictor1 + data$predictor2)
But this doesn't seem to be working. It returns:
Warning messages: 1: Removed 2 rows containing non-finite values
(stat_smooth). 2: Computation failed in stat_smooth(): variable
lengths differ (found for 'data$predictor1') 3: Removed 2 rows
containing missing values (geom_point).
Thank you for the help!
If you want to add a regression line from a glm, you can do it directly with geom_smooth, provided that you supply a list of appropriate arguments to the method.args parameter.
For example, suppose I have the following count data and wish to carry out a Poisson regression using glm:
set.seed(1)
df <- data.frame(x = 1:100,
y = rpois(100, seq(1, 5, length.out = 100)))
Then my model would look like this:
model <- glm(y ~ x, data = df, family = poisson)
And if I want to plot my data with a regression line, I can do:
library(ggplot2)
ggplot(df, aes(x, y)) +
geom_point() +
geom_smooth(method = "glm", method.args = list(family = poisson),
fill = "deepskyblue4", color = "deepskyblue4", linetype = 2)
If instead you want to use the output of predict directly, you could do something like:
predicted.values <- predict(model, type = "response")
ggplot(df, aes(x, y)) +
geom_point() +
geom_line(aes(y = predicted.values), linetype = 2)

How to extract stat_smooth exponential fit parameters ggplot2

I've already tried many of the suggestions found here, but I simply can't figure it out.
Is it possible to extract the equation (y = a+exp(-b*x)) from a line fitted with stat_smooth?
This is a data example:
df <-data_frame(Time = c(0.5,1,2,4,8,16,24), Concentration = c(1,0.5,0.2,0.05,0.02,0.01,0.001))
Plot <- ggplot(df, aes(x=Time, y=Concentration))+
geom_point(size=2) +
stat_smooth(method = nls, formula = y ~ a*exp(-b *x),
se = FALSE,
method.args = list(start = c(a=10, b=0.01)))+
theme_classic(base_size = 15) +
labs(x=expression(Time (h)),
y=expression(C[t]/C[0]))
I tried to use "stat_regline_equation" , but it does not work when I add the exponential function.
To extract data from ggplot you can use: ggplot_build()
Values from stat_smooth() are in ggplot_build(Plot)$data[[2]]
You can assign it to the object: build <- ggplot_build(Plot)$data[[2]]
Both codes below give the same result
Plot <- ggplot(df, aes(x=Time, y=Concentration)) + geom_point(size=2) +
stat_smooth(method = nls, formula = y ~ a*exp(-b *x), se = FALSE,
method.args = list(start = c(a=10, b=0.01)))
and
Plot <- ggplot(df, aes(x=Time, y=Concentration)) + geom_point(size=2) +
geom_line(data=build,aes(x=x,y=y),color="blue")
I don't think it's possible. (1) I poked around in the guts of the object generated by ggplot_build(Plot) and didn't find anything likely (that doesn't prove it isn't there but...) (2) If you poke around in the source code of the ggpubr::stat_regline_equation() function you can see that rather than poke around in the stored information from the smooth it has to call a package function that re-fits the linear model so it can extract the coefficients and construct the equation.
You probably just have to re-fit the model yourself:
nls_fit <- nls(formula = Concentration ~ a*exp(-b *Time),
start = c(a=10, b=0.01), data = df)
coef(nls_fit)
(You might find the format returned by broom::tidy(nls_fit) convenient.)
For this particular model you can also get the coefficients via
cc <- coef(glm(Concentration ~ Time, data = df, family = gaussian(link= "log")))
c(exp(cc[1]), -cc[2])
You could in principle write your own stat_ function mirroring stat_regline_equation that would encapsulate this functionality, but it would be a lot more work/wouldn't be worth it unless you were doing this operation very routinely or wanted to make it easy for others to do ...

Fit a regression line with categorical variable in ggplot?

I have simulated a dataset and stored it in a tibble:
library(tidyverse)
set.seed(2002)
tre.sett <- rnorm(n = 12, mean = 41, sd = 5) #12 individer
ett.sett <-rnorm(n = 12, mean = 21, sd = 5) #12 individer
dat <- tibble(individ = seq(1:24),
gruppe = rep(c("tre.sett", "ett.sett"), c(length(tre.sett), length(ett.sett))),
rm = c(tre.sett, ett.sett))
Next I can create a basic plot of rm and gruppe using ggplot from tidyverse.
ggplot(dat, aes(gruppe, rm)) +
geom_point()+
theme_bw()
This gives me the following figure:
I want to add a regresson line between the two groups, but I'm struggling to implement one. If I use geom_smooth() nothing appears in figure. The intercept and slope from my model is 21.900 and 20.524, respectively.
One solution has been given in the comments: re-encode the categories as integers before using geom_smooth.
Another solution. Since the "regression line" just connects the mean of the two groups, you can use stat_summary:
dat %>%
ggplot(aes(gruppe, rm)) +
geom_point() +
stat_summary(geom = "line", fun = mean, group = 1) +
theme_bw()
Result:
You might also want to look at the sjPlot package which uses the plot_model function to visualise regression models. It would be used something like this:
library(sjPlot)
lm1 <- lm(rm ~ gruppe, data = dat)
lm1 %>%
plot_model(type = "pred",
terms = "gruppe",
show.data = TRUE) +
geom_line() +
theme_bw()
Result:

annotate r squared to ggplot by using facet_wrap

I just joined the community and looking forward to get some help for the data analysis for my master thesis.
At the moment I have the following problem:
I plotted 42 varieties with ggplot by using facet_wrap:
`ggplot(sumfvvar,aes(x=TemperaturCmean,y=Fv.Fm,col=treatment))+
geom_point(shape=1,size=1)+
geom_smooth(method=lm)+
scale_color_brewer(palette = "Set1")+
facet_wrap(.~Variety)`
That works very well, but I would like to annotate the r squared values for the regression lines. I have two treatments and 42 varieties, therefore 84 regression lines.
Are there any possibilties to calculate all r squared values and integrate them into the ggplot? I found allready the function
ggplotRegression <- function (fit) {
require(ggplot2)
ggplot(fit$model, aes_string(x = names(fit$model)[2], y = names(fit$model)[1])) +
geom_point() +
stat_smooth(method = "lm") +
labs(title = paste("Adj R2 = ",signif(summary(fit)$adj.r.squared, 5),
"Intercept =",signif(fit$coef[[1]],5 ),
" Slope =",signif(fit$coef[[2]], 5),
" P =",signif(summary(fit)$coef[2,4], 5)))
}
but that works just for one variety and one treatment. Could be a loop for the lm() function an option?
Here is an example with the ggpmisc package:
library(ggpmisc)
set.seed(4321)
x <- 1:100
y <- (x + x^2 + x^3) + rnorm(length(x), mean = 0, sd = mean(x^3) / 4)
my.data <- data.frame(x = x,
y = y,
group = c("A", "B"))
formula <- y ~ poly(x, 1, raw = TRUE)
ggplot(my.data, aes(x, y)) +
facet_wrap(~ group) +
geom_point() +
geom_smooth(method = "lm", formula = formula) +
stat_poly_eq(formula = formula, parse = TRUE,
mapping = aes(label = stat(rr.label)))
You can't apply different labels to different facet, unless you add another r^2 column to your data.. One way is to use geom_text, but you need to calculate the stats you need first. Below I show an example with iris, and for your case, just change Species for Variety, and so on
library(tidyverse)
# simulate data for 2 treatments
# d2 is just shifted up from d1
d1 <- data.frame(iris,Treatment="A")
d2 <- data.frame(iris,Treatment="B") %>%
mutate(Sepal.Length=Sepal.Length+rnorm(nrow(iris),1,0.5))
# combine datasets
DF <- rbind(d1,d2) %>% rename(Variety = Species)
# plot like you did
# note I use "free" scales, if scales very different between Species
# your facet plots will be squished
g <- ggplot(DF,aes(x=Sepal.Width,y=Sepal.Length,col=Treatment))+
geom_point(shape=1,size=1)+
geom_smooth(method=lm)+
scale_color_brewer(palette = "Set1")+
facet_wrap(.~Variety,scales="free")
# rsq function
RSQ = function(y,x){signif(summary(lm(y ~ x))$adj.r.squared, 3)}
#calculate rsq for variety + treatment
STATS <- DF %>%
group_by(Variety,Treatment) %>%
summarise(Rsq=RSQ(Sepal.Length,Sepal.Width)) %>%
# make a label
# one other option is to use stringr::str_wrap in geom_text
mutate(Label=paste("Treat",Treatment,", Rsq=",Rsq))
# set vertical position of rsq
VJUST = ifelse(STATS$Treatment=="A",1.5,3)
# finally the plot function
g + geom_text(data=STATS,aes(x=-Inf,y=+Inf,label=Label),
hjust = -0.1, vjust = VJUST,size=3)
For the last geom_text() call, I allowed the y coordinates of the text to be different by multiplying the Treatment.. You might need to adjust that depending on your plot..

ggplot2 geom_smooth, extended model for method=lm

I would like to use geom_smooth to get a fitted line from a certain linear regression model.
It seems to me that the formula can only take x and y and not any additional parameter.
To show more clearly what I want:
library(dplyr)
library(ggplot2)
set.seed(35413)
df <- data.frame(pred = runif(100,10,100),
factor = sample(c("A","B"), 100, replace = TRUE)) %>%
mutate(
outcome = 100 + 10*pred +
ifelse(factor=="B", 200, 0) +
ifelse(factor=="B", 4, 0)*pred +
rnorm(100,0,60))
With
ggplot(df, aes(x=pred, y=outcome, color=factor)) +
geom_point(aes(color=factor)) +
geom_smooth(method = "lm") +
theme_bw()
I produce fitted lines that, due to the color=factor option, are basically the output of the linear model lm(outcome ~ pred*factor, df)
In some cases, however, I prefer the lines to be the output of a different model fit, like lm(outcome ~ pred + factor, df), for which I can use something like:
fit <- lm(outcome ~ pred+factor, df)
predval <- expand.grid(
pred = seq(
min(df$pred), max(df$pred), length.out = 1000),
factor = unique(df$factor)) %>%
mutate(outcome = predict(fit, newdata = .))
ggplot(df, aes(x=pred, y=outcome, color=factor)) +
geom_point() +
geom_line(data = predval) +
theme_bw()
which results in :
My question: is there a way to produce the latter graph exploiting the geom_smooth instead? I know there is a formula = - option in geom_smooth but I can't make something like formula = y ~ x + factor or formula = y ~ x + color (as I defined color = factor) work.
This is a very interesting question. Probably the main reason why geom_smooth is so "resistant" to allowing custom models of multiple variables is that it is limited to producing 2-D curves; consequently, its arguments are designed for handling two-dimensional data (i.e. formula = response variable ~ independent variable).
The trick to getting what you requested is using the mapping argument within geom_smooth, instead of formula. As you've probably seen from looking at the documentation, formula only allows you to specify the mathematical structure of the model (e.g. linear, quadratic, etc.). Conversely, the mapping argument allows you to directly specify new y-values - such as the output of a custom linear model that you can call using predict().
Note that, by default, inherit.aes is set to TRUE, so your plotted regressions will be coloured appropriately by your categorical variable. Here's the code:
# original plot
plot1 <- ggplot(df, aes(x=pred, y=outcome, color=factor)) +
geom_point(aes(color=factor)) +
geom_smooth(method = "lm") +
ggtitle("outcome ~ pred") +
theme_bw()
# declare new model here
plm <- lm(formula = outcome ~ pred + factor, data=df)
# plot with lm for outcome ~ pred + factor
plot2 <-ggplot(df, aes(x=pred, y=outcome, color=factor)) +
geom_point(aes(color=factor)) +
geom_smooth(method = "lm", mapping=aes(y=predict(plm,df))) +
ggtitle("outcome ~ pred + factor") +
theme_bw()

Resources