Fit a regression line with categorical variable in ggplot? - r

I have simulated a dataset and stored it in a tibble:
library(tidyverse)
set.seed(2002)
tre.sett <- rnorm(n = 12, mean = 41, sd = 5) #12 individer
ett.sett <-rnorm(n = 12, mean = 21, sd = 5) #12 individer
dat <- tibble(individ = seq(1:24),
gruppe = rep(c("tre.sett", "ett.sett"), c(length(tre.sett), length(ett.sett))),
rm = c(tre.sett, ett.sett))
Next I can create a basic plot of rm and gruppe using ggplot from tidyverse.
ggplot(dat, aes(gruppe, rm)) +
geom_point()+
theme_bw()
This gives me the following figure:
I want to add a regresson line between the two groups, but I'm struggling to implement one. If I use geom_smooth() nothing appears in figure. The intercept and slope from my model is 21.900 and 20.524, respectively.

One solution has been given in the comments: re-encode the categories as integers before using geom_smooth.
Another solution. Since the "regression line" just connects the mean of the two groups, you can use stat_summary:
dat %>%
ggplot(aes(gruppe, rm)) +
geom_point() +
stat_summary(geom = "line", fun = mean, group = 1) +
theme_bw()
Result:
You might also want to look at the sjPlot package which uses the plot_model function to visualise regression models. It would be used something like this:
library(sjPlot)
lm1 <- lm(rm ~ gruppe, data = dat)
lm1 %>%
plot_model(type = "pred",
terms = "gruppe",
show.data = TRUE) +
geom_line() +
theme_bw()
Result:

Related

ggplot2 geom_qq change theoretical data

I have a set of pvalues i.e 0<=pval<=1
I want to plot qqplot using ggplot2
As in the documentation the following code will plot a q_q plot, however if my data are pvalues I want the therotical values to be also probabilites ie. 0<=therotical v<=1
df <- data.frame(y = rt(200, df = 5))
p <- ggplot(df, aes(sample = y))
p + stat_qq() + stat_qq_line()
I am aware of the qqplot.pvalues from gaston package it does the job but the plot is not as customizable as the ggplot version.
In gaston package the theoretical data are plotted as -log10((n:1)/(n + 1)) where n is number of pvalues. How to pass these values to ggplot as theoritical data?
Assuming you have some p-values, say from a normal distribution you could create it manually
library(ggplot2)
data <- data.frame(outcome = rnorm(150))
data$pval <- pnorm(data$outcome)
data <- data[order(data$pval),]
ggplot(data = data, aes(y = pval, x = pnorm(qnorm(ppoints(nrow(data)))))) +
geom_point() +
geom_abline(slope = 1) +
labs(x = 'theoraetical p-val', y = 'observed p-val', title = 'qqplot (pval-scale)')
Although I am not sure this plot is sensible to use for conclusions.

annotate r squared to ggplot by using facet_wrap

I just joined the community and looking forward to get some help for the data analysis for my master thesis.
At the moment I have the following problem:
I plotted 42 varieties with ggplot by using facet_wrap:
`ggplot(sumfvvar,aes(x=TemperaturCmean,y=Fv.Fm,col=treatment))+
geom_point(shape=1,size=1)+
geom_smooth(method=lm)+
scale_color_brewer(palette = "Set1")+
facet_wrap(.~Variety)`
That works very well, but I would like to annotate the r squared values for the regression lines. I have two treatments and 42 varieties, therefore 84 regression lines.
Are there any possibilties to calculate all r squared values and integrate them into the ggplot? I found allready the function
ggplotRegression <- function (fit) {
require(ggplot2)
ggplot(fit$model, aes_string(x = names(fit$model)[2], y = names(fit$model)[1])) +
geom_point() +
stat_smooth(method = "lm") +
labs(title = paste("Adj R2 = ",signif(summary(fit)$adj.r.squared, 5),
"Intercept =",signif(fit$coef[[1]],5 ),
" Slope =",signif(fit$coef[[2]], 5),
" P =",signif(summary(fit)$coef[2,4], 5)))
}
but that works just for one variety and one treatment. Could be a loop for the lm() function an option?
Here is an example with the ggpmisc package:
library(ggpmisc)
set.seed(4321)
x <- 1:100
y <- (x + x^2 + x^3) + rnorm(length(x), mean = 0, sd = mean(x^3) / 4)
my.data <- data.frame(x = x,
y = y,
group = c("A", "B"))
formula <- y ~ poly(x, 1, raw = TRUE)
ggplot(my.data, aes(x, y)) +
facet_wrap(~ group) +
geom_point() +
geom_smooth(method = "lm", formula = formula) +
stat_poly_eq(formula = formula, parse = TRUE,
mapping = aes(label = stat(rr.label)))
You can't apply different labels to different facet, unless you add another r^2 column to your data.. One way is to use geom_text, but you need to calculate the stats you need first. Below I show an example with iris, and for your case, just change Species for Variety, and so on
library(tidyverse)
# simulate data for 2 treatments
# d2 is just shifted up from d1
d1 <- data.frame(iris,Treatment="A")
d2 <- data.frame(iris,Treatment="B") %>%
mutate(Sepal.Length=Sepal.Length+rnorm(nrow(iris),1,0.5))
# combine datasets
DF <- rbind(d1,d2) %>% rename(Variety = Species)
# plot like you did
# note I use "free" scales, if scales very different between Species
# your facet plots will be squished
g <- ggplot(DF,aes(x=Sepal.Width,y=Sepal.Length,col=Treatment))+
geom_point(shape=1,size=1)+
geom_smooth(method=lm)+
scale_color_brewer(palette = "Set1")+
facet_wrap(.~Variety,scales="free")
# rsq function
RSQ = function(y,x){signif(summary(lm(y ~ x))$adj.r.squared, 3)}
#calculate rsq for variety + treatment
STATS <- DF %>%
group_by(Variety,Treatment) %>%
summarise(Rsq=RSQ(Sepal.Length,Sepal.Width)) %>%
# make a label
# one other option is to use stringr::str_wrap in geom_text
mutate(Label=paste("Treat",Treatment,", Rsq=",Rsq))
# set vertical position of rsq
VJUST = ifelse(STATS$Treatment=="A",1.5,3)
# finally the plot function
g + geom_text(data=STATS,aes(x=-Inf,y=+Inf,label=Label),
hjust = -0.1, vjust = VJUST,size=3)
For the last geom_text() call, I allowed the y coordinates of the text to be different by multiplying the Treatment.. You might need to adjust that depending on your plot..

Plotting standard error bars

I have a long format dataset with 3 variables. Im plotting two of the variables and faceting by the other one, using ggplot2. I'd like to plot the standard error bars of the observations from each facet too, but I've got no idea how. Anyone knows?
HereĀ“s a picture of what i've got. I'd like to have the standard error bars on each facet. Thanks!!
Edit: here's some example data and the plot.
data <- data.frame(rep(c("1","2","3","4","5","6","7","8","9","10",
"11","12","13","14","15","16","17","18","19","20",
"21","22","23","24","25","26","27","28","29","30",
"31","32"), 2),
rep(c("a","b","c","d","e","f","g","h","i","j","k","l"), 32),
rnorm(n = 384))
colnames(data) <- c("estado","sector","VA")
ggplot(data, aes(x = estado, y = VA, col = sector)) +
facet_grid(.~sector) +
geom_point()
If all you want is the mean & standard error bar associated with each "estado"-"sector" combination, you can leave ggplot to do all the work, by replacing the geom_point() line with stat_summary():
ggplot(data,
aes(x = estado, y = VA, col = sector)) +
facet_grid(. ~ sector) +
stat_summary(fun.data = mean_se)
See ?mean_se from the ggplot2 package for more details on the function. The default parameter option gives you the mean as well as the range for 1 standard error above & below the mean.
If you want to show the original points, just add back the geom_point() line. (Though I think the plot would be rather cluttered for the reader, in that case...)
Maybe you could try something like below?
set.seed(1)
library(dplyr)
dat = data.frame(estado = factor(rep(1:32, 2)),
sector = rep(letters[1:12], 32),
VA = rnorm(384))
se = function(x) {
sd(x)/sqrt(length(x))
}
dat_sum = dat %>% group_by(estado, sector) %>%
summarise(mu = mean(VA), se = se(VA))
dat_plot = full_join(dat, dat_sum)
ggplot(dat_plot, aes(estado, y = VA, color = sector)) +
geom_jitter() +
geom_errorbar(aes(estado, y = mu, color = sector,
ymin = mu - se, ymax = mu + se)) +
facet_grid(.~sector)

Using modelr::add_predictions for glm

I am trying to calculate the logistic regression prediction for a set of data using the tidyverse and modelr packages. Clearly I am doing something wrong in the add_predictions as I am not receiving the "response" of the logistic function as I would if I were using the 'predict' function in stats. This should be simple, but I can't figure it out and multiple searches yielded little.
library(tidyverse)
library(modelr)
options(na.action = na.warn)
library(ISLR)
d <- as_tibble(ISLR::Default)
model <- glm(default ~ balance, data = d, family = binomial)
grid <- d %>% data_grid(balance) %>% add_predictions(model)
ggplot(d, aes(x=balance)) +
geom_point(aes(y = default)) +
geom_line(data = grid, aes(y = pred))
predict.glm's type parameter defaults to "link", which add_predictions does not change by default, nor provide you with any way to change to the almost-certainly desired "response". (A GitHub issue exists; add your nice reprex on it if you like.) That said, it's not hard to just use predict directly within the tidyverse via dplyr::mutate.
Also note that ggplot is coercing default (a factor) to numeric in order to plot the line, which is fine, except that "No" and "Yes" are replaced by 1 and 2, while the probabilities returned by predict will be between 0 and 1. Explicitly coercing to numeric and subtracting one fixes the plot, though an extra scale_y_continuous call is required to fix the labels.
library(tidyverse)
library(modelr)
d <- as_tibble(ISLR::Default)
model <- glm(default ~ balance, data = d, family = binomial)
grid <- d %>% data_grid(balance) %>%
mutate(pred = predict(model, newdata = ., type = 'response'))
ggplot(d, aes(x = balance)) +
geom_point(aes(y = as.numeric(default) - 1)) +
geom_line(data = grid, aes(y = pred)) +
scale_y_continuous('default', breaks = 0:1, labels = levels(d$default))
Also note that if all you want is a plot, geom_smooth can calculate predictions directly for you:
ggplot(d, aes(balance, as.numeric(default) - 1)) +
geom_point() +
geom_smooth(method = 'glm', method.args = list(family = 'binomial')) +
scale_y_continuous('default', breaks = 0:1, labels = levels(d$default))

Density plots in lattice that are proportional to the total group

I am trying to create density plots in lattice where each estimate is proportional to the total group and not just to the subset.
This is possible using histograms, like so:
require(lattice)
set.seed(1)
sex <- factor(sample(c("Men", "Women"), 200, replace = T, prob = c(.3, .7)))
age <- rnorm(200, 50, 10)
histogram(~age | sex, type = "count")
But how do I manage this using densityplot()? The type = "count" argument does not seem to work here.
Edit: This is the result I'm looking for:
library(ggplot2)
data <- data.frame(sex, age)
ggplot(data, aes(age)) +
geom_density(aes(y = ..count..)) +
facet_grid(~ sex)

Resources