Comparing mean values by group between several variables - r

I am trying to reproduce a graph from Stata in R. I have several variables and want to display their mean in each treatment group of which there are two. The Stata graph is as follows:
This coefficient plot is not actually a plot of coefficients, but of the mean values by each treatment for each separate variable. The df basically looks something like.
workable data

It is difficult to answer your question without reproducible data.
However, this might get what you desire just with mean:
library(dplyr)
mpg %>%
select(manufacturer, cty, trans) %>%
group_by(manufacturer, trans) %>%
summarize(cty_mean = mean(cty)) %>%
ggplot(aes(x=cty_mean, y=reorder(manufacturer, cty_mean), color=trans)) +
geom_point()
If you also wish to include the coefficients or std errors, then you could achieve by including a function in summarize().

I figured out geom_pointrange() is probably what you are looking for:
library("ggplot2")
set.seed(111018)
interval1 <- -qnorm((1-0.9)/2)
means_treatment_1 <- rnorm(2)
se_treatment_1 <- rnorm(2)
df_treatment_1 <- data.frame("Mean" = means_treatment_1,
"lower" = means_treatment_1 - se_treatment_1*interval1,
"upper" = means_treatment_1 + se_treatment_1*interval1,
"Variable" = c("medicare_spending_dummy",
"job_training_dummy"),
"Treatment" = "a")
means_treatment_2 <- rnorm(2)
se_treatment_2 <- rnorm(2)
df_treatment_2 <- data.frame("Mean" = means_treatment_2,
"lower" = means_treatment_2 - se_treatment_2*interval1,
"upper" = means_treatment_2 + se_treatment_2*interval1,
"Variable" = c("medicare_spending_dummy",
"job_training_dummy"),
"Treatment" = "b")
df_tot<-rbind(df_treatment_1, df_treatment_2)
# Plot
ggplot(df_tot, aes(colour = Treatment)) +
geom_hline(yintercept = 0, colour = gray(1/2), lty = 2) +
geom_pointrange(aes(x = Variable, y = Mean, ymin = lower, ymax = upper ),lwd = 1, position = position_dodge(width = 1/2)) +
coord_flip() +
theme_bw()

Related

Referring to the input data of ggplot and use that in a custom function within a geom

I'm using ggplot geom_vline in combination with a custom function to plot certain values on top of a histogram.
The example function below e.g. returns a vector of three values (the mean and x sds below or above the mean). I can now plot these values in geom_vline(xintercept) and see them in my graph.
#example function
sds_around_the_mean <- function(x, multiplier = 1) {
mean <- mean(x, na.rm = TRUE)
sd <- sd(x, na.rm = TRUE)
tibble(low = mean - multiplier * sd,
mean = mean,
high = mean + multiplier * sd) %>%
pivot_longer(cols = everything()) %>%
pull(value)
}
Reproducible data
#data
set.seed(123)
normal <- tibble(data = rnorm(1000, mean = 100, sd = 5))
outliers <- tibble(data = runif(5, min = 150, max = 200))
df <- bind_rows(lst(normal, outliers), .id = "type")
df %>%
ggplot(aes(x = data)) +
geom_histogram(bins = 100) +
geom_vline(xintercept = sds_around_the_mean(df$data, multiplier = 3),
linetype = "dashed", color = "red") +
geom_vline(xintercept = sds_around_the_mean(df$data, multiplier = 2),
linetype = "dashed")
The problem is, that as you can see I would have to define data$df at various places.
This becomes more error-prone when I apply any change to the original df that I pipe into ggplot, e.g. filtering out outliers before plotting. I would have to apply the same changes again at multiple places.
E.g.
df %>% filter(type == "normal")
#also requires
df$data
#to be changed to
df$data[df$type == "normal"]
#in geom_vline to obtain the correct input values for the xintercept.
So instead, how could I replace the df$data argument with the respective column of whatever has been piped into ggplot() in the first place? Something similar to the "." operator, I assume. I've also tried stat_summary with geom = "vline" to achieve this, but without the desired effect.
You can enclose the ggplot part in curly brackets and reference the incoming dataset with the . symbol both in the ggplot command and when calculating the sds_around_the_mean. This will make it dynamic.
df %>%
{ggplot(data = ., aes(x = data)) +
geom_histogram(bins = 100) +
geom_vline(xintercept = sds_around_the_mean(.$data, multiplier = 3),
linetype = "dashed", color = "red") +
geom_vline(xintercept = sds_around_the_mean(.$data, multiplier = 2),
linetype = "dashed")}

Elegant ggplot to report summary data and trend at each time point in an RCT

I am analysing an RCT and I wish to report summary statistics (mean with 95%CI) for a number of variables at three time points stratified by treatment allocation. Below is my code so far which only yields this figure.
set.seed(42)
n <- 100
dat1 <- data.frame(id=1:n,
treat = factor(sample(c('Trt','Ctrl'), n, rep=TRUE, prob=c(.5, .5))),
time = factor("T1"),
outcome1=rbinom(n = 100, size = 1, prob = 0.3),
st=runif(n, min=24, max=60),
qt=runif(n, min=.24, max=.60),
zt=runif(n, min=124, max=360)
)
dat2 <- data.frame(id=1:n,
treat = dat1$treat,
time = factor("T2"),
outcome1=dat1$outcome1,
st=runif(n, min=34, max=80),
qt=runif(n, min=.44, max=.90),
zt=runif(n, min=214, max=460)
)
dat3 <- data.frame(id=1:n,
treat = dat1$treat,
time = factor("T3"),
outcome1=dat1$outcome1,
st=runif(n, min=44, max=90),
qt=runif(n, min=.74, max=1.60),
zt=runif(n, min=324, max=1760)
)
dat <- rbind(dat1,dat2, dat3)
ggplot(dat,aes(x=mean(zt), y=time)) + geom_point(aes(colour=treat)) + coord_flip() + geom_line(aes(colour=treat))
I have three questions
can a line be added connecting T1 to T2 to T3 showing the trend
can the 95%CI for the mean be added to each point without having to calculate a "ymin" and "ymax" for all my response variables
if I have multiple response variables (in this example "st", "qt" and "zt") is there a way to produce these all at one as some sort of facet?
Pivot_longer should do most of what you need. Pivot your st, qt, and zt (and whatever other response variables you need). Here I've labeled them "response_variables" and their values as value. You can then facet_wrap by response_variable. Stat_summary will add a line and the mean and ci (se), after group and color by treat. I opted for scales = "free" in facet_wrap otherwise you won't see much going on as zt dominates with its larger range
library(dplyr)
library(ggplot2)
library(Hmisc)
library(tidyr)
dat %>%
pivot_longer(-(1:4), names_to = "response_variables") %>%
ggplot(.,aes(x=value, y=time, group = treat, color = treat)) +
facet_wrap(~response_variables, scales = "free") +
coord_flip() +
stat_summary(fun.data = mean_cl_normal,
geom = "errorbar") +
stat_summary(fun = mean,
geom = "line") +
stat_summary(fun = mean,
geom = "point")

Adding trend lines across groups and setting tick labels in a grouped violin plot or box plot

I have xy grouped data that I'm plotting using R's ggplot2 geom_violin adding regression trend lines:
Here are the data:
library(dplyr)
library(plotly)
library(ggplot2)
set.seed(1)
df <- data.frame(value = c(rnorm(500,8,1),rnorm(600,6,1.5),rnorm(400,4,0.5),rnorm(500,2,2),rnorm(400,4,1),rnorm(600,7,0.5),rnorm(500,3,1),rnorm(500,3,1),rnorm(500,3,1)),
age = c(rep("d3",500),rep("d8",600),rep("d24",400),rep("d3",500),rep("d8",400),rep("d24",600),rep("d3",500),rep("d8",500),rep("d24",500)),
group = c(rep("A",1500),rep("B",1500),rep("C",1500))) %>%
dplyr::mutate(time = as.integer(age)) %>%
dplyr::arrange(group,time) %>%
dplyr::mutate(group_age=paste0(group,"_",age))
df$group_age <- factor(df$group_age,levels=unique(df$group_age))
And my current plot:
ggplot(df,aes(x=group_age,y=value,fill=age,color=age,alpha=0.5)) +
geom_violin() + geom_boxplot(width=0.1,aes(fill=age,color=age,middle=mean(value))) +
geom_smooth(data=df,mapping=aes(x=group_age,y=value,group=group),color="black",method='lm',size=1,se=T) + theme_minimal()
My questions are:
How do I get rid of the alpha part of the legend?
I would like the x-axis ticks to be df$group rather than df$group_age, which means a tick per each group at the center of that group where the label is group. Consider a situation where not all groups have all ages - for example, if a certain group has only two of the ages and I'm pretty sure ggplot will only present only these two ages, I'd like the tick to still be centered between their two ages.
One more question:
It would also be nice to have the p-values of each fitted slope plotted on top of each group.
I tried:
library(ggpmisc)
my.formula <- value ~ group_age
ggplot(df,aes(x=group_age,y=value,fill=age,color=age,alpha=0.5)) +
geom_violin() + geom_boxplot(width=0.1,aes(fill=age,color=age,middle=mean(value))) +
geom_smooth(data=df,mapping=aes(x=group_age,y=value,group=group),color="black",method='lm',size=1,se=T) + theme_minimal() +
stat_poly_eq(formula = my.formula,aes(label=stat(p.value.label)),parse=T)
But I get the same plot as above with the following warning message:
Warning message:
Computation failed in `stat_poly_eq()`:
argument "x" is missing, with no default
geom_smooth() fits a line, while stat_poly_eqn() issues an error. A factor is a categorical variable with unordered levels. A trend against a factor is undefined. geom_smooth() may be taking the levels and converting them to "arbitrary" numerical values, but these values are just indexes rather than meaningful values.
To obtain a plot similar to what is described in the question but using code that provides correct linear regression lines and the corresponding p-values I would use the code below. The main change is that the numerical variable time is mapped to x making the fitting of a regression a valid operation. To allow for a linear fit an x-scale with a log10 transformation is used, with breaks and labels at the ages for which data is available.
library(dplyr)
library(ggplot2)
library(ggpmisc)
set.seed(1)
df <-
data.frame(
value = c(
rnorm(500, 8, 1), rnorm(600, 6, 1.5), rnorm(400, 4, 0.5),
rnorm(500, 2, 2), rnorm(400, 4, 1), rnorm(600, 7, 0.5),
rnorm(500, 3, 1), rnorm(500, 3, 1), rnorm(500, 3, 1)
),
age = c(
rep("d3", 500), rep("d8", 600), rep("d24", 400),
rep("d3", 500), rep("d8", 400), rep("d24", 600),
rep("d3", 500), rep("d8", 500), rep("d24", 500)
),
group = c(rep("A", 1500), rep("B", 1500), rep("C", 1500))
) %>%
mutate(time = as.integer(gsub("d", "", age))) %>%
arrange(group, time) %>%
mutate(age = factor(age, levels = c("d3", "d8", "d24")),
group = factor(group))
my_formula = y ~ x
ggplot(df, aes(x = time, y = value)) +
geom_violin(aes(fill = age, color = age), alpha = 0.3) +
geom_boxplot(width = 0.1,
aes(color = age), fill = NA) +
geom_smooth(color = "black", formula = my_formula, method = 'lm') +
stat_poly_eq(aes(label = stat(p.value.label)),
formula = my_formula, parse = TRUE,
npcx = "center", npcy = "bottom") +
scale_x_log10(name = "Age", breaks = c(3, 8, 24)) +
facet_wrap(~group) +
theme_minimal()
Which creates the following figure:
Here is a solution. The alpha - legend issue is easy. Anything you place into the aes() functioning will get placed in a legend. This feature should be used when you want a feature of the data to be used as an aestetic. Putting alpha outside of an aes will remove it from the legend.
I'm not sure the x legend is what you wanted but i did it manually so it should be easy to configure.
Regarding the p.values, i did separate linear regressions and store the p.value in three different vectors which can be called into the ggplot using the annotate. For two of the groups the p.value was <.001 so the round functioning will round it to 0. Therefore, i just added p. <.001
Good luck with this!
library(dplyr)
library(ggplot2)
set.seed(1)
df <- data.frame(value = c(rnorm(500,8,1),rnorm(600,6,1.5),rnorm(400,4,0.5),rnorm(500,2,2),rnorm(400,4,1),rnorm(600,7,0.5),rnorm(500,3,1),rnorm(500,3,1),rnorm(500,3,1)),
age = c(rep("d3",500),rep("d8",600),rep("d24",400),rep("d3",500),rep("d8",400),rep("d24",600),rep("d3",500),rep("d8",500),rep("d24",500)),
group = c(rep("A",1500),rep("B",1500),rep("C",1500))) %>%
dplyr::mutate(time = as.integer(age)) %>%
dplyr::arrange(group,time) %>%
dplyr::mutate(group_age=paste0(group,"_",age))
df$group_age <- factor(df$group_age,levels=unique(df$group_age))
mod1 <- lm(value ~ time,df\[df$group == 'A',\])
mod1 <- summary(mod1)$coefficients\[8\] %>% round(2)
mod2 <- lm(value ~ time,df\[df$group == 'B',\])
mod2 <- summary(mod2)$coefficients\[8\] %>% round(2)
mod3 <- lm(value ~ time,df\[df$group == 'C',\])
mod3 <- summary(mod3)$coefficients\[8\] %>% round(2)
ggplot(df,aes(x=group_age,y=value,fill=age,color=age)) +
geom_violin(alpha=0.5) +
geom_boxplot(width=0.1,aes(fill=age,color=age,middle=mean(value))) +
geom_smooth(mapping=aes(x=group_age,y=value,group=group),color="black",method='lm',size=1,se=T) +
scale_x_discrete(labels = c('','A','','','B','','','C','')) +
annotate('text',x = 2,y = -1,label = paste('pvalue: <.001')) +
annotate('text',x = 6,y = 10,label = paste('pvalue: <.001')) +
annotate('text',x = 8,y = -1.2,label = paste('pvalue:',mod3))+
theme_minimal()

show 2 standard deviations on a ggplot2 control chart (in addition to the normal 3)

First I create the data:
library(ggplot2)
library(ggQC)
set.seed(5555)
Golden_Egg_df <- data.frame(month=1:12, egg_diameter = rnorm(n = 12, mean = 1.5, sd = 0.2))
Then I setup the base ggplot.
XmR_Plot <- ggplot(Golden_Egg_df, aes(x = month, y = egg_diameter)) +
geom_point() + geom_line()
I can create a simple control chart with the ggQC package, in the following manner.
XmR_Plot + stat_QC(method = "XmR")
I can facet the control chart to show different levels of standard deviation (in this example, between 1-3).
XmR_Plot + stat_qc_violations(method = "XmR")
What I want is to be able to see both 2 and 3 standard deviations on the same chart, not faceted. My imagined syntax would be
XmR_Plot + stat_QC(method = "XmR", stand.dev = c(2, 3))
or something like that. But it obviously does not work, how do I get multiple standard deviations to show on 1 chart? It'd look something like this:
[
I highly recommend calculating your summary statistics yourself. You'll get a lot more control over the plot!
library(ggplot2)
library(dplyr)
library(tidyr)
set.seed(5555)
golden.egg.df = data.frame(month=1:12,
egg_diameter = rnorm(n = 12,
mean = 1.5,
sd = 0.2)
)
lines.df = golden.egg.df %>%
# Calculate all the summary stats
mutate(mean = mean(egg_diameter),
sd = sd(egg_diameter),
plus_one = mean + sd,
plus_two = mean + 2 * sd,
plus_three = mean + 3 * sd,
minus_one = mean - sd,
minus_two = mean - 2 * sd,
minus_three = mean - 3 * sd
) %>%
# Remove what we don't want to plot
select(-month, -egg_diameter, -sd) %>%
# Filter so the dataframe is now one unique row
unique() %>%
# Make the table tall for plotting
gather(key = stat,
value = value) %>%
# Add a new column which indicates how many SDs a line is from
# the mean
mutate(linetype = gsub("[\\s\\S]+?_", "", stat, perl = TRUE))
ggplot(golden.egg.df,
aes(x = month, y = egg_diameter)) +
geom_hline(data = lines.df,
aes(yintercept = value, linetype = linetype)) +
geom_point() +
geom_line()

Bar plot of group means with lines of individual results overlaid

this is my first stack overflow post and I am a relatively new R user, so please go gently!
I have a data frame with three columns, a participant identifier, a condition (factor with 2 levels either Placebo or Experimental), and an outcome score.
set.seed(1)
dat <- data.frame(Condition = c(rep("Placebo",10),rep("Experimental",10)),
Outcome = rnorm(20,15,2),
ID = factor(rep(1:10,2)))
I would like to construct a bar plot with two bars with the mean outcome score for each condition and the standard deviation as an error bar. I would like to then overlay lines connecting points for each participant's score in each condition. So the plot displays the individual response as well as the group mean.If it is also possible I would like to include an axis break.
I don't seem to be able to find any advice in other threads, apologies if I am repeating a question.
Many Thanks.
p.s. I realise that presenting data in this way will not be to everyones tastes. It is for a specific requirement!
This ought to work:
library(ggplot2)
library(dplyr)
dat.summ <- dat %>% group_by(Condition) %>%
summarize(mean.outcome = mean(Outcome),
sd.outcome = sd(Outcome))
ggplot(dat.summ, aes(x = Condition, y = mean.outcome)) +
geom_bar(stat = "identity") +
geom_errorbar(aes(ymin = mean.outcome - sd.outcome,
ymax = mean.outcome + sd.outcome),
color = "dodgerblue", width = 0.3) +
geom_point(data = dat, aes(x = Condition, y = Outcome),
color = "firebrick", size = 1.2) +
geom_line(data = dat, aes(x = Condition, y = Outcome, group = ID),
color = "firebrick", size = 1.2, alpha = 0.5) +
scale_y_continuous(limits = c(0, max(dat$Outcome)))
Some people are better with ggplot's stat functions and arguments than I am and might do it differently. I prefer to just transform my data first.
set.seed(1)
dat <- data.frame(Condition = c(rep("Placebo",10),rep("Experimental",10)),
Outcome = rnorm(20,15,2),
ID = factor(rep(1:10,2)))
dat.w <- reshape(dat, direction = 'wide', idvar = 'ID', timevar = 'Condition')
means <- colMeans(dat.w[, 2:3])
sds <- apply(dat.w[, 2:3], 2, sd)
ci.l <- means - sds
ci.u <- means + sds
ci.width <- .25
bp <- barplot(means, ylim = c(0,20))
segments(bp, ci.l, bp, ci.u)
segments(bp - ci.width, ci.u, bp + ci.width, ci.u)
segments(bp - ci.width, ci.l, bp + ci.width, ci.l)
segments(x0 = bp[1], x1 = bp[2], y0 = dat.w[, 2], y1 = dat.w[, 3], col = 1:10)
points(c(rep(bp[1], 10), rep(bp[2], 10)), dat$Outcome, col = 1:10, pch = 19)
Here is a method using the transfomations inside ggplot2
ggplot(dat) +
stat_summary(aes(x=Condition, y=Outcome, group=Condition), fun.y="mean", geom="bar") +
stat_summary(aes(x=Condition, y=Outcome, group=Condition), fun.data="mean_se", geom="errorbar", col="green", width=.8, size=2) +
geom_line(aes(x=Condition, y=Outcome, group=ID), col="red")

Resources