How to plot additional statistics in boxplot for each group? - r

I would like to see boxplots of combination of factors and I was told to use lattice for that. I tried it and it looks like this:
But now I would like to also add an ANOVA statistics to each of the groups. Possibly the statistics should display the p-value in each panel (in the white below the e.g. "Australia"). How to do this in lattice? Note that I don't insist on lattice at all...
Example code:
set.seed(123)
n <- 300
country <- sample(c("Europe", "Africa", "Asia", "Australia"), n, replace = TRUE)
type <- sample(c("city", "river", "village"), n, replace = TRUE)
month <- sample(c("may", "june", "july"), n, replace = TRUE)
x <- rnorm(n)
df <- data.frame(x, country, type, month)
bwplot(x ~ type|country+month, data = df, panel=function(...) {
panel.abline(h=0, col="green")
panel.bwplot(...)
})
The code to perform ANOVA for one of the groups and to extract p-value is this:
model <- aov(x ~ type, data = df[df$country == 'Africa' & df$month == 'may',])
p_value <- summary(model)[[1]][["Pr(>F)"]][2]

Here's one way using ggplot2. First we can compute the p-values separately for every month/country combination (I use data.table. you can use whichever way you're comfortable with). Then, we add geom_text and specify pvalue as the label and specify x and y coordinates where the text should be within each facet.
require(data.table)
dt <- data.table(df)
pval <- dt[, list(pvalue = paste0("pval = ", sprintf("%.3f",
summary(aov(x ~ type))[[1]][["Pr(>F)"]][1]))),
by=list(country, month)]
ggplot(data = df, aes(x=type, y=x)) + geom_boxplot() +
geom_text(data = pval, aes(label=pvalue, x="river", y=2.5)) +
facet_grid(country ~ month) + theme_bw() +
theme(panel.margin=grid::unit(0,"lines"), # thanks to #DieterMenne
strip.background = element_rect(fill = NA),
panel.grid.major = element_line(colour=NA),
panel.grid.minor = element_line(colour=NA))

Related

for loop run over factor levels to plot several plots not working

I am trying to run a for loop over a factor level (treatment in this case) to plot graphs for each of the levels using a function. My goal is to obtain several graphs on my wd(), one for each treatment level.
Problem: The outcome is always one single messed up barplot with all the variables and errorbars included.
dataset looks something like this:
set.seed(108) test <- data.frame(
n = 1:12,
treatment = factor(paste("trt", 1:2)),
rep= factor(paste("rep", 1:2)),
type = sample(LETTERS, 3),
mean= sample(1:100, 12),
sd= sample(1:50, 12),
var3 = sample(1:100, 12),
var4 = sample(1:100, 12))
I believe that I'm missing something on my for loop code:
df$treatment<- as.factor(df$treatment)
treatment_levels<- unique(levels(df$treatment))
for(i in 1:length(treatment_levels)){
df <- df[treatment_levels[i],]
x <- df$type
avg <- df$mean
sd <- df$sd
grp<- df$rep
title<- treatment_levels[i]
xtitle<- "type"
ytitle<- " "
fig_name <- paste(title,"_bp")
bpfunction(df, x, avg, sd, grp, title, xtitle, ytitle, fig_name)
}
my function to plot a barplot is:
bpfunction(df, x, avg, sd, grp, title, xtitle, ytitle, fig_name)
{
bp <- ggplot(df, aes(x = x, y = avg, fill = grp)) +
geom_bar(stat = 'identity', aes(fill = grp), size = 1) +
geom_errorbar(aes(ymin=avg-sd , ymax=avg +sd))+
labs(x = x, y = avg, title = title)
ggsave(paste(fig_name, "png", sep = "."), plot = bp)
}
Several issues with your implementation:
Currently, your row indexing is not logical but factor value:
df <- df[treatment_levels[i],]
Therefore, adjust to proper filtering by column:
sub_df <- df[df$treatment == treatment_levels[i],]
Better yet, avoid bookkeeping of unique treatment factor levels and use by (object-oriented wrapper to tapply) or split + lapply.
Reassigning df in a for loop. After first iteration, df can no longer be subsetted for other treatment values. Therefore, use a different object name. (Actually, avoid df altogether for more substantive name).
Using a numeric vector of many values as labels per your labs argument.
Avoid passing vectors pointing to data frame columns into aes. Instead, pass string variables to be dynamically rendered with .data[[]] or double curly brace {{}} (ggplot2 v3.0.0+):
bpfunction(treatment_df, x, avg, sd, grp, title, xtitle, ytitle, fig_name)
{
bp <- ggplot(treatment_df, aes(x = .data[[x]], y = .data[[avg]], fill = .data[[grp]])) +
geom_bar(stat = 'identity', aes(fill = .data[[grp]]), size = 1) +
geom_errorbar(aes(ymin={{avg}}-{{sd}}, ymax={{avg}}+{{sd}})) +
labs(x = x, y = avg, title = title)
ggsave(fig_name, plot = bp)
return(bp)
}
# REFACTOR USING by
treatment_plots <- by(df, df$treatment, function(sub_df)
bpfunction(
sub_df,
x = "type",
avg = "mean",
sd = "sd",
grp = "rep",
title sub_df$treatment[1],
xtitle = "type",
ytitle = " ",
fig_name = paste0(sub_df$treatment[1], "_bp.png")
)
)
# REFACTOR USING split + lapply
treatment_plots <- split(df, df$treatment) |> lapply(
function(sub_df) bp_function(...same as above...)
)

Elegant ggplot to report summary data and trend at each time point in an RCT

I am analysing an RCT and I wish to report summary statistics (mean with 95%CI) for a number of variables at three time points stratified by treatment allocation. Below is my code so far which only yields this figure.
set.seed(42)
n <- 100
dat1 <- data.frame(id=1:n,
treat = factor(sample(c('Trt','Ctrl'), n, rep=TRUE, prob=c(.5, .5))),
time = factor("T1"),
outcome1=rbinom(n = 100, size = 1, prob = 0.3),
st=runif(n, min=24, max=60),
qt=runif(n, min=.24, max=.60),
zt=runif(n, min=124, max=360)
)
dat2 <- data.frame(id=1:n,
treat = dat1$treat,
time = factor("T2"),
outcome1=dat1$outcome1,
st=runif(n, min=34, max=80),
qt=runif(n, min=.44, max=.90),
zt=runif(n, min=214, max=460)
)
dat3 <- data.frame(id=1:n,
treat = dat1$treat,
time = factor("T3"),
outcome1=dat1$outcome1,
st=runif(n, min=44, max=90),
qt=runif(n, min=.74, max=1.60),
zt=runif(n, min=324, max=1760)
)
dat <- rbind(dat1,dat2, dat3)
ggplot(dat,aes(x=mean(zt), y=time)) + geom_point(aes(colour=treat)) + coord_flip() + geom_line(aes(colour=treat))
I have three questions
can a line be added connecting T1 to T2 to T3 showing the trend
can the 95%CI for the mean be added to each point without having to calculate a "ymin" and "ymax" for all my response variables
if I have multiple response variables (in this example "st", "qt" and "zt") is there a way to produce these all at one as some sort of facet?
Pivot_longer should do most of what you need. Pivot your st, qt, and zt (and whatever other response variables you need). Here I've labeled them "response_variables" and their values as value. You can then facet_wrap by response_variable. Stat_summary will add a line and the mean and ci (se), after group and color by treat. I opted for scales = "free" in facet_wrap otherwise you won't see much going on as zt dominates with its larger range
library(dplyr)
library(ggplot2)
library(Hmisc)
library(tidyr)
dat %>%
pivot_longer(-(1:4), names_to = "response_variables") %>%
ggplot(.,aes(x=value, y=time, group = treat, color = treat)) +
facet_wrap(~response_variables, scales = "free") +
coord_flip() +
stat_summary(fun.data = mean_cl_normal,
geom = "errorbar") +
stat_summary(fun = mean,
geom = "line") +
stat_summary(fun = mean,
geom = "point")

Adding the median value to every box (not ggplot)

I have a boxplot from the code below and i want to add median values.
boxplot(ndvi_pct_sep~edge_direction, data= data_sample, subset = edge_direction %in% c(64,4, 1,16),ylab="NDVI2028-2016", xlab="Forest edge direction",names=c("north", "south", "east", "west"))
.
I want to add the median values to the boxplots, any idea how to do it?
It will likely involve using legends - since I don't have your data I cant make it perfect, but the below code should get you started using the ToothGrowth data contained in R. I am showing a base R and ggplot example (I know you said no ggplot, but others may use it).
# Load libraries
library(dplyr); library(ggplot2)
# get median data
mediandata <- ToothGrowth %>% group_by(dose) %>% summarise(median = median(len, na.rm = TRUE))
l <- unname(unlist(mediandata))
tg <- ToothGrowth # for convenience
tg$dose <- as.factor(tg$dose)
### Base R approach
boxplot(len ~ dose, data = tg,
main = "Guinea Pigs' Tooth Growth",
xlab = "Vitamin C dose mg",
ylab = "tooth length", col = "red")
for (i in 1:3){
legend(i-0.65,l[i+3]+5, legend = paste0("Median: ",l[i+3]), bty = "n")
}
### ggplot approach
ggplot(data = tg, aes(dose, len)) +
theme_classic() + theme(legend.position = "none") +
geom_boxplot()+
annotate("text",
x = c(1,2,3),
y = l[4:6]+1, # shit so you can read it
label = l[4:6])
Base R:
ggplot:
Here's a straightforward solution with text and without forloop:
Toy data:
set.seed(12)
df <- data.frame(
var1 = sample(LETTERS[1:4], 100, replace = TRUE),
var2 = rnorm(100)
)
Calculate the medians:
library(dplyr)
med <- df %>%
group_by(var1) %>%
summarise(medians = median(var2)) %>%
pull(medians)
Alternatively, in base R:
bx <- boxplot(df$var2 ~ df$var1)
med <- bx$stats[3,1:4]
Boxplot:
boxplot(df$var2 ~ df$var1)
Annotate boxplots:
text(1:4, med, round(med,3), pos = 3, cex = 0.6)
You can do
b <- boxplot(count ~ spray, data = InsectSprays, col = "lightgray", boxwex=.2)
s <- b$stats
text(1:ncol(s)+.4, s[3,], round(s[3,],1), col="red")

How to specify groups with colors in qqplot()?

I have created a qqplot (with quantiles of beta distribution) from a dataset including two groups. To visualize, which points belong to which group, I would like to color them. I have tried the following:
res <- beta.mle(data$values) #estimate parameters of beta distribution
qqplot(qbeta(ppoints(500),res$param[1], res$param[2]),data$values,
col = data$group,
ylab = "Quantiles of data",
xlab = "Quantiles of Beta Distribution")
the result is shown here:
I have seen solutions specifying a "col" vector for qqnorm, hover this seems to not work with qqplot, as simply half the points is colored in either color, regardless of group. Is there a way to fix this?
A simulated some data just to shown how to add color in ggplot
Libraries
library(tidyverse)
# install.packages("Rfast")
Data
#Simulating data from beta distribution
x <- rbeta(n = 1000,shape1 = .5,shape2 = .5)
#Estimating parameters
res <- Rfast::beta.mle(x)
data <-
tibble(
simulated_data = sort(x),
quantile_data = qbeta(ppoints(length(x)),res$param[1], res$param[2])
) %>%
#Creating a group variable using quartiles
mutate(group = cut(x = simulated_data,
quantile(simulated_data,seq(0,1,.25)),
include.lowest = T))
Code
data %>%
# Adding group variable as color
ggplot(aes( x = quantile_data, y = simulated_data, col = group))+
geom_point()
Output
For those who are wondering, how to work with pre-defined groups, this is the code that worked for me:
library(tidyverse)
library(Rfast)
res <- beta.mle(x)
# make sure groups are not numerrical
# (else color skale might turn out continuous)
g <- plyr::mapvalues(g, c("1", "2"), c("Group1", "Group2"))
data <-
tibble(
my_data = sort(x),
quantile_data = qbeta(ppoints(length(x)),res$param[1], res$param[2]),
group = g[order(x)]
)
data %>%
# Adding group variable as color
ggplot(aes( x = quantile_data, y = my_data, col = group))+
geom_point()
result

R: Automatically Producing Histograms

I am using the R programming language. I created the following data set for this example:
var_1 <- rnorm(1000,10,10)
var_2 <- rnorm(1000, 5, 5)
var_3 <- rnorm(1000, 6,18)
favorite_food <- c("pizza","ice cream", "sushi", "carrots", "onions", "broccoli", "spinach", "artichoke", "lima beans", "asparagus", "eggplant", "lettuce", "cucumbers")
favorite_food <- sample(favorite_food, 1000, replace=TRUE, prob=c(0.5, 0.45, 0.04, 0.001, 0.001, 0.001, 0.001, 0.001, 0.001, 0.001, 0.001, 0.001, 0.001))
response <- c("a","b")
response <- sample(response, 1000, replace=TRUE, prob=c(0.3, 0.7))
data = data.frame( var_1, var_2, var_3, favorite_food, response)
data$favorite_food = as.factor(data$favorite_food)
data$response = as.factor(data$response)
From here, I want to make histograms for the two categorical variables in this data set and put them on the same page:
#make histograms and put them on the same page (note: I don't know why the "par(mfrow = c(1,2))" statement is not working)
par(mfrow = c(1,2))
histogram(data$response, main = "response"))
histogram(data$favorite_food, main = "favorite food"))
My question : Is it possibly to automatically produce histograms for all categorical variables (without manually writing the "histogram()" statement for each variable) in a given data set and print them on the same page? Is it better to the use the "ggplot2" library instead for this problem ?
I can manually write the "histogram()" statement for each individual categorical variables in the data set, but I was looking for a quicker way to do this. Is it possible to do this with a "for loop"?
Thanks
A ggplot2/tidyverse solution is to lengthen each column into data and then use faceting to plot them all in the same page:
(with edit to plot only factor variables)
factor_vars <- sapply(data, is.factor)
varnames <- names(data)
deselect_not_factors <- varnames[!factor_vars]
library(tidyr)
library(ggplot2)
data_long <- data %>%
pivot_longer(
cols = -deselect_not_factors,
names_to = "category",
values_to = "value"
)
ggplot(data_long) +
geom_bar(
aes(x = value)
) +
facet_wrap(~category, scales = "free")
Here's a base R alternative using barplot in for loop :
cols <- names(data)[sapply(data, is.factor)]
#This would need some manual adjustment if number of columns increase
par(mfrow = c(1,length(cols)))
for(i in cols) {
barplot(table(data[[i]]), main = i)
}
As an alternative, you can capitalize on the fantastic DataExplorer package.
Note that histograms are for continuous variables and hence, you wanted to create bar plots for your categorical variables. This can be done as follows:
if(require(DataExplorer)==FALSE) install.packages("DataExplorer"); library(DataExplorer)
DataExplorer::plot_histogram(data) # plots histograms for continuous variables
DataExplorer::plot_bar(data) # bar plots for categorical variables
Please refer to the package manual for more details.
Here is a try using cowplot & ggplot2
library(ggplot2)
library(dplyr)
library(foreach)
library(cowplot)
list_variables <- c("response", "favorite_food")
all_plot <- foreach(current_var = c(list_variables)) %do% {
# need to do this to avoid ggplot reference to same summary data afterward.
data_summary_name <- paste0(current_var, "_summary")
eval(substitute(
{
graph_data <- data %>%
group_by(!!sym(current_var)) %>%
summarize(count = n(), .groups = "drop") %>%
mutate(share = count / sum(count))
plot <- ggplot(graph_data) +
geom_bar(mapping = aes(x = !!sym(current_var), y = share), width = 1,
fill = "#00FFFF", color = "#000000", stat = "identity") +
scale_y_continuous(labels = scales::percent) +
ggtitle(current_var) + ylab("Perecent of Total") +
theme_bw()
}, list(graph_data = as.name(data_summary_name))
))
return(plot)
}
plot_grid(plotlist = all_plot, ncol = 2)
Note: For reference about why I use eval & substitue you can reference to this question on ggplot2 generate same plot for different variables in a for loop
Using facet_wrap as approach similar to QuishSwash with data calculated in share instead
list_variables <- c("response", "favorite_food")
# Calculate share for choosen variables defined in list_variables
# You can adjust by having some variables selection based on some condition
summary_df <- bind_rows(foreach(current_var = c(list_variables)) %do% {
data %>%
group_by(variable = !!sym(current_var)) %>%
summarize(count = n(), .groups = "drop") %>%
mutate(share = count / sum(count),
variable_name = current_var)
})
ggplot(summary_df) +
geom_bar(
aes(x = variable, y = share),
fill = "#00FFFF", color = "#000000", stat = "identity") +
facet_wrap(~variable_name, scales = "free") +
scale_y_continuous(labels = scales::percent) +
theme_bw()
Created on 2021-04-29 by the reprex package (v2.0.0)

Resources