I am using the R programming language. I created the following data set for this example:
var_1 <- rnorm(1000,10,10)
var_2 <- rnorm(1000, 5, 5)
var_3 <- rnorm(1000, 6,18)
favorite_food <- c("pizza","ice cream", "sushi", "carrots", "onions", "broccoli", "spinach", "artichoke", "lima beans", "asparagus", "eggplant", "lettuce", "cucumbers")
favorite_food <- sample(favorite_food, 1000, replace=TRUE, prob=c(0.5, 0.45, 0.04, 0.001, 0.001, 0.001, 0.001, 0.001, 0.001, 0.001, 0.001, 0.001, 0.001))
response <- c("a","b")
response <- sample(response, 1000, replace=TRUE, prob=c(0.3, 0.7))
data = data.frame( var_1, var_2, var_3, favorite_food, response)
data$favorite_food = as.factor(data$favorite_food)
data$response = as.factor(data$response)
From here, I want to make histograms for the two categorical variables in this data set and put them on the same page:
#make histograms and put them on the same page (note: I don't know why the "par(mfrow = c(1,2))" statement is not working)
par(mfrow = c(1,2))
histogram(data$response, main = "response"))
histogram(data$favorite_food, main = "favorite food"))
My question : Is it possibly to automatically produce histograms for all categorical variables (without manually writing the "histogram()" statement for each variable) in a given data set and print them on the same page? Is it better to the use the "ggplot2" library instead for this problem ?
I can manually write the "histogram()" statement for each individual categorical variables in the data set, but I was looking for a quicker way to do this. Is it possible to do this with a "for loop"?
Thanks
A ggplot2/tidyverse solution is to lengthen each column into data and then use faceting to plot them all in the same page:
(with edit to plot only factor variables)
factor_vars <- sapply(data, is.factor)
varnames <- names(data)
deselect_not_factors <- varnames[!factor_vars]
library(tidyr)
library(ggplot2)
data_long <- data %>%
pivot_longer(
cols = -deselect_not_factors,
names_to = "category",
values_to = "value"
)
ggplot(data_long) +
geom_bar(
aes(x = value)
) +
facet_wrap(~category, scales = "free")
Here's a base R alternative using barplot in for loop :
cols <- names(data)[sapply(data, is.factor)]
#This would need some manual adjustment if number of columns increase
par(mfrow = c(1,length(cols)))
for(i in cols) {
barplot(table(data[[i]]), main = i)
}
As an alternative, you can capitalize on the fantastic DataExplorer package.
Note that histograms are for continuous variables and hence, you wanted to create bar plots for your categorical variables. This can be done as follows:
if(require(DataExplorer)==FALSE) install.packages("DataExplorer"); library(DataExplorer)
DataExplorer::plot_histogram(data) # plots histograms for continuous variables
DataExplorer::plot_bar(data) # bar plots for categorical variables
Please refer to the package manual for more details.
Here is a try using cowplot & ggplot2
library(ggplot2)
library(dplyr)
library(foreach)
library(cowplot)
list_variables <- c("response", "favorite_food")
all_plot <- foreach(current_var = c(list_variables)) %do% {
# need to do this to avoid ggplot reference to same summary data afterward.
data_summary_name <- paste0(current_var, "_summary")
eval(substitute(
{
graph_data <- data %>%
group_by(!!sym(current_var)) %>%
summarize(count = n(), .groups = "drop") %>%
mutate(share = count / sum(count))
plot <- ggplot(graph_data) +
geom_bar(mapping = aes(x = !!sym(current_var), y = share), width = 1,
fill = "#00FFFF", color = "#000000", stat = "identity") +
scale_y_continuous(labels = scales::percent) +
ggtitle(current_var) + ylab("Perecent of Total") +
theme_bw()
}, list(graph_data = as.name(data_summary_name))
))
return(plot)
}
plot_grid(plotlist = all_plot, ncol = 2)
Note: For reference about why I use eval & substitue you can reference to this question on ggplot2 generate same plot for different variables in a for loop
Using facet_wrap as approach similar to QuishSwash with data calculated in share instead
list_variables <- c("response", "favorite_food")
# Calculate share for choosen variables defined in list_variables
# You can adjust by having some variables selection based on some condition
summary_df <- bind_rows(foreach(current_var = c(list_variables)) %do% {
data %>%
group_by(variable = !!sym(current_var)) %>%
summarize(count = n(), .groups = "drop") %>%
mutate(share = count / sum(count),
variable_name = current_var)
})
ggplot(summary_df) +
geom_bar(
aes(x = variable, y = share),
fill = "#00FFFF", color = "#000000", stat = "identity") +
facet_wrap(~variable_name, scales = "free") +
scale_y_continuous(labels = scales::percent) +
theme_bw()
Created on 2021-04-29 by the reprex package (v2.0.0)
Related
I'm using ggplot geom_vline in combination with a custom function to plot certain values on top of a histogram.
The example function below e.g. returns a vector of three values (the mean and x sds below or above the mean). I can now plot these values in geom_vline(xintercept) and see them in my graph.
#example function
sds_around_the_mean <- function(x, multiplier = 1) {
mean <- mean(x, na.rm = TRUE)
sd <- sd(x, na.rm = TRUE)
tibble(low = mean - multiplier * sd,
mean = mean,
high = mean + multiplier * sd) %>%
pivot_longer(cols = everything()) %>%
pull(value)
}
Reproducible data
#data
set.seed(123)
normal <- tibble(data = rnorm(1000, mean = 100, sd = 5))
outliers <- tibble(data = runif(5, min = 150, max = 200))
df <- bind_rows(lst(normal, outliers), .id = "type")
df %>%
ggplot(aes(x = data)) +
geom_histogram(bins = 100) +
geom_vline(xintercept = sds_around_the_mean(df$data, multiplier = 3),
linetype = "dashed", color = "red") +
geom_vline(xintercept = sds_around_the_mean(df$data, multiplier = 2),
linetype = "dashed")
The problem is, that as you can see I would have to define data$df at various places.
This becomes more error-prone when I apply any change to the original df that I pipe into ggplot, e.g. filtering out outliers before plotting. I would have to apply the same changes again at multiple places.
E.g.
df %>% filter(type == "normal")
#also requires
df$data
#to be changed to
df$data[df$type == "normal"]
#in geom_vline to obtain the correct input values for the xintercept.
So instead, how could I replace the df$data argument with the respective column of whatever has been piped into ggplot() in the first place? Something similar to the "." operator, I assume. I've also tried stat_summary with geom = "vline" to achieve this, but without the desired effect.
You can enclose the ggplot part in curly brackets and reference the incoming dataset with the . symbol both in the ggplot command and when calculating the sds_around_the_mean. This will make it dynamic.
df %>%
{ggplot(data = ., aes(x = data)) +
geom_histogram(bins = 100) +
geom_vline(xintercept = sds_around_the_mean(.$data, multiplier = 3),
linetype = "dashed", color = "red") +
geom_vline(xintercept = sds_around_the_mean(.$data, multiplier = 2),
linetype = "dashed")}
I have a boxplot from the code below and i want to add median values.
boxplot(ndvi_pct_sep~edge_direction, data= data_sample, subset = edge_direction %in% c(64,4, 1,16),ylab="NDVI2028-2016", xlab="Forest edge direction",names=c("north", "south", "east", "west"))
.
I want to add the median values to the boxplots, any idea how to do it?
It will likely involve using legends - since I don't have your data I cant make it perfect, but the below code should get you started using the ToothGrowth data contained in R. I am showing a base R and ggplot example (I know you said no ggplot, but others may use it).
# Load libraries
library(dplyr); library(ggplot2)
# get median data
mediandata <- ToothGrowth %>% group_by(dose) %>% summarise(median = median(len, na.rm = TRUE))
l <- unname(unlist(mediandata))
tg <- ToothGrowth # for convenience
tg$dose <- as.factor(tg$dose)
### Base R approach
boxplot(len ~ dose, data = tg,
main = "Guinea Pigs' Tooth Growth",
xlab = "Vitamin C dose mg",
ylab = "tooth length", col = "red")
for (i in 1:3){
legend(i-0.65,l[i+3]+5, legend = paste0("Median: ",l[i+3]), bty = "n")
}
### ggplot approach
ggplot(data = tg, aes(dose, len)) +
theme_classic() + theme(legend.position = "none") +
geom_boxplot()+
annotate("text",
x = c(1,2,3),
y = l[4:6]+1, # shit so you can read it
label = l[4:6])
Base R:
ggplot:
Here's a straightforward solution with text and without forloop:
Toy data:
set.seed(12)
df <- data.frame(
var1 = sample(LETTERS[1:4], 100, replace = TRUE),
var2 = rnorm(100)
)
Calculate the medians:
library(dplyr)
med <- df %>%
group_by(var1) %>%
summarise(medians = median(var2)) %>%
pull(medians)
Alternatively, in base R:
bx <- boxplot(df$var2 ~ df$var1)
med <- bx$stats[3,1:4]
Boxplot:
boxplot(df$var2 ~ df$var1)
Annotate boxplots:
text(1:4, med, round(med,3), pos = 3, cex = 0.6)
You can do
b <- boxplot(count ~ spray, data = InsectSprays, col = "lightgray", boxwex=.2)
s <- b$stats
text(1:ncol(s)+.4, s[3,], round(s[3,],1), col="red")
I am trying to create a script using which I can automate the creation of a set of plots (faceted and grouped) with p-values calculated and plotted using the ggpubr and rstatix packages.
set.seed(1234)
create the dataset
data_set <-
data.frame(
var1 = rep(c("N", "N", "Y", "Y"),4),
var2 = c(rep("type1",8), rep("type2", 8)),
var3 = c(rep("type1",4),rep("type2",8),rep("type1",4)),
x = rnorm(16),
y = rnorm(16),
z = rnorm(16)
)
Perform t test for variable xvs. var2 grouped by var3 and faceted by var1 (see below) and store the results as a dataframe using rstatix functions
stat.test <- data_set %>%
group_by(var2, var1) %>%
t_test( x ~ var3) %>%
adjust_pvalue(method = "bonferroni") %>%
add_significance("p.adj") %>%
add_xy_position(x = "var2", dodge = 0.8)
perform another t-test on variable x vs. var3 this time using data grouped by var2 and faceted again by var1 and perform a mutate to alter some variables so they align correctly when plotted using the function below.
stat.test.1 <- data_set %>%
group_by(var3, var1) %>%
t_test( x ~ var2) %>%
adjust_pvalue(method = "bonferroni") %>%
add_significance("p.adj") %>%
add_xy_position(x = "var3", dodge = 0.8) %>%
mutate(
xmin = xmin + c(0, 0, -0.6, -0.6),
xmax = xmax + c(0.6, 0.6, 0, 0),
y.position = y.position + c(1, 1, 2, 2)
)
Plot using ggboxplot
ggboxplot(
data_set,
x = "var2",
add = "mean_sd",
y = "x",
color = "var3",
facet.by = "var1"
) +
stat_pvalue_manual(stat.test,
label = "p.adj",
tip.length = 0.01,
hide.ns = FALSE) +
stat_pvalue_manual(
stat.test.1,
label = "p.adj",
tip.length = 0.01,
hide.ns = FALSE
) +
scale_y_continuous(expand = expansion(mult = c(0.01, 0.1)))
All of this works to my expectations and I get the plot I want along with significance values plotted (though not perfect, needs some adjustments to the y positions of the significance bars).
What I want to do is create a function or script using tidy approach to create a similar set of boxplots for all numeric variables (x, y and z) grouped and faceted in the same manner as this plot. I am able to get the plots themselves, but having difficulty with generating the stats dataframes and using them to add the p values and significance bars into the plots. Thanks.
I have xy grouped data that I'm plotting using R's ggplot2 geom_violin adding regression trend lines:
Here are the data:
library(dplyr)
library(plotly)
library(ggplot2)
set.seed(1)
df <- data.frame(value = c(rnorm(500,8,1),rnorm(600,6,1.5),rnorm(400,4,0.5),rnorm(500,2,2),rnorm(400,4,1),rnorm(600,7,0.5),rnorm(500,3,1),rnorm(500,3,1),rnorm(500,3,1)),
age = c(rep("d3",500),rep("d8",600),rep("d24",400),rep("d3",500),rep("d8",400),rep("d24",600),rep("d3",500),rep("d8",500),rep("d24",500)),
group = c(rep("A",1500),rep("B",1500),rep("C",1500))) %>%
dplyr::mutate(time = as.integer(age)) %>%
dplyr::arrange(group,time) %>%
dplyr::mutate(group_age=paste0(group,"_",age))
df$group_age <- factor(df$group_age,levels=unique(df$group_age))
And my current plot:
ggplot(df,aes(x=group_age,y=value,fill=age,color=age,alpha=0.5)) +
geom_violin() + geom_boxplot(width=0.1,aes(fill=age,color=age,middle=mean(value))) +
geom_smooth(data=df,mapping=aes(x=group_age,y=value,group=group),color="black",method='lm',size=1,se=T) + theme_minimal()
My questions are:
How do I get rid of the alpha part of the legend?
I would like the x-axis ticks to be df$group rather than df$group_age, which means a tick per each group at the center of that group where the label is group. Consider a situation where not all groups have all ages - for example, if a certain group has only two of the ages and I'm pretty sure ggplot will only present only these two ages, I'd like the tick to still be centered between their two ages.
One more question:
It would also be nice to have the p-values of each fitted slope plotted on top of each group.
I tried:
library(ggpmisc)
my.formula <- value ~ group_age
ggplot(df,aes(x=group_age,y=value,fill=age,color=age,alpha=0.5)) +
geom_violin() + geom_boxplot(width=0.1,aes(fill=age,color=age,middle=mean(value))) +
geom_smooth(data=df,mapping=aes(x=group_age,y=value,group=group),color="black",method='lm',size=1,se=T) + theme_minimal() +
stat_poly_eq(formula = my.formula,aes(label=stat(p.value.label)),parse=T)
But I get the same plot as above with the following warning message:
Warning message:
Computation failed in `stat_poly_eq()`:
argument "x" is missing, with no default
geom_smooth() fits a line, while stat_poly_eqn() issues an error. A factor is a categorical variable with unordered levels. A trend against a factor is undefined. geom_smooth() may be taking the levels and converting them to "arbitrary" numerical values, but these values are just indexes rather than meaningful values.
To obtain a plot similar to what is described in the question but using code that provides correct linear regression lines and the corresponding p-values I would use the code below. The main change is that the numerical variable time is mapped to x making the fitting of a regression a valid operation. To allow for a linear fit an x-scale with a log10 transformation is used, with breaks and labels at the ages for which data is available.
library(dplyr)
library(ggplot2)
library(ggpmisc)
set.seed(1)
df <-
data.frame(
value = c(
rnorm(500, 8, 1), rnorm(600, 6, 1.5), rnorm(400, 4, 0.5),
rnorm(500, 2, 2), rnorm(400, 4, 1), rnorm(600, 7, 0.5),
rnorm(500, 3, 1), rnorm(500, 3, 1), rnorm(500, 3, 1)
),
age = c(
rep("d3", 500), rep("d8", 600), rep("d24", 400),
rep("d3", 500), rep("d8", 400), rep("d24", 600),
rep("d3", 500), rep("d8", 500), rep("d24", 500)
),
group = c(rep("A", 1500), rep("B", 1500), rep("C", 1500))
) %>%
mutate(time = as.integer(gsub("d", "", age))) %>%
arrange(group, time) %>%
mutate(age = factor(age, levels = c("d3", "d8", "d24")),
group = factor(group))
my_formula = y ~ x
ggplot(df, aes(x = time, y = value)) +
geom_violin(aes(fill = age, color = age), alpha = 0.3) +
geom_boxplot(width = 0.1,
aes(color = age), fill = NA) +
geom_smooth(color = "black", formula = my_formula, method = 'lm') +
stat_poly_eq(aes(label = stat(p.value.label)),
formula = my_formula, parse = TRUE,
npcx = "center", npcy = "bottom") +
scale_x_log10(name = "Age", breaks = c(3, 8, 24)) +
facet_wrap(~group) +
theme_minimal()
Which creates the following figure:
Here is a solution. The alpha - legend issue is easy. Anything you place into the aes() functioning will get placed in a legend. This feature should be used when you want a feature of the data to be used as an aestetic. Putting alpha outside of an aes will remove it from the legend.
I'm not sure the x legend is what you wanted but i did it manually so it should be easy to configure.
Regarding the p.values, i did separate linear regressions and store the p.value in three different vectors which can be called into the ggplot using the annotate. For two of the groups the p.value was <.001 so the round functioning will round it to 0. Therefore, i just added p. <.001
Good luck with this!
library(dplyr)
library(ggplot2)
set.seed(1)
df <- data.frame(value = c(rnorm(500,8,1),rnorm(600,6,1.5),rnorm(400,4,0.5),rnorm(500,2,2),rnorm(400,4,1),rnorm(600,7,0.5),rnorm(500,3,1),rnorm(500,3,1),rnorm(500,3,1)),
age = c(rep("d3",500),rep("d8",600),rep("d24",400),rep("d3",500),rep("d8",400),rep("d24",600),rep("d3",500),rep("d8",500),rep("d24",500)),
group = c(rep("A",1500),rep("B",1500),rep("C",1500))) %>%
dplyr::mutate(time = as.integer(age)) %>%
dplyr::arrange(group,time) %>%
dplyr::mutate(group_age=paste0(group,"_",age))
df$group_age <- factor(df$group_age,levels=unique(df$group_age))
mod1 <- lm(value ~ time,df\[df$group == 'A',\])
mod1 <- summary(mod1)$coefficients\[8\] %>% round(2)
mod2 <- lm(value ~ time,df\[df$group == 'B',\])
mod2 <- summary(mod2)$coefficients\[8\] %>% round(2)
mod3 <- lm(value ~ time,df\[df$group == 'C',\])
mod3 <- summary(mod3)$coefficients\[8\] %>% round(2)
ggplot(df,aes(x=group_age,y=value,fill=age,color=age)) +
geom_violin(alpha=0.5) +
geom_boxplot(width=0.1,aes(fill=age,color=age,middle=mean(value))) +
geom_smooth(mapping=aes(x=group_age,y=value,group=group),color="black",method='lm',size=1,se=T) +
scale_x_discrete(labels = c('','A','','','B','','','C','')) +
annotate('text',x = 2,y = -1,label = paste('pvalue: <.001')) +
annotate('text',x = 6,y = 10,label = paste('pvalue: <.001')) +
annotate('text',x = 8,y = -1.2,label = paste('pvalue:',mod3))+
theme_minimal()
I would like to be able to plot each of "X1 by grpA", "X2 by grpA", "X3 by grpB", "X1 by grpB", "X2 by grpB", and "x3 by grpB" using ggplot2::ggplot() in conjunction with a for loop.
So far, I can get it to almost work, but the argument for the column of the grouping variable in the facet_grid() function does not resolve correctly when I try to use tidy_eval properties. It does work, however, when I type the column name explicitly, but of course, having to type the name explicitly would make it so I would not be able to dynamically change the grouping variable.
I provide the following data-set returned by the following code snippet to give context to my question:
set.seed(1)
dfr <- tibble(x1 = factor(sample(letters[1:7], 50, replace = T), levels=letters[1:7]),
x2 = factor(sample(letters[1:7], 50, replace = T), levels=letters[1:7]),
x3 = factor(sample(letters[1:7], 50, replace = T), levels=letters[1:7]),
grpA = factor(sample(c("grp1","grp2"),50, prob=c(0.3, 0.7) ,replace=T), levels = c("grp1", "grp2")),
grpB = factor(sample(c("grp1","grp2"),50, prob=c(0.6, 0.4) ,replace=T), levels = c("grp1", "grp2"))
)
head(df)
I also provide a function that creates the plotting data I need to make the grouped plots. It accepts strings as arguments for the parameters 'groupvar' and 'mainvar':
plot_data_prepr <- function(dat, groupvar, mainvar){
groupvar <- sym(groupvar)
mainvar <- sym(mainvar)
plot_data <- dat %>%
group_by(!!groupvar) %>%
count(!!mainvar, .drop = F) %>% drop_na() %>%
mutate(pct = n/sum(n),
pct2 = ifelse(n == 0, 0.005, n/sum(n)),
grp_tot = sum(n),
pct_lab = paste0(format(pct*100, digits = 1),'%'),
pct_pos = pct2 + .02)
return(plot_data)
}
here is normal usage of the function:
plot_data_prepr(dat = dfr, groupvar = "grpA", mainvar = "x1")
Now I share my for loop that fails when I try to use tidy_eval in the facet_grid() function in the context of ggplot(); the returned error = "Error in !sgvar : invalid argument type"
"FAILING EXAMPLE:"
for (i in seq_along(names(dfr)[1:3])){
mvar <- names(dfr)[i]
print(mvar)
gvar <- names(dfr[4])
print(gvar)
smvar <- sym(mvar)
sgvar <- sym(gvar)
plot <- ggplot(data=plot_data_prepr(dfr, gvar, mvar),
mapping = aes(x=!!smvar, y = pct2, fill = !!smvar)) +
geom_bar(stat = 'identity') +
ylim(0,1) +
geom_text(aes(x=!!smvar, label=pct_lab, y = pct_pos + .02)) +
facet_grid(. ~ !!sgvar) +
ggtitle(paste0(mvar," by ",gvar))
print(plot)
}
When I run the loop by explicitly typing grpA in place of !!sgvar in the facet_grid() function, it works for some reason:
"FUNCTIONING BUT NOT WHAT I WANT EXAMPLE:"
for (i in seq_along(names(dfr)[1:3])){
mvar <- names(dfr)[i]
print(mvar)
gvar <- names(dfr[4])
print(gvar)
smvar <- sym(mvar)
sgvar <- sym(gvar)
plot <- ggplot(data=plot_data_prepr(dfr, gvar, mvar),
mapping = aes(x=!!smvar, y = pct2, fill = !!smvar)) +
geom_bar(stat = 'identity') +
ylim(0,1) +
geom_text(aes(x=!!smvar, label=pct_lab, y = pct_pos + .02)) +
facet_grid(. ~ grpA) +
ggtitle(paste0(mvar," by ",gvar))
print(plot)
}
Of course, if I wanted to loop through a set of grouping variables, then needing to explicitly type each one would not allow for looping. Could someone explain why my code with the 'bang bang' operator inside facet_gric() doesn't work properly in the 'FAILING EXAMPLE' and also suggest how to remedy this error?
Thank you.
It's difficult to piece together exactly what you're looking for, since your example code has errors, unassigned variable names and pieces of code missing. However, I think you're wanting the loop to print all of the pairs of grouping variables and main variables by cycling through the names of your data frame.
So that there is no dubiety, here is a full reprex:
Load packages and create reproducible data:
library(dplyr)
library(ggplot2)
set.seed(1)
df <- tibble(x1 = factor(sample(letters[1:7], 50, replace = TRUE)),
x2 = factor(sample(letters[1:7], 50, replace = TRUE)),
x3 = factor(sample(letters[1:7], 50, replace = TRUE)),
grpA = factor(sample(c("grp1", "grp2"), 50,
prob = c(0.3, 0.7), replace=TRUE)),
grpB = factor(sample(c("grp1", "grp2"), 50,
prob = c(0.6, 0.4), replace=TRUE)))
Define data preparation function
plot_data_prepr <- function(dat, groupvar, mainvar)
{
groupvar <- sym(groupvar)
mainvar <- sym(mainvar)
plot_data <- dat %>%
group_by(!!groupvar) %>%
count(!!mainvar, .drop = F) %>% tidyr::drop_na() %>%
mutate(pct = n/sum(n),
pct2 = ifelse(n == 0, 0.005, n/sum(n)),
grp_tot = sum(n),
pct_lab = paste0(format(pct*100, digits = 1),'%'),
pct_pos = pct2 + .02)
return(plot_data)
}
Loop to create all 6 plots
for(gvar in names(df)[4:5]){
for(mvar in names(df)[1:3])
{
print(ggplot(plot_data_prepr(df, gvar, mvar),
aes(x = !!sym(mvar), y = pct2, fill = !!sym(mvar))) +
geom_bar(stat = 'identity') +
ylim(0,1) +
geom_text(aes(label=pct_lab, y = pct_pos + .02)) +
facet_grid(as.formula(paste0(".~", gvar))) +
ggtitle(paste0(mvar, " by ", gvar))
)
}
}
Output:
Created on 2020-06-30 by the reprex package (v0.3.0)