Separated Boxplots for each column of a dataset - r

I have a data frame with 79 columns.
For each column, I am trying to produce an entirely separated boxplot.
I have tried
apply(integers, 2,function(x) boxplot(x, main = colnames(integers["x"])))
However, I cannot add the title of each column to the respective boxplot.

library(tidyverse)
plot_function <- function(column_name, data_in) {
plot_out <- ggplot(data_in, aes_string(y = column_name)) +
geom_boxplot() +
labs(title = column_name)
return(plot_out)
}
plot_columns <- names(iris)[1:4]
plot <- lapply(plot_columns, function(x, y) plot_function(x, y), y = iris)
plot[[1]]

Related

add one legend with all variables for combined graphs

I'm trying to plot two graphs side-by-side with one common legend that incorporates all the variables between both graphs (some vars are different between the graphs).
Here's a mock example of what I've been attempting:
#make relative abundance values for n rows
makeData <- function(n){
n <- n
x <- runif(n, 0, 1)
y <- x / sum(x)
}
#make random matrices filled with relative abundance values
makeDF <- function(col, rw){
df <- matrix(ncol=col, nrow=rw)
for(i in 1:ncol(df)){
df[,i] <- makeData(nrow(df))
}
return(df)
}
#create df1 and assign col names
df1 <- makeDF(4, 5)
colSums(df1) #verify relative abundance values = 1
df1 <- as.data.frame(df1)
colnames(df1) <- c("taxa","s1", "s2", "s3")
df1$taxa <- c("ASV1", "ASV2", "ASV3", "ASV4", "ASV5")
#repeat for df2
df2 <- makeDF(4,5)
df2 <- as.data.frame(df2)
colnames(df2) <- c("taxa","s1", "s2", "s3")
df2$taxa <- c("ASV1", "ASV5", "ASV6", "ASV7", "ASV8")
# convert wide data format to long format -- for plotting
library(reshape2)
makeLong <- function(df){
df.long <- melt(df, id.vars="taxa",
measure.vars=grep("s\\d+", names(df), val=T),
variable.name="sample",
value.name="value")
return(df.long)
}
df1 <- makeLong(df1)
df2 <- makeLong(df2)
#generate distinct colours for each asv
taxas <- union(df1$taxa, df2$taxa)
library("RColorBrewer")
qual_col_pals = brewer.pal.info[brewer.pal.info$category == 'qual',]
colpals <- qual_col_pals[c("Set1", "Dark2", "Set3"),] #select colour palettes
col_vector = unlist(mapply(brewer.pal, colpals$maxcolors, rownames(colpals)))
taxa.col=sample(col_vector, length(taxas))
names(taxa.col) <- taxas
# plot using ggplot
library(ggplot2)
plotdf2 <- ggplot(df2, aes(x=sample, y=value, fill=taxa)) +
geom_bar(stat="identity")+
scale_fill_manual("ASV", values = taxa.col)
plotdf1 <- ggplot(df1, aes(x=sample, y=value, fill=taxa)) +
geom_bar(stat="identity")+
scale_fill_manual("ASV", values = taxa.col)
#combine plots to one figure and merge legend
library(ggpubr)
ggpubr::ggarrange(plotdf1, plotdf2, ncol=2, nrow=1, common.legend = T, legend="bottom")
(if you have suggestions on how to generate better mock data, by all means!)
When I run my code, I am able to get the two graphs in one figure, but the legend does not incorporate all variables from both plots:
I ideally would like to avoid having repeat variables in the legend, such as:
From what I've searched online, the legend only works when the variables are the same between graphs, but in my case I have similar and different variables.
Thanks for any help!
Maybe this is what you are looking for:
Convert your taxa variables to factor with the levels equal to your taxas variable, i.e. to include all levels from both datasets.
Add argument drop=FALSE to both scale_fill_manual to prevent dropping of unused factor levels.
Note: I only added the relevant parts of the code and set the seed to 42 at the beginning of the script.
set.seed(42)
df1$taxa <- factor(df1$taxa, taxas)
df2$taxa <- factor(df2$taxa, taxas)
# plot using ggplot
library(ggplot2)
plotdf2 <- ggplot(df2, aes(x=sample, y=value, fill=taxa)) +
geom_bar(stat="identity") +
scale_fill_manual("ASV", values = taxa.col, drop = FALSE)
plotdf1 <- ggplot(df1, aes(x=sample, y=value, fill=taxa)) +
geom_bar(stat="identity")+
scale_fill_manual("ASV", values = taxa.col, drop = FALSE)
#combine plots to one figure and merge legend
library(ggpubr)
ggpubr::ggarrange(plotdf1, plotdf2, ncol=2, nrow=1, common.legend = T, legend="bottom")

Multiple plot in R in a single page

I'm having trouble displaying the multiple graphs on the same page. I'm having a data frame with 18 numerical columns. For each column, I need to show its histogram and boxplot on the same page with a 4*9 grid. Following is what I tried. But I need to show it along with the boxplot as well. Through a for a loop if possible. Can someone please help me to do it.
library(gridExtra)
library(ggplot2)
p <- list()
for(i in 1:18){
x <- my_data[,i]
p[[i]] <- ggplot(gather(x), aes(value)) +
geom_histogram(bins = 10) +
facet_wrap(~key, scales = 'free_x')
}
do.call(grid.arrange,p)
I received the following graph.
When following is tried, I'm getting the graph in separate pages
library(dplyr)
dat2 <- my_data %>% mutate_all(scale)
# Boxplot from the R trees dataset
boxplot(dat2, col = rainbow(ncol(dat2)))
par(mfrow = c(2, 2)) # Set up a 2 x 2 plotting space
# Create the loop.vector (all the columns)
loop.vector <- 1:4
p <- list()
for (i in loop.vector) { # Loop over loop.vector
# store data in column.i as x
x <- my_data[,i]
# Plot histogram of x
p[[i]] <-hist(x,
main = paste("Question", i),
xlab = "Scores",
xlim = c(0, 100))
plot_grid(p, label_size = 12)
}
You can assemble the base R boxplot and the ggplot object generated with facet_wrap together using the R package patchwork:
library(ggplot2)
library(patchwork)
p <- ggplot(mtcars, aes(x = mpg)) +
geom_histogram() +
facet_wrap(~gear)
wrap_elements(~boxplot(split(mtcars$mpg, mtcars$gear))) / p
ggsave('test.png', width = 6, height = 8, units = 'in')

put correlation coefficient on ggplot scatter plot after faceting

I'm having issue to put correlation coefficient on my scatter plot after facet_wrap by another variable.
Below is the example I made using mtcars dataset for illustration purpose.
when I plot it out, both plot have the same correlation number. It seems the correlation coef is not calculated for each facet. I could not figure out a way to achieve that. Really appreciate it if anyone could kindly help with that...
library(ggplot2)
library(dplyr)
corr_eqn <- function(x,y, method='pearson', digits = 2) {
corr_coef <- round(cor.test(x, y, method=method)$estimate, digits = digits)
corr_pval <- tryCatch(format(cor.test(x,y, method=method)$p.value,
scientific=TRUE),
error=function(e) NA)
paste(method, 'r = ', corr_coef, ',', 'pval =', corr_pval)
}
sca.plot <- function (cor.coef=TRUE) {
df<- mtcars %>% filter(vs==1)
p<- df %>%
ggplot(aes(x=hp, y=mpg))+
geom_point()+
geom_smooth()+
facet_wrap(~cyl, ncol=3)
if (cor.coef) {
p<- p+geom_text(x=0.9*max(df$hp, na.rm=TRUE),
y=0.9*max(df$mpg, na.rm=TRUE),
label = corr_eqn(df[['hp']],df[['mpg']],
method='pearson'))
}
return (p)
}
sca.plot(cor.coef=TRUE)
Call facets through variable inputFacet, loop over this variable to calculate corr_enq and plot facets using variable name with get.
In shiny you'll probably have user input as input$facet here it's called inputFacet. We plot main plot getting this variable in facet_wrap(~ get(inputFacet), ncol = 3). Next we loop over all facet options with for(i in seq_along(resCor$facets)) and store result in rescore.
This should solve "correlation coef is not calculated for each facet" problem.
library(dplyr)
library(ggplot2)
inputFacet <- "cyl"
cor.coef = TRUE
df <- mtcars
p <- df %>%
ggplot(aes(hp, mpg))+
geom_point()+
geom_smooth()+
facet_wrap(~ get(inputFacet), ncol = 3)
if (cor.coef) {
resCor <- data.frame(facets = unique(mtcars[, inputFacet]))
for(i in seq_along(resCor$facets)) {
foo <- mtcars[mtcars[, inputFacet] == resCor$facets[i], ]
resCor$text[i] <- corr_eqn(foo$hp, foo$mpg)
}
colnames(resCor)[1] <- inputFacet
p <- p + geom_text(data = resCor,
aes(0.9 * max(df$hp, na.rm = TRUE),
0.9 * max(df$mpg, na.rm = TRUE),
label = text))
}
p

ggplot2: Adding sample size information to x-axis tick labels

This question is related to
Create custom geom to compute summary statistics and display them *outside* the plotting region
(NOTE: All functions have been simplified; no error checks for correct objects types, NAs, etc.)
In base R, it is quite easy to create a function that produces a stripchart with the sample size indicated below each level of the grouping variable: you can add the sample size information using the mtext() function:
stripchart_w_n_ver1 <- function(data, x.var, y.var) {
x <- factor(data[, x.var])
y <- data[, y.var]
# Need to call plot.default() instead of plot because
# plot() produces boxplots when x is a factor.
plot.default(x, y, xaxt = "n", xlab = x.var, ylab = y.var)
levels.x <- levels(x)
x.ticks <- 1:length(levels(x))
axis(1, at = x.ticks, labels = levels.x)
n <- sapply(split(y, x), length)
mtext(paste0("N=", n), side = 1, line = 2, at = x.ticks)
}
stripchart_w_n_ver1(mtcars, "cyl", "mpg")
or you can add the sample size information to the x-axis tick labels using the axis() function:
stripchart_w_n_ver2 <- function(data, x.var, y.var) {
x <- factor(data[, x.var])
y <- data[, y.var]
# Need to set the second element of mgp to 1.5
# to allow room for two lines for the x-axis tick labels.
o.par <- par(mgp = c(3, 1.5, 0))
on.exit(par(o.par))
# Need to call plot.default() instead of plot because
# plot() produces boxplots when x is a factor.
plot.default(x, y, xaxt = "n", xlab = x.var, ylab = y.var)
n <- sapply(split(y, x), length)
levels.x <- levels(x)
axis(1, at = 1:length(levels.x), labels = paste0(levels.x, "\nN=", n))
}
stripchart_w_n_ver2(mtcars, "cyl", "mpg")
While this is a very easy task in base R, it is maddingly complex in ggplot2 because it is very hard to get at the data being used to generate the plot, and while there are functions equivalent to axis() (e.g., scale_x_discrete, etc.) there is no equivalent to mtext() that lets you easily place text at specified coordinates within the margins.
I tried using the built in stat_summary() function to compute the sample sizes (i.e., fun.y = "length") and then place that information on the x-axis tick labels, but as far as I can tell, you can't extract the sample sizes and then somehow add them to the x-axis tick labels using the function scale_x_discrete(), you have to tell stat_summary() what geom you want it to use. You could set geom="text", but then you have to supply the labels, and the point is that the labels should be the values of the sample sizes, which is what stat_summary() is computing but which you can't get at (and you would also have to specify where you want the text to be placed, and again, it is difficult to figure out where to place it so that it lies directly underneath the x-axis tick labels).
The vignette "Extending ggplot2" (http://docs.ggplot2.org/dev/vignettes/extending-ggplot2.html) shows you how to create your own stat function that allows you to get directly at the data, but the problem is that you always have to define a geom to go with your stat function (i.e., ggplot thinks you want to plot this information within the plot, not in the margins); as far as I can tell, you can't take the information you compute in your custom stat function, not plot anything in the plot area, and instead pass the information to a scales function like scale_x_discrete(). Here was my try at doing it this way; the best I could do was place the sample size information at the minimum value of y for each group:
StatN <- ggproto("StatN", Stat,
required_aes = c("x", "y"),
compute_group = function(data, scales) {
y <- data$y
y <- y[!is.na(y)]
n <- length(y)
data.frame(x = data$x[1], y = min(y), label = paste0("n=", n))
}
)
stat_n <- function(mapping = NULL, data = NULL, geom = "text",
position = "identity", inherit.aes = TRUE, show.legend = NA,
na.rm = FALSE, ...) {
ggplot2::layer(stat = StatN, mapping = mapping, data = data, geom = geom,
position = position, inherit.aes = inherit.aes, show.legend = show.legend,
params = list(na.rm = na.rm, ...))
}
ggplot(mtcars, aes(x = factor(cyl), y = mpg)) + geom_point() + stat_n()
I thought I had solved the problem by simply creating a wrapper function to ggplot:
ggstripchart <- function(data, x.name, y.name,
point.params = list(),
x.axis.params = list(labels = levels(x)),
y.axis.params = list(), ...) {
if(!is.factor(data[, x.name]))
data[, x.name] <- factor(data[, x.name])
x <- data[, x.name]
y <- data[, y.name]
params <- list(...)
point.params <- modifyList(params, point.params)
x.axis.params <- modifyList(params, x.axis.params)
y.axis.params <- modifyList(params, y.axis.params)
point <- do.call("geom_point", point.params)
stripchart.list <- list(
point,
theme(legend.position = "none")
)
n <- sapply(split(y, x), length)
x.axis.params$labels <- paste0(x.axis.params$labels, "\nN=", n)
x.axis <- do.call("scale_x_discrete", x.axis.params)
y.axis <- do.call("scale_y_continuous", y.axis.params)
stripchart.list <- c(stripchart.list, x.axis, y.axis)
ggplot(data = data, mapping = aes_string(x = x.name, y = y.name)) + stripchart.list
}
ggstripchart(mtcars, "cyl", "mpg")
However, this function does not work correctly with faceting. For example:
ggstripchart(mtcars, "cyl", "mpg") + facet_wrap(~am)
shows the the sample sizes for both facets combined for each facet. I would have to build faceting into the wrapper function, which defeats the point of trying to use everything ggplot has to offer.
If anyone has any insights to this problem I would be grateful. Thanks so much for your time!
I have updated the EnvStats
package to include a stat called stat_n_text which will add the sample size (the number of unique y-values) below each unique x-value. See the help file for stat_n_text for more information and a list of examples. Below is a simple example:
library(ggplot2)
library(EnvStats)
p <- ggplot(mtcars,
aes(x = factor(cyl), y = mpg, color = factor(cyl))) +
theme(legend.position = "none")
p + geom_point() +
stat_n_text() +
labs(x = "Number of Cylinders", y = "Miles per Gallon")
My solution might be a little simple but it works well.
Given an example with faceting by am I start by creating labels
using paste and \n.
mtcars2 <- mtcars %>%
group_by(cyl, am) %>% mutate(n = n()) %>%
mutate(label = paste0(cyl,'\nN = ',n))
I then use these labels instead of cyl in the ggplot code
ggplot(mtcars2,
aes(x = factor(label), y = mpg, color = factor(label))) +
geom_point() +
xlab('cyl') +
facet_wrap(~am, scales = 'free_x') +
theme(legend.position = "none")
To produce something like the figure below.
You can print the counts below the x-axis labels using geom_text if you turn off clipping, but you'll probably have to tweak the placement. I've included a "nudge" parameter for that in the code below. Also, the method below is intended for cases where all the facets (if any) are column facets.
I realize you ultimately want code that will work inside a new geom, but perhaps the examples below can be adapted for use in a geom.
library(ggplot2)
library(dplyr)
pgg = function(dat, x, y, facet=NULL, nudge=0.17) {
# Convert x-variable to a factor
dat[,x] = as.factor(dat[,x])
# Plot points
p = ggplot(dat, aes_string(x, y)) +
geom_point(position=position_jitter(w=0.3, h=0)) + theme_bw()
# Summarise data to get counts by x-variable and (if present) facet variables
dots = lapply(c(facet, x), as.symbol)
nn = dat %>% group_by_(.dots=dots) %>% tally
# If there are facets, add them to the plot
if (!is.null(facet)) {
p = p + facet_grid(paste("~", paste(facet, collapse="+")))
}
# Add counts as text labels
p = p + geom_text(data=nn, aes(label=paste0("N = ", nn$n)),
y=min(dat[,y]) - nudge*1.05*diff(range(dat[,y])),
colour="grey20", size=3.5) +
theme(axis.title.x=element_text(margin=unit(c(1.5,0,0,0),"lines")))
# Turn off clipping and return plot
p <- ggplot_gtable(ggplot_build(p))
p$layout$clip[p$layout$name=="panel"] <- "off"
grid.draw(p)
}
pgg(mtcars, "cyl", "mpg")
pgg(mtcars, "cyl", "mpg", facet=c("am","vs"))
Another, potentially more flexible, option is to add the counts to the bottom of the plot panel. For example:
pgg = function(dat, x, y, facet_r=NULL, facet_c=NULL) {
# Convert x-variable to a factor
dat[,x] = as.factor(dat[,x])
# Plot points
p = ggplot(dat, aes_string(x, y)) +
geom_point(position=position_jitter(w=0.3, h=0)) + theme_bw()
# Summarise data to get counts by x-variable and (if present) facet variables
dots = lapply(c(facet_r, facet_c, x), as.symbol)
nn = dat %>% group_by_(.dots=dots) %>% tally
# If there are facets, add them to the plot
if (!is.null(facet_r) | !is.null(facet_c)) {
facets = paste(ifelse(is.null(facet_r),".",facet_r), " ~ " ,
ifelse(is.null(facet_c),".",facet_c))
p = p + facet_grid(facets)
}
# Add counts as text labels
p + geom_text(data=nn, aes(label=paste0("N = ", nn$n)),
y=min(dat[,y]) - 0.15*min(dat[,y]), colour="grey20", size=3) +
scale_y_continuous(limits=range(dat[,y]) + c(-0.1*min(dat[,y]), 0.01*max(dat[,y])))
}
pgg(mtcars, "cyl", "mpg")
pgg(mtcars, "cyl", "mpg", facet_c="am")
pgg(mtcars, "cyl", "mpg", facet_c="am", facet_r="vs")

Custom scatterplot matrix using facet_grid in ggplot2

I'm trying to write a custom scatterplot matrix function in ggplot2 using facet_grid. My data have two categorical variables and one numeric variable.
I'd like to facet (make the scatterplot rows/cols) according to one of the categorical variables and change the plotting symbol according to the other categorical.
I do so by first constructing a larger dataset that includes all combinations (combs) of the categorical variable from which I'm creating the scatterplot panels.
My questions are:
How to use geom_rect to white-out the diagonal and upper panels in facet_grid (I can only make the middle ones black so far)?
How can you move the titles of the facets to the bottom and left hand sides respectively?
How does one remove tick axes and labels for the top left and bottom right facets?
Thanks in advance.
require(ggplot2)
# Data
nC <- 5
nM <- 4
dat <- data.frame(
Control = rep(LETTERS[1:nC], nM),
measure = rep(letters[1:nM], each = nC),
value = runif(nC*nM))
# Change factors to characters
dat <- within(dat, {
Control <- as.character(Control)
measure <- as.character(measure)
})
# Check, lapply(dat, class)
# Define scatterplot() function
scatterplotmatrix <- function(data,...){
controls <- with(data, unique(Control))
measures <- with(data, unique(measure))
combs <- expand.grid(1:length(controls), 1:length(measures), 1:length(measures))
# Add columns for values
combs$value1 = 1
combs$value2 = 0
for ( i in 1:NROW(combs)){
combs[i, "value1"] <- subset(data, subset = Control==controls[combs[i,1]] & measure == measures[combs[i,2]], select = value)
combs[i, "value2"] <- subset(data, subset = Control==controls[combs[i,1]] & measure == measures[combs[i,3]], select = value)
}
for ( i in 1:NROW(combs)){
combs[i,"Control"] <- controls[combs[i,1]]
combs[i,"Measure1"] <- measures[combs[i,2]]
combs[i,"Measure2"] <- measures[combs[i,3]]
}
# Final pairs plot
plt <- ggplot(combs, aes(x = value1, y = value2, shape = Control)) +
geom_point(size = 8, colour = "#F8766D") +
facet_grid(Measure2 ~ Measure1) +
ylab("") +
xlab("") +
scale_x_continuous(breaks = c(0,0.5,1), labels = c("0", "0.5", "1"), limits = c(-0.05, 1.05)) +
scale_y_continuous(breaks = c(0,0.5,1), labels = c("0", "0.5", "1"), limits = c(-0.05, 1.05)) +
geom_rect(data = subset(combs, subset = Measure1 == Measure2), colour='white', xmin = -Inf, xmax = Inf,ymin = -Inf,ymax = Inf)
return(plt)
}
# Call
plt1 <- scatterplotmatrix(dat)
plt1
I'm not aware of a way to move the panel strips (the labels) to the bottom or left. Also, it's not possible to format the individual panels separately (e.g., turn off the tick marks for just one facet). So if you really need these features, you will probably have to use something other than, or in addition to ggplot. You should really look into GGally, although I've never had much success with it.
As far as leaving some of the panels blank, here is a way.
nC <- 5; nM <- 4
set.seed(1) # for reproducible example
dat <- data.frame(Control = rep(LETTERS[1:nC], nM),
measure = rep(letters[1:nM], each = nC),
value = runif(nC*nM))
scatterplotmatrix <- function(data,...){
require(ggplot2)
require(data.table)
require(plyr) # for .(...)
DT <- data.table(data,key="Control")
gg <- DT[DT,allow.cartesian=T]
setnames(gg,c("Control","H","x","V","y"))
fmt <- function(x) format(x,nsmall=1)
plt <- ggplot(gg, aes(x,y,shape = Control)) +
geom_point(subset=.(as.numeric(H)<as.numeric(V)),size=5, colour="#F8766D") +
facet_grid(V ~ H) +
ylab("") + xlab("") +
scale_x_continuous(breaks=c(0,0.5,1), labels=fmt, limits=c(-0.05, 1.05)) +
scale_y_continuous(breaks=c(0,0.5,1), labels=fmt, limits=c(-0.05, 1.05))
return(plt)
}
scatterplotmatrix(dat)
The main feature of this is the use of subset=.(as.numeric(H)<as.numeric(V)) in the call to geom_point(...). This subsets the dataset so you only get a point layer when the condition is met, e.g. in facets where is.numeric(H)<is.numeric(V). This works because I've left the H and V columns as factors and is.numeric(...) operating on a factor returns the levels, not the names.
The rest is just a more compact (and much faster) way of creating what you called comb.

Resources