Ordered bar graphs using ggplot2 and facet - r

I have a data.frame that looks something like this:
HSP90AA1 SSH2 ACTB TotalTranscripts
ESC_11_TTCGCCAAATCC 8.053308 12.038484 10.557234 33367.23
ESC_10_TTGAGCTGCACT 9.430003 10.687959 10.437068 30285.41
ESC_11_GCCGCGTTATAA 7.953726 9.918988 10.078192 30133.94
ESC_11_GCATTCTGGCTC 11.184402 11.056144 8.316846 24857.07
ESC_11_GTTACATTTCAC 11.943733 11.004500 9.240883 23629.00
ESC_11_CCGTTGCCCCTC 7.441695 9.774733 7.566619 22792.18
The TotalTranscripts column is sorted in descending order. What I'd like to do is generate three bar graphs using ggplot2 with each bar graph corresponding to each column of the data.frame with the exception of TotalTranscripts. I'd like the bar graphs to be ordered by TotalTranscripts just as the data.frame. I would be ideal to have these bar graphs on one plot using a facet wrap.
Any help would be greatly appreciated! Thank you!
EDIT: Here is my current code using barplot().
cells = "ESC"
genes = c("HSP90AA1", "SSH2", "ACTB")
g = data[genes,grep(cells, colnames(data))]
g = data.frame(t(g), colSums(data)[grep(cells, colnames(data))])
colnames(g)[ncol(g)] = "TotalTranscripts"
g = g[order(g$TotalTranscripts, decreasing=T), , drop=F]
barplot(as.matrix(g[1]), beside=TRUE, names.arg=paste(rownames(g)," (",g$TotalTranscripts,")",sep=""), las=2, col="light blue", cex.names=0.3, main=paste(colnames(g)[1], "\nCells sorted by total number of transcripts (colSums)", sep=""))
This will generate a plot that looks like this.
Again, the problem I seem to be having here is how to have multiple of these plots on the same image. I would like to add 20+ columns to this data.frame but I've cut this down to 3 for the sake of simplicity.
EDIT: Current code incorporating the answer below
cells = "ESC"
genes = rownames(data[x,])[1:8]
# genes = c("HSP90AA1", "SSH2", "ACTB")
g = data[genes,grep(cells, colnames(data))]
g = data.frame(t(g), colSums(data)[grep(cells, colnames(data))])
colnames(g)[ncol(g)] = "TotalTranscripts"
g = g[order(g$TotalTranscripts, decreasing=T), , drop=F]
g$rowz <- row.names(g)
g$Cells <- reorder(g$rowz, rev(g$TotalTranscripts))
df1 <- melt(g, id.vars = c("Cells", "TotalTranscripts"), measure.vars=genes)
ggplot(df1, aes(x = Cells, y = value)) + geom_bar(stat = "identity") +
theme(axis.title.x=element_blank(), axis.text.x = element_blank()) +
facet_wrap(~ variable, scales = "free") +
theme_bw() + theme(axis.text.x = element_text(angle = 90))

Here is the example data for anybody else:
df <- structure(list(HSP90AA1 = c(8.053308, 9.430003, 7.953726, 11.184402,
11.943733, 7.441695), SSH2 = c(12.038484, 10.687959, 9.918988,
11.056144, 11.0045, 9.774733), ACTB = c(10.557234, 10.437068,
10.078192, 8.316846, 9.240883, 7.566619), TotalTranscripts = c(33367.23,
30285.41, 30133.94, 24857.07, 23629, 22792.18)), .Names = c("HSP90AA1",
"SSH2", "ACTB", "TotalTranscripts"), class = "data.frame", row.names = c("ESC_11_TTCGCCAAATCC",
"ESC_10_TTGAGCTGCACT", "ESC_11_GCCGCGTTATAA", "ESC_11_GCATTCTGGCTC",
"ESC_11_GTTACATTTCAC", "ESC_11_CCGTTGCCCCTC"))
And here is a solution:
#New column for row names so they can be used as x-axis elements
df$rowz <- row.names(df)
#Explicitly order the rows (see the Kohske link)
df$rowz1 <- reorder(df$rowz, rev(df$TotalTranscripts))
library(reshape2)
#Melt the data from wide to long
df1 <- melt(df, id.vars = c("rowz1", "TotalTranscripts"),
measure.vars = c("HSP90AA1", "SSH2", "ACTB"))
library(ggplot2)
gp <- ggplot(df1, aes(x = rowz1, y = value)) + geom_bar(stat = "identity") +
facet_wrap(~ variable, scales = "free") +
theme_bw()
gp + theme(axis.text.x = element_text(angle = 90))
This example by Kohske is a constant reference for me on ordering elements in ggplot2.
If you have many columns, but the same six ESC complexes, you can switch the groupings, i.e. x = variable and facet_wrap(~ rowz1), but this fundamentally changes how you are visualizing/comparing your data. Also, consider facet_grid(row ~ column) if you can organize the columns by 2 components (Columns being the data that are melted into 'variable' and 'value').
And this additional SO solution isn't related to your question, but it is an elegant way to reorder elements in each facet by their values (for future reference).
Finally, the method that will give you the finest control is to plot each graph separately and combine the grobs. Baptiste's packages like gridExtra and gtable are useful for these tasks.
**EDIT in response to new information from OP**
The OP has subsequently asked how to visualize the data, especially when there are more ESC categorical variables (up to 600+).
Here are some examples, with the big caveat that with many categorical variables, they should be grouped or converted to a continuous variable somehow.
#Plot colour to a few discrete, categorical variables
gp + aes(fill = rowz1) +
theme(axis.text.x = element_blank(), axis.ticks.x = element_blank()) +
labs(x = NULL, fill = "Cell", title = "Discrete categorical variables")
#Plot colour on a continuous scale.
#Ultimately, not appropriate for this example! (but shown for reference)
#More appropriate: fill = TotalTranscripts
gp + aes(fill = as.numeric(rowz1)) +
theme(axis.text.x = element_blank(), axis.ticks.x = element_blank()) +
labs(x = NULL, title = "Continuous variables (legend won't work for many values)") +
scale_fill_gradient2(name = "Cell",
breaks = as.numeric(df1$rowz1),
labels = df1$rowz1,
midpoint=median(as.numeric(df1$rowz1)))
#x is continuous, colour plotted to the categorical variable.
#Same caveats as earlier.
gp1 <- ggplot(df1, aes(x = TotalTranscripts/1000, y = value, colour = rowz1)) +
geom_point(size=3) + facet_wrap(~ variable, scales = "free") +
labs(title = "X is an actual continuous variable") +
theme_bw() + labs(x = bquote("Total Transcripts,"~10^3), colour = "Cell")
gp1

Related

How can I plot 2 related variables on the same axis using ggplot? [duplicate]

Edit: This question has been marked as duplicated, but the responses here have been tried and did not work because the case in question is a line chart, not a bar chart. Applying those methods produces a chart with 5 lines, 1 for each year - not useful. Did anyone who voted to mark as duplicate actually try those approaches on the sample dataset supplied with this question? If so please post as an answer.
Original Question:
There's a feature in Excel pivot charts which allows multilevel categorical axes.I'm trying to find a way to do the same thing with ggplot (or any other plotting package in R).
Consider the following dataset:
set.seed(1)
df=data.frame(year=rep(2009:2013,each=4),
quarter=rep(c("Q1","Q2","Q3","Q4"),5),
sales=40:59+rnorm(20,sd=5))
If this is imported to an Excel pivot table, it is straightforward to create the following chart:
Note how the x-axis has two levels, one for quarter and one for the grouping variable, year. Are multilevel axes possible with ggplot?
NB: There is a hack with facets that produces something similar, but this is not what I'm looking for.
library(ggplot2)
ggplot(df) +
geom_line(aes(x=quarter,y=sales,group=year))+
facet_grid(.~year,scales="free")
New labels are added using annotate(geom = "text",. Turn off clipping of x axis labels with clip = "off" in coord_cartesian.
Use theme to add extra margins (plot.margin) and remove (element_blank()) x axis text (axis.title.x, axis.text.x) and vertical grid lines (panel.grid.x).
library(ggplot2)
ggplot(data = df, aes(x = interaction(year, quarter, lex.order = TRUE),
y = sales, group = 1)) +
geom_line(colour = "blue") +
annotate(geom = "text", x = seq_len(nrow(df)), y = 34, label = df$quarter, size = 4) +
annotate(geom = "text", x = 2.5 + 4 * (0:4), y = 32, label = unique(df$year), size = 6) +
coord_cartesian(ylim = c(35, 65), expand = FALSE, clip = "off") +
theme_bw() +
theme(plot.margin = unit(c(1, 1, 4, 1), "lines"),
axis.title.x = element_blank(),
axis.text.x = element_blank(),
panel.grid.major.x = element_blank(),
panel.grid.minor.x = element_blank())
See also the nice answer by #eipi10 here: Axis labels on two lines with nested x variables (year below months)
The suggested code by Henrik does work and helped me a lot! I think the solution has a high value. But please be aware, that there is a small misstake in the first line of the code, which results in a wrong order of the data.
Instead of
... aes(x = interaction(year,quarter), ...
it should be
... aes(x = interaction(quarter,year), ...
The resulting graphic has the data in the right order.
P.S. I suggested an edit (which was rejected until now) and, due to a small lack of reputation, I am not allowed to comment, what I rather would have done.
User Tung had a great answer on this thread
library(tidyverse)
library(lubridate)
library(scales)
set.seed(123)
df <- tibble(
date = as.Date(41000:42000, origin = "1899-12-30"),
value = c(rnorm(500, 5), rnorm(501, 10))
)
# create year column for facet
df <- df %>%
mutate(year = as.factor(year(date)))
p <- ggplot(df, aes(date, value)) +
geom_line() +
geom_vline(xintercept = as.numeric(df$date[yday(df$date) == 1]), color = "grey60") +
scale_x_date(date_labels = "%b",
breaks = pretty_breaks(),
expand = c(0, 0)) +
# switch the facet strip label to the bottom
facet_grid(.~ year, space = 'free_x', scales = 'free_x', switch = 'x') +
labs(x = "") +
theme_classic(base_size = 14, base_family = 'mono') +
theme(panel.grid.minor.x = element_blank()) +
# remove facet spacing on x-direction
theme(panel.spacing.x = unit(0,"line")) +
# switch the facet strip label to outside
# remove background color
theme(strip.placement = 'outside',
strip.background.x = element_blank())
p

How to show important values on a graph with ggplot?

How do I show the specific values of variables on a graph?
For example:
ggplot(data=df)+
geom_bar(mapping=aes(x=var))
How do I get it to have the actual count on the bar chart?
I believe this question has asked before but I couldn' find a duplicate quickly.
Here is an example how to annotate the columns of a bar chart with the counts:
n_row <- 100L
set.seed(123L)
df <- data.frame(var = sample(LETTERS[1:5], n_row, TRUE, 5:1))
library(ggplot2)
ggplot(data = df) + aes(x = var) +
geom_bar() +
stat_count(geom = "text", aes(label = ..count..), vjust = "bottom")
Alternatively, we can write
ggplot(data = df) + aes(x = var, label = ..count..) +
geom_bar() +
geom_text(stat = "count", vjust = "bottom")
Some geoms and stats do compute variables which can be accessed using special names like ..count... To plot labels, the x and y positions and the text need to be specified. The x position is taken from the date as specified in aes(). The y position seems to be taken automatically from the statistical transformation but the text needs to be specified explicitely.
Suggested reading:
Statistical transformations in R for Data Science
ggplot2 homepage

R Side-by-side grouped boxplot

I have temporal data of gas emissions from two species of plant, both of which have been subjected to the same treatments. With some previous help to get this code together [edit]:
soilflux = read.csv("soil_fluxes.csv")
library(ggplot2)
soilflux$Treatment <- factor(soilflux$Treatment,levels=c("L-","C","L+"))
soilplot = ggplot(soilflux, aes(factor(Week), Flux, fill=Species, alpha=Treatment)) + stat_boxplot(geom ='errorbar') + geom_boxplot()
soilplot = soilplot + labs(x = "Week", y = "Flux (mg m-2 d-1)") + theme_bw(base_size = 12, base_family = "Helvetica")
soilplot
Producing this which works well but has its flaws.
Whilst it conveys all the information I need it to, despite Google trawls and looking through here I just couldn't get the 'Treatment' part of the legend to show that L- is light and L+ darkest. I've also been told that a monochrome colour scheme is easier to differentiate hence I'm trying to get something like this where the legend is clear.
(source: biomedcentral.com)
As a workaround you could create a combined factor from species and treatment and assign the fill colors manually:
library(ggplot2)
library(RColorBrewer)
d <- expand.grid(week = factor(1:4), species = factor(c("Heisteria", "Simarouba")),
trt = factor(c("C", "L-", "L+"), levels = c("L-", "C", "L+")))
d <- d[rep(1:24, each = 30), ]
d$flux <- runif(NROW(d))
# Create a combined factor for coding the color
d$spec.trt <- interaction(d$species, d$trt, lex.order = TRUE, sep = " - ")
ggplot(d, aes(x = week, y = flux, fill = spec.trt)) +
stat_boxplot(geom ='errorbar') + geom_boxplot() +
scale_fill_manual(values = c(brewer.pal(3, "Greens"), brewer.pal(3, "Reds")))

Histogram of stacked boxes in ggplot2

The graph I'm currently trying to make falls a little between two stools. I want to make a histogram that is composed of stacked and labelled boxes. Here's an example of exactly the sort of thing I'm talking about, taken from a recent article in the New York Times:
http://farm8.staticflickr.com/7109/7026409819_1d2aaacd0a.jpg
Is it possible to achieve this using ggplot2?
To amplify the question somewhat, so far what I have is:
dfr <- data.frame(
name = LETTERS[1:26],
percent = rnorm(26, mean=15)
)
ggplot(dfr, aes(x=percent, fill=name)) + geom_bar() +
stat_bin(geom="text", aes(label=name))
...which I'm clearly doing all wrong. Ultimately what I'd ideally like is something along the lines of the manually-modified graph below, with (say) letters A to M filled one shade and N to Z filled another.
http://farm8.staticflickr.com/7116/7026536711_4df9a1aa12.jpg
Here you go!
set.seed(3421)
# added type to mimick which candidate is supported
dfr <- data.frame(
name = LETTERS[1:26],
percent = rnorm(26, mean=15),
type = sample(c("A", "B"), 26, replace = TRUE)
)
# easier to prepare data in advance. uses two ideas
# 1. calculate histogram bins (quite flexible)
# 2. calculate frequencies and label positions
dfr <- transform(dfr, perc_bin = cut(percent, 5))
dfr <- ddply(dfr, .(perc_bin), mutate,
freq = length(name), pos = cumsum(freq) - 0.5*freq)
# start plotting. key steps are
# 1. plot bars, filled by type and grouped by name
# 2. plot labels using name at position pos
# 3. get rid of grid, border, background, y axis text and lables
ggplot(dfr, aes(x = perc_bin)) +
geom_bar(aes(y = freq, group = name, fill = type), colour = 'gray',
show_guide = F) +
geom_text(aes(y = pos, label = name), colour = 'white') +
scale_fill_manual(values = c('red', 'orange')) +
theme_bw() + xlab("") + ylab("") +
opts(panel.grid.major = theme_blank(), panel.grid.minor = theme_blank(),
axis.ticks = theme_blank(), panel.border = theme_blank(),
axis.text.y = theme_blank())

Heatmaps in R using ggplot function - how to cluster rows?

I am currently generating heatmaps in R using the ggplot function. In the code below.. I first read the data into a dataframe, remove any duplicate rows, factorise timestamp field, melt the dataframe (according to 'timestamp'), scale all variable between 0 and 1, then plot the heatmap.
In the resulting heatmap, time is plotted on the x axis and each iostat-sda variable (see sample data below) is plotted along the y axis. Note: If you want to try out the R code – you can paste the sample data below into a file called iostat-sda.csv.
however I really need to be able cluster the rows within this heatmap... anyone know how this can be achieved using the ggplot function?
Any help would be very much appreciated!!
############################## The code
library(ggplot2)
fileToAnalyse_f <- read.csv(file="iostat-sda.csv",head=TRUE,sep=",")
fileToAnalyse <- subset(fileToAnalyse, !duplicated(timestamp))
fileToAnalyse[,1]<-factor(fileToAnalyse[,1])
fileToAnalyse.m <- melt(fileToAnalyse, id=1)
fileToAnalyse.s <- ddply(fileToAnalyse.m, .(variable), transform, rescale = rescale(value) ) #scales each variable between 0 and 1
base_size <- 9
ggplot(fileToAnalyse.s, aes(timestamp, variable)) + geom_tile(aes(fill = rescale), colour = "black") + scale_fill_gradient(low = "black", high = "white") + theme_grey(base_size = base_size) + labs(x = "Time", y = "") + opts(title = paste("Heatmap"),legend.position = "right", axis.text.x = theme_blank(), axis.ticks = theme_blank()) + scale_y_discrete(expand = c(0, 0)) + scale_x_discrete(expand = c(0, 0))
########################## Sample data from iostat-sda.csv
timestamp,DSKRRQM,DSKWRQM,DSKR,DSKW,DSKRMB,DSKWMB,DSKARQS,DSKAQUS,DSKAWAIT,DSKSVCTM,DSKUtil
1319204905,0.33,0.98,10.35,2.37,0.72,0.02,120.00,0.01,0.40,0.31,0.39
1319204906,1.00,4841.00,682.00,489.00,60.09,40.68,176.23,2.91,2.42,0.50,59.00
1319204907,0.00,1600.00,293.00,192.00,32.64,13.89,196.45,5.48,10.76,2.04,99.00 1319204908,0.00,3309.00,1807.00,304.00,217.39,26.82,236.93,4.84,2.41,0.45,96.00
1319204909,0.00,5110.00,93.00,427.00,0.72,43.31,173.43,4.43,8.67,1.90,99.00
1319204910,0.00,6345.00,115.00,496.00,0.96,52.25,178.34,4.00,6.32,1.62,99.00
1319204911,0.00,6793.00,129.00,666.00,1.33,57.22,150.83,4.74,6.16,1.26,100.00
1319204912,0.00,6444.00,115.00,500.00,0.93,53.06,179.77,4.20,6.83,1.58,97.00
1319204913,0.00,1923.00,835.00,215.00,78.45,16.68,185.55,4.81,4.58,0.91,96.00
1319204914,0.00,0.00,788.00,0.00,83.51,0.00,217.04,0.45,0.57,0.25,20.00
1319204915,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00
1319204916,0.00,4.00,2.00,4.00,0.01,0.04,17.67,0.00,0.00,0.00,0.00
1319204917,0.00,8.00,4.00,8.00,0.02,0.09,17.83,0.00,0.00,0.00,0.00
1319204918,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00
1319204919,0.00,2.00,113.00,4.00,11.96,0.03,209.93,0.06,0.51,0.43,5.00
1319204920,0.00,59.00,147.00,54.00,11.15,0.63,120.02,0.04,0.20,0.15,3.00
1319204921,1.00,19.00,57.00,18.00,4.68,0.20,133.47,0.07,0.93,0.67,5.00
There is a nice package called NeatMap which simplifies generating heatmaps in ggplot2. Some of the row clustering methods include Multidimensional Scaling, PCA, or hierarchical clustering. Things to watch out for are:
Data to make.heatmap1 has to be in wide format
Data has to be a matrix, not a dataframe
Assign rownames to the wide-format matrix before generating the plot
I've changed your code slightly to avoid naming variables the same as base functions (i.e. rescale)
fileToAnalyse.s <- ddply(fileToAnalyse.m, .(variable), transform, rescale.x = rescale(value) ) #scales each variable between 0 and 1
fileToAnalyse.w <- dcast(fileToAnalyse.s, timestamp ~ variable, value_var="rescale.x")
rownames(fileToAnalyse.w) <- as.character(fileToAnalyse.w[, 1])
ggheatmap <- make.heatmap1(as.matrix(fileToAnalyse.w[, -1]), row.method = "complete.linkage", row.metric="euclidean", column.cluster.method ="none", row.labels = rownames(fileToAnalyse.w))
+scale_fill_gradient(low = "black", high = "white") + labs(x = "Time", y = "") + opts(title = paste("Heatmap")

Resources