The graph I'm currently trying to make falls a little between two stools. I want to make a histogram that is composed of stacked and labelled boxes. Here's an example of exactly the sort of thing I'm talking about, taken from a recent article in the New York Times:
http://farm8.staticflickr.com/7109/7026409819_1d2aaacd0a.jpg
Is it possible to achieve this using ggplot2?
To amplify the question somewhat, so far what I have is:
dfr <- data.frame(
name = LETTERS[1:26],
percent = rnorm(26, mean=15)
)
ggplot(dfr, aes(x=percent, fill=name)) + geom_bar() +
stat_bin(geom="text", aes(label=name))
...which I'm clearly doing all wrong. Ultimately what I'd ideally like is something along the lines of the manually-modified graph below, with (say) letters A to M filled one shade and N to Z filled another.
http://farm8.staticflickr.com/7116/7026536711_4df9a1aa12.jpg
Here you go!
set.seed(3421)
# added type to mimick which candidate is supported
dfr <- data.frame(
name = LETTERS[1:26],
percent = rnorm(26, mean=15),
type = sample(c("A", "B"), 26, replace = TRUE)
)
# easier to prepare data in advance. uses two ideas
# 1. calculate histogram bins (quite flexible)
# 2. calculate frequencies and label positions
dfr <- transform(dfr, perc_bin = cut(percent, 5))
dfr <- ddply(dfr, .(perc_bin), mutate,
freq = length(name), pos = cumsum(freq) - 0.5*freq)
# start plotting. key steps are
# 1. plot bars, filled by type and grouped by name
# 2. plot labels using name at position pos
# 3. get rid of grid, border, background, y axis text and lables
ggplot(dfr, aes(x = perc_bin)) +
geom_bar(aes(y = freq, group = name, fill = type), colour = 'gray',
show_guide = F) +
geom_text(aes(y = pos, label = name), colour = 'white') +
scale_fill_manual(values = c('red', 'orange')) +
theme_bw() + xlab("") + ylab("") +
opts(panel.grid.major = theme_blank(), panel.grid.minor = theme_blank(),
axis.ticks = theme_blank(), panel.border = theme_blank(),
axis.text.y = theme_blank())
Related
I need to make 5 plots of bacteria species. Each plot has a different number of species present in a range of 30-90. I want each bacteria to always have the same color in all plots, therefore I need to set an assigned color to each name.
I tried to use scale_colour_manual to create a color set but, the environment created has only 16 colors. How can I increase the number of colors present in the environment created?
the code I am using can be replicated as follow:
colour_genus <- stringi::stri_rand_strings(90, 5) #to be random names
nb.cols = nrow(colour_genus) #to set the length of my string
MyPalette = colorRampPalette(brewer.pal(12,"Set1"))(nb.cols) # the palette of choice
colGenus <- scale_color_manual(name = colour_genus, values = MyPalette)
The output formed contains only 16 values, so when I try to apply it to a figure with 90 factors, it complains I have only 16 values
abundance <- runif(90, min = 10, max = 100)
my_data <- data.frame(colour_genus, abundance)
p <- ggplot(my_data, aes(x = colour_genus, y= abundance)) +
geom_bar(aes(color = colour_genus, fill = colour_genus), stat = "identity", position = "stack") +
labs(x = "", y = "Relative Abundance\n") +
theme(panel.background = element_blank())
p + theme(legend.text= element_text(size=7, face="bold"), axis.text.x = element_text(angle = 90)) + guides(fill=guide_legend(ncol=2)) + scale_fill_manual(values=colGenus)
The following error shows:
Error: Insufficient values in manual scale. 90 needed but only 16 provided.
Thank you very much for your help.
When you know all your 90 bacci names in front of plotting, you can try.
set.seed(123)
colour_genus <- sort(stringi::stri_rand_strings(90, 5))#to be random names. I sorted the vector to illustrate the output better (optional).
MyPalette <- sample(colors(), length(colour_genus))
# named vector for scale_fill
names(MyPalette) <- colour_genus
# data
abundance <- runif(90, min = 10, max = 100)
my_data <- data.frame(colour_genus, abundance)
# two sets to show results
set1 <- my_data[20:30,]
set2 <- my_data[25:35,]
ggplot(set1, aes(x = colour_genus, y= abundance)) +
geom_col(aes(fill = colour_genus)) +
scale_fill_manual(values = MyPalette)
ggplot(set2, aes(x = colour_genus, y= abundance)) +
geom_col(aes(fill = colour_genus)) +
scale_fill_manual(values = MyPalette)
I work with some rnaseq data, and a need to plot a heatmap with dots at determined transcripts of genes. I can not figure out how to this with ggpplot or pheatmap. So I have to use inkscape to manually put every dot on the plot. It's exausting, and a waste of time. Bellow is the image from inkscape:
I've made the basic plot with this code:
pal <- colorRampPalette(c("blue","white","red"))
a<-pal(200)
my_sample_col <- data.frame(Condition =
c("ALZxCon","PAxCon","PSPxCon"))
rownames(my_sample_col)<- colnames(transcript.table[,1:3])
my_colour <- list(Condition = c(ALZxCon = "lightblue",PAxCon =
"pink",PSPxCon = "yellow"))
pheatmap(transcript.table[,1:3],annotation_col =
my_sample_col,annotation_colors = my_colour[1],
color=a,show_colnames = F,cellheight = 15,cex=1,cluster_rows =
F,cluster_cols = F,
fontsize_row = 10,gaps_col = c(1,2),cellwidth = 15)
Where transcript table is something like this:
log2FC(AZ) log2FC(PA) log2FC(PSP) Sig(AZ) Sig(PA) Sig(PSP)
ABCA7_ENST000002633094 -0.2 -0.3 -0.2 Not Sig FDR<0.05 FDR<0.05
ABCA7_ENST0000043319 -0.6 -0.37 -0.7 FDR<0.05 FDR<0.05 FDR<0.05
I want to generate a heatmap where the square of the transcripts with FDR < 0.05 gets a black dot. Can you guys help with this?
I'm personally not an enormous fan of functions such as pheatmap, precisely because you can't customise every detail you would want. I'll show an alternative with ggplot2.
First things first, ggplot likes data in a long format, which I would do as follows:
# Loading in your data
z <- "log2FC(AZ),log2FC(PA),log2FC(PSP),Sig(AZ),Sig(PA),Sig(PSP)
ABCA7_ENST000002633094,-0.2,-0.3,-0.2,Not Sig,FDR<0.05,FDR<0.05
ABCA7_ENST0000043319,-0.6,-0.37,-0.7,FDR<0.05,FDR<0.05,FDR<0.05"
tab <- read.table(text=z, header = T, sep = ",")
# Converting to long format
lfc <- tab[,1:3]
pval <- tab[,4:6]
colnames(lfc) <- colnames(pval) <- c("AZ", "PA", "PSP")
lfc <- reshape2::melt(as.matrix(lfc))
pval <- reshape2::melt(as.matrix(pval))
df <- cbind(lfc, pval = pval$value)
Which will get us our main ingredients for the heatmap and the significance dots, but we would need a little extra data.frame for some annotation:
anno <- data.frame(x = levels(df$Var2),
y = "Condition")
Now the trick in getting this annotation to work nicely with the heatmap is a package called ggnewscale, which will allow us to set both a continuous fill for the heatmap and a discrete fill for the annotation. What remains is to make the actual plot, wherein I've tried to conserve some aspects of the pheatmap function in your example.
library(ggnewscale)
ggplot(df, aes(Var2, Var1)) +
# Important for ggnewscale is to specify a fill in the layer/geom itself
geom_tile(aes(fill = value),
width = 0.9, colour = "grey50") +
geom_point(data = df[df$pval == "FDR<0.05",]) +
scale_fill_gradientn(colours = c("blue", "white", "red"),
limits = c(-1,1)*max(abs(df$value)),
name = expression(atop("Log"[2]*" Fold","Change"))) +
# Set new scale fill after you've specified the scale for the heatmap
new_scale_fill() +
geom_tile(data = anno, aes(x, y, fill = x),
width = 0.9, height = 0.8, colour = "grey50") +
scale_fill_discrete(name = "Condition") +
scale_x_discrete(name = "", expand = c(0,0)) +
scale_y_discrete(name = "", expand = c(0,0),
limits = c(levels(df$Var1), "Condition"),
position = "right") +
coord_equal() +
theme(panel.background = element_blank(),
axis.ticks = element_blank(),
axis.text.x = element_blank(),
axis.text.y = element_text(face = c(rep("plain", nlevels(df$Var1)), "bold")))
Which looks like this:
Mix and match the ggplot code as you please.
I'm plotting a time series value with its percentages using facet_wrap in ggplot:
For the plot below, the upper plot is the value, and the lower plot is percentage change. And I would like the y-axis in the lower plot to be "%". Normally in ggplot I would do something like
+ scale_y_continuous(labels = scales::percent)
But since I'm using facet_wrap, how do I specify that I only want one of the 2 plots' y-axis label to be percentages?
P.S. Here is the code to generate this plot:
library(data.table)
library(ggplot2)
library(scales)
library(dplyr)
pct <- function(x) {x/lag(x)-1}
Dates = seq(from = as.Date("2000-01-01"),
to =as.Date("2018-10-01"),
by = "1 month")
set.seed(1024)
this_raw = data.frame(CM = Dates,
value = rnorm(n = length(Dates)),
variable = rep("FAKE",length(Dates)))
this_diff = na.omit(as.data.table(this_raw %>%
group_by(variable) %>%
mutate_each(funs(pct), c(value))))
this_diff$type = "PerCng"
this_raw$type = "RAW"
plot_all = rbindlist(list(this_raw,this_diff))
plot_all$type = factor(plot_all$type, levels = c("RAW", "PerCng"))
out_gg = plot_all %>%
ggplot(aes(x=CM, y=value)) +
geom_line(color = "royalblue3") +
theme(legend.position='bottom')+
ggtitle("FAKE DATA") +
facet_wrap(~ type, scale = "free_y", nrow = 2,
strip.position = "left",
labeller = as_labeller(c(RAW = "Original", PerCng = "% Change") ) )+
scale_x_date(date_breaks = "12 month", date_labels = "%Y-%m",
date_minor_breaks = "3 month")+
ylab("")+
theme(plot.title = element_text(hjust = 0.5,size = 12),
axis.text.x = element_text(size = 6,angle = 45, hjust = 1),
axis.text.y = element_text(size = 6),
axis.title.y = element_text(size = 6)) +
theme(strip.background = element_blank(),
strip.placement = "outside")+
theme(legend.title=element_blank())
print(out_gg)
I agree with the above comments that facets are really not intended for this use case. Aligning separate plots is the orthodox way to go.
That said, if you already have a bunch of nicely formatted ggplot objects, and really don't want to refactor the code just for axis labels, you can convert them to grob objects and dig underneath the hood:
library(grid)
# Convert from ggplot object to grob object
gp <- ggplotGrob(out_gg)
# Optional: Plot out the grob version to verify that nothing has changed (yet)
grid.draw(gp)
# Also optional: Examine the underlying grob structure to figure out which grob name
# corresponds to the appropriate y-axis label. In this case, it's "axis-l-2-1": axis
# to the left of plot panels, 2nd row / 1st column of the facet matrix.
gp[["layout"]]
gtable::gtable_show_layout(gp)
# Some of gp's grobs only generate their contents at drawing time.
# Using grid.force replaces such grobs with their drawing time content (if you check
# your global environment, the size of gp should increase significantly after running
# the grid.force line).
# This step is necesary in order to use gPath() to generate the path to nested grobs
# (& the text grob for y-axis labels is nested rather deeply inside the rabbit hole).
gp <- grid.force(gp)
path.to.label <- gPath("axis-l-2", "axis", "axis", "GRID.text")
# Get original label
old.label <- getGrob(gTree = gp,
gPath = path.to.label,
grep = TRUE)[["label"]]
# Edit label values
new.label <- percent(as.numeric(old.label))
# Overwrite ggplot grob, replacing old label with new
gp = editGrob(grob = gp,
gPath = path.to.label,
label = new.label,
grep = TRUE)
# plot
grid.draw(gp)
Edit: This question has been marked as duplicated, but the responses here have been tried and did not work because the case in question is a line chart, not a bar chart. Applying those methods produces a chart with 5 lines, 1 for each year - not useful. Did anyone who voted to mark as duplicate actually try those approaches on the sample dataset supplied with this question? If so please post as an answer.
Original Question:
There's a feature in Excel pivot charts which allows multilevel categorical axes.I'm trying to find a way to do the same thing with ggplot (or any other plotting package in R).
Consider the following dataset:
set.seed(1)
df=data.frame(year=rep(2009:2013,each=4),
quarter=rep(c("Q1","Q2","Q3","Q4"),5),
sales=40:59+rnorm(20,sd=5))
If this is imported to an Excel pivot table, it is straightforward to create the following chart:
Note how the x-axis has two levels, one for quarter and one for the grouping variable, year. Are multilevel axes possible with ggplot?
NB: There is a hack with facets that produces something similar, but this is not what I'm looking for.
library(ggplot2)
ggplot(df) +
geom_line(aes(x=quarter,y=sales,group=year))+
facet_grid(.~year,scales="free")
New labels are added using annotate(geom = "text",. Turn off clipping of x axis labels with clip = "off" in coord_cartesian.
Use theme to add extra margins (plot.margin) and remove (element_blank()) x axis text (axis.title.x, axis.text.x) and vertical grid lines (panel.grid.x).
library(ggplot2)
ggplot(data = df, aes(x = interaction(year, quarter, lex.order = TRUE),
y = sales, group = 1)) +
geom_line(colour = "blue") +
annotate(geom = "text", x = seq_len(nrow(df)), y = 34, label = df$quarter, size = 4) +
annotate(geom = "text", x = 2.5 + 4 * (0:4), y = 32, label = unique(df$year), size = 6) +
coord_cartesian(ylim = c(35, 65), expand = FALSE, clip = "off") +
theme_bw() +
theme(plot.margin = unit(c(1, 1, 4, 1), "lines"),
axis.title.x = element_blank(),
axis.text.x = element_blank(),
panel.grid.major.x = element_blank(),
panel.grid.minor.x = element_blank())
See also the nice answer by #eipi10 here: Axis labels on two lines with nested x variables (year below months)
The suggested code by Henrik does work and helped me a lot! I think the solution has a high value. But please be aware, that there is a small misstake in the first line of the code, which results in a wrong order of the data.
Instead of
... aes(x = interaction(year,quarter), ...
it should be
... aes(x = interaction(quarter,year), ...
The resulting graphic has the data in the right order.
P.S. I suggested an edit (which was rejected until now) and, due to a small lack of reputation, I am not allowed to comment, what I rather would have done.
User Tung had a great answer on this thread
library(tidyverse)
library(lubridate)
library(scales)
set.seed(123)
df <- tibble(
date = as.Date(41000:42000, origin = "1899-12-30"),
value = c(rnorm(500, 5), rnorm(501, 10))
)
# create year column for facet
df <- df %>%
mutate(year = as.factor(year(date)))
p <- ggplot(df, aes(date, value)) +
geom_line() +
geom_vline(xintercept = as.numeric(df$date[yday(df$date) == 1]), color = "grey60") +
scale_x_date(date_labels = "%b",
breaks = pretty_breaks(),
expand = c(0, 0)) +
# switch the facet strip label to the bottom
facet_grid(.~ year, space = 'free_x', scales = 'free_x', switch = 'x') +
labs(x = "") +
theme_classic(base_size = 14, base_family = 'mono') +
theme(panel.grid.minor.x = element_blank()) +
# remove facet spacing on x-direction
theme(panel.spacing.x = unit(0,"line")) +
# switch the facet strip label to outside
# remove background color
theme(strip.placement = 'outside',
strip.background.x = element_blank())
p
I have a data.frame that looks something like this:
HSP90AA1 SSH2 ACTB TotalTranscripts
ESC_11_TTCGCCAAATCC 8.053308 12.038484 10.557234 33367.23
ESC_10_TTGAGCTGCACT 9.430003 10.687959 10.437068 30285.41
ESC_11_GCCGCGTTATAA 7.953726 9.918988 10.078192 30133.94
ESC_11_GCATTCTGGCTC 11.184402 11.056144 8.316846 24857.07
ESC_11_GTTACATTTCAC 11.943733 11.004500 9.240883 23629.00
ESC_11_CCGTTGCCCCTC 7.441695 9.774733 7.566619 22792.18
The TotalTranscripts column is sorted in descending order. What I'd like to do is generate three bar graphs using ggplot2 with each bar graph corresponding to each column of the data.frame with the exception of TotalTranscripts. I'd like the bar graphs to be ordered by TotalTranscripts just as the data.frame. I would be ideal to have these bar graphs on one plot using a facet wrap.
Any help would be greatly appreciated! Thank you!
EDIT: Here is my current code using barplot().
cells = "ESC"
genes = c("HSP90AA1", "SSH2", "ACTB")
g = data[genes,grep(cells, colnames(data))]
g = data.frame(t(g), colSums(data)[grep(cells, colnames(data))])
colnames(g)[ncol(g)] = "TotalTranscripts"
g = g[order(g$TotalTranscripts, decreasing=T), , drop=F]
barplot(as.matrix(g[1]), beside=TRUE, names.arg=paste(rownames(g)," (",g$TotalTranscripts,")",sep=""), las=2, col="light blue", cex.names=0.3, main=paste(colnames(g)[1], "\nCells sorted by total number of transcripts (colSums)", sep=""))
This will generate a plot that looks like this.
Again, the problem I seem to be having here is how to have multiple of these plots on the same image. I would like to add 20+ columns to this data.frame but I've cut this down to 3 for the sake of simplicity.
EDIT: Current code incorporating the answer below
cells = "ESC"
genes = rownames(data[x,])[1:8]
# genes = c("HSP90AA1", "SSH2", "ACTB")
g = data[genes,grep(cells, colnames(data))]
g = data.frame(t(g), colSums(data)[grep(cells, colnames(data))])
colnames(g)[ncol(g)] = "TotalTranscripts"
g = g[order(g$TotalTranscripts, decreasing=T), , drop=F]
g$rowz <- row.names(g)
g$Cells <- reorder(g$rowz, rev(g$TotalTranscripts))
df1 <- melt(g, id.vars = c("Cells", "TotalTranscripts"), measure.vars=genes)
ggplot(df1, aes(x = Cells, y = value)) + geom_bar(stat = "identity") +
theme(axis.title.x=element_blank(), axis.text.x = element_blank()) +
facet_wrap(~ variable, scales = "free") +
theme_bw() + theme(axis.text.x = element_text(angle = 90))
Here is the example data for anybody else:
df <- structure(list(HSP90AA1 = c(8.053308, 9.430003, 7.953726, 11.184402,
11.943733, 7.441695), SSH2 = c(12.038484, 10.687959, 9.918988,
11.056144, 11.0045, 9.774733), ACTB = c(10.557234, 10.437068,
10.078192, 8.316846, 9.240883, 7.566619), TotalTranscripts = c(33367.23,
30285.41, 30133.94, 24857.07, 23629, 22792.18)), .Names = c("HSP90AA1",
"SSH2", "ACTB", "TotalTranscripts"), class = "data.frame", row.names = c("ESC_11_TTCGCCAAATCC",
"ESC_10_TTGAGCTGCACT", "ESC_11_GCCGCGTTATAA", "ESC_11_GCATTCTGGCTC",
"ESC_11_GTTACATTTCAC", "ESC_11_CCGTTGCCCCTC"))
And here is a solution:
#New column for row names so they can be used as x-axis elements
df$rowz <- row.names(df)
#Explicitly order the rows (see the Kohske link)
df$rowz1 <- reorder(df$rowz, rev(df$TotalTranscripts))
library(reshape2)
#Melt the data from wide to long
df1 <- melt(df, id.vars = c("rowz1", "TotalTranscripts"),
measure.vars = c("HSP90AA1", "SSH2", "ACTB"))
library(ggplot2)
gp <- ggplot(df1, aes(x = rowz1, y = value)) + geom_bar(stat = "identity") +
facet_wrap(~ variable, scales = "free") +
theme_bw()
gp + theme(axis.text.x = element_text(angle = 90))
This example by Kohske is a constant reference for me on ordering elements in ggplot2.
If you have many columns, but the same six ESC complexes, you can switch the groupings, i.e. x = variable and facet_wrap(~ rowz1), but this fundamentally changes how you are visualizing/comparing your data. Also, consider facet_grid(row ~ column) if you can organize the columns by 2 components (Columns being the data that are melted into 'variable' and 'value').
And this additional SO solution isn't related to your question, but it is an elegant way to reorder elements in each facet by their values (for future reference).
Finally, the method that will give you the finest control is to plot each graph separately and combine the grobs. Baptiste's packages like gridExtra and gtable are useful for these tasks.
**EDIT in response to new information from OP**
The OP has subsequently asked how to visualize the data, especially when there are more ESC categorical variables (up to 600+).
Here are some examples, with the big caveat that with many categorical variables, they should be grouped or converted to a continuous variable somehow.
#Plot colour to a few discrete, categorical variables
gp + aes(fill = rowz1) +
theme(axis.text.x = element_blank(), axis.ticks.x = element_blank()) +
labs(x = NULL, fill = "Cell", title = "Discrete categorical variables")
#Plot colour on a continuous scale.
#Ultimately, not appropriate for this example! (but shown for reference)
#More appropriate: fill = TotalTranscripts
gp + aes(fill = as.numeric(rowz1)) +
theme(axis.text.x = element_blank(), axis.ticks.x = element_blank()) +
labs(x = NULL, title = "Continuous variables (legend won't work for many values)") +
scale_fill_gradient2(name = "Cell",
breaks = as.numeric(df1$rowz1),
labels = df1$rowz1,
midpoint=median(as.numeric(df1$rowz1)))
#x is continuous, colour plotted to the categorical variable.
#Same caveats as earlier.
gp1 <- ggplot(df1, aes(x = TotalTranscripts/1000, y = value, colour = rowz1)) +
geom_point(size=3) + facet_wrap(~ variable, scales = "free") +
labs(title = "X is an actual continuous variable") +
theme_bw() + labs(x = bquote("Total Transcripts,"~10^3), colour = "Cell")
gp1