ggplot2 stacked barplots, formatting, and grids

ggplot2 stacked barplots, formatting, and grids - r

In the data that I am attempting to plot, each sample belongs in one of several groups, that will be plotted on their own grids. I am plotting stacked bar plots for each sample that will be ordered in increasing number of sequences, which is an id attribute of each sample.
Currently, the plot (with some random data) looks like this:
(Since I don't have the required 10 rep for images, I am linking it here)
There are couple things I need to accomplish. And I don't know where to start.
I would like the bars not to be placed at its corresponding nseqs value, rather placed next to each other in ascending nseqs order.
I don't want each grid to have the same scale. Everything needs to fit snugly.
I have tried to set scales and size to for facet_grid to free_x, but this results in an unused argument error. I think this is related to the fact that I have not been able to get the scales library loaded properly (it keeps saying not available).
Code that deals with plotting:
ggfdata <- melt(fdata, id.var=c('group','nseqs','sample'))
p <- ggplot(ggfdata, aes(x=nseqs, y=value, fill = variable)) +
geom_bar(stat='identity') +
facet_grid(~group) +
scale_y_continuous() +
opts(title=paste('Taxonomic Distribution - grouped by',colnames(meta.frame)[i]))

Try this:
update.packages()
## I'm assuming your ggplot2 is out of date because you use opts()
## If the scales library is unavailable, you might need to update R
ggfdata <- melt(fdata, id.var=c('group','nseqs','sample'))
ggfdata$nseqs <- factor(ggfdata$nseqs)
## Making nseqs a factor will stop ggplot from treating it as a numeric,
## which sounds like what you want
p <- ggplot(ggfdata, aes(x=nseqs, y=value, fill = variable)) +
geom_bar(stat='identity') +
facet_wrap(~group, scales="free_x") + ## No need for facet_grid with only one variable
labs(title = paste('Taxonomic Distribution - grouped by',colnames(meta.frame)[i]))

Related

What is the purpose of using facet_grid(variable ~ .) instead of just using facet_wrap?

So I'm self-teaching myself R right now using this online resource: "https://r4ds.had.co.nz/data-visualisation.html#facets"
This particular section is going over the use of facet_wrap and facet_grid. It's clear to me that facet_grid is primarily used when wanting to visualize a plot along two additional dimensions, rather than just one. What I don't understand is why you can use facet_grid(.~variable) or facet_grid(variable~.) to basically achieve the same result as facet_wrap. Putting a "." in place of a variable results in just not faceting along the row or column dimension, or in other words showing 1 additional variable just as facet_wrap would do.
If anyone can shed some light on this, thank you!

If you use facet_grid, the facets will always be in one row/column. They will never wrap to make a rectangle. But really if you just have one variable with few levels, it doesn't much matter.
You can also see that facet_grid(.~variable) and facet_grid(variable~.) will put the facet labels in different places (row headings vs column headings)
mg <- ggplot(mtcars, aes(x = mpg, y = wt)) + geom_point()
mg + facet_grid(vs~ .) + labs(title="facet_grid(vs~ .)"),
mg + facet_grid(.~ vs) + labs(title="facet_grid(.~ vs)")
So in the most simple of cases, there's nothing that different between them. The main reason to use facet_grid is to have a single, common axis for all facets so you can easily scan across all panels to make a direct comparison of data.

Actually, the same result is not produced all the time...
The number of facets which appear across the graphs pane is fixed with facet_grid (always the number of unique values in the variable) where as facet_wrap, like its name suggests, wraps the facets around the graphics pane. In this way the functions only result in the same graph when the number of facets produced is small.
Both facet_grid and facet_wrap take their arguments in the form row~columns, and nowdays we don't need to use the dot with facet_grid.
In order to compare their differences let's add a new variable with 8 unqiue values to the mtcars data set:
library(tidyverse)
mtcars$example <- rep(1:8, length.out = 32)
ggplot()+
geom_point(data = mtcars, aes(x = mpg, y = wt))+
facet_grid(~example, labeller = label_both)
Which results in a cluttered plot:
Compared to:
ggplot()+
geom_point(data = mtcars, aes(x = mpg, y = wt))+
facet_wrap(~example, labeller = label_both)
Which results in:

How to make stacked bar chart with count values on y axis>

I'm trying to create a stacked barchart with gene sequencing data, where for each gene there is a tRF.type and Amino.Acid value. An example data set looks like this:
tRF <- c('tRF-26-OB1690PQR3E', 'tRF-27-OB1690PQR3P', 'tRF-30-MIF91SS2P46I')
tRF.type <- c('5-tRF', 'i-tRF', '3-tRF')
Amino.Acid <- c('Ser', 'Lys', 'Ser')
tRF.data <- data.frame(tRF, tRF.type, Amino.Acid)
I would like the x-axis to represent the amino acid type, the y-axis the number of counts of each tRF type and the the fill of the bars to represent each tRF type.
My code is:
ggplot(chart_data, aes(x = Amino.Acid, y = tRF.type, fill = tRF.type)) +
geom_bar(stat="identity") +
ggtitle("LAN5 - 4 days post CNTF treatment") +
xlab("Amino Acid") +
ylab("tRF type")
However, it generates this graph, where the y-axis is labelled with the categories of tRF type. How can I change my code so that the y-axis scale is numerical and represents the counts of each tRF type?
Barchart

OP and Welcome to SO. In future questions, please, be sure to provide a minimal reproducible example - meaning provide code, an image (if possible), and at least a representative dataset that can demonstrate your question or problem clearly.
TL;DR - don't use stat="identity", just use geom_bar() without providing a stat, since default is to use the counts. This should work:
ggplot(chart_data, aes(x = Amino.Acid, fill = tRF.type)) + geom_bar()
The dataset provided doesn't adequately demonstrate your issue, so here's one that can work. The example data herein consists of 100 observations and two columns: one called Capitals for randomly-selected uppercase letters and one Lowercase for randomly-selected lowercase letters.
library(ggplot2)
set.seed(1234)
df <- data.frame(
Capitals=sample(LETTERS, 100, replace=TRUE),
Lowercase=sample(letters, 100, replace=TRUE)
)
If I plot similar to your code, you can see the result:
ggplot(df, aes(x=Capitals, y=Lowercase, fill=Lowercase)) +
geom_bar(stat="identity")
You can see, the bars are stacked, but the y axis is all smooshed down. The reason is related to understanding the difference between geom_bar() and geom_col(). Checking the documentation for these functions, you can see that the main difference is that geom_col() will plot bars with heights equal to the y aesthetic, whereas geom_bar() plots by default according to stat="count". In fact, using geom_bar(stat="identity") is really just a complicated way of saying geom_col().
Since your y aesthetic is not numeric, ggplot still tries to treat the discrete levels numerically. It doesn't really work out well, and it's the reason why your axis gets smooshed down like that. What you want, is geom_bar(stat="count").... which is the same as just using geom_bar() without providing a stat=.
The one problem is that geom_bar() only accepts an x or a y aesthetic. This means you should only give it one of them. This fixes the issue and now you get the proper chart:
ggplot(df, aes(x=Capitals, fill=Lowercase)) + geom_bar()

You want your y-axis to be a count, not tRF.type. This code should give you the correct plot: I've removed the y = tRF.type from ggplot(), and stat = "identity from geom_bar() (it is using the default value of stat = "count instead).
ggplot(tRF.data, aes(x = Amino.Acid, fill = tRF.type)) +
geom_bar() +
ggtitle("LAN5 - 4 days post CNTF treatment") +
xlab("Amino Acid") +
ylab("tRF type")

increase distance between stack of geom_line()

I have some diffraction data from XRD. I'd like to plot it all in one chart but stacked. Because the range of y is quite large, stacking is not so straight forward. there's a link to data if you wish to play and the simple script is below
https://www.dropbox.com/s/b9kyubzncwxge9j/xrd.csv?dl=0
library(dplyr)
library(ggplot2)
#load it up
xrd <- read.csv("xrd.csv")
#melt it
xrd.m = melt(xrd, id.var="Degrees_2_Theta")
# Reorder so factor levels are grouped together
xrd.m$variable = factor(xrd.m$variable,
levels=sort(unique(as.character(xrd.m$variable))))
names(xrd.m)[names(xrd.m) == "variable"] <- "Sample"
names(xrd.m)[names(xrd.m) == "Degrees_2_Theta"] <- "angle"
#colours use for nearly everything
cbPalette <- c("#000000", "#E69F00", "#56B4E9", "#009E73", "#F0E442", "#0072B2", "#D55E00", "#CC79A7")
#plot
ggplot(xrd.m, aes(angle, value, colour=Sample, group=Sample)) +
geom_line(position = "stack") +
scale_colour_manual(values=cbPalette) +
theme_linedraw() +
theme(legend.position = "none",
axis.text.y=element_blank(),
axis.ticks.y=element_blank()) +
labs(x="Degrees 2-theta", y="Intensity - stacked for clarity")
Here is the plot- as you can see it's not quite stacked
Here is something I had in excel a way back. ugly - but slightly better
I'm not sure that I will actually use the stacked plot function from R because I find it always looks off from past experience and instead might use the same data manipulation I used from excel.

It seems that you have a different understanding of the result of applying position="stack" on your geom_line() than what actually is happening. What you're looking to do is probably best served by either using faceting or creating a ridgeline plot. I will give you solutions for both of those approaches here with some example data (sorry, I don't click dropbox links and they will eventually break anyway).
What does position="stack" actually do?
The result of position="stack" will be that your y values of each line will be added, or "stacked", together in the resulting plot. That means that the lines as drawn will only actually accurately reflect the actual value in the data for one of the lines, and the other will be "added on top" of that (stacked). The behavior is best illustrated via an example:
ex <- data.frame(x=c(1,1,2,2,3,3), y=c(1,5,1,2,1,1), grp=rep(c('A','B'),3))
ggplot(ex, aes(x,y, color=grp)) + geom_line()
The y values for "A" are equal to 1 at all values of x. This is the same as indicating position="identity". Now, let's see what happens if we use position="stack":
ggplot(ex, aes(x,y, color=grp)) + geom_line(position="stack")
You should see, the value of y plotted for "B" is equal to B, whereas the y value for "A" is actually the value for "A" added to the value for "B". Hope that makes sense.
Faceting
What you're trying to do is take the overlapping lines you have and "separate" them vertically, right? That's not quite stacking, as you likely want to maintain their y values as position="identity" (the default). One way to do that quite easily is to use faceting, which creates what you could call "stacked plots" according to one or two variables in your dataset. In this case, I'm using example data (for reasons outlined above), but you can use this to understand how you want to arrange your own data.
set.seed(1919191)
df <- data.frame(
x=rep(1:100, 5),
y=c(rnorm(100,0,0.1), rnorm(100,0,0.2), rnorm(100,0,0.3), rnorm(100,0,0.4), rnorm(100,0,0.5)),
sample_name=c(rep('A',100), rep('B',100), rep('C',100), rep('D',100), rep('E',100)))
# plot code
p <- ggplot(df, aes(x,y, color=sample_name))
p + geom_line() + facet_grid(sample_name ~ .)
Create a Ridgeline Plot
The other way that kind of does the same thing is to create what is known as a ridgeline plot. You can do this via the package ggridges and here's an example using geom_ridgeline():
p + geom_ridgeline(
aes(y=sample_name, height=y),
fill=NA, scale=1, min_height=-Inf)
The idea here is to understand that geom_ridgeline() changes your y axis to be the grouping variable (so we actually have to redefine that in aes()), and the actual y value for each of those groups should be assigned to the height= aesthetic. If you have data that has negative y values (now height= values), you'll also want to set the min_height=, or it will cut them off at 0 by default. You can also change how much each of the groups are separated by playing with scale= (does not always change in the way you think it would, btw).

compare boxplots with a single value

I want to compare the distribution of several variables (here X1 and X2) with a single value (here bm). The issue is that these variables are too many (about a dozen) to use a single boxplot.
Additionaly the levels are too different to use one plot. I need to use facets to make things more organised:
However with this plot my benchmark category (bm), which is a single value in X1 and X2, does not appear in X1 and seems to have several values in X2. I want it to be only this green line, which it is in the first plot. Any ideas why it changes? Is there any good workaround? I tried the options of facet_wrap/facet_grid, but nothing there delivered the right result.
I also tried combining a bar plot with bm and three empty categories with the boxplot. But firstly it looked terrible and secondly it got similarly screwed up in the facetting. Basically any work around would help.
Below the code to create the minimal example displayed here:
# Creating some sample data & loading libraries
library(ggplot2)
library(RColorBrewer)
set.seed(10111)
x=matrix(rnorm(40),20,2)
y=rep(c(-1,1),c(10,10))
x[y==1,]=x[y==1,]+1
x[,2]=x[,2]+20
df=data.frame(x,y)
# creating a benchmark point
benchmark=data.frame(y=rep("bm",2),key=c("X1","X2"),value=c(-0.216936,20.526312))
# melting the data frame, rbinding it with the benchmark
test_dat=rbind(tidyr::gather(df,key,value,-y),benchmark)
# Creating a plot
p_box <- ggplot(data = test_dat, aes(x=key, y=value,color=as.factor(test_dat$y))) +
geom_boxplot() + scale_color_manual(name="Cluster",values=brewer.pal(8,"Set1"))
# The first line delivers the first plot, the second line the second plot
p_box
p_box + facet_wrap(~key,scales = "free",drop = FALSE) + theme(legend.position = "bottom")

The problem only lies int the use of test_dat$y inside the color aes. Never use $ in aes, ggplot will mess up.
Anyway, I think you plot would improve if you use a geom_hline for the benchmark, instead of hacking in a single value boxplot:
library(ggplot2)
library(RColorBrewer)
ggplot(tidyr::gather(df,key,value,-y)) +
geom_boxplot(aes(x=key, y=value, color=as.factor(y))) +
geom_hline(data = benchmark, aes(yintercept = value), color = '#4DAF4A', size = 1) +
scale_color_manual(name="Cluster",values=brewer.pal(8,"Set1")) +
facet_wrap(~key,scales = "free",drop = FALSE) +
theme(legend.position = "bottom")

Adding text to facetted histogram

Using ggplot2 I have made facetted histograms using the following code.
library(ggplot2)
library(plyr)
df1 <- data.frame(monthNo = rep(month.abb[1:5],20),
classifier = c(rep("a",50),rep("b",50)),
values = c(seq(1,10,length.out=50),seq(11,20,length.out=50))
)
means <- ddply (df1,
c(.(monthNo),.(classifier)),
summarize,
Mean=mean(values)
)
ggplot(df1,
aes(x=values, colour=as.factor(classifier))) +
geom_histogram() +
facet_wrap(~monthNo,ncol=1) +
geom_vline(data=means, aes(xintercept=Mean, colour=as.factor(classifier)),
linetype="dashed", size=1)
The vertical line showing means per month is to stay.
But I want to also add text over these vertical lines displaying the mean values for each month. These means are from the 'means' data frame.
I have looked at geom_text and I can add text to plots. But it appears my circumstance is a little different and not so easy. It's a lot simpler to add text in some cases where you just add values of the plotted data points. But cases like this when you want to add the mean and not the value of the histograms I just can't find the solution.
Please help. Thanks.

Having noted the possible duplicate (another answer of mine), the solution here might not be as (initially/intuitively) obvious. You can do what you need if you split the geom_text call into two (for each classifier):
ggplot(df1, aes(x=values, fill=as.factor(classifier))) +
geom_histogram() +
facet_wrap(~monthNo, ncol=1) +
geom_vline(data=means, aes(xintercept=Mean, colour=as.factor(classifier)),
linetype="dashed", size=1) +
geom_text(y=0.5, aes(x=Mean, label=Mean),
data=means[means$classifier=="a",]) +
geom_text(y=0.5, aes(x=Mean, label=Mean),
data=means[means$classifier=="b",])
I'm assuming you can format the numbers to the appropriate precision and place them on the y-axis where you need to with this code.

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

ggplot2 stacked barplots, formatting, and grids - r

Related

What is the purpose of using facet_grid(variable ~ .) instead of just using facet_wrap?

How to make stacked bar chart with count values on y axis>

increase distance between stack of geom_line()

compare boxplots with a single value

Adding text to facetted histogram

Categories

Resources