How to create geom_boxplot with large amount of continuous x-variables - r

I have a data frame which contains x-axis numeric bins and continuous y-axis data across multiple categories. Initially, I created a boxplot by making the x-axis bins "factors", and doing a boxplot of the melted data. Reproducible data:
x <- seq(1,10,by=1)
y1 <- rnorm(10, mean=3)
y2 <- rnorm(10, mean=10)
y3<- rnorm(10, mean=1)
y4<- rnorm(10, mean=8)
y5<- rnorm(10, mean=12)
df <- data.frame(x,y1,y2,y3,y4,y5)
df.m <- melt(df, id="x")
My code to create the x-axis data as a factor:
df.m$x <- as.factor(df.m$x)
My ggplot:
ggplot(df.m, aes(x=x, y=value))+
geom_boxplot(notch=FALSE, outlier.shape=NA, fill="red", alpha=0.1)+
theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1))
The resulting plot:
:
The problem is that I cannot use x-axis numeric spacing because the x-axis is categorized as a factor, which has equal spacing. I want to be able to use something like scale_x_continuous to manipulate the axis breaks and spacing to, say, an interval of 2, rather than a boxplot every 1, but when I try to plot the data with the x-axis "as.numeric", I just get one boxplot of all of the data:
Any suggestions for a way to get this continuous-looking boxplot curve (the first image) while still being able to control the numeric properties of the x-axis? Thanks!

Here is a way using the original data you posted on Google - which actually was much more helpful, IMO.
ggplot(df, aes(x=CH, y=value,group=CH))+
geom_boxplot(notch=FALSE, outlier.shape=NA, fill="red", alpha=0.2)+
scale_x_log10()
So, as #BenBolker said before he deleted his answer(??), you should leave the x-variable (CH) as numeric, and set group=CH in the call to aes(...).
With your real data there is another problem though. Your CH is more or less logarithmically spaced, so there are about as many points < 1 as there are between 1 - 10, etc. ggplot wants to make the boxes all the same size, so with a linear x-axis the box width is smaller than the line width, and you don't see the boxes at all. Changing the x-axis to a logarithmic scale fixes that, more or less.

Don't make x a factor. You need to aesthetically map a group that is a factor determining which box the value is associated with, luckily, after melting, this is what you variable column is:
ggplot(df.m, aes(x = x, y = value, group = variable)) +
geom_boxplot()
As x is still numeric, you can give it whatever values you want within a specific variable level and the boxplot will show up at that spot. Or you could transform the x axis, etc.

Related

How to make stacked bar chart with count values on y axis>

I'm trying to create a stacked barchart with gene sequencing data, where for each gene there is a tRF.type and Amino.Acid value. An example data set looks like this:
tRF <- c('tRF-26-OB1690PQR3E', 'tRF-27-OB1690PQR3P', 'tRF-30-MIF91SS2P46I')
tRF.type <- c('5-tRF', 'i-tRF', '3-tRF')
Amino.Acid <- c('Ser', 'Lys', 'Ser')
tRF.data <- data.frame(tRF, tRF.type, Amino.Acid)
I would like the x-axis to represent the amino acid type, the y-axis the number of counts of each tRF type and the the fill of the bars to represent each tRF type.
My code is:
ggplot(chart_data, aes(x = Amino.Acid, y = tRF.type, fill = tRF.type)) +
geom_bar(stat="identity") +
ggtitle("LAN5 - 4 days post CNTF treatment") +
xlab("Amino Acid") +
ylab("tRF type")
However, it generates this graph, where the y-axis is labelled with the categories of tRF type. How can I change my code so that the y-axis scale is numerical and represents the counts of each tRF type?
Barchart
OP and Welcome to SO. In future questions, please, be sure to provide a minimal reproducible example - meaning provide code, an image (if possible), and at least a representative dataset that can demonstrate your question or problem clearly.
TL;DR - don't use stat="identity", just use geom_bar() without providing a stat, since default is to use the counts. This should work:
ggplot(chart_data, aes(x = Amino.Acid, fill = tRF.type)) + geom_bar()
The dataset provided doesn't adequately demonstrate your issue, so here's one that can work. The example data herein consists of 100 observations and two columns: one called Capitals for randomly-selected uppercase letters and one Lowercase for randomly-selected lowercase letters.
library(ggplot2)
set.seed(1234)
df <- data.frame(
Capitals=sample(LETTERS, 100, replace=TRUE),
Lowercase=sample(letters, 100, replace=TRUE)
)
If I plot similar to your code, you can see the result:
ggplot(df, aes(x=Capitals, y=Lowercase, fill=Lowercase)) +
geom_bar(stat="identity")
You can see, the bars are stacked, but the y axis is all smooshed down. The reason is related to understanding the difference between geom_bar() and geom_col(). Checking the documentation for these functions, you can see that the main difference is that geom_col() will plot bars with heights equal to the y aesthetic, whereas geom_bar() plots by default according to stat="count". In fact, using geom_bar(stat="identity") is really just a complicated way of saying geom_col().
Since your y aesthetic is not numeric, ggplot still tries to treat the discrete levels numerically. It doesn't really work out well, and it's the reason why your axis gets smooshed down like that. What you want, is geom_bar(stat="count").... which is the same as just using geom_bar() without providing a stat=.
The one problem is that geom_bar() only accepts an x or a y aesthetic. This means you should only give it one of them. This fixes the issue and now you get the proper chart:
ggplot(df, aes(x=Capitals, fill=Lowercase)) + geom_bar()
You want your y-axis to be a count, not tRF.type. This code should give you the correct plot: I've removed the y = tRF.type from ggplot(), and stat = "identity from geom_bar() (it is using the default value of stat = "count instead).
ggplot(tRF.data, aes(x = Amino.Acid, fill = tRF.type)) +
geom_bar() +
ggtitle("LAN5 - 4 days post CNTF treatment") +
xlab("Amino Acid") +
ylab("tRF type")

Is there a way to change a continuous axis' labels to be strings in ggplot?

I am creating a line chart in ggplot2 with a continuous y-axis but I would like to change the labels to strings to represent bounds, is this possible? For example, I want all values V<10 = A, 10>V<20 = B, and V>20 = C.
Y<-c(5,15,12,8,12,13,19,24)
Day<-c(1,2,3,4,5,6,7,8)
data <- data.frame(Y, Day)
ggplot(data= data, aes(x=Day, y=Y, group=1))+geom_line()+geom_point()
The label of 10 would be replaced by A, 20 by B and so on.
Hope this is a little clearer, thank you.
You can use scale_y_continuous and relabel the scale using the labels= argument. The key here is to also use breaks= which map back to the original numeric scale and then I'm also using limits= to ensure that we can see the whole alphabet.
ggplot(data= data, aes(x=Day, y=Y, group=1))+geom_line()+geom_point() +
scale_y_continuous(labels=LETTERS, breaks=1:26, limits=c(1,26))

How to plot density of points in one dimension with different factors in ggplot2

I am attempting to place individual points on a plot using ggplot2, however as there are many points, it is difficult to gauge how densely packed the points are. Here, there are two factors being compared against a continuous variable, and I want to change the color of the points to reflect how closely packed they are with their neighbors. I am using the geom_point function in ggplot2 to plot the points, but I don't know how to feed it the right information on color.
Here is the code I am using:
s1 = rnorm(1000, 1, 10)
s2 = rnorm(1000, 1, 10)
data = data.frame(task_number = as.factor(c(replicate(100, 1),
replicate(100, 2))),
S = c(s1, s2))
ggplot(data, aes(x = task_number, y = S)) + geom_point()
Which generates this plot:
However, I want it to look more like this image, but with one dimension rather than two (which I borrowed from this website: https://slowkow.com/notes/ggplot2-color-by-density/):
How do I change the colors of the first plot so it resembles that of the second plot?
I think the tricky thing about this is you want to show the original values, and evaluate the density at those values. I borrowed ideas from here to achieve that.
library(dplyr)
data = data %>%
group_by(task_number) %>%
# Use approxfun to interpolate the density back to
# the original points
mutate(dens = approxfun(density(S))(S))
ggplot(data, aes(x = task_number, y = S, colour = dens)) +
geom_point() +
scale_colour_viridis_c()
Result:
One could, of course come up with a meausure of proximity to neighbouring values for each value... However, wouldn't adjusting the transparency basically achieve the same goal (gauging how densely packed the points are)?
geom_point(alpha=0.03)

Wrong density values in a histogram with `fill` option in `ggplot2`

I was creating histograms with ggplot2 in R whose bins are separated with colors and noticed one thing. When the bins of a histogram are separated by colors with fill option, the density value of the histogram turns funny.
Here is the data.
set.seed(42)
x <- rnorm(10000,0,1)
df <- data.frame(x=x, b=x>1)
This is a histogram without fill.
ggplot(df, aes(x = x)) +
geom_histogram(aes(y=..density..))
This is a histogram with fill.
ggplot(df, aes(x = x, fill=b)) +
geom_histogram(aes(y=..density..))
You can see the latter is pretty crazy. The left side of the bins is sticking out. The density values of the bins of each color are obviously wrong.
I thought over this issue for a while. The data can't be wrong for the first histogram was normal. It should be something in ggplot2 or geom_histogram function. I googled "geom_histogram density fill" and couldn't find much help.
I want the end product to look like:
Separated by colors as you see in the second histogram
Size and shape identical to the first histogram
The vertical axis being density
How would you deal with issue?
I think what you may want is this:
ggplot(df, aes(x = x, fill=b)) +
geom_histogram()
Rather than the density. As mentioned above the density is asking for extra calcuations.
One thing that is important (in my opinion) is that histograms are graphs of one variable. As soon as you start adding data from other variables you start to change them more into bar charts or something else like that.
You will want work on setting the axis manually if you want it to range from 0 to .4.
The solution is to hand-compute density like this (instead of using the built-in ggplot2 version):
library(ggplot2)
# Generate test data
set.seed(42)
x <- rnorm(10000,0,1)
df <- data.frame(x=x, b=x>1)
ggplot(df, aes(x = x, fill=b)) +
geom_histogram(mapping = aes(y = ..count.. / (sum(..count..) * ..width..)))
when you provide a column name for the fill parameter in ggplot it groups varaiables and plots them according to each group with a unique color.
if you want a single color for the plot just specify the color you want:
FIXED
ggplot(df, aes(x = x)) +
geom_histogram(aes(y=..density..),fill="Blue")

R - Time series data with ggplot2

I have a time series dataset in which the x-axis is a list of events in reverse chronological order such that an observation will have an x value that looks like "n-1" or "n-2" all the way down to 1.
I'd like to make a line graph using ggplot that creates a smooth, continuous line that connects all of the points, but it seems when I try to input my data, the x-axis is extremely wonky.
The code I am currently using is
library(ggplot2)
theoretical = data.frame(PA = c("n-1", "n-2", "n-3"),
predictive_value = c(100, 99, 98));
p = ggplot(data=theoretical, aes(x=PA, y=predictive_value)) + geom_line();
p = p + scale_x_discrete(labels=paste("n-", 1:3, sep=""));
The fitted line and grid partitions that would normally appear using ggplot are replaced by no line and wayyy too many partitions.
When you use geom_line() with a factor on at least one axis, you need to specify a group aesthetic, in this case a constant.
p = ggplot(data=theoretical, aes(x=PA, y=predictive_value, group = 1)) + geom_line()
p = p + scale_x_discrete(labels=paste("n-", 1:3, sep=""))
p
If you want to get rid of the minor grid lines you can add
theme(panel.grid.minor = element_blank())
to your graph.
Note that it can be a little risky, scale-wise, to use factors on one axis like this. It may work better to use a typical continuous scale, and just relabel the points 1, 2, and 3 with "n-1", "n-2", and "n-3".

Resources