How to label stacked histogram in ggplot - r

I am trying to add corresponding labels to the color in the bar in a histogram. Here is a reproducible code.
ggplot(aes(displ),data =mpg) + geom_histogram(aes(fill=class),binwidth = 1,col="black")
This code gives a histogram and give different colors for the car "class" for the histogram bars. But is there any way I can add the labels of the "class" inside corresponding colors in the graph?

The inbuilt functions geom_histogram and stat_bin are perfect for quickly building plots in ggplot. However, if you are looking to do more advanced styling it is often required to create the data before you build the plot. In your case you have overlapping labels which are visually messy.
The following codes builds a binned frequency table for the dataframe:
# Subset data
mpg_df <- data.frame(displ = mpg$displ, class = mpg$class)
melt(table(mpg_df[, c("displ", "class")]))
# Bin Data
breaks <- 1
cuts <- seq(0.5, 8, breaks)
mpg_df$bin <- .bincode(mpg_df$displ, cuts)
# Count the data
mpg_df <- ddply(mpg_df, .(mpg_df$class, mpg_df$bin), nrow)
names(mpg_df) <- c("class", "bin", "Freq")
You can use this new table to set a conditional label, so boxes are only labelled if there are more than a certain number of observations:
ggplot(mpg_df, aes(x = bin, y = Freq, fill = class)) +
geom_bar(stat = "identity", colour = "black", width = 1) +
geom_text(aes(label=ifelse(Freq >= 4, as.character(class), "")),
position=position_stack(vjust=0.5), colour="black")
I don't think it makes a lot of sense duplicating the labels, but it may be more useful showing the frequency of each group:
ggplot(mpg_df, aes(x = bin, y = Freq, fill = class)) +
geom_bar(stat = "identity", colour = "black", width = 1) +
geom_text(aes(label=ifelse(Freq >= 4, Freq, "")),
position=position_stack(vjust=0.5), colour="black")
Update
I realised you can actually selectively filter a label using the internal ggplot function ..count... No need to preformat the data!
ggplot(mpg, aes(x = displ, fill = class, label = class)) +
geom_histogram(binwidth = 1,col="black") +
stat_bin(binwidth=1, geom="text", position=position_stack(vjust=0.5), aes(label=ifelse(..count..>4, ..count.., "")))
This post is useful for explaining special variables within ggplot: Special variables in ggplot (..count.., ..density.., etc.)
This second approach will only work if you want to label the dataset with the counts. If you want to label the dataset by the class or another parameter, you will have to prebuild the data frame using the first method.

Looking at the examples from the other stackoverflow links you shared, all you need to do is change the vjust parameter.
ggplot(mpg, aes(x = displ, fill = class, label = class)) +
geom_histogram(binwidth = 1,col="black") +
stat_bin(binwidth=1, geom="text", vjust=1.5)
That said, it looks like you have other issues. Namely, the labels stack on top of each other because there aren't many observations at each point. Instead I'd just let people use the legend to read the graph.

Related

R code of scatter plot for three variables

Hi I am trying to code for a scatter plot for three variables in R:
Race= [0,1]
YOI= [90,92,94]
ASB_mean = [1.56, 1.59, 1.74]
Antisocial <- read.csv(file = 'Antisocial.csv')
Table_1 <- ddply(Antisocial, "YOI", summarise, ASB_mean = mean(ASB))
Table_1
Race <- unique(Antisocial$Race)
Race
ggplot(data = Table_1, aes(x = YOI, y = ASB_mean, group_by(Race))) +
geom_point(colour = "Black", size = 2) + geom_line(data = Table_1, aes(YOI,
ASB_mean), colour = "orange", size = 1)
Image of plot: https://drive.google.com/file/d/1E-ePt9DZJaEr49m8fguHVS0thlVIodu9/view?usp=sharing
Data file: https://drive.google.com/file/d/1UeVTJ1M_eKQDNtvyUHRB77VDpSF1ASli/view?usp=sharing
Can someone help me understand where I am making mistake? I want to plot mean ASB vs YOI grouped by Race. Thanks.
I am not sure what is your desidered output. Maybe, if I well understood your question I Think that you want somthing like this.
g_Antisocial <- Antisocial %>%
group_by(Race) %>%
summarise(ASB = mean(ASB),
YOI = mean(YOI))
Antisocial %>%
ggplot(aes(x = YOI, y = ASB, color = as_factor(Race), shape = as_factor(Race))) +
geom_point(alpha = .4) +
geom_point(data = g_Antisocial, size = 4) +
theme_bw() +
guides(color = guide_legend("Race"), shape = guide_legend("Race"))
and this is the output:
#Maninder: there are a few things you need to look at.
First of all: The grammar of graphics of ggplot() works with layers. You can add layers with different data (frames) for the different geoms you want to plot.
The reason why your code is not working is that you mix the layer call and or do not really specify (and even mix) what is the scatter and line visualisation you want.
(I) Use ggplot() + geom_point() for a scatter plot
The ultimate first layer is: ggplot(). Think of this as your drawing canvas.
You then speak about adding a scatter plot layer, but you actually do not do it.
For example:
# plotting antisocal data set
ggplot() +
geom_point(data = Antisocial, aes(x = YOI, y = ASB, colour = as.factor(Race)))
will plot your Antiscoial data set using the scatter, i.e. geom_point() layer.
Note that I put Race as a factor to have a categorical colour scheme otherwise you might end up with a continous palette.
(II) line plot
In analogy to above, you would get for the line plot the following:
# plotting Table_1
ggplot() +
geom_line(data = Table_1, aes(x = YOI, y = ASB_mean))
I save showing the plot of the line.
(III) combining different layers
# putting both together
ggplot() +
geom_point(data = Antisocial, aes(x = YOI, y = ASB, colour = as.factor(Race))) +
geom_line(data = Table_1, aes(x = YOI, y = ASB_mean)) +
## this is to set the legend title and have a nice(r) name in your colour legend
labs(colour = "Race")
This yields:
That should explain how ggplot-layering works. Keep an eye on the datasets and geoms that you want to use. Before working with inheritance in aes, I recommend to keep the data= and aes() call in the geom_xxxx. This avoids confustion.
You may want to explore with geom_jitter() instead of geom_point() to get a bit of a better presentation of your dataset. The "few" points plotted are the result of many datapoints in the same position (and overplotted).
Moving away from plotting to your question "I want to plot mean ASB vs YOI grouped by Race."
I know too little about your research to fully comprehend what you mean with that.
I take it that the mean ASB you calculated over the whole population is your reference (aka your Table_1), and you would like to see how the Race groups feature vs this population mean.
One option is to group your race data points and show them as boxplots for each YOI.
This might be what you want. The boxplot gives you the median and quartiles, and you can compare this per group against the calculated ASB mean.
For presentation purposes, I highlighted the line by increasing its size and linetype. You can play around with the colours, etc. to give you the aesthetics you aim for.
Please note, that for the grouped boxplot, you also have to treat your integer variable YOI, I coerced into a categorical factor. Boxplot works with fill for the body (colour sets only the outer line). In this setup, you also need to supply a group value to geom_line() (I just assigned it to 1, but that is arbitrary - in other contexts you can assign another variable here).
ggplot() +
geom_boxplot(data = Antisocial, aes(x = as.factor(YOI), y = ASB, fill = as.factor(Race))) +
geom_line(data = Table_1, aes(x = as.factor(YOI), y = ASB_mean, group = 1)
, size = 2, linetype = "dashed") +
labs(x = "YOI", fill = "Race")
Hope this gets you going!

Boxplot ggplot2: Show mean value and number of observations in grouped boxplot

I wish to add the number of observations to this boxplot, not by group but separated by factor. Also, I wish to display the number of observations in addition to the x-axis label that it looks something like this: ("PF (N=12)").
Furthermore, I would like to display the mean value of each box inside of the box, displayed in millions in order not to have a giant number for each box.
Here is what I have got:
give.n <- function(x){
return(c(y = median(x)*1.05, label = length(x)))
}
mean.n <- function(x){x <- x/1000000
return(c(y = median(x)*0.97, label = round(mean(x),2)))
}
ggplot(Soils_noctrl) +
geom_boxplot(aes(x=Slope,y=Events.g_Bacteria, fill = Detergent),
varwidth = TRUE) +
stat_summary(aes(x = Slope, y = Events.g_Bacteria), fun.data = give.n, geom = "text",
fun = median,
position = position_dodge(width = 0.75))+
ggtitle("Cell Abundance")+
stat_summary(aes(x = Slope, y = Events.g_Bacteria),
fun.data = mean.n, geom = "text", fun = mean, colour = "red")+
facet_wrap(~ Location, scale = "free_x")+
scale_y_continuous(name = "Cell Counts per Gram (Millions)",
breaks = round (seq(min(0),
max(100000000), by = 5000000),1),
labels = function(y) y / 1000000)+
xlab("Sample")
And so far it looks like this:
As you can see, the mean value is at the bottom of the plot and the number of observations are in the boxes but not separated
Thank you for your help! Cheers
TL;DR - you need to supply a group= aesthetic, since ggplot2 does not know on which column data it is supposed to dodge the text geom.
Unfortunately, we don't have your data, but here's an example set that can showcase the rationale here and the function/need for group=.
set.seed(1234)
df1 <- data.frame(detergent=c(rep('EDTA',15),rep('Tween',15)), cells=c(rnorm(15,10,1),rnorm(15,10,3)))
df2 <- data.frame(detergent=c(rep('EDTA',20),rep('Tween',20)), cells=c(rnorm(20,1.3,1),rnorm(20,4,2)))
df3 <- data.frame(detergent=c(rep('EDTA',30),rep('Tween',30)), cells=c(rnorm(30,5,0.8),rnorm(30,3.3,1)))
df1$smp='Sample1'
df2$smp='Sample2'
df3$smp='Sample3'
df <- rbind(df1,df2,df3)
Instead of using stat_summary(), I'm just going to create a separate data frame to hold the mean values I want to include as text on my plot:
summary_df <- df %>% group_by(smp, detergent) %>% summarize(m=mean(cells))
Now, here's the plot and use of geom_text() with dodging:
p <- ggplot(df, aes(x=smp, y=cells)) +
geom_boxplot(aes(fill=detergent))
p + geom_text(data=summary_df,
aes(y=m, label=round(m,2)),
color='blue', position=position_dodge(0.8)
)
You'll notice the numbers are all separated along y= just fine, but the "dodging" is not working. This is because we have not supplied any information on how to do the dodging. In this case, the group= aesthetic can be supplied to let ggplot2 know that this is the column by which to use for the dodging:
p + geom_text(data=summary_df,
aes(y=m, label=round(m,2), group=detergent),
color='blue', position=position_dodge(0.8)
)
You don't have to supply the group= aesthetic if you supply another aesthetic such as color= or fill=. In cases where you give both a color= and group= aesthetic, the group= aesthetic will override any of the others for dodging purposes. Here's an example of the same, but where you don't need a group= aesthetic because I've moved color= up into the aes() (changing fill to greyscale so that you can see the text):
p + geom_text(data=summary_df,
aes(y=m, label=round(m,2), color=detergent),
position=position_dodge(0.8)
) + scale_fill_grey()
FUN FACT: Dodging still works even if you supply geom_text() with a nonsensical aesthetic that would normally work for dodging, such as fill=. You get a warning message Ignoring unknown aesthetics: fill, but the dodging still works:
p + geom_text(data=summary_df,
aes(y=m, label=round(m,2), fill=detergent),
position=position_dodge(0.8)
)
# gives you the same plot as if you just supplied group=detergent, but with black text
In your case, changing your stat_summary() line to this should work:
stat_summary(aes(x = Slope, y = Events.g_Bacteria, group = Detergent),...

Best way to calculate number of facets in geom_hline/_vline

When I combine geom_vline() with facet_grid() like so:
DATA <- data.frame(x = 1:6,y = 1:6, f = rep(letters[1:2],3))
ggplot(DATA,aes(x = x,y = y)) +
geom_point() +
facet_grid(f~.) +
geom_vline(xintercept = 2:3,
colour =c("goldenrod3","dodgerblue3"))
I get an error message stating Error: Aesthetics must be either length 1 or the same as the data (4): colour because there are two lines in each facet and there are two facets. One way to get around this is to use rep(c("goldenrod3","dodgerblue3"),2), but this requires that every time I change the faceting variables, I also have to calculate the number of facets and replace the magic number (2) in the call to rep(), which makes re-using ggplot code so much less nimble.
Is there a way to get the number of facets directly from ggplot for use in this situation?
You could put the xintercept and colour info into a data.frame to pass to geom_vline and then use scale_color_identity.
ggplot(DATA, aes(x = x, y = y)) +
geom_point() +
facet_grid(f~.) +
geom_vline(data = data.frame(xintercept = 2:3,
colour = c("goldenrod3","dodgerblue3") ),
aes(xintercept = xintercept, color = colour) ) +
scale_color_identity()
This side-steps the issue of figuring out the number of facets, although that could be done by pulling out the number of unique values in the faceting variable with something like length(unique(DATA$f)).

ggplot2: geom_bar with group, position_dodge and fill

I am trying to generate a barplot such that the x-axes is by patient with each patient having multiple samples. So for instance (using the mtcars data as a template of what the data would look like):
library("ggplot2")
ggplot(mtcars, aes(x = factor(cyl), group = factor(gear))) +
geom_bar(position = position_dodge(width = 0.8), binwidth = 25) +
xlab("Patient") +
ylab("Number of Mutations per Patient Sample")
This would produce something like this:
With each barplot representing a sample in each patient.
I want to add additional information about each patient sample by using colors to fill the barplots (e.g. different types of mutations in each patient sample). I was thinking I could specify the fill parameter like this:
ggplot(mtcars, aes(x = factor(cyl), group = factor(gear), fill = factor(vs))) +
geom_bar(position = position_dodge(width = 0.8), binwidth = 25) +
xlab("Patient") +
ylab("Number of Mutations per Patient Sample")
But this doesn't produce "stacked barplots" for each patient sample barplot. I am assuming this is because the position_dodge() is set. Is there anyway to get around this? Basically, what I want is:
ggplot(mtcars, aes(x = factor(cyl), fill = factor(vs))) +
geom_bar() +
xlab("Patient") +
ylab("Number of Mutations per Patient Sample")
But with these colors available in the first plot I listed. Is this possible with ggplot2?
I think facets are the closest approximation to what you seem to be looking for:
ggplot(mtcars, aes(x = factor(gear), fill = factor(vs))) +
geom_bar(position = position_dodge(width = 0.8), binwidth = 25) +
xlab("Patient") +
ylab("Number of Mutations per Patient Sample") +
facet_wrap(~cyl)
I haven't found anything related in the issue tracker of ggplot2.
If I understand your question correctly, you want to pass in aes() into your geom_bar layer. This will allow you to pass a fill aesthetic. You can then place your bars as "dodge" or "fill" depending on how you want to display the data.
A short example is listed here:
ggplot(mtcars, aes(x = factor(cyl), fill = factor(vs))) +
geom_bar(aes(fill = factor(vs)), position = "dodge", binwidth = 25) +
xlab("Patient") +
ylab("Number of Mutations per Patient Sample")
With the resulting plot: http://imgur.com/ApUJ4p2 (sorry S/O won't let me post images yet)
Hope that helps!
I have hacked around this a few times by layering multiple geom_cols on top of each other in the order I prefer. For example, the code
ggplot(data, aes(x=cert, y=pct, fill=Party, group=Treatment, shape=Treatment)) +
geom_col(aes(x=cert, y=1), position=position_dodge(width=.9), fill="gray90") +
geom_col(position=position_dodge(width=.9)) +
scale_fill_manual(values=c("gray90", "gray60"))
Allowed me to produce the feature you're looking for without faceting. Notice how I set the background layer's y value to 1. To add more layers, you can just cumulatively sum your variables.
Image of the plot:
I guess, my answer in this post will help you to build the chart with multiple stacked vertical bars for each patient ...
Layered axes in ggplot?
One way I don't see suggested above is to use facet_wrap to group samples by patient and then stack mutations by sample. Removes the need for dodging. Also changed and modified which mtcars attributes used to match question and get more variety in the mutations attribute.
patients <-c('Tom','Harry','Sally')
samples <- c('S1','S2','S3')
mutations <- c('M1','M2','M3','M4','M5','M6','M7','M8')
ds <- data.frame(
patients=patients[mtcars$cyl/2 - 1],
samples=samples[mtcars$gear - 2],
mutations=mutations[mtcars$carb]
)
ggplot(
ds,
aes(
x = factor(samples),
group = factor(mutations),
fill = factor(mutations)
)
) +
geom_bar() +
facet_wrap(~patients,nrow=1) +
ggtitle('Patient') +
xlab('Sample') +
ylab('Number of Mutations per Patient Sample') +
labs(fill = 'Mutation')
Output now has labels that match the specific language of the request...easier to see what is going on.

What is the simplest method to fill the area under a geom_freqpoly line?

The x-axis is time broken up into time intervals. There is an interval column in the data frame that specifies the time for each row. The column is a factor, where each interval is a different factor level.
Plotting a histogram or line using geom_histogram and geom_freqpoly works great, but I'd like to have a line, like that provided by geom_freqpoly, with the area filled.
Currently I'm using geom_freqpoly like this:
ggplot(quake.data, aes(interval, fill=tweet.type)) + geom_freqpoly(aes(group = tweet.type, colour = tweet.type)) + opts(axis.text.x=theme_text(angle=-60, hjust=0, size = 6))
I would prefer to have a filled area, such as provided by geom_density, but without smoothing the line:
The geom_area has been suggested, is there any way to use a ggplot2-generated statistic, such as ..count.., for the geom_area's y-values? Or, does the count aggregation need to occur prior to using ggplot2?
As stated in the answer, geom_area(..., stat = "bin") is the solution:
ggplot(quake.data, aes(interval)) + geom_area(aes(y = ..count.., fill = tweet.type, group = tweet.type), stat = "bin") + opts(axis.text.x=theme_text(angle=-60, hjust=0, size = 6))
produces:
Perhaps you want:
geom_area(aes(y = ..count..), stat = "bin")
geom_ribbon can be used to produce a filled area between two lines without needing to explicitly construct a polygon. There is good documentation here.
ggplot(quake.data, aes(interval, fill=tweet.type, group = 1)) + geom_density()
But I don't think this is a meaningful graphic.
I'm not entirely sure what you're aiming for. Do you want a line or bars. You should check out geom_bar for filled bars. Something like:
p <- ggplot(data, aes(x = time, y = count))
p + geom_bar(stat = "identity")
If you want a line filled in underneath then you should look at geom_area which I haven't personally used but it appears the construct will be almost the same.
p <- ggplot(data, aes(x = time, y = count))
p + geom_area()
Hope that helps. Give some more info and we can probably be more helpful.
Actually i would throw on an index, just the row of the data and use that as x, and then use
p <- ggplot(data, aes(x = index, y = count))
p + geom_bar(stat = "identity") + scale_x_continuous("Intervals",
breaks = index, labels = intervals)

Resources