ggplot2: Stack barcharts with group means - r

I have tried several things to make ggplot plot barcharts with means derived from factors in a dataframe, but i wasnt successful.
If you consider:
df <- as.data.frame(matrix(rnorm(60*2, mean=3,sd=1), 60, 2))
df$factor <- c(rep(factor(1:3), each=20))
I want to achieve a stacked, relative barchart like this:
This chart was created with manually calculating group means in a separate dataframe, melting it and using geom_bar(stat="identity", position = "fill) and scale_y_continuous(labels = percent_format()). I havent found a way to use stat_summary with stacked barcharts.
In a second step, i would like to have errorbars attached to the breaks of each column. I have six treatments and three species, so errorbars should be OK.

For anything this complicated, I think it's loads easier to pre-calculate the numbers, then plot them. This is easily done with dplyr/tidyr (even the error bars):
gather(df, 'cat', 'value', 1:2) %>%
group_by(factor, cat) %>%
summarise(mean=mean(value), se=sd(value)/sqrt(n())) %>%
group_by(cat) %>%
mutate(perc=mean/sum(mean), ymin=cumsum(perc) -se/sum(mean), ymax=cumsum(perc) + se/sum(mean)) %>%
ggplot(aes(x=cat, y=perc, fill=factor(factor))) +
geom_bar(stat='identity') +
geom_errorbar(aes(ymax=ymax, ymin=ymin))
Of course this looks a bit strange because there are error bars around 100% in the stacked bars. I think you'd be way better off ploting the actual data points, plus means and error bars and using faceting:
gather(df, 'cat', 'value', 1:2) %>%
group_by(cat, factor) %>%
summarise(mean=mean(value), se=sd(value)/sqrt(n())) %>%
ggplot(aes(x=cat, y=mean, colour=factor(factor))) +
geom_point(aes(y=value), position=position_jitter(width=.3, height=0), data=gather(df, 'cat', 'value', 1:2) ) +
geom_point(shape=5, size = 3) +
geom_errorbar(aes(ymin=mean-se, ymax=mean+se), width=.1) +
facet_grid(factor ~ .)
This way anyone can examine the data and see for themselves that they are normally distributed

Related

Combine scale_x_upset with scale_y_break

I made an upset plot using the ggupset package and added a break to the y axis with scale_y_break from the ggbreakpackage.
However, when I add scale_y_break, the combination matrix under the bar plot disappears.
Is there a way to combine the combination matrix of the plot made without scale_y_break with the bar plot portion of a plot made with scale_y_break? I can't seem to be able to access the grobs of these plots or use any other workaround. If anyone could help, I would greatly appreciate it!
Example with scale_x_upset and scale_y_break:
df = tidy_movies %>% distinct(title, year, length, .keep_all=TRUE)
ggplot(df, aes(x=Genres)) + geom_bar() + scale_x_upset(n_intersections = 20)+ scale_y_break(breaks = c(750,1000))
I would like to combine the barplot portion of the plot created with:
df = tidy_movies %>% distinct(title, year, length, .keep_all=TRUE)
ggplot(df, aes(x=Genres)) + geom_bar() + scale_x_upset(n_intersections = 20)+ scale_y_break(breaks = c(750,1000))
with the combination matrix portion of the plot made with:
df = tidy_movies %>% distinct(title, year, length, .keep_all=TRUE)
ggplot(df, aes(x=Genres)) + geom_bar() + scale_x_upset(n_intersections = 20)
Thanks!

stack bars with stat_binline()

I have data similar to the example below.
I am wanting to visualise the spread of the outcome variable (value) for each group (name). The fill aesthetic is the desired interval - the example below uses the interquartile range.
I would expect the position="identity" to stack the bars on top of each other for the fill aesthetic (as it does for geom_bar). This is the behaviour that I want.
When I try position="stack", it's a mess.
I have looked at the stat_binline examples and the ggridges vignette but neither have examples where the position is modified to stack the ridges (binned or not).
library(ggplot2)
library(ggridges)
set.seed(123)
size <- 1000
data.frame(
name=sample(LETTERS[1:5], size=size, replace=T),
value=c(sample(1:20, size=size*0.8, replace=T), rep(15, size*0.2))
) %>%
group_by(name) %>%
arrange(value) %>%
mutate(percentile=row_number()/n()) %>%
ungroup() %>%
mutate(in_interval=percentile > 0.25 & percentile < 0.75)%>%
ggplot(aes(x = value, y = name, height = stat(count), fill=in_interval)) +
stat_binline(position = "identity", alpha=0.3, bins=20, scale=0.9) +
coord_flip()
The overlap that I want to avoid is shown here. I want these bars to the stacked instead.
Thank you!
I reviewed the ggridges docs - https://wilkelab.org/ggridges/reference/stat_binline.html
The ggplot position page - https://ggplot2.tidyverse.org/reference/position_stack.html
And a few of the great [ggridges] tagged answers on SO -
https://stackoverflow.com/a/58557352/10276092
Add color gradient to ridgelines according to height
And all I've produced is a non-ggridges answer:
df %>%
ggplot(aes(x=value, fill=in_interval)) +
geom_histogram(bins=20) +
facet_grid(cols=vars(name)) +
coord_flip()

multiple line and facet_grid in Bar plot

I have a dataframe with 53 states and sex variable. e.g. the below DF is having 26 states.
set.seed(25)
test <- data.frame(
state = sample(letters[1:26], 10000, replace = TRUE),
sex = sample(c("M","F"), 10000, replace = TRUE)
)
Now I want to see which state has more female member, so I created a bar plot in a grid for each state and each grid has two bars (M,F).
test.pct = test %>% group_by(state, sex) %>%
summarise(count=n()) %>%
mutate(pct=count/sum(count))
ggplot(test.pct, aes(x=sex, y=pct, fill=sex)) +
geom_bar(stat="identity") +
facet_grid(. ~ state)
The problem is all these 26 grid are appearing in single line - visibility issue. I want to construct the plot in multiple frame, e.g 3X9 instead of 1X26.
Also the state should be ordered based of Female percentage.
Thanks for your help.
Problem #1: Use facet_wrap. Problem #2: Reorder the state levels beforehand.
It could look like this:
ggplot(transform(test.pct, state=factor(state,
levels=with(subset(test.pct, sex=="F"),
state[order(pct)]))),
aes(x=sex, y=pct, fill=sex)) +
geom_bar(stat="identity") +
facet_wrap(~ state, nrow = 3)
The first part is straightforward: just use facet_wrap instead of facet_grid. The ordering is a bit trickier; you have to reorder the levels of the factor. Just to make it a bit clearer, I've split the operation up into a few steps. First, extract only female percentages, then find the order of those percentages, and finally use that order to rearrange the order of the levels of state. That's a long-winded way of doing it, but I hope it makes the principle clear.
wom.pct <- test.pct %>% filter(sex == 'F')
ix <- order(wom.pct$pct)
test.pct$state <- factor(test.pct$state, levels = letters[1:26][ix])
ggplot(test.pct, aes(x=sex, y=pct, fill=sex)) +
geom_bar(stat="identity") +
facet_wrap( ~ state)

Rank Stacked Bar Chart by Sum of Subset of Fill Variable

Sample data:
set.seed(145)
df <- data.frame(Age=sample(c(1:10),20,replace=TRUE),
Rank=sample(c("Extremely","Very","Slightly","Not At All"),
20,replace=TRUE),
Percent=(runif(10,0,.01)))
df.plot <- ggplot(df,aes(x=Age,y=Percent,fill=Rank))+
geom_bar(stat="identity")+
coord_flip()
df.plot
Within the ggplot, how can I reorder x=Age, by the sum of Ranks "Extremely" and "Very" only?
I tried using the below, without success.
df.plot <- ggplot(df,aes(x=reorder(Age,Rank=="Extremely",sum),y=Percent,fill=Rank))+
geom_bar(stat="identity")+
coord_flip()
df.plot
Couple of notes:
The way that you are simulating your data does not rule out the possibility that for some ages, all categories are not represented (which is fine), but also that for some ages, some categories are duplicated. I am assuming that this is not true for your real data, so have let this be. Note also that your simulation logic does not produce percentages that add up, although the category names indicate that they should.
The way I would do this is to create the ordering of age based on your desired logic, and then pass that order to the factor call. This decouples the ordering logic and allows arbitrary ordering logic.
Here is then what I think you are looking for:
library(ggplot2)
library(dplyr)
library(scales)
set.seed(145)
# simulate the data
df_foo = data.frame(Age=sample(c(1:10),20,replace=TRUE),
Rank=sample(c("Extremely","Very","Slightly","Not At All"),
20,replace=TRUE),
Percent=(runif(10,0,.01)))
# get the ordering that you are interested in
age_order = df_foo %>%
filter(Rank %in% c("Extremely", "Very")) %>%
group_by(Age) %>%
summarize(SumRank = sum(Percent)) %>%
arrange(desc(SumRank)) %>%
`[[`("Age")
# in some cases ages do not appear in the order because the
# ordering logic does not span all categories
age_order = c(age_order, setdiff(unique(df_foo$Age), age_order))
# make age a factor sorted by the ordering above
ggplot(df_foo, aes(x = factor(Age, levels = age_order), y = Percent, fill = Rank))+
geom_bar(stat = "identity") +
coord_flip() +
theme_bw() +
scale_y_continuous(labels = percent)
Which code produces:

Plot including one categorical variable and two numeric variables

How can I show the values of AverageTime and AverageCost for their corresponding type on a graph. The scale of the variables is different since one of them is the average of time and another one is the average of cost. I want to define type as x and y refers to the value of AverageTime and AverageCost. (In this case, I will have two line plots just in one graph)
Type<-c("a","b","c","d","e","f","g","h","i","j","k")
AverageTime<-c(12,14,66,123,14,33,44,55,55,6,66)
AverageCost<-c(100,10000,400,20000,500000,5000,700,800,400000,500,120000)
df<-data.frame(Type,AverageTime,AverageCost)
This could be done using facet_wrap and scales="free_y" like so:
library(tidyr)
library(dplyr)
library(ggplot2)
df %>%
mutate(AverageCost=as.numeric(AverageCost), AverageTime=as.numeric(AverageTime)) %>%
gather(variable, value, -Type) %>%
ggplot(aes(x=Type, y=value, colour=variable, group=variable)) +
geom_line() +
facet_wrap(~variable, scales="free_y")
There you can compare the two lines even though they are different scales.
HTH
# install.packages("ggplot2", dependencies = TRUE)
library(ggplot2)
p <- ggplot(df, aes(AverageTime, AverageCost, colour=Type)) + geom_point()
p + geom_abline()
To show both lines in the same plot it will be hard since there are on different scales. You also need to convert AverageTime and AverageCost into a numeric variable.
library(ggplot2)
library(reshape2)
library(plyr)
to be able to plot both lines in one graph and take the average of the two, you need to some reshaping.
df_ag <- melt(df, id.vars=c("Type"))
df_ag_sb <- df_ag %>% group_by(Type, variable) %>% summarise(meanx = mean(as.numeric(value), na.rm=TRUE))
ggplot(df_ag_sb, aes(x=Type, y=as.numeric(meanx), color=variable, group=variable)) + geom_line()

Resources