Summarise and create a stacked bar chart in R - r

For a population of individuals I have a regular time series of what category they fall into. I would like to summarise the composition of this population over time, by the categories, as a stacked bar chart in R. For example:
set.seed(1)
id <- seq(1:25)
t1 <- sample(LETTERS[1:5], 25, replace=TRUE)
t2 <- sample(LETTERS[1:5], 25, replace=TRUE, prob=c(0.1,0.1,0.1,0.1,0.6))
t3 <- sample(LETTERS[1:5], 25, replace=TRUE, prob=c(0.2,0.1,0.2,0.1,0.4))
df <- data.frame(cbind(id, t1, t2, t3))
with frequencies:
> table(df$t1)
A B C D E
7 6 3 2 7
> table(df$t2)
B C D E
3 4 5 13
> table(df$t3)
A B C D E
4 2 5 4 10
So, at time period 1, 7 of the 25 are category A, 6 category B, whilst at time period 2, none are category A, 3 category B, etc. The chart will look like this (from EXCEL):
Can this be made in ggplot? Thanks.

Here is an option with data.table
library(dplyr)
library(data.table)
library(ggplot2)
melt(setDT(df), id.var = "id")[, .N, .(variable, value)][, perc := N / sum(N), variable] %>%
ggplot(aes(x = variable, y = perc, fill = value)) +
geom_bar(stat = "identity") +
scale_y_continuous(labels = scales::percent)

This can be done by first reshaping into 'long' format with pivot_longer, then get the frequency count and use the summarised 'n' as 'y' in ggplot aes while specifying the 'x' as 'name' and the fill as 'value' column created from pivot_longer
library(dplyr)
library(tidyr)
library(ggplot2)
df %>%
pivot_longer(cols = everything()) %>%
count(name, value) %>%
ggplot(aes(x = name, y = n, fill = value)) +
geom_col()
If we need proportion instead of count,
df %>%
pivot_longer(cols = everything()) %>%
count(name, value) %>%
group_by(name) %>%
mutate(prop = n/sum(n)) %>%
ggplot(aes(x = name, y = prop, fill = value)) +
geom_col() +
scale_y_continuous(labels= scales::percent)

Related

Can I make a bar plot, where each bar represents a column in a data frame?

I would like to make a bar plot, where each bar is represented by one of the three columns in this data frame. The 'size' of each bar depends on the sum created by adorn_totals.
Reproducible example:
library(janitor)
test_df <- data.frame(
a = c(1:5),
b = c(1:5),
c = c(1:5)
) %>%
adorn_totals(where = 'row', tabyl = c(a, b, c))
I tried a solution that has previously been posted, but that didn't work:
Link to the post: Bar plot for each column in a data frame in R
library(janitor)
library(ggplot2)
df <- data.frame(
a = c(1:5),
b = c(1:5),
c = c(1:5)
) %>%
adorn_totals(where = 'row', tabyl = c(a, b, c))
lapply(names(df), function(col) {
ggplot(df, aes(.data[[col]], ..count..)) +
geom_bar(aes(fill = .data[[col]]), position = "dodge")
}) -> list_plots
This is one way:
library(janitor)
library(ggplot2)
test_df <- data.frame(
a = c(1:5),
b = c(1:5),
c = c(1:5)
) %>%
adorn_totals(where = 'row', tabyl = c(a, b, c))
tail(test_df,1) %>% stack() %>%
ggplot(aes(ind, values)) + geom_col()
Created on 2022-11-07 with reprex v2.0.2
Of course, you don't need to totalize the df before plotting it, since ggplot does it for you. I add another example with an explanation of stack, some color, and no totals.
library(ggplot2)
test_df <- data.frame(
a = c(1:5),
b = c(1:5),
c = c(1:5))
test_df |> stack()
#> values ind
#> 1 1 a
#> 2 2 a
#> 3 3 a
#> 4 4 a
#> 5 5 a
#> 6 1 b
#> 7 2 b
#> 8 3 b
#> 9 4 b
#> 10 5 b
#> 11 1 c
#> 12 2 c
#> 13 3 c
#> 14 4 c
#> 15 5 c
test_df |> stack() |>
ggplot(aes(ind, values, fill=ind)) + geom_col()
Created on 2022-11-07 with reprex v2.0.2
If you want to use ggplot, you would be best to slice the totals off the bottom, pivot into long format and plot the result:
library(janitor)
library(tidyverse)
data.frame(
a = c(1:5),
b = c(1:5),
c = c(1:5)
) %>%
adorn_totals(where = 'row', tabyl = c(a, b, c)) %>%
slice_tail(n = 1) %>%
pivot_longer(everything()) %>%
ggplot(aes(name, value, fill = name)) +
geom_col(color = "gray") +
scale_fill_brewer() +
theme_minimal(base_size = 16)
Two pivot_longer alternatives without janitor::adorn_totals()
#uses the internal weight stat to calculate the sum
#geom_bar only uses one aesthetic (x OR y)
data.frame(a = c(1:5), b = c(1:5), c = c(1:5)) %>%
pivot_longer(everything()) %>%
ggplot(aes(name, weight=value))+
geom_bar()
#geom_col version
#Lots of flexibility in summarise:
data.frame(a = c(1:5), b = c(1:5), c = c(1:5)) %>%
pivot_longer(everything()) %>%
group_by(name) %>%
summarise(total=sum(value)) %>%
ggplot(aes(name, total))+
geom_col()

Graph plotting first,second,and third values of variables

For example I have this dataset:
c1 c2
A 1
A 3
A 10
B 5
B 4
C 3
C 4
C 6
A 5
C 7
Is there a short way to maybe plot in 1 graph the first third of values of the A,B,C, the second third of values A,B,C, and the third third values A,B,C. For every variables there will be 3 lines.
So there will be 9 lines in total
You could use group_split and lapply:
df <- data.frame(c1 = rep(LETTERS[1:3], 3), c2 = sample(1:10, size = 9, rep = T))
df %>%
group_by(c1) %>%
mutate(num = 1:n()) %>%
group_split(num) -> plot_list
lapply(plot_list, function(x) {
ggplot(x, aes(x = num, y = c2)) + geom_line()
})
Or you use facets:
df %>%
group_by(c1) %>%
mutate(num = 1:n()) %>%
ggplot() +
facet_grid(scales = "free", cols = vars(num)) +
geom_line(aes(x = c1, y = c2, group = num))

Add percentages above columns in geom_bar with faceting

I have a dataset with multiple columns. I am visually summarizing several columns using simple bar plots. A simple example:
set.seed(123)
df <-
data.frame(
a = sample(1:2, 20, replace = T),
b = sample(0:1, 20, replace = T)
)
ggplot(gather(df,, factor_key = TRUE), aes(x = factor(value))) +
geom_bar() +
facet_wrap(~ key, scales = "free_x", as.table = TRUE) +
xlab("")
Now, I want to add percentages above each of the 4 columns, saying what percent of rows in the dataframe each column represents. I.e., here, the following numbers would right above the four columns, from left to right in this order: 55%, 45%, 60%, 40%.
How can I automate this---given that I have a large number of columns I have to do this for? (Note I want to keep the raw count of responses on the Y axis and just have percentages appear in the plots.)
In addition to the answer proposed by #BappaDas, in your particular case you want to preserve the count and add percentage whereas the proposed answer has percentages both on y axis and text labeling.
Here, a modified solution is to compute the count for each variable and calculate the percentage. A possible way of doing it is to use tidyr (for reshaping the data in a "long" form) and dplyr package:
library(tidyr)
library(dplyr)
df %>% pivot_longer(everything(), names_to = "var", values_to = "val") %>%
group_by(var) %>% count(val) %>%
mutate(Label = n/sum(n))
# A tibble: 4 x 4
# Groups: var [2]
var val n Label
<chr> <int> <int> <dbl>
1 a 1 11 0.55
2 a 2 9 0.45
3 b 0 12 0.6
4 b 1 8 0.4
Now at the end of this pipe sequence, you can add ggplot plotting code in order to obtain the desired output by passing the count as y argument and the percentage as label argument:
library(tidyr)
library(dplyr)
library(ggplot2)
df %>% pivot_longer(everything(), names_to = "var", values_to = "val") %>%
group_by(var) %>% count(val) %>%
mutate(Label = n/sum(n)) %>%
ggplot(aes(x = factor(val), y = n))+
geom_col()+
facet_wrap(~var, scales = "free", as.table = TRUE)+
xlab("")+
geom_text(aes(label = scales::percent(Label)), vjust = -0.5)

How to combine ggplot and dplyr into a function?

Consider this simple example
library(dplyr)
library(ggplot2)
dataframe <- data_frame(id = c(1,2,3,4),
group = c('a','b','c','c'),
value = c(200,400,120,300))
# A tibble: 4 x 3
id group value
<dbl> <chr> <dbl>
1 1 a 200
2 2 b 400
3 3 c 120
4 4 c 300
Here I want to write a function that takes the dataframe and the grouping variable as input. Ideally, after grouping and aggregating I would like to print a ggpplot chart.
This works:
get_charts2 <- function(data, mygroup){
quo_var <- enquo(mygroup)
df_agg <- data %>%
group_by(!!quo_var) %>%
summarize(mean = mean(value, na.rm = TRUE),
count = n()) %>%
ungroup()
df_agg
}
> get_charts2(dataframe, group)
# A tibble: 3 x 3
group mean count
<chr> <dbl> <int>
1 a 200 1
2 b 400 1
3 c 210 2
Unfortunately, adding ggplot into the function above FAILS
get_charts1 <- function(data, mygroup){
quo_var <- enquo(mygroup)
df_agg <- data %>%
group_by(!!quo_var) %>%
summarize(mean = mean(value, na.rm = TRUE),
count = n()) %>%
ungroup()
ggplot(df_agg, aes(x = count, y = mean, color = !!quo_var, group = !!quo_var)) +
geom_point() +
geom_line()
}
> get_charts1(dataframe, group)
Error in !quo_var : invalid argument type
I dont understand what is wrong here. Any ideas?
Thanks!
EDIT: interesting follow-up here how to create factor variables from quosures in functions using ggplot and dplyr?
ggplot does not yet support tidy eval syntax (you can't use the !!). You need to use more traditional standard evaluation calls. You can use aes_q in ggplot to help with this.
get_charts1 <- function(data, mygroup){
quo_var <- enquo(mygroup)
df_agg <- data %>%
group_by(!!quo_var) %>%
summarize(mean = mean(value, na.rm = TRUE),
count = n()) %>%
ungroup()
ggplot(df_agg, aes_q(x = quote(count), y = quote(mean), color = quo_var, group = quo_var)) +
geom_point() +
geom_line()
}
get_charts1(dataframe, group)
ggplot2 v3.0.0 released in July 2018 supports !! (bang bang), !!!, and :=. aes_()/aes_q() and aes_string() are soft-deprecated.
OP's original code should work
library(tidyverse)
get_charts1 <- function(data, mygroup){
quo_var <- enquo(mygroup)
df_agg <- data %>%
group_by(!!quo_var) %>%
summarize(mean = mean(value, na.rm = TRUE),
count = n()) %>%
ungroup()
ggplot(df_agg, aes(x = count, y = mean,
color = !!quo_var, group = !!quo_var)) +
geom_point() +
geom_line()
}
get_charts1(dataframe, group)
Edit: using the tidy evaluation pronoun .data[] to slice the chosen variable from the data frame also works
get_charts2 <- function(data, mygroup){
df_agg <- data %>%
group_by(.data[[mygroup]]) %>%
summarize(mean = mean(value, na.rm = TRUE),
count = n()) %>%
ungroup()
ggplot(df_agg, aes(x = count, y = mean,
color = .data[[mygroup]], group = .data[[mygroup]])) +
geom_point() +
geom_line()
}
get_charts2(dataframe, "group")
Created on 2018-04-04 by the reprex package (v0.2.0).

ggplot2: comparing 2 groups through fraction of its members

Lets say we have 10000 users classified in 2 groups: lvl beginner and lvl pro.
Every user has a rank, going from 1 to 20.
The df:
# beginers
n <- 7000
user.id <- 1:n
lvl <- "beginer"
rank <- sample(1:20, n, replace = TRUE,
prob = seq(.9,0.1,length.out = 20))
df.beginer <- data.frame(user.id, rank, lvl)
# pros
n <- 3000
user.id <- 1:n
lvl <- "pro"
rank <- sample(1:20, n, replace = TRUE,
prob = seq(.9,0.3,length.out = 20))
df.pro <- data.frame(user.id, rank, lvl)
library(dplyr)
df <- bind_rows(df.beginer, df.pro)
df2 <- tbl_df(df) %>% group_by(lvl, rank) %>% mutate(count = n())
Problem 1:
I need a bar plot comparing each group side by side, but instead if giving me counts, I need percents, so the bars from each group will have the same max hight (100%)
The plot I got so far:
library(ggplot2)
plot <- ggplot(df2, aes(rank))
plot + geom_bar(aes(fill=lvl), position="dodge")
Problem 2:
I need a line plot comparing each group, so we will have 2 lines, but instead if giving me counts, I need percents, so the lines from each group will have the same max hight (100%)
The plot I got so far:
plot + geom_line(aes(y=count, color=lvl))
Problem 3:
Lets say that the ranks are cumulative, so a user who has rank 3, also has rank 1 and 2. A user who has rank 20 has all ranks from 1 to 20.
So, when plotting, I want the plot to start with rank 1 having 100% of users,
rank 2 will have something less, rank 3 even less and so on.
I got all this done on tableau but I really dislike it and want to show myself that R can handle all this stuff.
Thank you!
Three problems, three solutions:
problem 1 - calculate percentage and use geom_col
df %>%
group_by(rank, lvl)%>%
summarise(count = n()) %>%
group_by(lvl) %>%
mutate(count_perc = count / sum(count)) %>% # calculate percentage
ggplot(., aes(x = rank, y = count_perc))+
geom_col(aes(fill = lvl), position = 'dodge')
problem 2 - pretty much the same as problem 1 except use geom_line instead of geom_col
df %>%
group_by(rank, lvl)%>%
summarise(count = n()) %>%
group_by(lvl) %>%
mutate(count_perc = count / sum(count)) %>%
ggplot(., aes(x = rank, y = count_perc))+
geom_line(aes(colour = lvl))
problem 3 - make use of arrange and cumsum
df %>%
group_by(lvl, rank) %>%
summarise(count = n()) %>% # count by level and rank
group_by(lvl) %>%
arrange(desc(rank)) %>% # sort descending
mutate(cumulative_count = cumsum(count)) %>% # use cumsum
mutate(cumulative_count_perc = cumulative_count / max(cumulative_count)) %>%
ggplot(., aes(x = rank, y = cumulative_count_perc))+
geom_line(aes(colour = lvl))

Resources