Plotting group proportions with continuous variable - r

I would like to plot the proportion of levels of a group alongside a continuous variable. Since the x-axis is continuous, it is not really possible to compute proportions at each point (since there is an infinite number of them). So, usually, one cuts the continuous variable into bins, and plot them. Another solution is to use the density, but I want the proportions (so, the percentage) in the y-axis and I'm pretty sure density is not about proportions.
As an example, let's use iris and try to plot the share of each species among Sepal.Length. One can create bins using Hmisc::cut2 and then count the proportions for each group:
library(tidyverse)
library(Hmisc)
dat <- iris %>%
mutate(Sepal.Length = Sepal.Length + rnorm(n()),
cut = cut2(Sepal.Length, g = 30, levels.mean = T)) %>%
group_by(cut) %>%
summarise(set = sum(Species == "setosa") / n(),
vir = sum(Species == "virginica") / n(),
ver = sum(Species == "versicolor") / n()) %>%
pivot_longer(-cut)
# A tibble: 90 x 3
cut name value
<fct> <chr> <dbl>
1 3.0126 set 0.6
2 3.0126 vir 0
3 3.0126 ver 0.4
4 3.7616 set 0.8
5 3.7616 vir 0
6 3.7616 ver 0.2
7 3.9898 set 0.8
8 3.9898 vir 0
9 3.9898 ver 0.2
10 4.1577 set 0.2
# ... with 80 more rows
And the plot looks like this, e.g. for name == "ver"
dat %>%
filter(name == "ver") %>%
ggplot(aes(x = cut, y = value)) +
geom_col()
Now, is there any way to make this easier, and more esthetic?
Especially, making the x-axis a continuous back again so that one could e.g. create a geom_line between every columns of the plot (maybe making rolling means?). Or is it a bad practice and that's why I can't see any documentation about this?

Setting the variable cut to numeric did the job, but there may still be better options.
dat %>%
filter(name == "ver") %>%
ggplot(aes(x = as.numeric(as.character(cut)), y = value)) +
geom_col()
Or with a line:
dat %>%
filter(name == "ver") %>%
ggplot(aes(x = as.numeric(as.character(cut)), y = value)) +
geom_line()

Related

Add percentages above columns in geom_bar with faceting

I have a dataset with multiple columns. I am visually summarizing several columns using simple bar plots. A simple example:
set.seed(123)
df <-
data.frame(
a = sample(1:2, 20, replace = T),
b = sample(0:1, 20, replace = T)
)
ggplot(gather(df,, factor_key = TRUE), aes(x = factor(value))) +
geom_bar() +
facet_wrap(~ key, scales = "free_x", as.table = TRUE) +
xlab("")
Now, I want to add percentages above each of the 4 columns, saying what percent of rows in the dataframe each column represents. I.e., here, the following numbers would right above the four columns, from left to right in this order: 55%, 45%, 60%, 40%.
How can I automate this---given that I have a large number of columns I have to do this for? (Note I want to keep the raw count of responses on the Y axis and just have percentages appear in the plots.)
In addition to the answer proposed by #BappaDas, in your particular case you want to preserve the count and add percentage whereas the proposed answer has percentages both on y axis and text labeling.
Here, a modified solution is to compute the count for each variable and calculate the percentage. A possible way of doing it is to use tidyr (for reshaping the data in a "long" form) and dplyr package:
library(tidyr)
library(dplyr)
df %>% pivot_longer(everything(), names_to = "var", values_to = "val") %>%
group_by(var) %>% count(val) %>%
mutate(Label = n/sum(n))
# A tibble: 4 x 4
# Groups: var [2]
var val n Label
<chr> <int> <int> <dbl>
1 a 1 11 0.55
2 a 2 9 0.45
3 b 0 12 0.6
4 b 1 8 0.4
Now at the end of this pipe sequence, you can add ggplot plotting code in order to obtain the desired output by passing the count as y argument and the percentage as label argument:
library(tidyr)
library(dplyr)
library(ggplot2)
df %>% pivot_longer(everything(), names_to = "var", values_to = "val") %>%
group_by(var) %>% count(val) %>%
mutate(Label = n/sum(n)) %>%
ggplot(aes(x = factor(val), y = n))+
geom_col()+
facet_wrap(~var, scales = "free", as.table = TRUE)+
xlab("")+
geom_text(aes(label = scales::percent(Label)), vjust = -0.5)

How do I create a stacked bar chart in R, where the y axis should denote the percentages for the bars?

I would like to create a stacked bar chart in R. My X axis just contains data on sex i.e male or female. I just need the y axis to show percentages of the stacked bars. The "Survived" column is just a mixture of 0s and 1s. I.e 1 denoting that an indiividual survived an experience and 0 showing that the individual did not survive the experience. I am not sure what to put in for the y label. Can anyone help please?
ggplot(data = df, mapping = aes(x = Sex, y = ? , fill = Survived)) + geom_bar(stat = "identity")
One possible solution is to use dplyr package to calculate percentage of each categories outside of ggplot2 and then use those values to get your bargraph using geom_col:
library(dplyr)
df %>% count(Sex, Survive) %>%
group_by(Sex) %>%
mutate(Percent = n/sum(n)*100)
# A tibble: 4 x 4
# Groups: Sex [2]
Sex Survive n Percent
<fct> <dbl> <int> <dbl>
1 F 0 26 55.3
2 F 1 21 44.7
3 M 0 34 64.2
4 M 1 19 35.8
And now with the plotting part:
library(dplyr)
library(ggplot2)
df %>% count(Sex, Survive) %>%
group_by(Sex) %>%
mutate(Percent = n/sum(n)*100) %>%
ggplot(aes(x = Sex, y = Percent, fill = as.factor(Survive)))+
geom_col()
Reproducible example
df <- data.frame(Sex = sample(c("M","F"),100, replace = TRUE),
Survive = sample(c(0,1), 100, replace = TRUE))

Swimmer plot in R (ggplot): How to order stacked bars?

I have a question concerning ordering of stacked bars in a swimmer plot using GGplot in R.
I have a sample dataset of (artificial) patients, who receive treatments.
library(tidyverse)
df <- read.table(text="patient start_t_1 t_1_duration start_t_2 t_2_duration start_t_3 t_3_duration start_t_4 t_4_duration end
1 0 1.5 1.5 3 NA NA 4.5 10 10
2 0 2 4.5 2 NA NA 2 2.5 10
3 0 5 5 2 7 0.5 7.5 2 9.5
4 0 8 NA NA NA NA 8 2 10", header=TRUE)
All patients start the first treatment at time = 0. Subsequently, patients get different treatments (numbered t_2 up to t_4).
I tried to plot the swimmer plot, using the following code:
df %>%
gather(variable, value, c(t_1_duration, t_2_duration, t_3_duration, t_4_duration)) %>%
ggplot(aes(x = patient, y = value, fill = variable)) +
geom_bar(stat = "identity") +
coord_flip()
However, the treatments are not displayed in the right order.
For example: patient 3 receives all treatments in consecutive orde, while patient 2 receives first treatment 1, then 4 and eventually 2.
So, simply reversing the order does not work.
How do I order the stacked bars in a chronological way?
What about this:
df %>%
gather(variable, value, c(t_1_duration, t_2_duration, t_3_duration,t_4_duration)) %>%
ggplot(aes(x = patient,
y = value,
# here you can specify the order of the variable
fill = factor(variable,
levels =c("t_4_duration", "t_3_duration", "t_2_duration","t_1_duration")))) +
geom_bar(stat = "identity") +
coord_flip()+ guides(fill=guide_legend("My title"))
EDIT:
that has been a long trip, because it involves a kind of hack. I think it's not not a dupe of that question, because it involves also some data reshaping:
library(reshape2)
# divide starts and duration
starts <- df %>% select(patient, start_t_1, start_t_2, start_t_3, start_t_4)
duration <- df %>% select(patient, t_1_duration,t_2_duration, t_3_duration, t_4_duration)
# here you melt them
starts <- melt(starts, id = 'patient') %>%
mutate(keytreat = substr(variable,nchar(as.vector(variable))-2, nchar(as.vector(variable)))) %>%
`colnames<-`(c("patient", "variable", "start","keytreat")) %>% select(-variable)
duration <- melt(duration, id = 'patient') %>% mutate(keytreat = substr(variable,1, 3)) %>%
`colnames<-`(c("patient", "variable", "duration","keytreat")) %>% select(-variable)
# join
dats <- starts %>% left_join(duration) %>% arrange(patient, start) %>% filter(!is.na(start))
# here the part for the plot
bars <- map(unique(dats$patient)
, ~geom_bar(stat = "identity", position = "stack"
, data = dats %>% filter(patient == .x)))
dats %>%
ggplot(aes(x = patient,
y = duration,
fill = reorder(keytreat,-start))) +
bars +
guides(fill=guide_legend("ordering")) + coord_flip()

ggplot2: comparing 2 groups through fraction of its members

Lets say we have 10000 users classified in 2 groups: lvl beginner and lvl pro.
Every user has a rank, going from 1 to 20.
The df:
# beginers
n <- 7000
user.id <- 1:n
lvl <- "beginer"
rank <- sample(1:20, n, replace = TRUE,
prob = seq(.9,0.1,length.out = 20))
df.beginer <- data.frame(user.id, rank, lvl)
# pros
n <- 3000
user.id <- 1:n
lvl <- "pro"
rank <- sample(1:20, n, replace = TRUE,
prob = seq(.9,0.3,length.out = 20))
df.pro <- data.frame(user.id, rank, lvl)
library(dplyr)
df <- bind_rows(df.beginer, df.pro)
df2 <- tbl_df(df) %>% group_by(lvl, rank) %>% mutate(count = n())
Problem 1:
I need a bar plot comparing each group side by side, but instead if giving me counts, I need percents, so the bars from each group will have the same max hight (100%)
The plot I got so far:
library(ggplot2)
plot <- ggplot(df2, aes(rank))
plot + geom_bar(aes(fill=lvl), position="dodge")
Problem 2:
I need a line plot comparing each group, so we will have 2 lines, but instead if giving me counts, I need percents, so the lines from each group will have the same max hight (100%)
The plot I got so far:
plot + geom_line(aes(y=count, color=lvl))
Problem 3:
Lets say that the ranks are cumulative, so a user who has rank 3, also has rank 1 and 2. A user who has rank 20 has all ranks from 1 to 20.
So, when plotting, I want the plot to start with rank 1 having 100% of users,
rank 2 will have something less, rank 3 even less and so on.
I got all this done on tableau but I really dislike it and want to show myself that R can handle all this stuff.
Thank you!
Three problems, three solutions:
problem 1 - calculate percentage and use geom_col
df %>%
group_by(rank, lvl)%>%
summarise(count = n()) %>%
group_by(lvl) %>%
mutate(count_perc = count / sum(count)) %>% # calculate percentage
ggplot(., aes(x = rank, y = count_perc))+
geom_col(aes(fill = lvl), position = 'dodge')
problem 2 - pretty much the same as problem 1 except use geom_line instead of geom_col
df %>%
group_by(rank, lvl)%>%
summarise(count = n()) %>%
group_by(lvl) %>%
mutate(count_perc = count / sum(count)) %>%
ggplot(., aes(x = rank, y = count_perc))+
geom_line(aes(colour = lvl))
problem 3 - make use of arrange and cumsum
df %>%
group_by(lvl, rank) %>%
summarise(count = n()) %>% # count by level and rank
group_by(lvl) %>%
arrange(desc(rank)) %>% # sort descending
mutate(cumulative_count = cumsum(count)) %>% # use cumsum
mutate(cumulative_count_perc = cumulative_count / max(cumulative_count)) %>%
ggplot(., aes(x = rank, y = cumulative_count_perc))+
geom_line(aes(colour = lvl))

split data into groups in R

My data frame looks like this:
plant distance
one 0
one 1
one 2
one 3
one 4
one 5
one 6
one 7
one 8
one 9
one 9.9
two 0
two 1
two 2
two 3
two 4
two 5
two 6
two 7
two 8
two 9
two 9.5
I want to split distance of each level into groups by interval(for instance,interval=3), and compute percentage of each group. Finally, plot the percentages of each level of each group similar like this:
my code:
library(ggplot2)
library(dplyr)
dat <- data %>%
mutate(group = factor(cut(distance, seq(0, max(distance), 3), F))) %>%
group_by(plant, group) %>%
summarise(percentage = n()) %>%
mutate(percentage = percentage / sum(percentage))
p <- ggplot(dat, aes(x = plant, y = percentage, fill = group)) +
geom_bar(stat = "identity", position = "stack")+
scale_y_continuous(labels=percent)
p
But my plot is shown below: the group 4 was missing.
And I found that the dat was wrong, the group 4 was NA.
The likely reason is that the length of group 4 was less than the interval=3, so my question is how to fix it? Thank you in advance!
I have solved the problem.The reason is that the cut(distance, seq(0, max(distance), 3), F) did not include the maximum and minimum values.
Here is my solution:
dat <- my_data %>%
mutate(group = factor(cut(distance, seq(from = min(distance), by = 3, length.out = n()/ 3 + 1), include.lowest = TRUE))) %>%
count(plant, group) %>%
group_by(plant) %>%
mutate(percentage = n / sum(n))

Resources