100% percentage barplot in R - r

I have a dataset like so:
I want to create a 100% bar plot from this... such that there is a 100% bar for status, and a 100% bar for Type.... like so:
The picture only has 1 bar for status, but I wish for two bars side by side, 1 for status, 1 for type..
Any help would be appreciated, I want to do this in R

Simple base R solution:
p1 <- as.matrix(prop.table(table(data$status))) * 100
p2 <- as.matrix(prop.table(table(data$Type))) * 100
op <- par(mfrow=c(1,2), las=1, mar=c(3,4,1,0))
barplot(p1, legend=TRUE, names="status", ylab="Percent")
barplot(p2, legend=TRUE, names="Type")
par(op)
data <- data.frame(id=1:10,
status=c("P","F","F","P","F","P","P","F","P","P"),
Type=c("full","full","full","part","part","full","full","part","part","full"))
data
id status Type
1 1 P full
2 2 F full
3 3 F full
4 4 P part
5 5 F part
6 6 P full
7 7 P full
8 8 F part
9 9 P part
10 10 P full
Maybe with ggplot2
data %>%
pivot_longer(-id) %>%
group_by(name, value) %>%
summarise(n=n()) %>%
ggplot(aes(fill=value, y=n, x=name)) +
geom_bar(position="fill", stat="identity") # Needs polishing

Despite the information provided being a little bit scarce indeed, You seem to look for a stacked barplot with percentages. You might try something like:
# Some sample data:
dta <- tibble(id = 1:10,
status = rbernoulli(n = 10, p = 0.3),
type = rbernoulli(n = 10, p = 0.6))
# Transformation and plotting:
dta %>%
pivot_longer(c(status, type), names_to = "variable") %>%
group_by(variable, value) %>%
summarize(percent = n()) %>%
mutate(percent = percent / sum(percent)) %>%
ungroup() %>%
ggplot() +
aes(x = variable, y = percent, fill = value) +
geom_bar(stat = "identity") +
theme_bw()
Resulting in:
Does this help?

Related

How to draw stacked barplot on the summed data

For data called df that reads:
car suv pickup
1 2 1
2 3 4
4 1 2
5 4 2
3 1 1
total = apply(df,1,sum)
barplot(total,col= rainbow(5))
So what I did right now is plotting a barplot on total number of cars, which are in fact, the sum of each row. What I want to do now is to present it as a stack barplot on the sum.
For now, it would just show "total" without any lines indicating whether 1 car, 2 suv, 1 pickup addes to 4 "total".
Note. It is different from barplot(matrix(df)), because that's just dividing it my car,suv,pickup, that disregards total number.
You can achieve this easily using ggplot2 and reshape2.
You will need an ID column to track the rows, so I have added that in. I melt the data to long type so that the different groups can be managed and plotted accordingly.
Then plot using geom_bar, specifying the row ids as the x axis and the groupings (fill and colour) for the stack plot and legend.
library(reshape2)
library(ggplot2)
df <- data.frame("ID" = c(1,2,3,4,5), "car" = c(1,2,4,5,3), "suv" = c(2,3,1,4,1), "pickup" = c(1, 4, 2, 2, 1))
long_df <- df %>% melt(id.vars = c("ID") ,value.name = "Number", variable.name = "Type")
ggplot(data = long_df, aes(x = ID, y = Number)) +
geom_bar(aes(fill = Type, colour = Type),
stat = "identity",
position = "stack")
With base R
df %>% melt(id.vars = c("ID") ,value.name = "Number", variable.name = "Type") %>%
dcast(Type ~ ID, value.var = "Number") %>%
as.matrix() %>%
barplot()
Are you after something like this?
library(tidyverse)
df %>%
rowid_to_column("row") %>%
gather(k, v, -row) %>%
ggplot(aes(row, v, fill = k)) +
geom_col()
We use a stacked barplot here, so there is no need to manually calculate the sum. The key here is to transform data from wide to long and keep track of the row.
Sample data
df <- read.table(text =
"car suv pickup
1 2 1
2 3 4
4 1 2
5 4 2
3 1 1", header = T)

Swimmer plot in R (ggplot): How to order stacked bars?

I have a question concerning ordering of stacked bars in a swimmer plot using GGplot in R.
I have a sample dataset of (artificial) patients, who receive treatments.
library(tidyverse)
df <- read.table(text="patient start_t_1 t_1_duration start_t_2 t_2_duration start_t_3 t_3_duration start_t_4 t_4_duration end
1 0 1.5 1.5 3 NA NA 4.5 10 10
2 0 2 4.5 2 NA NA 2 2.5 10
3 0 5 5 2 7 0.5 7.5 2 9.5
4 0 8 NA NA NA NA 8 2 10", header=TRUE)
All patients start the first treatment at time = 0. Subsequently, patients get different treatments (numbered t_2 up to t_4).
I tried to plot the swimmer plot, using the following code:
df %>%
gather(variable, value, c(t_1_duration, t_2_duration, t_3_duration, t_4_duration)) %>%
ggplot(aes(x = patient, y = value, fill = variable)) +
geom_bar(stat = "identity") +
coord_flip()
However, the treatments are not displayed in the right order.
For example: patient 3 receives all treatments in consecutive orde, while patient 2 receives first treatment 1, then 4 and eventually 2.
So, simply reversing the order does not work.
How do I order the stacked bars in a chronological way?
What about this:
df %>%
gather(variable, value, c(t_1_duration, t_2_duration, t_3_duration,t_4_duration)) %>%
ggplot(aes(x = patient,
y = value,
# here you can specify the order of the variable
fill = factor(variable,
levels =c("t_4_duration", "t_3_duration", "t_2_duration","t_1_duration")))) +
geom_bar(stat = "identity") +
coord_flip()+ guides(fill=guide_legend("My title"))
EDIT:
that has been a long trip, because it involves a kind of hack. I think it's not not a dupe of that question, because it involves also some data reshaping:
library(reshape2)
# divide starts and duration
starts <- df %>% select(patient, start_t_1, start_t_2, start_t_3, start_t_4)
duration <- df %>% select(patient, t_1_duration,t_2_duration, t_3_duration, t_4_duration)
# here you melt them
starts <- melt(starts, id = 'patient') %>%
mutate(keytreat = substr(variable,nchar(as.vector(variable))-2, nchar(as.vector(variable)))) %>%
`colnames<-`(c("patient", "variable", "start","keytreat")) %>% select(-variable)
duration <- melt(duration, id = 'patient') %>% mutate(keytreat = substr(variable,1, 3)) %>%
`colnames<-`(c("patient", "variable", "duration","keytreat")) %>% select(-variable)
# join
dats <- starts %>% left_join(duration) %>% arrange(patient, start) %>% filter(!is.na(start))
# here the part for the plot
bars <- map(unique(dats$patient)
, ~geom_bar(stat = "identity", position = "stack"
, data = dats %>% filter(patient == .x)))
dats %>%
ggplot(aes(x = patient,
y = duration,
fill = reorder(keytreat,-start))) +
bars +
guides(fill=guide_legend("ordering")) + coord_flip()

ggplot2: comparing 2 groups through fraction of its members

Lets say we have 10000 users classified in 2 groups: lvl beginner and lvl pro.
Every user has a rank, going from 1 to 20.
The df:
# beginers
n <- 7000
user.id <- 1:n
lvl <- "beginer"
rank <- sample(1:20, n, replace = TRUE,
prob = seq(.9,0.1,length.out = 20))
df.beginer <- data.frame(user.id, rank, lvl)
# pros
n <- 3000
user.id <- 1:n
lvl <- "pro"
rank <- sample(1:20, n, replace = TRUE,
prob = seq(.9,0.3,length.out = 20))
df.pro <- data.frame(user.id, rank, lvl)
library(dplyr)
df <- bind_rows(df.beginer, df.pro)
df2 <- tbl_df(df) %>% group_by(lvl, rank) %>% mutate(count = n())
Problem 1:
I need a bar plot comparing each group side by side, but instead if giving me counts, I need percents, so the bars from each group will have the same max hight (100%)
The plot I got so far:
library(ggplot2)
plot <- ggplot(df2, aes(rank))
plot + geom_bar(aes(fill=lvl), position="dodge")
Problem 2:
I need a line plot comparing each group, so we will have 2 lines, but instead if giving me counts, I need percents, so the lines from each group will have the same max hight (100%)
The plot I got so far:
plot + geom_line(aes(y=count, color=lvl))
Problem 3:
Lets say that the ranks are cumulative, so a user who has rank 3, also has rank 1 and 2. A user who has rank 20 has all ranks from 1 to 20.
So, when plotting, I want the plot to start with rank 1 having 100% of users,
rank 2 will have something less, rank 3 even less and so on.
I got all this done on tableau but I really dislike it and want to show myself that R can handle all this stuff.
Thank you!
Three problems, three solutions:
problem 1 - calculate percentage and use geom_col
df %>%
group_by(rank, lvl)%>%
summarise(count = n()) %>%
group_by(lvl) %>%
mutate(count_perc = count / sum(count)) %>% # calculate percentage
ggplot(., aes(x = rank, y = count_perc))+
geom_col(aes(fill = lvl), position = 'dodge')
problem 2 - pretty much the same as problem 1 except use geom_line instead of geom_col
df %>%
group_by(rank, lvl)%>%
summarise(count = n()) %>%
group_by(lvl) %>%
mutate(count_perc = count / sum(count)) %>%
ggplot(., aes(x = rank, y = count_perc))+
geom_line(aes(colour = lvl))
problem 3 - make use of arrange and cumsum
df %>%
group_by(lvl, rank) %>%
summarise(count = n()) %>% # count by level and rank
group_by(lvl) %>%
arrange(desc(rank)) %>% # sort descending
mutate(cumulative_count = cumsum(count)) %>% # use cumsum
mutate(cumulative_count_perc = cumulative_count / max(cumulative_count)) %>%
ggplot(., aes(x = rank, y = cumulative_count_perc))+
geom_line(aes(colour = lvl))

R - ggplot2 parallel categorical plot

I am working with categorical longitudinal data. My data has 3 simple variables such as :
id variable value
1 1 1 c
2 1 2 b
3 1 3 c
4 1 4 c
5 1 5 c
...
Where variable is basically time, and value are the 3 possible categories one id can take.
I am interested in producing a "parallel" longitudinal graph, similar to this with ggplot2
I am struggling a bit to get it right. What I came up for now is this :
dt0 %>% ggplot(aes(variable, value, group = id, colour = id)) +
geom_line(colour="grey70") +
geom_point(aes(colour=value, size = nn), size=4) +
scale_colour_brewer(palette="Set1") + theme_minimal()
The issue with this graph is that we can't really see the "thickness" of the "transition" (the id lines).
I wondered if you could help me for :
a) help make visible the id lines, or make it "thicker" according to the number of id going form one state to the other
b) I also would like to re-size the point according to the number of id in this state. I tried to do it with geom_point(aes(colour=value, size = nn), size=4) but it doesn't seem to work.
Thanks.
# data #
library(dplyr)
library(ggplot2)
set.seed(10)
# generate random sequences #
dt = as.data.frame( cbind(id = 1:1000, replicate(5, sample( c('a', 'b', 'c'), prob = c(0.1,0.2,0.7), 1000, replace = T)) ) )
# transform to PP file #
dt = dt %>% melt(id.vars = c('id'))
# create a vector 1-0 if the activity was performed #
dt0 = dt %>% group_by(id) %>% mutate(variable = 1:n()) %>% arrange(id)
# create the number of people in that state #
dt0 = dt0 %>% count(id, variable, value)
dt0 = dt0 %>% group_by(variable, value, n) %>% mutate(nn = n())
# to produce the first graph # 
library(vcrpart)
otsplot(dt0$variable, factor(dt0$value), dt0$id)
you were so close with geom_point(aes(colour=value, size = nn), size=4), the problem was that with you redefined size after defining it in aes() ggplot overwrote the variable reference with the constant 4. Assuming you want to use nn to scale line thinkness as well, you could tweak your code to this:
dt0 %>% ggplot(aes(variable, value, group = id, colour = id)) +
geom_line(colour="grey70", aes(size = nn)) +
geom_point(aes(colour=value, size = nn)) +
scale_colour_brewer(palette="Set1") + theme_minimal()
If you wanted to use a lag value for the line thickness I would suggests adding that as a new column in dt0.

split data into groups in R

My data frame looks like this:
plant distance
one 0
one 1
one 2
one 3
one 4
one 5
one 6
one 7
one 8
one 9
one 9.9
two 0
two 1
two 2
two 3
two 4
two 5
two 6
two 7
two 8
two 9
two 9.5
I want to split distance of each level into groups by interval(for instance,interval=3), and compute percentage of each group. Finally, plot the percentages of each level of each group similar like this:
my code:
library(ggplot2)
library(dplyr)
dat <- data %>%
mutate(group = factor(cut(distance, seq(0, max(distance), 3), F))) %>%
group_by(plant, group) %>%
summarise(percentage = n()) %>%
mutate(percentage = percentage / sum(percentage))
p <- ggplot(dat, aes(x = plant, y = percentage, fill = group)) +
geom_bar(stat = "identity", position = "stack")+
scale_y_continuous(labels=percent)
p
But my plot is shown below: the group 4 was missing.
And I found that the dat was wrong, the group 4 was NA.
The likely reason is that the length of group 4 was less than the interval=3, so my question is how to fix it? Thank you in advance!
I have solved the problem.The reason is that the cut(distance, seq(0, max(distance), 3), F) did not include the maximum and minimum values.
Here is my solution:
dat <- my_data %>%
mutate(group = factor(cut(distance, seq(from = min(distance), by = 3, length.out = n()/ 3 + 1), include.lowest = TRUE))) %>%
count(plant, group) %>%
group_by(plant) %>%
mutate(percentage = n / sum(n))

Resources