split data into groups in R - r

My data frame looks like this:
plant distance
one 0
one 1
one 2
one 3
one 4
one 5
one 6
one 7
one 8
one 9
one 9.9
two 0
two 1
two 2
two 3
two 4
two 5
two 6
two 7
two 8
two 9
two 9.5
I want to split distance of each level into groups by interval(for instance,interval=3), and compute percentage of each group. Finally, plot the percentages of each level of each group similar like this:
my code:
library(ggplot2)
library(dplyr)
dat <- data %>%
mutate(group = factor(cut(distance, seq(0, max(distance), 3), F))) %>%
group_by(plant, group) %>%
summarise(percentage = n()) %>%
mutate(percentage = percentage / sum(percentage))
p <- ggplot(dat, aes(x = plant, y = percentage, fill = group)) +
geom_bar(stat = "identity", position = "stack")+
scale_y_continuous(labels=percent)
p
But my plot is shown below: the group 4 was missing.
And I found that the dat was wrong, the group 4 was NA.
The likely reason is that the length of group 4 was less than the interval=3, so my question is how to fix it? Thank you in advance!

I have solved the problem.The reason is that the cut(distance, seq(0, max(distance), 3), F) did not include the maximum and minimum values.
Here is my solution:
dat <- my_data %>%
mutate(group = factor(cut(distance, seq(from = min(distance), by = 3, length.out = n()/ 3 + 1), include.lowest = TRUE))) %>%
count(plant, group) %>%
group_by(plant) %>%
mutate(percentage = n / sum(n))

Related

100% percentage barplot in R

I have a dataset like so:
I want to create a 100% bar plot from this... such that there is a 100% bar for status, and a 100% bar for Type.... like so:
The picture only has 1 bar for status, but I wish for two bars side by side, 1 for status, 1 for type..
Any help would be appreciated, I want to do this in R
Simple base R solution:
p1 <- as.matrix(prop.table(table(data$status))) * 100
p2 <- as.matrix(prop.table(table(data$Type))) * 100
op <- par(mfrow=c(1,2), las=1, mar=c(3,4,1,0))
barplot(p1, legend=TRUE, names="status", ylab="Percent")
barplot(p2, legend=TRUE, names="Type")
par(op)
data <- data.frame(id=1:10,
status=c("P","F","F","P","F","P","P","F","P","P"),
Type=c("full","full","full","part","part","full","full","part","part","full"))
data
id status Type
1 1 P full
2 2 F full
3 3 F full
4 4 P part
5 5 F part
6 6 P full
7 7 P full
8 8 F part
9 9 P part
10 10 P full
Maybe with ggplot2
data %>%
pivot_longer(-id) %>%
group_by(name, value) %>%
summarise(n=n()) %>%
ggplot(aes(fill=value, y=n, x=name)) +
geom_bar(position="fill", stat="identity") # Needs polishing
Despite the information provided being a little bit scarce indeed, You seem to look for a stacked barplot with percentages. You might try something like:
# Some sample data:
dta <- tibble(id = 1:10,
status = rbernoulli(n = 10, p = 0.3),
type = rbernoulli(n = 10, p = 0.6))
# Transformation and plotting:
dta %>%
pivot_longer(c(status, type), names_to = "variable") %>%
group_by(variable, value) %>%
summarize(percent = n()) %>%
mutate(percent = percent / sum(percent)) %>%
ungroup() %>%
ggplot() +
aes(x = variable, y = percent, fill = value) +
geom_bar(stat = "identity") +
theme_bw()
Resulting in:
Does this help?

How to draw stacked barplot on the summed data

For data called df that reads:
car suv pickup
1 2 1
2 3 4
4 1 2
5 4 2
3 1 1
total = apply(df,1,sum)
barplot(total,col= rainbow(5))
So what I did right now is plotting a barplot on total number of cars, which are in fact, the sum of each row. What I want to do now is to present it as a stack barplot on the sum.
For now, it would just show "total" without any lines indicating whether 1 car, 2 suv, 1 pickup addes to 4 "total".
Note. It is different from barplot(matrix(df)), because that's just dividing it my car,suv,pickup, that disregards total number.
You can achieve this easily using ggplot2 and reshape2.
You will need an ID column to track the rows, so I have added that in. I melt the data to long type so that the different groups can be managed and plotted accordingly.
Then plot using geom_bar, specifying the row ids as the x axis and the groupings (fill and colour) for the stack plot and legend.
library(reshape2)
library(ggplot2)
df <- data.frame("ID" = c(1,2,3,4,5), "car" = c(1,2,4,5,3), "suv" = c(2,3,1,4,1), "pickup" = c(1, 4, 2, 2, 1))
long_df <- df %>% melt(id.vars = c("ID") ,value.name = "Number", variable.name = "Type")
ggplot(data = long_df, aes(x = ID, y = Number)) +
geom_bar(aes(fill = Type, colour = Type),
stat = "identity",
position = "stack")
With base R
df %>% melt(id.vars = c("ID") ,value.name = "Number", variable.name = "Type") %>%
dcast(Type ~ ID, value.var = "Number") %>%
as.matrix() %>%
barplot()
Are you after something like this?
library(tidyverse)
df %>%
rowid_to_column("row") %>%
gather(k, v, -row) %>%
ggplot(aes(row, v, fill = k)) +
geom_col()
We use a stacked barplot here, so there is no need to manually calculate the sum. The key here is to transform data from wide to long and keep track of the row.
Sample data
df <- read.table(text =
"car suv pickup
1 2 1
2 3 4
4 1 2
5 4 2
3 1 1", header = T)

Swimmer plot in R (ggplot): How to order stacked bars?

I have a question concerning ordering of stacked bars in a swimmer plot using GGplot in R.
I have a sample dataset of (artificial) patients, who receive treatments.
library(tidyverse)
df <- read.table(text="patient start_t_1 t_1_duration start_t_2 t_2_duration start_t_3 t_3_duration start_t_4 t_4_duration end
1 0 1.5 1.5 3 NA NA 4.5 10 10
2 0 2 4.5 2 NA NA 2 2.5 10
3 0 5 5 2 7 0.5 7.5 2 9.5
4 0 8 NA NA NA NA 8 2 10", header=TRUE)
All patients start the first treatment at time = 0. Subsequently, patients get different treatments (numbered t_2 up to t_4).
I tried to plot the swimmer plot, using the following code:
df %>%
gather(variable, value, c(t_1_duration, t_2_duration, t_3_duration, t_4_duration)) %>%
ggplot(aes(x = patient, y = value, fill = variable)) +
geom_bar(stat = "identity") +
coord_flip()
However, the treatments are not displayed in the right order.
For example: patient 3 receives all treatments in consecutive orde, while patient 2 receives first treatment 1, then 4 and eventually 2.
So, simply reversing the order does not work.
How do I order the stacked bars in a chronological way?
What about this:
df %>%
gather(variable, value, c(t_1_duration, t_2_duration, t_3_duration,t_4_duration)) %>%
ggplot(aes(x = patient,
y = value,
# here you can specify the order of the variable
fill = factor(variable,
levels =c("t_4_duration", "t_3_duration", "t_2_duration","t_1_duration")))) +
geom_bar(stat = "identity") +
coord_flip()+ guides(fill=guide_legend("My title"))
EDIT:
that has been a long trip, because it involves a kind of hack. I think it's not not a dupe of that question, because it involves also some data reshaping:
library(reshape2)
# divide starts and duration
starts <- df %>% select(patient, start_t_1, start_t_2, start_t_3, start_t_4)
duration <- df %>% select(patient, t_1_duration,t_2_duration, t_3_duration, t_4_duration)
# here you melt them
starts <- melt(starts, id = 'patient') %>%
mutate(keytreat = substr(variable,nchar(as.vector(variable))-2, nchar(as.vector(variable)))) %>%
`colnames<-`(c("patient", "variable", "start","keytreat")) %>% select(-variable)
duration <- melt(duration, id = 'patient') %>% mutate(keytreat = substr(variable,1, 3)) %>%
`colnames<-`(c("patient", "variable", "duration","keytreat")) %>% select(-variable)
# join
dats <- starts %>% left_join(duration) %>% arrange(patient, start) %>% filter(!is.na(start))
# here the part for the plot
bars <- map(unique(dats$patient)
, ~geom_bar(stat = "identity", position = "stack"
, data = dats %>% filter(patient == .x)))
dats %>%
ggplot(aes(x = patient,
y = duration,
fill = reorder(keytreat,-start))) +
bars +
guides(fill=guide_legend("ordering")) + coord_flip()

ggplot2: comparing 2 groups through fraction of its members

Lets say we have 10000 users classified in 2 groups: lvl beginner and lvl pro.
Every user has a rank, going from 1 to 20.
The df:
# beginers
n <- 7000
user.id <- 1:n
lvl <- "beginer"
rank <- sample(1:20, n, replace = TRUE,
prob = seq(.9,0.1,length.out = 20))
df.beginer <- data.frame(user.id, rank, lvl)
# pros
n <- 3000
user.id <- 1:n
lvl <- "pro"
rank <- sample(1:20, n, replace = TRUE,
prob = seq(.9,0.3,length.out = 20))
df.pro <- data.frame(user.id, rank, lvl)
library(dplyr)
df <- bind_rows(df.beginer, df.pro)
df2 <- tbl_df(df) %>% group_by(lvl, rank) %>% mutate(count = n())
Problem 1:
I need a bar plot comparing each group side by side, but instead if giving me counts, I need percents, so the bars from each group will have the same max hight (100%)
The plot I got so far:
library(ggplot2)
plot <- ggplot(df2, aes(rank))
plot + geom_bar(aes(fill=lvl), position="dodge")
Problem 2:
I need a line plot comparing each group, so we will have 2 lines, but instead if giving me counts, I need percents, so the lines from each group will have the same max hight (100%)
The plot I got so far:
plot + geom_line(aes(y=count, color=lvl))
Problem 3:
Lets say that the ranks are cumulative, so a user who has rank 3, also has rank 1 and 2. A user who has rank 20 has all ranks from 1 to 20.
So, when plotting, I want the plot to start with rank 1 having 100% of users,
rank 2 will have something less, rank 3 even less and so on.
I got all this done on tableau but I really dislike it and want to show myself that R can handle all this stuff.
Thank you!
Three problems, three solutions:
problem 1 - calculate percentage and use geom_col
df %>%
group_by(rank, lvl)%>%
summarise(count = n()) %>%
group_by(lvl) %>%
mutate(count_perc = count / sum(count)) %>% # calculate percentage
ggplot(., aes(x = rank, y = count_perc))+
geom_col(aes(fill = lvl), position = 'dodge')
problem 2 - pretty much the same as problem 1 except use geom_line instead of geom_col
df %>%
group_by(rank, lvl)%>%
summarise(count = n()) %>%
group_by(lvl) %>%
mutate(count_perc = count / sum(count)) %>%
ggplot(., aes(x = rank, y = count_perc))+
geom_line(aes(colour = lvl))
problem 3 - make use of arrange and cumsum
df %>%
group_by(lvl, rank) %>%
summarise(count = n()) %>% # count by level and rank
group_by(lvl) %>%
arrange(desc(rank)) %>% # sort descending
mutate(cumulative_count = cumsum(count)) %>% # use cumsum
mutate(cumulative_count_perc = cumulative_count / max(cumulative_count)) %>%
ggplot(., aes(x = rank, y = cumulative_count_perc))+
geom_line(aes(colour = lvl))

plot line width (size) based on counts using ggplot or any other method in R

I have a dataset in long-format, each ID 'walks' 3 steps, each step (variable name is step) can land on different locations (variable name is milestone), I want to draw all of the paths. Because there are some paths more traveled, I want to make the width (size) of the paths proportional to their counts. I am imagining it to be something like geom_line(aes(size=..count..))in ggplot, but it doesn't work.
Below is my code, in the code you can find the url for the example dataset. My silly solution to add width was to dodge the line, but it's not proportional, and it leaves cracks.
ddnew <- read.csv("https://raw.github.com/bossaround/question/master/data9.csv" )
ggplot(ddnew, aes(x=step, y=milestone, group=user_id)) +
geom_line(position = position_dodge(width=0.05)) +
scale_x_discrete(limits=c("0","1","2","3","4","5","6","7","8","9")) +
scale_y_discrete(limits=c("0","1","2","3","4","5","6","7","8","9"))
The plot from my current code looks like this, but you can see the cracks, and it's not proportional.
I was hoping this can look like a Sankey diagram with the width indicating counts.
Does this help?
library(ggplot2)
ddnew <- read.csv("https://raw.github.com/bossaround/question/master/data9.csv" )
ggplot(ddnew, aes(x=step, y=milestone, group=user_id)) +
stat_summary(geom="line", fun.y = "sum", aes(size=milestone),alpha=0.2, color="grey50")+
scale_x_discrete(limits=factor(0:2)) +
scale_y_discrete(limits=factor(0:10)) +
theme(panel.background = element_blank(),
legend.position = "none")
One option is to use the riverplot package. First you'll need to summarize your data so that you can define the edges and nodes.
> library(riverplot)
>
> paths <- spread(ddnew, step, milestone) %>%
+ count(`1`, `2`, `3`)
> paths
Source: local data frame [9 x 4]
Groups: 1, 2 [?]
`1` `2` `3` n
<int> <int> <int> <int>
1 1 2 3 7
2 1 2 10 8
3 1 3 2 1
4 1 4 8 1
5 1 10 2 118
6 1 10 3 33
7 1 10 4 2
8 1 10 5 1
9 1 10 NA 46
Next define your nodes (i.e. each combination of step and milestone).
prefix <- function(p, n) {paste(p, n, sep = '-')}
nodes <- distinct(ddnew, step, milestone) %>%
mutate(ID = prefix(step, milestone),
y = dense_rank(milestone)) %>%
select(ID, x = step, y)
Then define your edges:
e12 <- group_by(paths, N1 = `1`, N2 = `2`) %>%
summarise(Value = sum(n)) %>%
ungroup() %>%
mutate(N1 = prefix(1, N1),
N2 = prefix(2, N2))
e23 <- group_by(paths, N1 = `2`, N2 = `3`) %>%
filter(!is.na(N2)) %>%
summarise(Value = sum(n)) %>%
ungroup() %>%
mutate(N1 = prefix(2, N1),
N2 = prefix(3, N2))
edges <- bind_rows(e12, e23) %>%
mutate(Value = Value) %>%
as.data.frame()
Finally, make the plot:
style <- default.style()
style$srt <- '0' # display node labels horizontally
makeRiver(nodes, edges) %>% plot(default_style = style)
If you are looking for user-specifc counts of paths then this might help:
ddnew <- read.csv("https://raw.github.com/bossaround/question/master/data9.csv" )
ddnew <- ddnew %>%
group_by(user_id) %>%
mutate(step_id = paste(step, collapse = ","),
milestone_id = paste(milestone, collapse = ",")) %>%
group_by(step_id, milestone_id) %>%
mutate(width = n())
ggplot(ddnew, aes(x=step, y=milestone, group=user_id)) +
geom_line(aes(size = width)) +
scale_x_discrete(limits=c("0","1","2","3","4","5","6","7","8","9")) +
scale_y_discrete(limits=c("0","1","2","3","4","5","6","7","8","9"))
The idea is to count unique user-specific paths and assign these counts as width in the geom_line() aesthetic.

Resources