I have the following dataset with the following variables indicating whether a person used their phone (a dummy variable with 1 = used the phone ("Yes") and 0 ("No") else); their ID and district and sub-district they live in. Note that a same person may have been recorded twice or more under different sub-districts. However, I only want to count such a person once, that is, consider only unique IDs.
district sub_district id used_phone
A SX 1 Yes
A SX 2 Yes
A SX 3 No
A SX 4 No
A SY 4 No
A SY 5 Yes
A SZ 6 Yes
A SX 6 Yes
A SZ 7 No
B RX 8 No
B RV 9 No
B RX 9 No
B RV 10 Yes
B RV 11 Yes
B RT 12 Yes
B RT 13 Yes
B RV 13 Yes
B RT 14 No
B RX 14 No
N.B: used_phone is a factor variable
For the above dataset, I want to plot a distribution of "whether a person used a phone" for which I was using the following code:
ggplot(df, aes(x=used_phone)) +
geom_bar(color = "black", fill = "aquamarine4", position = "dodge") +
labs(x="Used phone", y = "Number of people") +
ggtitle("Whether person used phone") +
theme_bw() +
theme(plot.title = element_text(hjust = 0.5)))
This code works fine. However, I want to do two things:
Add % labels for each group (yes & no) over the respective bars but y-axis to show the "count"
Plot the graph such that it only considers the unique IDs
Looking forward to solving this with your help as I am novice in R.
Thanks,
Rachita
As the duplicates in id are id's living in different sub_district at the same time and you want to not double count them, I delete the variable sub_district.
Then erase all duplicates, count the phones and calculate the percentage. The DF coming from this is shown.
ggplot is with geom_col and the percentage on the axis with scales.
I have commented out two lines of code which allows you to facet for district in your ggplot. The diagram coming out of this is attached at the bottom.
library(tidyverse)
df <- read.table(text="district sub_district id used_phone
A SX 1 Yes
A SX 2 Yes
A SX 3 No
A SX 4 No
A SY 4 No
A SY 5 Yes
A SZ 6 Yes
A SX 6 Yes
A SZ 7 No
B RX 8 No
B RV 9 No
B RX 9 No
B RV 10 Yes
B RV 11 Yes
B RT 12 Yes
B RT 13 Yes
B RV 13 Yes
B RT 14 No
B RX 14 No", header = T)
table(df$used_phone)
#>
#> No Yes
#> 9 10
ddf <- df %>%
select(-sub_district) %>% # delete sub_district
distinct(id, .keep_all = T) %>% # unique id`s`
#group_by(district) %>%
count(used_phone) %>% # cout phones
mutate(pct = n / sum(n)) # calculate percentage
ddf
#> # A tibble: 2 x 3
#> used_phone n pct
#> <chr> <int> <dbl>
#> 1 No 6 0.429
#> 2 Yes 8 0.571
ggplot(ddf, aes(used_phone, pct, fill = used_phone)) +
geom_col(position = 'dodge') +
#facet_wrap(~district) +
scale_fill_manual(values = c("aquamarine4", "aquamarine3")) +
scale_y_continuous(labels = scales::percent_format())
New Addition based on comment:
wants y-axis in counts
wants percentage as labels over the bar
wants as facet for district
ddf <- df %>%
select(-sub_district) %>% # delete sub_district
distinct(id, .keep_all = T) %>% # unique id`s`
group_by(district) %>%
count(used_phone) %>% # cout phones
mutate(pct = n / sum(n), # calculate percentage
label = paste0(round(pct*100, 2), '%'))
ggplot(ddf, aes(used_phone, n, fill = used_phone)) +
geom_col(position = 'dodge') +
facet_wrap(~district) +
scale_fill_manual(values = c("aquamarine4", "aquamarine3")) +
geom_text(aes(label = label),
position = position_stack(vjust = 1.05),
size = 3) +
labs(y='count')
*new addition*
change the basis for percent
ddf <- df %>%
select(-sub_district) %>% # delete sub_district
distinct(id, .keep_all = T) %>% # unique id`s`
mutate(ssum = n()) %>%
group_by(district) %>%
count(used_phone, ssum) %>% # cout phones
mutate(pct = n / ssum, # calculate percentage
label = paste0(round(pct*100, 2), '%'))
I have introduced a new variable which sums the numbers up before grouping. That gives:
Here is one suggestion that could work:
Summarize your df based on used_phone and count total number of people who have either used phone and not.
Based on the summarized count, you can calculate percent share and with that you can add label cloumn which is just percent share with % sign
You can plot using ggplot and using the new summarized df. You can use geom_text() to add percentage labels at the top of bars, use vjust argument in position_stack() to play around with label's position.
df %>%
distinct(.keep_all = T) %>%
group_by(used_phone) %>%
summarize(n()) %>%
setNames(., c('used_phone', 'count')) %>%
mutate('share' = count/sum(count),
'label' = paste0(round(share*100, 2), '%')) -> df
ggplot(df, aes(y=count, x=used_phone)) +
geom_bar(stat='identity',
color = "black",
fill = "aquamarine4",
position = "dodge") +
geom_text(aes(label = label),
position = position_stack(vjust = 1.02),
size = 3) +
labs(title = 'Whether person used phone',
x = 'Used Phone',
y = 'Number of People') +
theme_bw()
Related
I'm trying to represent the movements of patients between several treatment groups measured in 3 different years. However, there're dropouts where some patients from 1st year are missing in the 2nd year or there are patients in the 2nd year who weren't in the 1st. Same for 3rd year. I have a label called "none" for these combinations, but I don't want it to be in the plot.
An example plot with only 2 years:
EDIT
I have tried with geom_sankey as well (https://rdrr.io/github/davidsjoberg/ggsankey/man/geom_sankey.html).
Although it is more accurate to what I'm looking for. I don't know how to omit the stratum groups without labels (NA). In this case, I'm using my full data, not a dummy example. I can't share it but I can try to create an example if needed. This is the code I've tried:
data = bind_rows(data_2015,data_2017,data_2019) %>%
select(sip, Year, Grp) %>%
mutate(Grp = factor(Grp), Year = factor(Year)) %>%
arrange(sip) %>%
pivot_wider(names_from = Year, values_from = Grp)
df_sankey = data %>% make_long(`2015`,`2017`,`2019`)
ggplot(df_sankey, aes(x = x,
next_x = next_x,
node = node,
next_node = next_node,
fill = factor(node),
label = node,
color=factor(node) )) +
geom_sankey(flow.alpha = 0.5, node.color = 1) +
geom_sankey_label(size = 3.5, color = 1, fill = "white") +
scale_fill_viridis_d() +
scale_colour_viridis_d() +
theme_sankey(base_size = 16) +
theme(legend.position = "none") + xlab('')
Figure:
Any idea how to omit the missing groups every year as stratum (without omitting them in the alluvium) will be super helpful. Thanks!
Solved! The solution was much easier I though. I'll leave here the solution in case someone else struggles with a similar problem.
Create a wide table of counts per every group / cohort.
# Data with 3 cohorts for years 2015, 2017 and 2019
# Grp is a factor with 3 levels: 1 to 6
# sip is a unique ID
library(tidyverse)
data_wide = data %>%
select(sip, Year, Grp) %>%
mutate(Grp = factor(Grp, levels=c(1:6)), Year = factor(Year)) %>%
arrange(sip) %>%
pivot_wider(names_from = Year, values_from = Grp)
Using ggsankey package we can transform it as the specific type the package expects. There's already an useful function for this.
df_sankey = data %>% make_long(`2015`,`2017`,`2019`)
# The tibble accounts for every change in X axis and Y categorical value (node):
> head(df_sankey)
# A tibble: 6 × 4
x node next_x next_node
<fct> <chr> <fct> <chr>
1 2015 3 2017 2
2 2017 2 2019 2
3 2019 2 NA NA
4 2015 NA 2017 1
5 2017 1 2019 1
6 2019 1 NA NA
Looks like using the pivot_wider() to pass it to make_long() created a situation where each combination for every value was completed, including missings as NA. Drop NA values in 'node' and create the plot.
df_sankey %>% drop_na(node) %>%
ggplot(aes(x = x,
next_x = next_x,
node = node,
next_node = next_node,
fill = factor(node),
label = node,
color=factor(node) )) +
geom_sankey(flow.alpha = 0.5, node.color = 1) +
geom_sankey_label(size = 3.5, color = 1, fill = "white") +
scale_fill_viridis_d() +
scale_colour_viridis_d() +
theme_sankey(base_size = 16) +
theme(legend.position = "none") + xlab('')
Solved!
With my dataframe that looks like this (I have in total 1322 rows) :
I'd like to make a bar plot with the percentage of rating of the CFS score. It should look similar to this :
With this code, I can make a single bar plot for the column cfs_triage :
ggplot(data = df) +
geom_bar(mapping = aes(x = cfs_triage, y = (..count..)/sum(..count..)))
But I can't find out to make one with the three varaibles next to another.
Thank you in advance to all of you that will help me with making this barplot with the percentage of rating for this three variable !(I'm not sure that my explanations are very clear, but I hope that it's the case :))
Your best bet here is to pivot your data into long format. We don't have your data, but we can reproduce a similar data set like this:
set.seed(1)
df <- data.frame(cfs_triage = sample(10, 1322, TRUE, prob = 1:10),
cfs_silver = sample(10, 1322, TRUE),
cfs_student = sample(10, 1322, TRUE, prob = 10:1))
df[] <- lapply(df, function(x) { x[sample(1322, 300)] <- NA; x})
Now the dummy data set looks a lot like yours:
head(df)
#> cfs_triage cfs_silver cfs_student
#> 1 9 NA 1
#> 2 8 4 2
#> 3 NA 8 NA
#> 4 NA 10 9
#> 5 9 5 NA
#> 6 3 1 NA
If we pivot into long format, then we will end up with two columns: one containing the values, and one containing the column name that the value belonged to in the original data frame:
library(tidyverse)
df_long <- df %>%
pivot_longer(everything())
head(df_long)
#> # A tibble: 6 x 2
#> name value
#> <chr> <int>
#> 1 cfs_triage 9
#> 2 cfs_silver NA
#> 3 cfs_student 1
#> 4 cfs_triage 8
#> 5 cfs_silver 4
#> 6 cfs_student 2
This then allows us to plot with value on the x axis, and we can use name as a grouping / fill variable:
ggplot(df_long, aes(value, fill = name)) +
geom_bar(position = 'dodge') +
scale_fill_grey(name = NULL) +
theme_bw(base_size = 16) +
scale_x_continuous(breaks = 1:10)
#> Warning: Removed 900 rows containing non-finite values (`stat_count()`).
Created on 2022-11-25 with reprex v2.0.2
Maybe you need something like this: The formatting was taken from #Allan Cameron (many Thanks!):
library(tidyverse)
library(scales)
df %>%
mutate(id = row_number()) %>%
pivot_longer(-id) %>%
group_by(id) %>%
mutate(percent = value/sum(value, na.rm = TRUE)) %>%
mutate(percent = ifelse(is.na(percent), 0, percent)) %>%
mutate(my_label = str_trim(paste0(format(100 * percent, digits = 1), "%"))) %>%
ggplot(aes(x = factor(name), y = percent, fill = factor(name), label = my_label))+
geom_col(position = position_dodge())+
geom_text(aes(label = my_label), vjust=-1) +
facet_wrap(. ~ id, nrow=1, strip.position = "bottom")+
scale_fill_grey(name = NULL) +
scale_y_continuous(labels = scales::percent)+
theme_bw(base_size = 16)+
theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1))
I wan to plot the distribution of the overall number of wins of a player. I would like to have the last section of the x-axis as a "more than the values before" category.
Example data:
game_data <- data.frame(player = c(1,2,3,4,5, 6), n_wins = c(1,8,2,3,6,4))
game_data
player n_wins
1 1 1
2 2 8
3 3 2
4 4 3
5 5 6
6 6 4
6 6 4
The following code creates a category "NA", but I want it to be 5+ (= more than 5 wins).
game_data %>% group_by(player) %>% summarise(allwins = sum(n_wins)) %>%
ggplot(aes(x = cut(allwins, breaks = seq(1,6, by = 1)), include.lowest=TRUE)) +
geom_bar(aes(y = (..count..)/sum(..count..))) +
scale_y_continuous(labels=scales::percent) +
labs(title="Distribution of Wins", subtitle="", y="Fraction of Players", x="Number of Wins")
I do not only want to change the label, I want it to automatically create the last category.
You can do the following by including +Inf as a break, note that you have no values that are 5, so you need to add a drop=FALSE with scale_x_discrete:
set.seed(100)
game_data <- data.frame(player = c(1,2,3,4,5, 6), n_wins = c(1,8,2,3,6,4))
BR = c(0:5,+Inf)
game_data %>%
group_by(player) %>% summarise(allwins = sum(n_wins)) %>%
ggplot(aes(x = cut(allwins, breaks = BR,labels=c(1:5,"5+")))) +
geom_bar(aes(y = (..count..)/sum(..count..))) +
scale_y_continuous(labels=scales::percent) +
labs(title="Distribution of Wins", subtitle="",
y="Fraction of Players", x="Number of Wins")+
scale_x_discrete(drop=FALSE)
Maybe a small comment, why do you need to summarize the data?
I have a dataframe of following form:
School_type Year fund rate
1 1998 8 0.1
0 1998 7 0.2
1 1999 9 0.11
0 1999 8 0.22
1 2000 10 0.12
0 2000 15 0.23
I am thinking about plotting the "fund" and "rate" for each school type and the x axis is year, so there are four lines--two higher lines and two lower lines, but I don't know how to implement this with two scales of y-axes. Thanks in advance.
I am not sure if this is what you are looking for, but here is my two cents on your question.
#create the dataframe
df = data.frame("school_type" = 0:1, "year" = c("1998","1998","1999","1999","2000","2000"),
"fund" = c("8","7","9","8","10","15"), "rate" = c("0.1","0.2","0.11","0.22","0.12","0.23"))
#Modify the variable typr
df$fund = as.numeric(as.character(df$fund))
df$rate = as.numeric(as.character(df$rate))
#plot the log of the variables
df %>%
mutate(log_fund = log(fund),
log_rate = log(rate)) %>%
melt(id.vars = c("school_type","year")) %>%
filter(variable %in% c("log_fund","log_rate")) %>%
ggplot(aes(x = year, y = value, group = variable, color = variable, shape = variable)) +
geom_line(size = 1) +
geom_point(size = 3) +
facet_wrap(~ school_type) +
theme_bw()
Result:
I have a dataset in long-format, each ID 'walks' 3 steps, each step (variable name is step) can land on different locations (variable name is milestone), I want to draw all of the paths. Because there are some paths more traveled, I want to make the width (size) of the paths proportional to their counts. I am imagining it to be something like geom_line(aes(size=..count..))in ggplot, but it doesn't work.
Below is my code, in the code you can find the url for the example dataset. My silly solution to add width was to dodge the line, but it's not proportional, and it leaves cracks.
ddnew <- read.csv("https://raw.github.com/bossaround/question/master/data9.csv" )
ggplot(ddnew, aes(x=step, y=milestone, group=user_id)) +
geom_line(position = position_dodge(width=0.05)) +
scale_x_discrete(limits=c("0","1","2","3","4","5","6","7","8","9")) +
scale_y_discrete(limits=c("0","1","2","3","4","5","6","7","8","9"))
The plot from my current code looks like this, but you can see the cracks, and it's not proportional.
I was hoping this can look like a Sankey diagram with the width indicating counts.
Does this help?
library(ggplot2)
ddnew <- read.csv("https://raw.github.com/bossaround/question/master/data9.csv" )
ggplot(ddnew, aes(x=step, y=milestone, group=user_id)) +
stat_summary(geom="line", fun.y = "sum", aes(size=milestone),alpha=0.2, color="grey50")+
scale_x_discrete(limits=factor(0:2)) +
scale_y_discrete(limits=factor(0:10)) +
theme(panel.background = element_blank(),
legend.position = "none")
One option is to use the riverplot package. First you'll need to summarize your data so that you can define the edges and nodes.
> library(riverplot)
>
> paths <- spread(ddnew, step, milestone) %>%
+ count(`1`, `2`, `3`)
> paths
Source: local data frame [9 x 4]
Groups: 1, 2 [?]
`1` `2` `3` n
<int> <int> <int> <int>
1 1 2 3 7
2 1 2 10 8
3 1 3 2 1
4 1 4 8 1
5 1 10 2 118
6 1 10 3 33
7 1 10 4 2
8 1 10 5 1
9 1 10 NA 46
Next define your nodes (i.e. each combination of step and milestone).
prefix <- function(p, n) {paste(p, n, sep = '-')}
nodes <- distinct(ddnew, step, milestone) %>%
mutate(ID = prefix(step, milestone),
y = dense_rank(milestone)) %>%
select(ID, x = step, y)
Then define your edges:
e12 <- group_by(paths, N1 = `1`, N2 = `2`) %>%
summarise(Value = sum(n)) %>%
ungroup() %>%
mutate(N1 = prefix(1, N1),
N2 = prefix(2, N2))
e23 <- group_by(paths, N1 = `2`, N2 = `3`) %>%
filter(!is.na(N2)) %>%
summarise(Value = sum(n)) %>%
ungroup() %>%
mutate(N1 = prefix(2, N1),
N2 = prefix(3, N2))
edges <- bind_rows(e12, e23) %>%
mutate(Value = Value) %>%
as.data.frame()
Finally, make the plot:
style <- default.style()
style$srt <- '0' # display node labels horizontally
makeRiver(nodes, edges) %>% plot(default_style = style)
If you are looking for user-specifc counts of paths then this might help:
ddnew <- read.csv("https://raw.github.com/bossaround/question/master/data9.csv" )
ddnew <- ddnew %>%
group_by(user_id) %>%
mutate(step_id = paste(step, collapse = ","),
milestone_id = paste(milestone, collapse = ",")) %>%
group_by(step_id, milestone_id) %>%
mutate(width = n())
ggplot(ddnew, aes(x=step, y=milestone, group=user_id)) +
geom_line(aes(size = width)) +
scale_x_discrete(limits=c("0","1","2","3","4","5","6","7","8","9")) +
scale_y_discrete(limits=c("0","1","2","3","4","5","6","7","8","9"))
The idea is to count unique user-specific paths and assign these counts as width in the geom_line() aesthetic.