I'm trying to represent the movements of patients between several treatment groups measured in 3 different years. However, there're dropouts where some patients from 1st year are missing in the 2nd year or there are patients in the 2nd year who weren't in the 1st. Same for 3rd year. I have a label called "none" for these combinations, but I don't want it to be in the plot.
An example plot with only 2 years:
EDIT
I have tried with geom_sankey as well (https://rdrr.io/github/davidsjoberg/ggsankey/man/geom_sankey.html).
Although it is more accurate to what I'm looking for. I don't know how to omit the stratum groups without labels (NA). In this case, I'm using my full data, not a dummy example. I can't share it but I can try to create an example if needed. This is the code I've tried:
data = bind_rows(data_2015,data_2017,data_2019) %>%
select(sip, Year, Grp) %>%
mutate(Grp = factor(Grp), Year = factor(Year)) %>%
arrange(sip) %>%
pivot_wider(names_from = Year, values_from = Grp)
df_sankey = data %>% make_long(`2015`,`2017`,`2019`)
ggplot(df_sankey, aes(x = x,
next_x = next_x,
node = node,
next_node = next_node,
fill = factor(node),
label = node,
color=factor(node) )) +
geom_sankey(flow.alpha = 0.5, node.color = 1) +
geom_sankey_label(size = 3.5, color = 1, fill = "white") +
scale_fill_viridis_d() +
scale_colour_viridis_d() +
theme_sankey(base_size = 16) +
theme(legend.position = "none") + xlab('')
Figure:
Any idea how to omit the missing groups every year as stratum (without omitting them in the alluvium) will be super helpful. Thanks!
Solved! The solution was much easier I though. I'll leave here the solution in case someone else struggles with a similar problem.
Create a wide table of counts per every group / cohort.
# Data with 3 cohorts for years 2015, 2017 and 2019
# Grp is a factor with 3 levels: 1 to 6
# sip is a unique ID
library(tidyverse)
data_wide = data %>%
select(sip, Year, Grp) %>%
mutate(Grp = factor(Grp, levels=c(1:6)), Year = factor(Year)) %>%
arrange(sip) %>%
pivot_wider(names_from = Year, values_from = Grp)
Using ggsankey package we can transform it as the specific type the package expects. There's already an useful function for this.
df_sankey = data %>% make_long(`2015`,`2017`,`2019`)
# The tibble accounts for every change in X axis and Y categorical value (node):
> head(df_sankey)
# A tibble: 6 × 4
x node next_x next_node
<fct> <chr> <fct> <chr>
1 2015 3 2017 2
2 2017 2 2019 2
3 2019 2 NA NA
4 2015 NA 2017 1
5 2017 1 2019 1
6 2019 1 NA NA
Looks like using the pivot_wider() to pass it to make_long() created a situation where each combination for every value was completed, including missings as NA. Drop NA values in 'node' and create the plot.
df_sankey %>% drop_na(node) %>%
ggplot(aes(x = x,
next_x = next_x,
node = node,
next_node = next_node,
fill = factor(node),
label = node,
color=factor(node) )) +
geom_sankey(flow.alpha = 0.5, node.color = 1) +
geom_sankey_label(size = 3.5, color = 1, fill = "white") +
scale_fill_viridis_d() +
scale_colour_viridis_d() +
theme_sankey(base_size = 16) +
theme(legend.position = "none") + xlab('')
Solved!
Related
With my dataframe that looks like this (I have in total 1322 rows) :
I'd like to make a bar plot with the percentage of rating of the CFS score. It should look similar to this :
With this code, I can make a single bar plot for the column cfs_triage :
ggplot(data = df) +
geom_bar(mapping = aes(x = cfs_triage, y = (..count..)/sum(..count..)))
But I can't find out to make one with the three varaibles next to another.
Thank you in advance to all of you that will help me with making this barplot with the percentage of rating for this three variable !(I'm not sure that my explanations are very clear, but I hope that it's the case :))
Your best bet here is to pivot your data into long format. We don't have your data, but we can reproduce a similar data set like this:
set.seed(1)
df <- data.frame(cfs_triage = sample(10, 1322, TRUE, prob = 1:10),
cfs_silver = sample(10, 1322, TRUE),
cfs_student = sample(10, 1322, TRUE, prob = 10:1))
df[] <- lapply(df, function(x) { x[sample(1322, 300)] <- NA; x})
Now the dummy data set looks a lot like yours:
head(df)
#> cfs_triage cfs_silver cfs_student
#> 1 9 NA 1
#> 2 8 4 2
#> 3 NA 8 NA
#> 4 NA 10 9
#> 5 9 5 NA
#> 6 3 1 NA
If we pivot into long format, then we will end up with two columns: one containing the values, and one containing the column name that the value belonged to in the original data frame:
library(tidyverse)
df_long <- df %>%
pivot_longer(everything())
head(df_long)
#> # A tibble: 6 x 2
#> name value
#> <chr> <int>
#> 1 cfs_triage 9
#> 2 cfs_silver NA
#> 3 cfs_student 1
#> 4 cfs_triage 8
#> 5 cfs_silver 4
#> 6 cfs_student 2
This then allows us to plot with value on the x axis, and we can use name as a grouping / fill variable:
ggplot(df_long, aes(value, fill = name)) +
geom_bar(position = 'dodge') +
scale_fill_grey(name = NULL) +
theme_bw(base_size = 16) +
scale_x_continuous(breaks = 1:10)
#> Warning: Removed 900 rows containing non-finite values (`stat_count()`).
Created on 2022-11-25 with reprex v2.0.2
Maybe you need something like this: The formatting was taken from #Allan Cameron (many Thanks!):
library(tidyverse)
library(scales)
df %>%
mutate(id = row_number()) %>%
pivot_longer(-id) %>%
group_by(id) %>%
mutate(percent = value/sum(value, na.rm = TRUE)) %>%
mutate(percent = ifelse(is.na(percent), 0, percent)) %>%
mutate(my_label = str_trim(paste0(format(100 * percent, digits = 1), "%"))) %>%
ggplot(aes(x = factor(name), y = percent, fill = factor(name), label = my_label))+
geom_col(position = position_dodge())+
geom_text(aes(label = my_label), vjust=-1) +
facet_wrap(. ~ id, nrow=1, strip.position = "bottom")+
scale_fill_grey(name = NULL) +
scale_y_continuous(labels = scales::percent)+
theme_bw(base_size = 16)+
theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1))
I wan to plot the distribution of the overall number of wins of a player. I would like to have the last section of the x-axis as a "more than the values before" category.
Example data:
game_data <- data.frame(player = c(1,2,3,4,5, 6), n_wins = c(1,8,2,3,6,4))
game_data
player n_wins
1 1 1
2 2 8
3 3 2
4 4 3
5 5 6
6 6 4
6 6 4
The following code creates a category "NA", but I want it to be 5+ (= more than 5 wins).
game_data %>% group_by(player) %>% summarise(allwins = sum(n_wins)) %>%
ggplot(aes(x = cut(allwins, breaks = seq(1,6, by = 1)), include.lowest=TRUE)) +
geom_bar(aes(y = (..count..)/sum(..count..))) +
scale_y_continuous(labels=scales::percent) +
labs(title="Distribution of Wins", subtitle="", y="Fraction of Players", x="Number of Wins")
I do not only want to change the label, I want it to automatically create the last category.
You can do the following by including +Inf as a break, note that you have no values that are 5, so you need to add a drop=FALSE with scale_x_discrete:
set.seed(100)
game_data <- data.frame(player = c(1,2,3,4,5, 6), n_wins = c(1,8,2,3,6,4))
BR = c(0:5,+Inf)
game_data %>%
group_by(player) %>% summarise(allwins = sum(n_wins)) %>%
ggplot(aes(x = cut(allwins, breaks = BR,labels=c(1:5,"5+")))) +
geom_bar(aes(y = (..count..)/sum(..count..))) +
scale_y_continuous(labels=scales::percent) +
labs(title="Distribution of Wins", subtitle="",
y="Fraction of Players", x="Number of Wins")+
scale_x_discrete(drop=FALSE)
Maybe a small comment, why do you need to summarize the data?
Good evening,
this is my first question, so please be kind.
I want to analyse a dataset with more than 150 cols and 300 rows with R Studio but I'm a newbie.
My problem is here that I want to plot a line or bar chart with ggplot. Unfortunately I can't plot on the x-axis the category i with an average (with gender) of this category (regardless of whether plot or ggplot is used). Another Question is to replace "." in the title (colname) in the chart(s).
The main code for this question is attached and also a picture of a chart using Excel (as example).
In the best case my code will create for each heading catergory (the first two numbers of the colname) a chart with the sub categories (second 2 numbers). But at first I tried to plot a chart with one category but it didn't worked.
I would be pleased about a feedback or tip because it can't be that hard but I didn't found something yet.
Many thanks in advance.
P.S: The comment of Sandy from this question didn't worked for me.
Roh_daten <- data.frame(Age=c(25,22,23,21,21,18),Geschlecht=c("m","w","m","m","m","m"),Test.Kette_01_01 = c(6,5,5,4,5,5),Test.String_01_02=c(2,5,5,3,3,4),Testchar_02_01 = c(0,5,5,4,6,6))
Laufzahl_i <- 1
Farbe_m="blue"#willkürlich festgelegt
Farbe_w="red"#willkürlich festgelegt
library(ggplot2)
library(stringr)
Links = function(text, num_char) {
substr(text, 1, num_char)
}
Rechts = function(text, num_char) {
substr(text, nchar(text) - (num_char-1), nchar(text))
}
for(i in 2:ncol(Roh_daten)) #nicht 1 da dies nur die ID ist
{
#print(colnames(Roh_daten[i]))
if(i==ncol(Roh_daten)) break()
#colnames(Roh_daten[i]) <- c(String_in_string_replace(colnames(Roh_daten[i]),"\\.","\\ ","All"))
if(all.equal(Roh_daten[,i], as.integer(Roh_daten[,i]))==TRUE)
{
assign(paste(colnames(Roh_daten[i]),"test_men",sep = "_"),mean(Roh_daten[,i][Roh_daten$Geschlecht == "m"],na.rm = TRUE))#erstellt aus dem paste String eine Variable
assign(paste(colnames(Roh_daten[i]),"test_woman",sep = "_"),mean(Roh_daten[,i][Roh_daten$Geschlecht == "w"],na.rm = TRUE))
assign(paste(colnames(Roh_daten[i]),"test_m_w",sep = "_"),mean(subset(Roh_daten[,i],Roh_daten$Geschlecht == "m" | Roh_daten$Geschlecht == "w"),na.rm = TRUE))
if(Links(Rechts(colnames(Roh_daten[i]),5),2) == Links(Rechts(colnames(Roh_daten[i-1]),5),2)){#nur wenn stimmt alle -1
#print(Links(Rechts(colnames(Roh_daten[i-1]),5),2))
Laufzahl_i=Laufzahl_i+1
if(Links(Rechts(colnames(Roh_daten[i]),5),2) == Links(Rechts(colnames(Roh_daten[i+1]),5),2)){#letztes element von alle mit der bed. von oben
}else{
#print(c("Es wurde ", Laufzahl_i, " Mal der gleiche Bereich erkannt."))
Laufzahl_i <- 1
Var_name_m <- paste(colnames(Roh_daten[i]),"test_men",sep = "_")
Var_name_w <- paste(colnames(Roh_daten[i]),"test_woman",sep = "_")
plot(get(Var_name_m),t="b",col=Farbe_m,ylim = c(0,6),yaxt="n",main = Links(Var_name_m,str_locate(Var_name_m,"_")-1),ylab="Wichtigkeit")
text(x=get(Var_name_m),labels = as.character(round(get(Var_name_m),digits = 2)),pos=2,col = Farbe_m)
text(x=get(Var_name_w),labels = as.character(round(get(Var_name_w),digits = 2)),pos=4,col = Farbe_w)
axis(2, at = seq(0, 6, by = 0.5), las=2)
legend(x ="topleft", legend = c("m","w"),col=c(Farbe_m, Farbe_w), bty = "o")
points(get(Var_name_w),t="b",col=Farbe_w,ylim = c(0,6))
p <- ggplot(data=Roh_daten[i],aes(x=get(Var_name_m),y=get(Var_name_m))) + #xlab(colnames(Roh_daten[,i]))
#geom_line(linetype=2) +
geom_point(size=1,col=Farbe_m) +
geom_point(size=1,col=Farbe_w,aes(y=get(Var_name_w))) +
theme(panel.border = element_rect(colour = "black", fill=NA, size=0.5))
#geom_bar(stat="identity")
#scale_y_continuous(breaks = seq(1,6,by=1))
p
#ggplot(data=Roh_daten[i],aes(x=get(Var_name_m),y=get(Var_name_m))) + stat_summary(fun.y=mean, geom = "point")
}
}
}else {
print(paste(colnames(Roh_daten[i])," hat einen Fehler (String)"))
}
}
p
Question1: plotting the average per gender of each categories
I'm not sure that it is exactly what you are asking for but from my understanding, you are looking to get the same plot you get with excel. Breifly, the average of each gender for each category plotted as a line or a barchart and with mean values display on it.
Based on the example you provided, you can have the use of dplyr and tidyr libraries to average each column based on their gender and get them reshape for plotting in ggplot. Here how you can do it by steps:
First, get the average of each columns based on gender:
library(dplyr)
Roh_daten %>%
group_by(Geschlecht) %>%
summarise_all(.funs = mean)
# A tibble: 2 x 5
Geschlecht Age Test.Kette_01_01 Test.String_01_02 Testchar_02_01
<fct> <dbl> <dbl> <dbl> <dbl>
1 m 21.6 5 3.4 4.2
2 w 22 5 5 5
Next, we want to reshape these data in order to match the grammar of ggplot2 (briefly summarise, an unique column for x values, an unique column for y values, and columns for each categories) to be used, so you can use the function pivot_longer from tidyr:
library(dplyr)
library(tidyr)
Roh_daten %>%
group_by(Geschlecht) %>%
summarise_all(.funs = mean) %>%
pivot_longer(., -c(Geschlecht, Age), names_to = "Variable", values_to = "Value")
# A tibble: 6 x 4
Geschlecht Age Variable Value
<fct> <dbl> <chr> <dbl>
1 m 21.6 Test.Kette_01_01 5
2 m 21.6 Test.String_01_02 3.4
3 m 21.6 Testchar_02_01 4.2
4 w 22 Test.Kette_01_01 5
5 w 22 Test.String_01_02 5
6 w 22 Testchar_02_01 5
Finally, we can use ggplot2 to get a bar chart like this:
library(dplyr)
library(tidyr)
library(ggplot2)
Roh_daten %>%
group_by(Geschlecht) %>%
summarise_all(.funs = mean) %>%
pivot_longer(., -c(Geschlecht, Age), names_to = "Variable", values_to = "Value") %>%
ggplot(., aes(x = Variable, y = Value, group = Geschlecht))+
geom_bar(stat = "identity", aes(fill = Geschlecht), position = position_dodge())+
theme(legend.position = "top")+
geom_label(aes(label = Value), position = position_dodge(0.9), vjust = -0.5)+
ylim(0,5.5)
Or get lines and points like this (the library ggrepel will help to display labeling without overlapping on each other:
library(dplyr)
library(tidyr)
library(ggplot2)
library(ggrepel)
Roh_daten %>%
group_by(Geschlecht) %>%
summarise_all(.funs = mean) %>%
pivot_longer(., -c(Geschlecht, Age), names_to = "Variable", values_to = "Value") %>%
ggplot(., aes(x = Variable, y = Value, color = Geschlecht, group = Geschlecht))+
geom_point()+
geom_line()+
theme(legend.position = "top")+
geom_label_repel(aes(label = Value), vjust = -0.5)
Is it the kind of plot you are looking ? If not, can you clarify your question because I did not understand all your code.
Question2: Replacement of dots in colnames
For your second question regarding the replacement of "." in colnames of your dataset, you can have the use of the library rebus:
library(rebus)
gsub(DOT,"-", colnames(Roh_daten))
[1] "Age" "Geschlecht" "Test-Kette_01_01" "Test-String_01_02" "Testchar_02_01"
I hope it answer your questions.
I have a question concerning ordering of stacked bars in a swimmer plot using GGplot in R.
I have a sample dataset of (artificial) patients, who receive treatments.
library(tidyverse)
df <- read.table(text="patient start_t_1 t_1_duration start_t_2 t_2_duration start_t_3 t_3_duration start_t_4 t_4_duration end
1 0 1.5 1.5 3 NA NA 4.5 10 10
2 0 2 4.5 2 NA NA 2 2.5 10
3 0 5 5 2 7 0.5 7.5 2 9.5
4 0 8 NA NA NA NA 8 2 10", header=TRUE)
All patients start the first treatment at time = 0. Subsequently, patients get different treatments (numbered t_2 up to t_4).
I tried to plot the swimmer plot, using the following code:
df %>%
gather(variable, value, c(t_1_duration, t_2_duration, t_3_duration, t_4_duration)) %>%
ggplot(aes(x = patient, y = value, fill = variable)) +
geom_bar(stat = "identity") +
coord_flip()
However, the treatments are not displayed in the right order.
For example: patient 3 receives all treatments in consecutive orde, while patient 2 receives first treatment 1, then 4 and eventually 2.
So, simply reversing the order does not work.
How do I order the stacked bars in a chronological way?
What about this:
df %>%
gather(variable, value, c(t_1_duration, t_2_duration, t_3_duration,t_4_duration)) %>%
ggplot(aes(x = patient,
y = value,
# here you can specify the order of the variable
fill = factor(variable,
levels =c("t_4_duration", "t_3_duration", "t_2_duration","t_1_duration")))) +
geom_bar(stat = "identity") +
coord_flip()+ guides(fill=guide_legend("My title"))
EDIT:
that has been a long trip, because it involves a kind of hack. I think it's not not a dupe of that question, because it involves also some data reshaping:
library(reshape2)
# divide starts and duration
starts <- df %>% select(patient, start_t_1, start_t_2, start_t_3, start_t_4)
duration <- df %>% select(patient, t_1_duration,t_2_duration, t_3_duration, t_4_duration)
# here you melt them
starts <- melt(starts, id = 'patient') %>%
mutate(keytreat = substr(variable,nchar(as.vector(variable))-2, nchar(as.vector(variable)))) %>%
`colnames<-`(c("patient", "variable", "start","keytreat")) %>% select(-variable)
duration <- melt(duration, id = 'patient') %>% mutate(keytreat = substr(variable,1, 3)) %>%
`colnames<-`(c("patient", "variable", "duration","keytreat")) %>% select(-variable)
# join
dats <- starts %>% left_join(duration) %>% arrange(patient, start) %>% filter(!is.na(start))
# here the part for the plot
bars <- map(unique(dats$patient)
, ~geom_bar(stat = "identity", position = "stack"
, data = dats %>% filter(patient == .x)))
dats %>%
ggplot(aes(x = patient,
y = duration,
fill = reorder(keytreat,-start))) +
bars +
guides(fill=guide_legend("ordering")) + coord_flip()
I have a dataframe of following form:
School_type Year fund rate
1 1998 8 0.1
0 1998 7 0.2
1 1999 9 0.11
0 1999 8 0.22
1 2000 10 0.12
0 2000 15 0.23
I am thinking about plotting the "fund" and "rate" for each school type and the x axis is year, so there are four lines--two higher lines and two lower lines, but I don't know how to implement this with two scales of y-axes. Thanks in advance.
I am not sure if this is what you are looking for, but here is my two cents on your question.
#create the dataframe
df = data.frame("school_type" = 0:1, "year" = c("1998","1998","1999","1999","2000","2000"),
"fund" = c("8","7","9","8","10","15"), "rate" = c("0.1","0.2","0.11","0.22","0.12","0.23"))
#Modify the variable typr
df$fund = as.numeric(as.character(df$fund))
df$rate = as.numeric(as.character(df$rate))
#plot the log of the variables
df %>%
mutate(log_fund = log(fund),
log_rate = log(rate)) %>%
melt(id.vars = c("school_type","year")) %>%
filter(variable %in% c("log_fund","log_rate")) %>%
ggplot(aes(x = year, y = value, group = variable, color = variable, shape = variable)) +
geom_line(size = 1) +
geom_point(size = 3) +
facet_wrap(~ school_type) +
theme_bw()
Result: