I'm trying to represent the movements of patients between several treatment groups measured in 3 different years. However, there're dropouts where some patients from 1st year are missing in the 2nd year or there are patients in the 2nd year who weren't in the 1st. Same for 3rd year. I have a label called "none" for these combinations, but I don't want it to be in the plot.
An example plot with only 2 years:
EDIT
I have tried with geom_sankey as well (https://rdrr.io/github/davidsjoberg/ggsankey/man/geom_sankey.html).
Although it is more accurate to what I'm looking for. I don't know how to omit the stratum groups without labels (NA). In this case, I'm using my full data, not a dummy example. I can't share it but I can try to create an example if needed. This is the code I've tried:
data = bind_rows(data_2015,data_2017,data_2019) %>%
select(sip, Year, Grp) %>%
mutate(Grp = factor(Grp), Year = factor(Year)) %>%
arrange(sip) %>%
pivot_wider(names_from = Year, values_from = Grp)
df_sankey = data %>% make_long(`2015`,`2017`,`2019`)
ggplot(df_sankey, aes(x = x,
next_x = next_x,
node = node,
next_node = next_node,
fill = factor(node),
label = node,
color=factor(node) )) +
geom_sankey(flow.alpha = 0.5, node.color = 1) +
geom_sankey_label(size = 3.5, color = 1, fill = "white") +
scale_fill_viridis_d() +
scale_colour_viridis_d() +
theme_sankey(base_size = 16) +
theme(legend.position = "none") + xlab('')
Figure:
Any idea how to omit the missing groups every year as stratum (without omitting them in the alluvium) will be super helpful. Thanks!
Solved! The solution was much easier I though. I'll leave here the solution in case someone else struggles with a similar problem.
Create a wide table of counts per every group / cohort.
# Data with 3 cohorts for years 2015, 2017 and 2019
# Grp is a factor with 3 levels: 1 to 6
# sip is a unique ID
library(tidyverse)
data_wide = data %>%
select(sip, Year, Grp) %>%
mutate(Grp = factor(Grp, levels=c(1:6)), Year = factor(Year)) %>%
arrange(sip) %>%
pivot_wider(names_from = Year, values_from = Grp)
Using ggsankey package we can transform it as the specific type the package expects. There's already an useful function for this.
df_sankey = data %>% make_long(`2015`,`2017`,`2019`)
# The tibble accounts for every change in X axis and Y categorical value (node):
> head(df_sankey)
# A tibble: 6 × 4
x node next_x next_node
<fct> <chr> <fct> <chr>
1 2015 3 2017 2
2 2017 2 2019 2
3 2019 2 NA NA
4 2015 NA 2017 1
5 2017 1 2019 1
6 2019 1 NA NA
Looks like using the pivot_wider() to pass it to make_long() created a situation where each combination for every value was completed, including missings as NA. Drop NA values in 'node' and create the plot.
df_sankey %>% drop_na(node) %>%
ggplot(aes(x = x,
next_x = next_x,
node = node,
next_node = next_node,
fill = factor(node),
label = node,
color=factor(node) )) +
geom_sankey(flow.alpha = 0.5, node.color = 1) +
geom_sankey_label(size = 3.5, color = 1, fill = "white") +
scale_fill_viridis_d() +
scale_colour_viridis_d() +
theme_sankey(base_size = 16) +
theme(legend.position = "none") + xlab('')
Solved!
Say the categorical variables are,
Do_you_smoke -> Yes/ No
Do_you_drink -> Yes/No
Do_you_exercise -> Yes/No
All 3 categorical variables(Do_you_smoke, Do_you_drink, Do_you_exercise) have 2 category Yes or No. Now I want to visualize all these categorical variables against one continuous variable say "income" at once. How do I visualize this using R ?
It's always better to include a reproducible example of your data so that we can ensure any possible solutions work with your own data structure. However, from your description we should be able to recreate an example data set like this:
set.seed(69)
df <- data.frame(income = runif(1000, 10000, 100000))
df$smoke <- c("Yes", "No")[1 + rbinom(1000, 1, df$income/200000)]
df$drink <- sample(c("Yes", "No"), 1000, TRUE)
df$exercise <- c("No", "Yes")[1 + rbinom(1000, 1, df$income/100000)]
So our data frame contains four columns: the income amount and either a "Yes" or a "No" for each of your three variables:
head(df)
#> income smoke drink exercise
#> 1 57767.86 Yes No Yes
#> 2 79192.70 Yes Yes Yes
#> 3 68132.37 No No No
#> 4 87873.44 Yes No No
#> 5 43199.45 Yes Yes No
#> 6 88188.83 No Yes Yes
To plot this, we need to reshape the data. Since the incomes are all different, we can't get a percentage at each individual income level, so we will have to cut the income into bins. Let's do this by $10,000 bins. We then need to get the proportion of "Yes" for each variable in each income band. Finally, we want to put out data into long format, so that each proportion in each bin has its own row, labelled according to which of the three categorical variables it represents. We can then plot using ggplot.
We need to load a few libraries to help us:
library(dplyr)
library(ggplot2)
library(scales)
library(tidyr)
And now our code looks like this:
df %>%
mutate(income_bracket = cut(income, breaks = 1:10 * 10000)) %>%
group_by(income_bracket) %>%
summarise(exercise = length(which(exercise == "Yes"))/n(),
smoke = length(which(smoke == "Yes"))/n(),
drink = length(which(drink == "Yes"))/n()) %>%
mutate(income = paste(dollar(1:9 * 10000),
dollar(2:10 * 10000), sep = " -\n")) %>%
select(-income_bracket) %>%
pivot_longer(1:3) %>%
ggplot(aes(x = income, y = value, group = name, colour = name)) +
geom_line(size = 1.3) +
geom_point(size = 3) +
scale_y_continuous(labels = percent, limits = c(0, 1)) +
labs(title = "Percentage of activities by income",
y = "Percent", x = "Income bracket", color = "Do you...")
I want to create a chart, using ggplot, relating the variables "var_share" (in the y-axis) and "cbo" (in the x-axis), but by three time periods: 1996-2002, 2002-2008 and 2008-2012. Also, I want to calculate the "cbo" variable, by percentile. Here is my dataset:
ano cbo ocupado quant total share var_share
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1996 20 1 32 39675 0.0807 -0.343
2 1997 20 1 52 41481 0.125 0.554
3 1998 20 1 34 40819 0.0833 -0.336
4 1999 20 1 44 41792 0.105 0.264
5 2001 20 1 57 49741 0.115 0.0884
6 1996 21 1 253 39675 0.638 -0.0326
You can download the full dataset here.
The result is almost like this:
I believe this is what you are looking for. After reading your data in, a new variable called ano2 is build and after that a new DF which contains the bins called new you have defined.
The first plot then builds on this DF and uses stat_summary.
You also said something about the quantiles. I am not sure what exactly you have meant, but I grouped over this new variable and used technique from purrr to calculate the desired quantiles.
library(tidyverse)
df <- ocupacoes
df$ano2 <- readr::parse_date(paste0('01-01-', df$ano), '%d-%m-%Y')
ddf <- df %>%
mutate(new = case_when(
lubridate::year(ano2) %in% 1996:2002 ~ '96-02',
lubridate::year(ano2) %in% 2003:2008 ~ '02-08',
lubridate::year(ano2) %in% 2009:2012 ~ '08-12'
))
ggplot(ddf,aes(x = new, y = var_share, color = new,)) +
stat_summary(fun = mean, colour = "red", size = 1) +
scale_x_discrete(limits = c('96-02', '02-08', '08-12'))
# I think you were also looking for quantiles of cbo
ddf %>%
group_by(new) %>%
group_modify(~ {
quantile(.x$cbo, probs = seq(0,1, by = .2)) %>%
tibble::enframe(name = "prob", value = "quantile")
}) %>%
ggplot(aes(x = prob, quantile, color = new, group = new)) +
geom_line() +
scale_x_discrete(limits = c('0%', '20%' ,
'40%', '60%',
'80%' , '100%'))
I would like to create a stacked bar chart in R. My X axis just contains data on sex i.e male or female. I just need the y axis to show percentages of the stacked bars. The "Survived" column is just a mixture of 0s and 1s. I.e 1 denoting that an indiividual survived an experience and 0 showing that the individual did not survive the experience. I am not sure what to put in for the y label. Can anyone help please?
ggplot(data = df, mapping = aes(x = Sex, y = ? , fill = Survived)) + geom_bar(stat = "identity")
One possible solution is to use dplyr package to calculate percentage of each categories outside of ggplot2 and then use those values to get your bargraph using geom_col:
library(dplyr)
df %>% count(Sex, Survive) %>%
group_by(Sex) %>%
mutate(Percent = n/sum(n)*100)
# A tibble: 4 x 4
# Groups: Sex [2]
Sex Survive n Percent
<fct> <dbl> <int> <dbl>
1 F 0 26 55.3
2 F 1 21 44.7
3 M 0 34 64.2
4 M 1 19 35.8
And now with the plotting part:
library(dplyr)
library(ggplot2)
df %>% count(Sex, Survive) %>%
group_by(Sex) %>%
mutate(Percent = n/sum(n)*100) %>%
ggplot(aes(x = Sex, y = Percent, fill = as.factor(Survive)))+
geom_col()
Reproducible example
df <- data.frame(Sex = sample(c("M","F"),100, replace = TRUE),
Survive = sample(c(0,1), 100, replace = TRUE))
Good evening,
this is my first question, so please be kind.
I want to analyse a dataset with more than 150 cols and 300 rows with R Studio but I'm a newbie.
My problem is here that I want to plot a line or bar chart with ggplot. Unfortunately I can't plot on the x-axis the category i with an average (with gender) of this category (regardless of whether plot or ggplot is used). Another Question is to replace "." in the title (colname) in the chart(s).
The main code for this question is attached and also a picture of a chart using Excel (as example).
In the best case my code will create for each heading catergory (the first two numbers of the colname) a chart with the sub categories (second 2 numbers). But at first I tried to plot a chart with one category but it didn't worked.
I would be pleased about a feedback or tip because it can't be that hard but I didn't found something yet.
Many thanks in advance.
P.S: The comment of Sandy from this question didn't worked for me.
Roh_daten <- data.frame(Age=c(25,22,23,21,21,18),Geschlecht=c("m","w","m","m","m","m"),Test.Kette_01_01 = c(6,5,5,4,5,5),Test.String_01_02=c(2,5,5,3,3,4),Testchar_02_01 = c(0,5,5,4,6,6))
Laufzahl_i <- 1
Farbe_m="blue"#willkürlich festgelegt
Farbe_w="red"#willkürlich festgelegt
library(ggplot2)
library(stringr)
Links = function(text, num_char) {
substr(text, 1, num_char)
}
Rechts = function(text, num_char) {
substr(text, nchar(text) - (num_char-1), nchar(text))
}
for(i in 2:ncol(Roh_daten)) #nicht 1 da dies nur die ID ist
{
#print(colnames(Roh_daten[i]))
if(i==ncol(Roh_daten)) break()
#colnames(Roh_daten[i]) <- c(String_in_string_replace(colnames(Roh_daten[i]),"\\.","\\ ","All"))
if(all.equal(Roh_daten[,i], as.integer(Roh_daten[,i]))==TRUE)
{
assign(paste(colnames(Roh_daten[i]),"test_men",sep = "_"),mean(Roh_daten[,i][Roh_daten$Geschlecht == "m"],na.rm = TRUE))#erstellt aus dem paste String eine Variable
assign(paste(colnames(Roh_daten[i]),"test_woman",sep = "_"),mean(Roh_daten[,i][Roh_daten$Geschlecht == "w"],na.rm = TRUE))
assign(paste(colnames(Roh_daten[i]),"test_m_w",sep = "_"),mean(subset(Roh_daten[,i],Roh_daten$Geschlecht == "m" | Roh_daten$Geschlecht == "w"),na.rm = TRUE))
if(Links(Rechts(colnames(Roh_daten[i]),5),2) == Links(Rechts(colnames(Roh_daten[i-1]),5),2)){#nur wenn stimmt alle -1
#print(Links(Rechts(colnames(Roh_daten[i-1]),5),2))
Laufzahl_i=Laufzahl_i+1
if(Links(Rechts(colnames(Roh_daten[i]),5),2) == Links(Rechts(colnames(Roh_daten[i+1]),5),2)){#letztes element von alle mit der bed. von oben
}else{
#print(c("Es wurde ", Laufzahl_i, " Mal der gleiche Bereich erkannt."))
Laufzahl_i <- 1
Var_name_m <- paste(colnames(Roh_daten[i]),"test_men",sep = "_")
Var_name_w <- paste(colnames(Roh_daten[i]),"test_woman",sep = "_")
plot(get(Var_name_m),t="b",col=Farbe_m,ylim = c(0,6),yaxt="n",main = Links(Var_name_m,str_locate(Var_name_m,"_")-1),ylab="Wichtigkeit")
text(x=get(Var_name_m),labels = as.character(round(get(Var_name_m),digits = 2)),pos=2,col = Farbe_m)
text(x=get(Var_name_w),labels = as.character(round(get(Var_name_w),digits = 2)),pos=4,col = Farbe_w)
axis(2, at = seq(0, 6, by = 0.5), las=2)
legend(x ="topleft", legend = c("m","w"),col=c(Farbe_m, Farbe_w), bty = "o")
points(get(Var_name_w),t="b",col=Farbe_w,ylim = c(0,6))
p <- ggplot(data=Roh_daten[i],aes(x=get(Var_name_m),y=get(Var_name_m))) + #xlab(colnames(Roh_daten[,i]))
#geom_line(linetype=2) +
geom_point(size=1,col=Farbe_m) +
geom_point(size=1,col=Farbe_w,aes(y=get(Var_name_w))) +
theme(panel.border = element_rect(colour = "black", fill=NA, size=0.5))
#geom_bar(stat="identity")
#scale_y_continuous(breaks = seq(1,6,by=1))
p
#ggplot(data=Roh_daten[i],aes(x=get(Var_name_m),y=get(Var_name_m))) + stat_summary(fun.y=mean, geom = "point")
}
}
}else {
print(paste(colnames(Roh_daten[i])," hat einen Fehler (String)"))
}
}
p
Question1: plotting the average per gender of each categories
I'm not sure that it is exactly what you are asking for but from my understanding, you are looking to get the same plot you get with excel. Breifly, the average of each gender for each category plotted as a line or a barchart and with mean values display on it.
Based on the example you provided, you can have the use of dplyr and tidyr libraries to average each column based on their gender and get them reshape for plotting in ggplot. Here how you can do it by steps:
First, get the average of each columns based on gender:
library(dplyr)
Roh_daten %>%
group_by(Geschlecht) %>%
summarise_all(.funs = mean)
# A tibble: 2 x 5
Geschlecht Age Test.Kette_01_01 Test.String_01_02 Testchar_02_01
<fct> <dbl> <dbl> <dbl> <dbl>
1 m 21.6 5 3.4 4.2
2 w 22 5 5 5
Next, we want to reshape these data in order to match the grammar of ggplot2 (briefly summarise, an unique column for x values, an unique column for y values, and columns for each categories) to be used, so you can use the function pivot_longer from tidyr:
library(dplyr)
library(tidyr)
Roh_daten %>%
group_by(Geschlecht) %>%
summarise_all(.funs = mean) %>%
pivot_longer(., -c(Geschlecht, Age), names_to = "Variable", values_to = "Value")
# A tibble: 6 x 4
Geschlecht Age Variable Value
<fct> <dbl> <chr> <dbl>
1 m 21.6 Test.Kette_01_01 5
2 m 21.6 Test.String_01_02 3.4
3 m 21.6 Testchar_02_01 4.2
4 w 22 Test.Kette_01_01 5
5 w 22 Test.String_01_02 5
6 w 22 Testchar_02_01 5
Finally, we can use ggplot2 to get a bar chart like this:
library(dplyr)
library(tidyr)
library(ggplot2)
Roh_daten %>%
group_by(Geschlecht) %>%
summarise_all(.funs = mean) %>%
pivot_longer(., -c(Geschlecht, Age), names_to = "Variable", values_to = "Value") %>%
ggplot(., aes(x = Variable, y = Value, group = Geschlecht))+
geom_bar(stat = "identity", aes(fill = Geschlecht), position = position_dodge())+
theme(legend.position = "top")+
geom_label(aes(label = Value), position = position_dodge(0.9), vjust = -0.5)+
ylim(0,5.5)
Or get lines and points like this (the library ggrepel will help to display labeling without overlapping on each other:
library(dplyr)
library(tidyr)
library(ggplot2)
library(ggrepel)
Roh_daten %>%
group_by(Geschlecht) %>%
summarise_all(.funs = mean) %>%
pivot_longer(., -c(Geschlecht, Age), names_to = "Variable", values_to = "Value") %>%
ggplot(., aes(x = Variable, y = Value, color = Geschlecht, group = Geschlecht))+
geom_point()+
geom_line()+
theme(legend.position = "top")+
geom_label_repel(aes(label = Value), vjust = -0.5)
Is it the kind of plot you are looking ? If not, can you clarify your question because I did not understand all your code.
Question2: Replacement of dots in colnames
For your second question regarding the replacement of "." in colnames of your dataset, you can have the use of the library rebus:
library(rebus)
gsub(DOT,"-", colnames(Roh_daten))
[1] "Age" "Geschlecht" "Test-Kette_01_01" "Test-String_01_02" "Testchar_02_01"
I hope it answer your questions.