I have a dataframe of following form:
School_type Year fund rate
1 1998 8 0.1
0 1998 7 0.2
1 1999 9 0.11
0 1999 8 0.22
1 2000 10 0.12
0 2000 15 0.23
I am thinking about plotting the "fund" and "rate" for each school type and the x axis is year, so there are four lines--two higher lines and two lower lines, but I don't know how to implement this with two scales of y-axes. Thanks in advance.
I am not sure if this is what you are looking for, but here is my two cents on your question.
#create the dataframe
df = data.frame("school_type" = 0:1, "year" = c("1998","1998","1999","1999","2000","2000"),
"fund" = c("8","7","9","8","10","15"), "rate" = c("0.1","0.2","0.11","0.22","0.12","0.23"))
#Modify the variable typr
df$fund = as.numeric(as.character(df$fund))
df$rate = as.numeric(as.character(df$rate))
#plot the log of the variables
df %>%
mutate(log_fund = log(fund),
log_rate = log(rate)) %>%
melt(id.vars = c("school_type","year")) %>%
filter(variable %in% c("log_fund","log_rate")) %>%
ggplot(aes(x = year, y = value, group = variable, color = variable, shape = variable)) +
geom_line(size = 1) +
geom_point(size = 3) +
facet_wrap(~ school_type) +
theme_bw()
Result:
Related
I'm trying to represent the movements of patients between several treatment groups measured in 3 different years. However, there're dropouts where some patients from 1st year are missing in the 2nd year or there are patients in the 2nd year who weren't in the 1st. Same for 3rd year. I have a label called "none" for these combinations, but I don't want it to be in the plot.
An example plot with only 2 years:
EDIT
I have tried with geom_sankey as well (https://rdrr.io/github/davidsjoberg/ggsankey/man/geom_sankey.html).
Although it is more accurate to what I'm looking for. I don't know how to omit the stratum groups without labels (NA). In this case, I'm using my full data, not a dummy example. I can't share it but I can try to create an example if needed. This is the code I've tried:
data = bind_rows(data_2015,data_2017,data_2019) %>%
select(sip, Year, Grp) %>%
mutate(Grp = factor(Grp), Year = factor(Year)) %>%
arrange(sip) %>%
pivot_wider(names_from = Year, values_from = Grp)
df_sankey = data %>% make_long(`2015`,`2017`,`2019`)
ggplot(df_sankey, aes(x = x,
next_x = next_x,
node = node,
next_node = next_node,
fill = factor(node),
label = node,
color=factor(node) )) +
geom_sankey(flow.alpha = 0.5, node.color = 1) +
geom_sankey_label(size = 3.5, color = 1, fill = "white") +
scale_fill_viridis_d() +
scale_colour_viridis_d() +
theme_sankey(base_size = 16) +
theme(legend.position = "none") + xlab('')
Figure:
Any idea how to omit the missing groups every year as stratum (without omitting them in the alluvium) will be super helpful. Thanks!
Solved! The solution was much easier I though. I'll leave here the solution in case someone else struggles with a similar problem.
Create a wide table of counts per every group / cohort.
# Data with 3 cohorts for years 2015, 2017 and 2019
# Grp is a factor with 3 levels: 1 to 6
# sip is a unique ID
library(tidyverse)
data_wide = data %>%
select(sip, Year, Grp) %>%
mutate(Grp = factor(Grp, levels=c(1:6)), Year = factor(Year)) %>%
arrange(sip) %>%
pivot_wider(names_from = Year, values_from = Grp)
Using ggsankey package we can transform it as the specific type the package expects. There's already an useful function for this.
df_sankey = data %>% make_long(`2015`,`2017`,`2019`)
# The tibble accounts for every change in X axis and Y categorical value (node):
> head(df_sankey)
# A tibble: 6 × 4
x node next_x next_node
<fct> <chr> <fct> <chr>
1 2015 3 2017 2
2 2017 2 2019 2
3 2019 2 NA NA
4 2015 NA 2017 1
5 2017 1 2019 1
6 2019 1 NA NA
Looks like using the pivot_wider() to pass it to make_long() created a situation where each combination for every value was completed, including missings as NA. Drop NA values in 'node' and create the plot.
df_sankey %>% drop_na(node) %>%
ggplot(aes(x = x,
next_x = next_x,
node = node,
next_node = next_node,
fill = factor(node),
label = node,
color=factor(node) )) +
geom_sankey(flow.alpha = 0.5, node.color = 1) +
geom_sankey_label(size = 3.5, color = 1, fill = "white") +
scale_fill_viridis_d() +
scale_colour_viridis_d() +
theme_sankey(base_size = 16) +
theme(legend.position = "none") + xlab('')
Solved!
With my dataframe that looks like this (I have in total 1322 rows) :
I'd like to make a bar plot with the percentage of rating of the CFS score. It should look similar to this :
With this code, I can make a single bar plot for the column cfs_triage :
ggplot(data = df) +
geom_bar(mapping = aes(x = cfs_triage, y = (..count..)/sum(..count..)))
But I can't find out to make one with the three varaibles next to another.
Thank you in advance to all of you that will help me with making this barplot with the percentage of rating for this three variable !(I'm not sure that my explanations are very clear, but I hope that it's the case :))
Your best bet here is to pivot your data into long format. We don't have your data, but we can reproduce a similar data set like this:
set.seed(1)
df <- data.frame(cfs_triage = sample(10, 1322, TRUE, prob = 1:10),
cfs_silver = sample(10, 1322, TRUE),
cfs_student = sample(10, 1322, TRUE, prob = 10:1))
df[] <- lapply(df, function(x) { x[sample(1322, 300)] <- NA; x})
Now the dummy data set looks a lot like yours:
head(df)
#> cfs_triage cfs_silver cfs_student
#> 1 9 NA 1
#> 2 8 4 2
#> 3 NA 8 NA
#> 4 NA 10 9
#> 5 9 5 NA
#> 6 3 1 NA
If we pivot into long format, then we will end up with two columns: one containing the values, and one containing the column name that the value belonged to in the original data frame:
library(tidyverse)
df_long <- df %>%
pivot_longer(everything())
head(df_long)
#> # A tibble: 6 x 2
#> name value
#> <chr> <int>
#> 1 cfs_triage 9
#> 2 cfs_silver NA
#> 3 cfs_student 1
#> 4 cfs_triage 8
#> 5 cfs_silver 4
#> 6 cfs_student 2
This then allows us to plot with value on the x axis, and we can use name as a grouping / fill variable:
ggplot(df_long, aes(value, fill = name)) +
geom_bar(position = 'dodge') +
scale_fill_grey(name = NULL) +
theme_bw(base_size = 16) +
scale_x_continuous(breaks = 1:10)
#> Warning: Removed 900 rows containing non-finite values (`stat_count()`).
Created on 2022-11-25 with reprex v2.0.2
Maybe you need something like this: The formatting was taken from #Allan Cameron (many Thanks!):
library(tidyverse)
library(scales)
df %>%
mutate(id = row_number()) %>%
pivot_longer(-id) %>%
group_by(id) %>%
mutate(percent = value/sum(value, na.rm = TRUE)) %>%
mutate(percent = ifelse(is.na(percent), 0, percent)) %>%
mutate(my_label = str_trim(paste0(format(100 * percent, digits = 1), "%"))) %>%
ggplot(aes(x = factor(name), y = percent, fill = factor(name), label = my_label))+
geom_col(position = position_dodge())+
geom_text(aes(label = my_label), vjust=-1) +
facet_wrap(. ~ id, nrow=1, strip.position = "bottom")+
scale_fill_grey(name = NULL) +
scale_y_continuous(labels = scales::percent)+
theme_bw(base_size = 16)+
theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1))
I'm still pretty new to R, I'm sorry if the question is a stupid one!
For some descriptives I created a barplot to visualize group differences in my sample. I have two groups of people - suicide attempters and non-attempters. They differ regarding their diagnoses and so far I have a plot showing me how many people per group I have with a certain diagnosis, but I'd like to have a bar representing those people per group who do not have this diagnosis.
So I'd have a bar representing the number of people with MDD in the attempters group, a bar for those without in the attempters group, a bar for those with MDD in the non-attempters and a bar for those without MDD in the non-attempters.
Regarding my data: Everything is coded as 0 or 1, except for the attempters or not.
My old data frame looks something like this:
code
MDD
Anxiety
PTBS
attempters
01
0
1
1
1
02
1
1
0
0
03
0
0
1
0
04
0
1
0
0
At first I changed my data from wide to long and recoded the grouping variable attempters to a factor:
# create data frame for attempters
data_attempters <- data_gesamt %>%
pivot_longer(cols = c(MDD, Anxiety, PTBS),
names_to = "predictors", values_to = "value") %>%
filter(value == 1) %>%
# convert "attempters" to factor
mutate(attempter = as.factor(attempters)) %>%
# rename factor levels
mutate(attempter = recode_factor(attempter, "yes" = "0", "no" = "1")) %>%
group_by(predictors, attempter) %>%
summarize(severity = n(),.groups = "drop")
which got me a data frame as follows:
predictors
attempters
severity
MDD
0
1
Anxiety
1
1
Anxiety
0
2
PTBS
1
1
PTBS
0
1
and then used the following to plot:
plot_attempers <- data_attempters %>%
ggplot(aes(x = attempter, y = severity,
fill = attempter, group = attempter)) +
geom_bar(stat = "identity",
# position_dodge for avoid bar stacked on each other
position = position_dodge()) +
scale_fill_manual(labels = labs, values = c("0" = "#999999", "1" = "#CC79A7")) +
facet_grid(.~ predictors) +
scale_y_continuous(limits = c(0, 12), breaks = seq(0, 12, by = 1)) +
theme(legend.position = "bottom",
axis.text.x=element_blank()) +
ylab("Frequency")
plot_attempers
Did I add something in the code where I converted the data which made me lose the data about those who do not have a certain diagnosis which is why it is not shown in my plot? Or what do I need to add to get the non-diagnosis-people in the plot as well? Because as I can see in the new data-frame, I did lose those who do not have a diagnosis ...
My plot looks like this so far (please ignore the diagnoses I did not mention in my explanation here. I did not include them in this post so it is a smaller sample as well):
I would want four bars per diagnosis (two per group, one of them representing people with the diagnosis and one representing the people without)
You have to summarise your data first: Here I create a little example that simulate your data:
library(ggplot2)
library(reshape2)
df <- data.frame(code=1:100,
MDD=sample(0:1,100,replace = T,prob = c(0.3,0.7)),
anxiety=sample(0:1,100,replace = T,prob = c(0.4,0.6)),
PTBS=sample(0:1,100,replace = T),
attempters=sample(0:1,100,replace = T,prob = c(0.2,0.8)))
x <- reshape2::melt(df[,-1],id.vars="attempters",variable.name="diagnosis")
t <- x %>% group_by(diagnosis,attempters) %>%
summarise(sick=sum(value==1),healt=sum(value==0))
t <- reshape2::melt(t,id.vars=c("diagnosis","attempters"))
tt <-as.data.frame( apply(t, 2, as.factor))
ggplot(tt,aes(x=attempters,y=value))+
geom_bar(stat = "identity",aes(fill=variable),position = "dodge")+
facet_wrap(~diagnosis)+
scale_fill_manual(values = c("#CC79A7","#999999"))
and this is the resulting plot
I have a question concerning ordering of stacked bars in a swimmer plot using GGplot in R.
I have a sample dataset of (artificial) patients, who receive treatments.
library(tidyverse)
df <- read.table(text="patient start_t_1 t_1_duration start_t_2 t_2_duration start_t_3 t_3_duration start_t_4 t_4_duration end
1 0 1.5 1.5 3 NA NA 4.5 10 10
2 0 2 4.5 2 NA NA 2 2.5 10
3 0 5 5 2 7 0.5 7.5 2 9.5
4 0 8 NA NA NA NA 8 2 10", header=TRUE)
All patients start the first treatment at time = 0. Subsequently, patients get different treatments (numbered t_2 up to t_4).
I tried to plot the swimmer plot, using the following code:
df %>%
gather(variable, value, c(t_1_duration, t_2_duration, t_3_duration, t_4_duration)) %>%
ggplot(aes(x = patient, y = value, fill = variable)) +
geom_bar(stat = "identity") +
coord_flip()
However, the treatments are not displayed in the right order.
For example: patient 3 receives all treatments in consecutive orde, while patient 2 receives first treatment 1, then 4 and eventually 2.
So, simply reversing the order does not work.
How do I order the stacked bars in a chronological way?
What about this:
df %>%
gather(variable, value, c(t_1_duration, t_2_duration, t_3_duration,t_4_duration)) %>%
ggplot(aes(x = patient,
y = value,
# here you can specify the order of the variable
fill = factor(variable,
levels =c("t_4_duration", "t_3_duration", "t_2_duration","t_1_duration")))) +
geom_bar(stat = "identity") +
coord_flip()+ guides(fill=guide_legend("My title"))
EDIT:
that has been a long trip, because it involves a kind of hack. I think it's not not a dupe of that question, because it involves also some data reshaping:
library(reshape2)
# divide starts and duration
starts <- df %>% select(patient, start_t_1, start_t_2, start_t_3, start_t_4)
duration <- df %>% select(patient, t_1_duration,t_2_duration, t_3_duration, t_4_duration)
# here you melt them
starts <- melt(starts, id = 'patient') %>%
mutate(keytreat = substr(variable,nchar(as.vector(variable))-2, nchar(as.vector(variable)))) %>%
`colnames<-`(c("patient", "variable", "start","keytreat")) %>% select(-variable)
duration <- melt(duration, id = 'patient') %>% mutate(keytreat = substr(variable,1, 3)) %>%
`colnames<-`(c("patient", "variable", "duration","keytreat")) %>% select(-variable)
# join
dats <- starts %>% left_join(duration) %>% arrange(patient, start) %>% filter(!is.na(start))
# here the part for the plot
bars <- map(unique(dats$patient)
, ~geom_bar(stat = "identity", position = "stack"
, data = dats %>% filter(patient == .x)))
dats %>%
ggplot(aes(x = patient,
y = duration,
fill = reorder(keytreat,-start))) +
bars +
guides(fill=guide_legend("ordering")) + coord_flip()
I want to plot the rolling mean of data of different time series with ggplot2. My data have the following structure:
library(dplyr)
library(ggplot2)
library(zoo)
library(tidyr)
df <- data.frame(episode=seq(1:1000),
t_0 = runif(1000),
t_1 = 1 + runif(1000),
t_2 = 2 + runif(1000))
df.tidy <- gather(df, "time", "value", -episode) %>%
separate("time", c("t", "time"), sep = "_") %>%
subset(select = -t)
> head(df.tidy)
# episode time value
#1 1 0 0.7466480
#2 2 0 0.7238865
#3 3 0 0.9024454
#4 4 0 0.7274303
#5 5 0 0.1932375
#6 6 0 0.1826925
Now, the code below creates a plot where the lines for time = 1 and time = 2 towards the beginning of the episodes do not represent the data because value is filled with NAs and the first numeric entry in value is for time = 0.
ggplot(df.tidy, aes(x = episode, y = value, col = time)) +
geom_point(alpha = 0.2) +
geom_line(aes(y = rollmean(value, 10, align = "right", fill = NA)))
How do I have to adapt my code such that the rolling-mean lines are representative of my data?
Your issue is you are applying a moving average over the whole column, which makes data "leak" from one value of time to another.
You could group_by first to apply the rollmean to each time separately:
ggplot(df.tidy, aes(x = episode, y = value, col = time)) +
geom_point(alpha = 0.2) +
geom_line(data = df.tidy %>%
group_by(time) %>%
mutate(value = rollmean(value, 10, align = "right", fill = NA)))