Plotting dodged periodic time series - r

I have some data about events happening at some hours of the day in certain conditions.
The data_frame looks somehow like this :
> tibble(event_id = 1:1000, hour = rep_len(0:23, 1000), conditions = rep_len(c("Non", "Oui"), 1000))
# A tibble: 1,000 × 3
event_id hour conditions
<int> <int> <chr>
1 1 0 Non
2 2 1 Oui
3 3 2 Non
4 4 3 Oui
5 5 4 Non
6 6 5 Oui
7 7 6 Non
8 8 7 Oui
9 9 8 Non
10 10 9 Oui
Somehow I have managed to represent it using geom_bar this way :
mydataframe %>%
group_by(hour, conditions) %>%
count() %>%
ggplot() +
geom_bar(aes(x = hour, y = n, fill = conditions), stat = "identity", position = "dodge")
With my actual data, I get a figure looking like this :
But I would like to get something like 2 dodged smoothlines or geom_density which I can't seem to get.
Do you have some ideas to help me ?
Thank you

library(tidyverse)
set.seed(42)
mydataframe <- tibble(event_id = 1:1000, hour = rep_len(0:23, 1000), conditions = sample(c("Non", "Oui"), 1000, replace = TRUE))
mydataframe %>%
count(hour, conditions) %>%
ggplot() +
geom_smooth(aes(hour, n, color = conditions), se = FALSE, span = 0.3)
Or if you want to dodge them, you could do this and tweak the amount of width between the series:
mydataframe %>%
count(hour, conditions) %>%
ggplot() +
geom_smooth(aes(hour, n, color = conditions), se = FALSE, span = 0.3,
position = position_dodge(width = 1))

Related

Working with ggalluvial ggsankey library with missing combinations and dropouts

I'm trying to represent the movements of patients between several treatment groups measured in 3 different years. However, there're dropouts where some patients from 1st year are missing in the 2nd year or there are patients in the 2nd year who weren't in the 1st. Same for 3rd year. I have a label called "none" for these combinations, but I don't want it to be in the plot.
An example plot with only 2 years:
EDIT
I have tried with geom_sankey as well (https://rdrr.io/github/davidsjoberg/ggsankey/man/geom_sankey.html).
Although it is more accurate to what I'm looking for. I don't know how to omit the stratum groups without labels (NA). In this case, I'm using my full data, not a dummy example. I can't share it but I can try to create an example if needed. This is the code I've tried:
data = bind_rows(data_2015,data_2017,data_2019) %>%
select(sip, Year, Grp) %>%
mutate(Grp = factor(Grp), Year = factor(Year)) %>%
arrange(sip) %>%
pivot_wider(names_from = Year, values_from = Grp)
df_sankey = data %>% make_long(`2015`,`2017`,`2019`)
ggplot(df_sankey, aes(x = x,
next_x = next_x,
node = node,
next_node = next_node,
fill = factor(node),
label = node,
color=factor(node) )) +
geom_sankey(flow.alpha = 0.5, node.color = 1) +
geom_sankey_label(size = 3.5, color = 1, fill = "white") +
scale_fill_viridis_d() +
scale_colour_viridis_d() +
theme_sankey(base_size = 16) +
theme(legend.position = "none") + xlab('')
Figure:
Any idea how to omit the missing groups every year as stratum (without omitting them in the alluvium) will be super helpful. Thanks!
Solved! The solution was much easier I though. I'll leave here the solution in case someone else struggles with a similar problem.
Create a wide table of counts per every group / cohort.
# Data with 3 cohorts for years 2015, 2017 and 2019
# Grp is a factor with 3 levels: 1 to 6
# sip is a unique ID
library(tidyverse)
data_wide = data %>%
select(sip, Year, Grp) %>%
mutate(Grp = factor(Grp, levels=c(1:6)), Year = factor(Year)) %>%
arrange(sip) %>%
pivot_wider(names_from = Year, values_from = Grp)
Using ggsankey package we can transform it as the specific type the package expects. There's already an useful function for this.
df_sankey = data %>% make_long(`2015`,`2017`,`2019`)
# The tibble accounts for every change in X axis and Y categorical value (node):
> head(df_sankey)
# A tibble: 6 × 4
x node next_x next_node
<fct> <chr> <fct> <chr>
1 2015 3 2017 2
2 2017 2 2019 2
3 2019 2 NA NA
4 2015 NA 2017 1
5 2017 1 2019 1
6 2019 1 NA NA
Looks like using the pivot_wider() to pass it to make_long() created a situation where each combination for every value was completed, including missings as NA. Drop NA values in 'node' and create the plot.
df_sankey %>% drop_na(node) %>%
ggplot(aes(x = x,
next_x = next_x,
node = node,
next_node = next_node,
fill = factor(node),
label = node,
color=factor(node) )) +
geom_sankey(flow.alpha = 0.5, node.color = 1) +
geom_sankey_label(size = 3.5, color = 1, fill = "white") +
scale_fill_viridis_d() +
scale_colour_viridis_d() +
theme_sankey(base_size = 16) +
theme(legend.position = "none") + xlab('')
Solved!

Plot multiple variable in the same bar plot

With my dataframe that looks like this (I have in total 1322 rows) :
I'd like to make a bar plot with the percentage of rating of the CFS score. It should look similar to this :
With this code, I can make a single bar plot for the column cfs_triage :
ggplot(data = df) +
geom_bar(mapping = aes(x = cfs_triage, y = (..count..)/sum(..count..)))
But I can't find out to make one with the three varaibles next to another.
Thank you in advance to all of you that will help me with making this barplot with the percentage of rating for this three variable !(I'm not sure that my explanations are very clear, but I hope that it's the case :))
Your best bet here is to pivot your data into long format. We don't have your data, but we can reproduce a similar data set like this:
set.seed(1)
df <- data.frame(cfs_triage = sample(10, 1322, TRUE, prob = 1:10),
cfs_silver = sample(10, 1322, TRUE),
cfs_student = sample(10, 1322, TRUE, prob = 10:1))
df[] <- lapply(df, function(x) { x[sample(1322, 300)] <- NA; x})
Now the dummy data set looks a lot like yours:
head(df)
#> cfs_triage cfs_silver cfs_student
#> 1 9 NA 1
#> 2 8 4 2
#> 3 NA 8 NA
#> 4 NA 10 9
#> 5 9 5 NA
#> 6 3 1 NA
If we pivot into long format, then we will end up with two columns: one containing the values, and one containing the column name that the value belonged to in the original data frame:
library(tidyverse)
df_long <- df %>%
pivot_longer(everything())
head(df_long)
#> # A tibble: 6 x 2
#> name value
#> <chr> <int>
#> 1 cfs_triage 9
#> 2 cfs_silver NA
#> 3 cfs_student 1
#> 4 cfs_triage 8
#> 5 cfs_silver 4
#> 6 cfs_student 2
This then allows us to plot with value on the x axis, and we can use name as a grouping / fill variable:
ggplot(df_long, aes(value, fill = name)) +
geom_bar(position = 'dodge') +
scale_fill_grey(name = NULL) +
theme_bw(base_size = 16) +
scale_x_continuous(breaks = 1:10)
#> Warning: Removed 900 rows containing non-finite values (`stat_count()`).
Created on 2022-11-25 with reprex v2.0.2
Maybe you need something like this: The formatting was taken from #Allan Cameron (many Thanks!):
library(tidyverse)
library(scales)
df %>%
mutate(id = row_number()) %>%
pivot_longer(-id) %>%
group_by(id) %>%
mutate(percent = value/sum(value, na.rm = TRUE)) %>%
mutate(percent = ifelse(is.na(percent), 0, percent)) %>%
mutate(my_label = str_trim(paste0(format(100 * percent, digits = 1), "%"))) %>%
ggplot(aes(x = factor(name), y = percent, fill = factor(name), label = my_label))+
geom_col(position = position_dodge())+
geom_text(aes(label = my_label), vjust=-1) +
facet_wrap(. ~ id, nrow=1, strip.position = "bottom")+
scale_fill_grey(name = NULL) +
scale_y_continuous(labels = scales::percent)+
theme_bw(base_size = 16)+
theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1))

Adding percentage labels in barplots (gglot2)

I have the following dataset with the following variables indicating whether a person used their phone (a dummy variable with 1 = used the phone ("Yes") and 0 ("No") else); their ID and district and sub-district they live in. Note that a same person may have been recorded twice or more under different sub-districts. However, I only want to count such a person once, that is, consider only unique IDs.
district sub_district id used_phone
A SX 1 Yes
A SX 2 Yes
A SX 3 No
A SX 4 No
A SY 4 No
A SY 5 Yes
A SZ 6 Yes
A SX 6 Yes
A SZ 7 No
B RX 8 No
B RV 9 No
B RX 9 No
B RV 10 Yes
B RV 11 Yes
B RT 12 Yes
B RT 13 Yes
B RV 13 Yes
B RT 14 No
B RX 14 No
N.B: used_phone is a factor variable
For the above dataset, I want to plot a distribution of "whether a person used a phone" for which I was using the following code:
ggplot(df, aes(x=used_phone)) +
geom_bar(color = "black", fill = "aquamarine4", position = "dodge") +
labs(x="Used phone", y = "Number of people") +
ggtitle("Whether person used phone") +
theme_bw() +
theme(plot.title = element_text(hjust = 0.5)))
This code works fine. However, I want to do two things:
Add % labels for each group (yes & no) over the respective bars but y-axis to show the "count"
Plot the graph such that it only considers the unique IDs
Looking forward to solving this with your help as I am novice in R.
Thanks,
Rachita
As the duplicates in id are id's living in different sub_district at the same time and you want to not double count them, I delete the variable sub_district.
Then erase all duplicates, count the phones and calculate the percentage. The DF coming from this is shown.
ggplot is with geom_col and the percentage on the axis with scales.
I have commented out two lines of code which allows you to facet for district in your ggplot. The diagram coming out of this is attached at the bottom.
library(tidyverse)
df <- read.table(text="district sub_district id used_phone
A SX 1 Yes
A SX 2 Yes
A SX 3 No
A SX 4 No
A SY 4 No
A SY 5 Yes
A SZ 6 Yes
A SX 6 Yes
A SZ 7 No
B RX 8 No
B RV 9 No
B RX 9 No
B RV 10 Yes
B RV 11 Yes
B RT 12 Yes
B RT 13 Yes
B RV 13 Yes
B RT 14 No
B RX 14 No", header = T)
table(df$used_phone)
#>
#> No Yes
#> 9 10
ddf <- df %>%
select(-sub_district) %>% # delete sub_district
distinct(id, .keep_all = T) %>% # unique id`s`
#group_by(district) %>%
count(used_phone) %>% # cout phones
mutate(pct = n / sum(n)) # calculate percentage
ddf
#> # A tibble: 2 x 3
#> used_phone n pct
#> <chr> <int> <dbl>
#> 1 No 6 0.429
#> 2 Yes 8 0.571
ggplot(ddf, aes(used_phone, pct, fill = used_phone)) +
geom_col(position = 'dodge') +
#facet_wrap(~district) +
scale_fill_manual(values = c("aquamarine4", "aquamarine3")) +
scale_y_continuous(labels = scales::percent_format())
New Addition based on comment:
wants y-axis in counts
wants percentage as labels over the bar
wants as facet for district
ddf <- df %>%
select(-sub_district) %>% # delete sub_district
distinct(id, .keep_all = T) %>% # unique id`s`
group_by(district) %>%
count(used_phone) %>% # cout phones
mutate(pct = n / sum(n), # calculate percentage
label = paste0(round(pct*100, 2), '%'))
ggplot(ddf, aes(used_phone, n, fill = used_phone)) +
geom_col(position = 'dodge') +
facet_wrap(~district) +
scale_fill_manual(values = c("aquamarine4", "aquamarine3")) +
geom_text(aes(label = label),
position = position_stack(vjust = 1.05),
size = 3) +
labs(y='count')
*new addition*
change the basis for percent
ddf <- df %>%
select(-sub_district) %>% # delete sub_district
distinct(id, .keep_all = T) %>% # unique id`s`
mutate(ssum = n()) %>%
group_by(district) %>%
count(used_phone, ssum) %>% # cout phones
mutate(pct = n / ssum, # calculate percentage
label = paste0(round(pct*100, 2), '%'))
I have introduced a new variable which sums the numbers up before grouping. That gives:
Here is one suggestion that could work:
Summarize your df based on used_phone and count total number of people who have either used phone and not.
Based on the summarized count, you can calculate percent share and with that you can add label cloumn which is just percent share with % sign
You can plot using ggplot and using the new summarized df. You can use geom_text() to add percentage labels at the top of bars, use vjust argument in position_stack() to play around with label's position.
df %>%
distinct(.keep_all = T) %>%
group_by(used_phone) %>%
summarize(n()) %>%
setNames(., c('used_phone', 'count')) %>%
mutate('share' = count/sum(count),
'label' = paste0(round(share*100, 2), '%')) -> df
ggplot(df, aes(y=count, x=used_phone)) +
geom_bar(stat='identity',
color = "black",
fill = "aquamarine4",
position = "dodge") +
geom_text(aes(label = label),
position = position_stack(vjust = 1.02),
size = 3) +
labs(title = 'Whether person used phone',
x = 'Used Phone',
y = 'Number of People') +
theme_bw()

How to Make animated ggplot for different years using gganimate

So I have a simple data frame where the first column includes roadway IDs and the next 10 columns have traffic volumes on each roadway ID over 10 years.
I have been trying to come up with a code to display roadway ID on X axis and Traffic volume on Y axis. Then animate the graph over multiple years (Traffic volumes on the Y axis change). Here is a sample of my data frame:
Could anyone suggest a piece of code to do it? Here is a code that I have written but doesn't really work. I know this may be very wrong, but I am very new to gganimate and not sure how I can get different functions to work. Any help is appreciated.
year <- c(2001,2002,2003,2004,2005,2006,2007,2008,2009,2010)
p1 <- ggplot(data = Data) +
geom_point(aes(x = Data$LinkIDs, y=Data$Year2001Traffic)) +
geom_point(aes(x = Data$LinkIDs, y=Data$Year2002Traffic)) +
geom_point(aes(x = Data$LinkIDs, y=Data$Year2003Traffic)) +
geom_point(aes(x = Data$LinkIDs, y=Data$Year2004Traffic)) +
geom_point(aes(x = Data$LinkIDs, y=Data$Year2005Traffic)) +
geom_point(aes(x = Data$LinkIDs, y=Data$Year2006Traffic)) +
geom_point(aes(x = Data$LinkIDs, y=Data$Year2007Traffic)) +
geom_point(aes(x = Data$LinkIDs, y=Data$Year2008Traffic)) +
geom_point(aes(x = Data$LinkIDs, y=Data$Year2009Traffic)) +
geom_point(aes(x = Data$LinkIDs, y=Data$Year2010Traffic)) +
labs(title = 'Year: {frame_time}', x = 'Link ID', y = 'Traffic Volume') +
transition_time(year)
animate(p1)
Most of the work lies in changing the data before you send it to ggplot and gganimate. To help you with that work, I have created some sample data based on your picture (in the future please supply sample data yourself).
library(tidyverse)
library(gganimate)
df <- tribble(
~LinkIDs, ~Year2001Traffic, ~Year2002Traffic, ~Year2003Traffic,
"A", 1, 10, 15,
"B", 3, 1, 10,
"C", 10, 5, 1)
df
# A tibble: 3 x 4
LinkIDs Year2001Traffic Year2002Traffic Year2003Traffic
<chr> <dbl> <dbl> <dbl>
1 A 1 10 15
2 B 3 1 10
3 C 10 5 1
gganimate and ggplot work best with data in long format. So the first step is to change the data from wide to long before sending it to ggplot.
df <- df %>% gather(Year, Traffic, -LinkIDs)
df
# A tibble: 9 x 3
LinkIDs Year Traffic
<chr> <chr> <dbl>
1 A Year2001Traffic 1
2 B Year2001Traffic 3
3 C Year2001Traffic 10
4 A Year2002Traffic 10
5 B Year2002Traffic 1
6 C Year2002Traffic 5
7 A Year2003Traffic 15
8 B Year2003Traffic 10
9 C Year2003Traffic 1
gganimate needs the Year column to be a number before it can use it for animation. So we need to extract the numbers that are contained in the values.
df <- df %>% mutate(
Year = parse_number(Year))
df
# A tibble: 9 x 3
LinkIDs Year Traffic
<chr> <dbl> <dbl>
1 A 2001 1
2 B 2001 3
3 C 2001 10
4 A 2002 10
5 B 2002 1
6 C 2002 5
7 A 2003 15
8 B 2003 10
9 C 2003 1
Now the rest is straightforward. Just the plot the data, and use the year variable for the animation argument.
p1 <- ggplot(df, aes(x = LinkIDs, y = Traffic))+
geom_point()+
labs(title = 'Year: {frame_time}', x = 'Link ID', y = 'Traffic Volume')+
transition_time(Year)
animate(p1)
_________________________ EDIT AFTER UPDATED COMMENTS_______
Request in comments:
"I just want it to go through the timeline (from 2001 to 2003) just
once and then stop at 2003."
In case you want to stop at the year 2003, you would need to filter the data before you send it to ggplot - this is done via the filter command.
As of 23/3 2019, the is, as far as I know, no way to go through the animation just once. You can alter the end_pause argument in order to insert a pause after each iteration of the animation (I changed geom_point() to geom_col() given your description).
p2 <- df %>%
#keep only observations from the year 2003 and earlier
filter(Year <= 2003) %>%
#Send the data to plot
ggplot(aes(x = LinkIDs, y = Traffic, fill = LinkIDs))+
geom_col()+
labs(title = 'Year: {frame_time}', x = 'Link ID', y = 'Traffic Volume')+
transition_time(Year)
animate(p2, fps = 20, duration = 25, end_pause = 95)

split data into groups in R

My data frame looks like this:
plant distance
one 0
one 1
one 2
one 3
one 4
one 5
one 6
one 7
one 8
one 9
one 9.9
two 0
two 1
two 2
two 3
two 4
two 5
two 6
two 7
two 8
two 9
two 9.5
I want to split distance of each level into groups by interval(for instance,interval=3), and compute percentage of each group. Finally, plot the percentages of each level of each group similar like this:
my code:
library(ggplot2)
library(dplyr)
dat <- data %>%
mutate(group = factor(cut(distance, seq(0, max(distance), 3), F))) %>%
group_by(plant, group) %>%
summarise(percentage = n()) %>%
mutate(percentage = percentage / sum(percentage))
p <- ggplot(dat, aes(x = plant, y = percentage, fill = group)) +
geom_bar(stat = "identity", position = "stack")+
scale_y_continuous(labels=percent)
p
But my plot is shown below: the group 4 was missing.
And I found that the dat was wrong, the group 4 was NA.
The likely reason is that the length of group 4 was less than the interval=3, so my question is how to fix it? Thank you in advance!
I have solved the problem.The reason is that the cut(distance, seq(0, max(distance), 3), F) did not include the maximum and minimum values.
Here is my solution:
dat <- my_data %>%
mutate(group = factor(cut(distance, seq(from = min(distance), by = 3, length.out = n()/ 3 + 1), include.lowest = TRUE))) %>%
count(plant, group) %>%
group_by(plant) %>%
mutate(percentage = n / sum(n))

Resources