Finding multiple peak densities on facet wrapped ggplot for two datasets - r

I am currently attempting to plot densities of flies on julian dates, per year. The aim is to see when there are peak densities of flies, for two methods of data collection (group 1 and group 2). I have many rows of data, over the course of 10 years, for example, the data set looks like this:
year
julian
group
2000
214
1
2001
198
1
2001
224
1
2000
189
2
2000
214
2
2001
222
2
2001
259
2
2000
260
2
2000
212
1
Each row is a single observation.
This is my first time plotting using ggplots, so I am confused as to how to plot vertical peak lines for each year.
The code currently looks like this:
Code
data$group <- as.factor(data$group)
plots <- ggplot(data, aes(x = julian, group = group)) +
geom_density(aes(colour = group),adjust = 2) + facet_wrap(~year, ncol = 2)
I have attempted to plot peaks using this code:
geom_vline(data = vline, aes(xintercept = density(data$julian)$x[which.max(density(data$julian)$y)]))
vline <- summarise(group_by(data,year, group=group), density(ata$julian, group=group)$x[which.max(density(data$julian)$y)])
vline
However I assume it has found the peak density for all years and all groups.
Please may anyone help advise me on how to plot max densities for each year and group across each facet? Even better if there are multiple peaks, how would I find those, and a quantitative value for the peaks?
Thank you in advance, I am very new to ggplots.

Instead of trying to wrangle all computations into one line of code I would suggest to split it into steps like so. Instead of using your code to find the highest peak I make use of this answer which in principle should also find multiple peaks (see below):
library(dplyr)
library(ggplot2)
fun_peak <- function(x, adjust = 2) {
d <- density(x, adjust = adjust)
d$x[c(F, diff(diff(d$y) >= 0) < 0)]
}
vline <- data %>%
group_by(year, group) %>%
summarise(peak = fun_peak(julian))
#> `summarise()` has grouped output by 'year'. You can override using the `.groups` argument.
ggplot(data, aes(x = julian, group = group)) +
geom_density(aes(colour = group), adjust = 2) +
geom_vline(data = vline, aes(xintercept = peak)) +
facet_wrap(~year, ncol = 2)
And here is a small example with multiple peaks based on the example data in the linked answer:
x <- c(1,1,4,4,9)
data <- data.frame(
year = 2000,
julian = rep(c(1,1,4,4,9), 2),
group = rep(1:2, each = 5)
)
data$group <- as.factor(data$group)
vline <- data %>%
group_by(year, group) %>%
summarise(peak = fun_peak(julian, adjust = 1))
#> `summarise()` has grouped output by 'year', 'group'. You can override using the `.groups` argument.
ggplot(data, aes(x = julian, group = group)) +
geom_density(aes(colour = group), adjust = 1) +
geom_vline(data = vline, aes(xintercept = peak)) +
facet_wrap(~year, ncol = 2)

Related

How to animate several heatmaps in R

I have 30 heatmaps in the form of square matrices. Each heatmap is linked to a date and I would like to create an animation transitioning from one heatmap to the next based on the date. I am using gganimate in R in order to do so and I am encountering some problems. Let me show an example with some random data
# GENERATE RANDOM DATA. 30 3X3 MATRICES STORED IN list_matrices
pacman::p_load(tidyverse, ggplot2, av, gganimate)
list_matrices = list()
for(i in 1:30){list_matrices[[i]] = matrix(runif(9), nrow=3)}
# PUT ALL THE MATRICES TOGETHER INTO A TIBBLE AND
# DO A PIVOT LONGER IN ORDER TO USE GGPLOT
for(i in 1:length(list_matrices))
{
tmp_result = list_matrices[[i]] %>% as_tibble() %>%
mutate(rowname = c('a', 'b', 'c'),
frame = i) %>%
pivot_longer(-c(rowname, frame), names_to = 'colname')
if(i == 1)
{
df_result = tmp_result
} else{
df_result = rbind(df_result, tmp_result)
}
}
This is the dataframe that I will plot:
> df_result
# A tibble: 270 x 4
rowname frame colname value
<chr> <int> <chr> <dbl>
1 a 1 V1 0.456
2 a 1 V2 0.716
3 a 1 V3 0.316
4 b 1 V1 0.724
5 b 1 V2 0.766
And I create the plot:
p <- ggplot(data = df_result, aes(x = rowname, y = colname, fill = value)) +
geom_tile() +
geom_text(aes(label = round(value, 3))) +
scale_fill_gradient(low="white", high="red", limits=c(0, 1))
p + transition_time(frame) +
labs(title = "Date: {frame_time}")
I am having three problems here
I am outputing this into a video, but regardless the number of matrices in list_matrices (this is, regardless the number of heatmaps I have) the video always lasts 10 seconds, so transition is too fast. Is there a way to increase video duration?
The video quality is poor. Is there a way to increase quality?
I am including the value of the variable in the heatmap, and I have set geom_text(aes(label = round(value, 3))) to round the number of decimals, but some of the heatmaps in the video show way more than 3 decimals. Any idea why this is happening and how to fix it?
Your three problems could be addressed as
change fps to 1 to slow down (modify further to slow it down).
save as png to get better video quality
wrap your round with as.character() in the label.
p <- ggplot(data = df_result, aes(x = rowname, y = colname, fill = value)) + geom_tile() + geom_text(aes(label = as.character(round(value,3)))) + scale_fill_gradient(low="white", high="red", limits=c(0, 1))
pp <- p + transition_time(frame) + labs(title = "Date: {frame_time}")
animate(pp, fps=1, dev="png")

R ggplot geom_point overlay from 2 data frames, differentiated by color, subset by id

I have two data frames with identical rows and columns, DataMaster and IMPSAVG, for which I'm trying to create a series of combined overlaid 2d scatterplots (subset by country "ids" and variable columns) with observations from the two data sets differentiated by color in ggplot. The code below does not work, but gives a sense of what I'm aiming for (acctm is the variable and ARG is the country in this example).
ggplot() +
geom_point(data=DataMaster, aes(x="Year", y="acctm"), subset = .(Country %in% c("ARG")), shape=21, color= "red") +
geom_point(data=IMPSAVG, aes(x="Year", y="acctm"), subset = .(Country %in% c("ARG")), shape=21, color= "blue")
While just getting the above to work would be much appreciated, a loop to create separate plots of this variable for all unique country ids in the column Country found in both datasets (also specified by the vector CountryList$Country) would be amazing. Thanks!
Without reproducible example of your dataset, it is hard ot be sure of what you ar elooking for.
However, using these fake datasets:
df1 <- data.frame(Country = c("A","A","A","B","B"),
Year = 2010:2014,
Value = sample(1:100,5))
df2 <- data.frame(Country = c("A","A","A","B","B"),
Year = 2010:2014,
Value = sample(1:100,5))
1) Plotting without joining datasets (not the most appropriate)
You don't have to absolutely assemble your dataframes to plot them, however it will make things a little bit harder (especially if you want to customize several parameters).
Here you can do:
library(ggplot2)
ggplot()+
geom_point(data = df1, aes(x = Year, y = Value, color = "blue"), shape = 21)+
geom_point(data = df2, aes(x = Year, y = Value, color = "red"), shape = 21, show.legend = TRUE)+
scale_color_manual(values = c("blue","red"), labels = c("df1","df2"), name = "")
2) Assembling both dataframes (best way to do it)
However, it will be much easier if you assemble your both dataframes (ggplot2 is designed to work with dataframes in a longer format).
So, here, you can do:
df1$Dataset = "DF1"
df2$Dataset = "DF2"
DF <- rbind(df1,df2)
Country Year Value Dataset
1 A 2010 66 DF1
2 A 2011 64 DF1
3 A 2012 40 DF1
4 B 2013 58 DF1
5 B 2014 20 DF1
6 A 2010 78 DF2
7 A 2011 25 DF2
8 A 2012 71 DF2
9 B 2013 40 DF2
10 B 2014 61 DF2
Now, you can simply plot it like this which is much more concise:
library(ggplot2)
ggplot(DF, aes(x = Year, y = Value, color = Dataset))+
geom_point(shape = 21)
3) Subsetting dataframe
To plot only a subset of your dataframes, starting with the assembled dataframe DF, you can simply do:
library(ggplot2)
ggplot(subset(DF, Country =="A"), aes(x = Year, y = Value, color = Dataset))+
geom_point(shape = 21)
Does it answer your question ?
I think you need to create a new dataframe, which combines those two dataframes and subsets the countries that you are interested in. You can use rbind for combining the two, and also you should add a column for samples indicating which dataframe they are coming from, so that you can use it later in aes(..., color = new_column).
Just to add onto dc37's excellent write up, here is the trick to have one dataframe print on top of the other
ggplot(subset(DF, Country =="A"), aes(x = Year, y = Value, color = Dataset)) +
geom_point(shape = 21, na.rm = T) +
geom_point(data = subset(DF, Dataset == DF1 & Country == "A"),
aes(x = Year, y = compi, color = E), shape = 21, na.rm = T)
where "DF1" is the dataframe you want plotted on top.

R: suitable plot to display data with skewed counts

I have data like:
Name Count
Object1 110
Object2 111
Object3 95
Object4 40
...
Object2000 1
So only the first 3 objects have high counts, the rest 1996 objects have fewer than 40, with the majority less than 10. I am plotting this data with ggplot bar like:
ggplot(data=object_count, mapping = aes(x=object, y=count)) +
geom_bar(stat="identity") +
theme(axis.title.x=element_blank(),
axis.text.x=element_blank(),
axis.ticks.x=element_blank())
My plot is as below. As you can see, because there are so many objects with low counts, the width of the graph is very long, and the width of the bar is tiny, which is almost invisible for the hight-counts objects. Is there a better way to represent this data? My goal is to show a few top-count objects and to show there are many low-count ones. Is there a way to group the low count ones together?
My guess is that your data looks something like this:
set.seed(1)
object_count <- tibble(
obj_num = 1:2000,
object = paste0("Object", obj_num),
count = ceiling(20 * rpois(2000, 10) / obj_num)
)
head(object_count)
## A tibble: 6 x 3
# obj_num object count
# <int> <chr> <dbl>
#1 1 Object1 160
#2 2 Object2 100
#3 3 Object3 46
#4 4 Object4 55
#5 5 Object5 56
#6 6 Object6 40
Sure enough, when I plot that with ggplot(object_count, aes(object, count)) + geom_col() + [theme stuff] I get a similar figure.
Here are some strategies "to show a few top-count objects and to show there are many low-count ones."
Histogram
A vanilla histogram might not be clarifying here, since the important big values appear dramatically less often and would not be prominent enough:
ggplot(object_count, aes(count)) +
geom_histogram()
But we could change that by transforming the y axis to bring more emphasis to small values. The pseudo_log transformation is nice for that since it works like a log transform for large values, but linearly near -1 to 1. In this view, we can clearly see where the outliers with just one appearance are, but also see that there are many more small values. The binwidth = 1 here could be set to something wider if the specific values of the big values aren't as important as their general range.
ggplot(object_count, aes(count)) +
geom_histogram(binwidth = 1) +
scale_y_continuous(trans = "pseudo_log",
breaks = c(0:3, 100, 1000), minor_breaks = NULL)
Faceting
Another option could be to split your view into two pieces, one with detail on the big values, the other showing all the small values:
object_count %>%
mutate(biggies = if_else(count > 20, "Big", "Little")) %>%
ggplot(aes(obj_num, count)) +
geom_col() +
facet_grid(~biggies, scales = "free")
Lumping
Another option might be too lump together all the counts under 10. The version below emphasizes the object name and count, and the "Other" category has been labeled to show how many values it includes.
object_count %>%
mutate(group = if_else(count < 10, "Others", object)) %>%
group_by(group) %>%
summarize(avg = mean(count), count = n()) %>%
ungroup() %>%
mutate(group = if_else(group == "Others",
paste0("Others (n =", count, ")"),
group)) %>%
mutate(group = forcats::fct_reorder(group, avg)) %>%
ggplot() +
geom_col(aes(group, avg)) +
geom_text(aes(group, avg, label = round(avg, 0)), hjust = -0.5) +
coord_flip()
Cumulative count (~Pareto chart)
If you're interested in the share of total count, you might also look at the cumulative count and see how the big values make up a large share:
object_count %>%
mutate(cuml = cumsum(count)) %>%
ggplot(aes(obj_num)) +
geom_tile(aes(y = count + lag(cuml, default = 0),
height = count))

Highlight positions without data in facet_wrap ggplot

When facetting barplots in ggplot the x-axis includes all factor levels. However, not all levels may be present in each group. In addition, zero values may be present, so from the barplot alone it is not possible to distinguish between x-axis values with no data and those with zero y-values. Consider the following example:
library(tidyverse)
set.seed(43)
site <- c("A","B","C","D","E") %>% sample(20, replace=T) %>% sort()
year <- c("2010","2011","2012","2013","2014","2010","2011","2012","2013","2014","2010","2012","2013","2014","2010","2011","2012","2014","2012","2014")
isZero = rbinom(n = 20, size = 1, prob = 0.40)
value <- ifelse(isZero==1, 0, rnorm(20,10,3)) %>% round(0)
df <- data.frame(site,year,value)
ggplot(df, aes(x=year, y=value)) +
geom_bar(stat="identity") +
facet_wrap(~site)
This is fish census data, where not all sites were fished in all years, but some times no fish were caught. Hence the need to differentiate between the two situations. For example, there was no catch at site C in 2010 and it was not fished in 2011, and the reader cannot tell the difference. I would like to add something like "no data" to the plot for 2011. Maybe it is possible to fill the rows where data is missing, generate another column with the desired text to be added and then include this via geom_text?
So here is an example of your proposed method:
# Tabulate sites vs year, take zero entries
tab <- table(df$site, df$year)
idx <- which(tab == 0, arr.ind = T)
# Build new data.frame
missing <- data.frame(site = rownames(tab)[idx[, "row"]],
year = colnames(tab)[idx[, "col"]],
value = 1,
label = "N.D.") # For 'no data'
ggplot(df, aes(year, value)) +
geom_col() +
geom_text(data = missing, aes(label = label)) +
facet_wrap(~site)
Alternatively, you could also let the facets omit unused x-axis values:
ggplot(df, aes(x=year, y=value)) +
geom_bar(stat="identity") +
facet_wrap(~site, scales = "free_x")

programmatically setting individual axis limits in facets

I need help on setting the individual x-axis limits on different facets as described below.
A programmatical approach is preferred since I will apply the same template to different data sets.
first two facets will have the same x-axis limits (to have comparable bars)
the last facet's (performance) limits will be between 0 and 1, since it is calculated as a percentage
I have seen this and some other related questions but couldn't apply it to my data.
Thanks in advance.
df <-
data.frame(
call_reason = c("a","b","c","d"),
all_records = c(100,200,300,400),
problematic_records = c(80,60,100,80))
df <- df %>% mutate(performance = round(problematic_records/all_records, 2))
df
call_reason all_records problematic_records performance
a 100 80 0.80
b 200 60 0.30
c 300 100 0.33
d 400 80 0.20
df %>%
gather(key = facet_group, value = value, -call_reason) %>%
mutate(facet_group = factor(facet_group,
levels=c('all_records','problematic_records','performance'))) %>%
ggplot(aes(x=call_reason, y=value)) +
geom_bar(stat="identity") +
coord_flip() +
facet_grid(. ~ facet_group)
So here is one way to go about it with facet_grid(scales = "free_x"), in combination with a geom_blank(). Consider df to be your df at the moment before piping it into ggplot.
ggplot(df, aes(x=call_reason, y=value)) +
# geom_col is equivalent to geom_bar(stat = "identity")
geom_col() +
# geom_blank includes data for position scale training, but is not rendered
geom_blank(data = data.frame(
# value for first two facets is max, last facet is 1
value = c(rep(max(df$value), 2), 1),
# dummy category
call_reason = levels(df$call_reason)[1],
# distribute over facets
facet_group = levels(df$facet_group)
)) +
coord_flip() +
# scales are set to "free_x" to have them vary independently
# it doesn't really, since we've set a geom_blank
facet_grid(. ~ facet_group, scales = "free_x")
As long as your column names remain te same, this should work.
EDIT:
To reorder the call_reason variable, you could add the following in your pipe that goes into ggplot:
df %>%
gather(key = facet_group, value = value, -call_reason) %>%
mutate(facet_group = factor(facet_group,
levels=c('all_records','problematic_records','performance')),
# In particular the following bit:
call_reason = factor(call_reason, levels(call_reason)[order(value[facet_group == "performance"])]))

Resources