Replicating a trending chart with ggplot - r

I recently saw a chart I want to replicate in R. The chart shows a score or other measurement for multiple records as a colored box, binned into one of, say, 4 colors. In my image it is red, light red, light green, and green. So each record gets one box for each score they have - the idea is that each record had one score for a given point in time over several points in time. In my example, I'll use student test scores over time, so say we have 4 students and 8 tests throughout the year (in chronological order) we would have 8 boxes for each student, resulting in 32 boxes. Each row (student) would have 8 boxes.
Here is how I created some example data:
totallynotrealdata <- data.frame(Student = c(rep("A",8),rep("B",8),rep("C",8),rep("D",8)),Test = rep(1:8,4), Score = sample(1:99,32,replace = TRUE), BinnedScore = cut(totallynotrealdata$TB,breaks = c(0,25,50,75,100),labels = c(1,2,3,4)))
What I'm wondering is how I can recreate this chart in ggplot? Any geoms I should look at?

You could play with geom_rect(). This is very basic but I guess you can easily optimize it for your purposes:
df <- data.frame(Student = c(rep(1,8),rep(2,8),rep(3,8),rep(4,8)),
Test = rep(1:8,4),
Score = sample(1:99,32,replace = TRUE))
df$BinnedScore <- cut(df$Score,breaks = c(0,25,50,75,100),labels = c(1,2,3,4))
df$Student <- factor(df$Student, labels = LETTERS[1:length(unique(df$Student))])
library(ggplot2)
colors <- c("#f23d2e", "#e39e9c", "#bbd3a8", "#68f200")
numStuds <- length(levels(df$Student))
numTests <- max(df$Test)
ggplot() + geom_rect(data = df, aes(xmin = Test-1, xmax = Test, ymin = as.numeric(Student)-1, ymax = as.numeric(Student)), fill = colors[df$BinnedScore], col = grey(0.5)) +
xlab("Test") + ylab("Student") +
scale_y_continuous(breaks = seq(0.5, numStuds, 1), labels = levels(df$Student)) +
scale_x_continuous(breaks = seq(0.5, numTests, 1), labels = 1:numTests)

Related

geom_density_2d_filled and gganimate: cumulative 2D density estimate animation over time?

This is a follow-up question of sorts to ggplot2 stat_density_2d: how to fix polygon errors at the dataset bounding box edges?
I am trying to animate a 2D density estimate ggplot2::geom_density_2d_filled over time so that each frame adds data to what was presented before. So far I have the gganimate animation working for the 2D density estimate so that each point in time (the dataframe column monthly) is individual, but I have no idea how to proceed from here.
Is it possible to use gganimate to cumulatively animate geom_density_2d_filled? Or could this be achieved by manipulating the source dataframe somehow?
Please see my code below:
library(dplyr)
library(sf)
library(geofi)
library(ggplot2)
library(gganimate)
# Finland municipalities
muns <- geofi::get_municipalities(year = 2022)
# Create sample points
points <- sf::st_sample(muns, 240) %>% as.data.frame()
points[c("x", "y")] <- sf::st_coordinates(points$geometry)
monthly <- seq(as.Date("2020/1/1"), by = "month", length.out = 24) %>%
rep(., each = 10)
points$monthly <- monthly
p <- ggplot() +
geom_density_2d_filled(data = points,
aes(x = x, y = y, alpha = after_stat(level))) +
geom_sf(data = muns,
fill = NA,
color = "black") +
coord_sf(default_crs = sf::st_crs(3067)) +
geom_point(data = points,
aes(x = x, y = y),
alpha = 0.1) +
scale_alpha_manual(values = c(0, rep(0.75, 13)),
guide = "none") +
# gganimate specific
transition_states(monthly,
transition_length = 1,
state_length = 40) +
labs(title = "Month: {closest_state}") +
ease_aes("linear")
animate(p, renderer = gganimate::gifski_renderer())
gganimate::anim_save(filename = "so.gif", path = "anim")
The resulting animation is seen below. Could this be portrayed cumulatively?
To get cumulative figures the easiest way is to repeat each month's data in future months.
Using the tidyverse, add the following statement before you define p...
points <- points %>%
mutate(monthly = map(monthly, ~seq(., max(monthly), by = "month"))) %>%
unnest(monthly)
Note that a cumulative density will not necessarily increase over time - if you want an animation that steadily increases you might want to add contour_var = "count" to your geom_density... term.

How to specify unique geom assignments to facets?

Below I have simulated a dataset where an assignment was given to 5 groups of individuals on 5 different days (a new group with 200 new individuals each day). TrialStartDate denotes the date on which the assignment was given to each individual (ID), and TrialEndDate denotes when each individual finished the assignment.
set.seed(123)
data <-
data.frame(
TrialStartDate = rep(c(sample(seq(as.Date('2019/02/01'), as.Date('2019/02/15'), by="day"), 5)), each = 200),
TrialFinishDate = sample(seq(as.Date('2019/02/01'), as.Date('2019/02/15'), by = "day"), 1000,replace = T),
ID = seq(1,1000, 1)
)
I am interested in comparing how long individuals took to complete the trial depending on when they started the trial (i.e., assuming TrialStartDate has an effect on the length of time it takes to complete the trial).
To visualize this, I want to make a barplot showing counts of IDs on each TrialFinishDate where bars are colored by TrialStartDate (since each TrialStartDate acts as a grouping variable). The best I have come up with so far is by faceting like this:
data%>%
group_by(TrialStartDate, TrialFinishDate)%>%
count()%>%
ggplot(aes(x = TrialFinishDate, y = n, col = factor(TrialStartDate), fill = factor(TrialStartDate)))+
geom_bar(stat = "identity")+
facet_wrap(~TrialStartDate, ncol = 1)
However, I also want to add a vertical line to each facet showing when the TrialStartDate was for each group (preferably colored the same as the bars). When attempting to add vertical lines with geom_vline, it adds all the lines to each facet:
data%>%
group_by(TrialStartDate, TrialFinishDate)%>%
count()%>%
ggplot(aes(x = TrialFinishDate, y = n, col = factor(TrialStartDate), fill = factor(TrialStartDate)))+
geom_bar(stat = "identity")+
geom_vline(xintercept = unique(data$TrialStartDate))+
facet_wrap(~TrialStartDate, ncol = 1)
How can we make the vertical lines unique to the respective group in each facet?
You're specifying xintercept outside of aes, so the faceting is not respected.
This should do the trick:
data %>%
group_by(TrialStartDate, TrialFinishDate)%>%
count()%>%
ggplot(aes(x = TrialFinishDate, y = n, col = factor(TrialStartDate), fill = factor(TrialStartDate)))+
geom_bar(stat = "identity")+
geom_vline(aes(xintercept = TrialStartDate))+
facet_wrap(~TrialStartDate, ncol = 1)
Note geom_vline(aes(xintercept = TrialStartDate))

How to plot data that is separated into multiple categories using ggplot2?

A market survey on consumer satisfaction when using two cleaning products (Product A and Product B) was separated according to room (Kitchen, Bathroom, Bedroom), frequency of use (Daily, Weekly, Monthly) and product bottle size (Small, Medium, Large).
I thought the best way to represent these results is column plots. However, I think there are better ways to represent these results, so I come here to ask for suggestions for a better graphical representation, as I am not very satisfied with column plots.
In MWE the data is arranged ascending, but with real data, the different column sizes make the plot visually very polluted.
library(ggplot2)
Product <- c("Product A", "Product B")
Place <- c("Kitchen", "Bathroom", "Bedroom")
Bottle_size <- c("Small", "Medium", "Large")
Frequency <- c("Daily", "Weekly", "Monthly")
DF <- expand.grid(Bottle_size = Bottle_size,
Place = Place,
Frequency = Frequency,
Product = Product)
DF$Consumer_Approval <- seq(from = 10,
to = 100,
by = (100-10)/53)
ggplot(data = DF,
aes(x = Frequency,
y = Consumer_Approval)) +
geom_col(aes(fill = Bottle_size),
position = "dodge") +
facet_grid(Product ~ Place)
enter image description here
The real data is something like:
enter image description here
One option would be a lollipop chart, where the height of the lollipop shows the difference between the consumer rating for product A and product B. This allows a quick visual estimate of the preferred product in a given setting:
ProdA <- split(DF, Product)[[1]]
ProdB <- split(DF, Product)[[2]]
DF <- dplyr::left_join(ProdA, ProdB, c("Bottle_size", "Place", "Frequency"))
DF$AvsB <- DF$Consumer_Approval.x - DF$Consumer_Approval.y
ggplot(data = DF, aes(x = Frequency, y = AvsB, color = Bottle_size)) +
geom_point(position = position_dodge(width = 0.5)) +
geom_linerange(aes(ymin = AvsB, ymax = 0), position = position_dodge(width = 0.5)) +
geom_hline(yintercept = 0, linetype = 2) +
ylim(-100, 100) +
facet_grid(.~Place) +
labs(y = "Prefers Product B <- No preference -> Prefers Product A")

R control jitter function - avoid overplotting / non-random jitter

My problems seems simple, I am using ggplot2 with geom_jitter() to plot a variable. (take my picture as an example)
Jitter now adds some random noise to the variable (the variable is just called "1" in this example) to prevent overplotting. So I have now random noise in the y-direction and clearly what otherwise would be completely overplotted is now better visible.
But here is my question:
As you can see, there are still some points, that overplot each other. In my example here, this could be easily prevented, if it wouldn't be random noise in y-direction... but somehow more strategically placed offsets.
Can I somehow alter the geom_jitter() behavior or is there a similar function in ggplot2 that does exactly this?
Not really a minimal example, but also not too long:
library("imputeTS")
library("ggplot2")
data <- tsAirgap
# 2.1 Create required data
# Get all indices of the data that comes directly before and after an NA
na_indx_after <- which(is.na(data[1:(length(data) - 1)])) + 1
# starting from index 2 moves all indexes one in front, so no -1 needed for before
na_indx_before <- which(is.na(data[2:length(data)]))
# Get the actual values to the indices and put them in a data frame with a label
before <- data.frame(id = "1", type = "before", input = na_remove(data[na_indx_before]))
after <- data.frame(id = "1", type = "after", input = na_remove(data[na_indx_after]))
all <- data.frame(id = "1", type = "source", input = na_remove(data))
# Get n values for the plot labels
n_before <- length(before$input)
n_all <- length(all$input)
n_after <- length(after$input)
# 2.4 Create dataframe for ggplot2
# join the data together in one dataframe
df <- rbind(before, after, all)
# Create the plot
gg <- ggplot(data = df) +
geom_jitter(mapping = aes(x = id, y = input, color = type, alpha = type), width = 0.5 , height = 0.5)
gg <- gg + ggplot2::scale_color_manual(
values = c("before" = "skyblue1", "after" = "yellowgreen","source" = "gray66"),
)
gg <- gg + ggplot2::scale_alpha_manual(
values = c("before" = 1, "after" = 1,"source" = 0.3),
)
gg + ggplot2::theme_linedraw() + theme(aspect.ratio = 0.5) + ggplot2::coord_flip()
So many good suggestions...here is what Bens suggestion would look like for my example:
I changed parts of my code to:
gg <- ggplot(data = df, aes(x = input, color = type, fill = type, alpha = type)) +
geom_dotplot(binwidth = 15)
Would basically also work as intended for me. ggbeeplot as suggested by Jon also worked great for my purpose.
I thought of a hack I really like, using ggrepel. It's normally used for labels, but nothing preventing you from making the label into a point.
df <- data.frame(x = rnorm(200),
col = sample(LETTERS[1:3], 200, replace = TRUE),
y = 1)
ggplot(df, aes(x, y, label = "●", color = col)) + # using unicode black circle
ggrepel::geom_text_repel(segment.color = NA,
box.padding = 0.01, key_glyph = "point")
A downside of this method is that ggrepel can take a lot time for a large number of points, and will recalculate differently each time you change the plot size. A faster alternative would be to use ggbeeswarm::geom_quasirandom, which uses a deterministic process to define jitter that looks random.
ggplot(df, aes(x,y, color = col)) +
ggbeeswarm::geom_quasirandom(groupOnX = FALSE)

Heatmap plotting time against date ggplot

I would like to make a heatmap with ggplot.
The results should be something like this (the y-axis needs to be reversed though):
A subset of example data is below. For the actual application the dataframe has 1000+ users instead of only 3. The gradient filling should be based on the value of the users.
Date <- seq(
from = as.POSIXct("2016-01-01 00:00"),
to = as.POSIXct("2016-12-31 23:00"),
by = "hour"
)
user1 <- runif(length(Date), min = 0, max = 10)
user2 <- runif(length(Date), min = 0, max = 10)
user3 <- runif(length(Date), min = 0, max = 10)
example <- data.frame(Date, user1, user2, user3)
example$hour <- format(example$Date, format = "%H:%M")
example$total <- rowSums(example[,c(2:4)])
I have tried several things by using the (fill = total) argument in combination with geom_tile, geom_raster and stat_density2d (like suggested in similar posts here). An example below:
ggplot(plotHuishoudens, aes(Date, hour, fill = Total)) +
geom_tile() +
scale_fill_gradient(low = "blue", high = "red")
Which only shows individual points and not shows the y axis like a continuous variable (scale_y_continuous also did not help with this), although the variable is a continuous one?
How can I create a heatmap like the example provided above?
And how could I make a nice cut-off on the y axis (e.g. per 3 hours instead of per hour)?
The way your data is defined, you won't come to the desired output because example$Date is a POSIXct object, that is a date and an hour.
So, you must map your graph to the day only:
ggplot(data = example) +
geom_raster(aes(x=as.Date(Date, format='%d%b%y'), y=hour, fill=total)) +
scale_fill_gradient(low = "blue", high = "red")
For your second question, you can group hours like this:
example <- example %>%
group_by(grp = rep(row_number(), length.out = n(), each = 4)) %>%
summarise(Date = as.Date(sample(Date, 1), format='%d%b%y'),
total = sum(total),
time_slot = paste(min(hour), max(hour), sep = "-"))
ggplot(data = example) +
geom_raster(aes(x = Date, y = time_slot, fill = total)) +
scale_fill_gradientn(colours = (cet_pal(6, name = "inferno"))) # I like gradients from "cetcolor" package

Resources