Problem reproducing loess animation, perhaps issue with inflate vs crossing? - r

I'm trying to recreate some of the very nice animations showing the behavior of loess from David Robinson found at Variance Explained. When I try to recreate the animations there I get a different behavior with my groups than Robinson shows. In the animation the points are jumping about in a way I don't expect. I'm wondering if there is different behavior from tidyr::crossing than in the deprecated inflate function he uses from broom. Any advice on how to make the animation in last plot below appreciated.
library(lattice)
library(ggplot2)
library(broom)
theme_set(theme_bw())
mod <- loess(NOx ~ E, ethanol, degree = 1, span = .75)
fit <- broom::augment(mod)
# plot to animate with lm showing moving loess
ggplot(fit, aes(E, NOx)) +
geom_point() +
geom_line(aes(y = .fitted), color = "red")
library(dplyr)
dat <- ethanol %>%
# note use of crossing over inflate
tidyr::crossing(center = unique(ethanol$E)) %>%
mutate(dist = abs(E - center)) %>%
filter(rank(dist) / n() <= .75) %>%
mutate(weight = (1 - (dist / max(dist)) ^ 3) ^ 3)
# animate plot -- awry
p <- ggplot(dat, aes(x=E, y=NOx)) +
geom_point(aes(alpha = weight)) +
geom_smooth(aes(group = center, weight = weight), method = "lm", se = FALSE) +
geom_vline(aes(xintercept = center), lty = 2) +
geom_line(aes(y = .fitted), data = fit, color = "red")
# why so many lm fits in middle range of E that are below loess line?
# something is wrong with the groups defined by center?
p
# make the animation
library(gganimate)
p + labs(title = 'E={frame_time}') + transition_time(center)

Answering myself. I was missing a group_by
E.g.,
dat <- ethanol %>%
tidyr::crossing(center = unique(ethanol$E)) %>%
group_by(center) %>%
mutate(dist = abs(E - center)) %>%
filter(rank(dist) / n() <= .75) %>%
mutate(weight = (1 - (dist / max(dist)) ^ 3) ^ 3)
p <- ggplot(dat, aes(x=E, y=NOx)) +
geom_point(aes(alpha = weight)) +
geom_smooth(aes(group = center, weight = weight), method = "lm", se = FALSE) +
geom_vline(aes(xintercept = center), lty = 2) +
geom_line(aes(y = .fitted), data = fit, color = "red")
library(gganimate)
p + transition_states(center)

Related

Add a “fake” discrete or continuous legend to a plot with values that are not in the data

I have data (percentage changes) for several months for different states of a country that I want to plot as a map (each month as a separate png) and animate it as a GIF with magick.
The percentage changes (discrete values), however, do not have the same maximum and minimum value in each month. If I would simply plot each month the specified red color for the highest value would stand for different maximum values in each month (for example +240% - 245% in January and +260% - 265% in February). To tackle this issue I gathered all occurring percentage changes of all months in a vector. These discrete values got assigned colors (from light red - "0% - 5%" - to dark red - "260% - 265%") so that e.g. "240% - 245%" would show as the same red in January as well as in February.
The problem is: the legends that are plotted with each map differ since not every percentage change is present in each month and of course only values that exist in each subset for each month are shown in the legend.
Is it possible to (1) show the same legend for all maps (with all discrete values from "0-5%" to "260% - 265%" even though not all the values are plotted each month) or (2) can I simply add a "fake" continuous legend ranging from light red to dark red that ranges from 0% to 265%? (I found geom_blank() might be helpful for that, however, I have not managed to make it work.)
Here is a minimal reproducible example:
install.packages("sf")
install.packages("ggplot2")
install.packages("magick")
install.packages("tidyverse")
install.packages("maps")
library(sf)
library(ggplot2)
library(magick)
library(tidyverse)
library(maps)
states <- st_as_sf(map("state",
plot = FALSE,
fill = TRUE))
labels <- function(start, end) {
vec <- seq(start, end, 5)
paste0(vec,
"%",
" – ",
vec + 5,
"%")
}
lab_jan <- labels(0, (length(states$ID) - 1) * 5)
lab_feb <- labels(20, (length(states$ID) + 3) * 5)
colfun <- colorRampPalette(c("#EE7F74", "#86372E"))
col <- colfun(length(unique(c(lab_jan, lab_feb))))
lab_col <- tibble(label = unique(c(lab_jan, lab_feb)),
color = col)
states_jan <- bind_cols(states,
lab_jan = factor(lab_jan,
levels = lab_jan))
states_feb <- bind_cols(states,
lab_feb = factor(lab_feb,
levels = lab_feb))
jan_01 <- ggplot() +
geom_sf(data = states_jan,
aes(fill = lab_jan)) +
theme_void() +
scale_fill_manual(values = lab_col %>%
filter(label %in% states_jan$lab_jan) %>%
pull(color)) +
#theme(legend.position = "none") +
ggsave("01_jan.png", width = 10)
feb_02 <- ggplot() +
geom_sf(data = states_feb,
aes(fill = lab_feb)) +
theme_void() +
scale_fill_manual(values = lab_col %>%
filter(label %in% states_feb$lab_feb) %>%
pull(color)) +
#theme(legend.position = "none") +
ggsave("02_feb.png", width = 10)
list.files(pattern = '*.png', full.names = TRUE) %>%
image_read() %>%
image_join() %>%
image_animate(fps = 1) %>%
image_write("states.gif")
```
How about this approach:
lab_jan <- labels(0, (length(states$ID) - 1) * 5)
lab_feb <- labels(20, (length(states$ID) + 3) * 5)
lab_all <- union(lab_jan, lab_feb)
states_jan <- bind_cols(states, lab_jan = lab_jan)
states_feb <- bind_cols(states,lab_feb = lab_feb)
states_jan <- states_jan %>%
mutate(lab_jan = factor(lab_jan, levels=lab_all))
states_feb <- states_feb %>%
mutate(lab_feb = factor(lab_feb, levels=lab_all))
jan_01 <- ggplot() +
geom_sf(data = states_jan,
aes(fill = as.numeric(lab_jan))) +
theme_void() +
scale_fill_gradient(low = "#EE7F74", high="#86372E",
limits=c(1, 53),
breaks=c(1,10,20,30,40,50),
labels=lab_all[c(1,10,20,30,40,50)]) +
labs(fill="")
#theme(legend.position = "none") +
# ggsave("01_jan.png", width = 10)
feb_02 <- ggplot() +
geom_sf(data = states_feb,
aes(fill = as.numeric(lab_feb))) +
theme_void() +
scale_fill_gradient(low = "#EE7F74", high="#86372E",
limits=c(1, 53),
breaks=c(1,10,20,30,40,50),
labels=lab_all[c(1,10,20,30,40,50)]) +
labs(fill="")
gridExtra::grid.arrange(jan_01, feb_02, nrow=1)

ggplot2 - a custom histogram with a rug plot

I am trying to create a custom histogram with a rug plot showing the original values on the X axis.
I am going to use the mtcars dataset to illustrate. Its not be best dataset for this question...but hopefully the reader will understand what I am trying to achieve...
Below shows the basic histogram, without any rug plot attempt.
I want to create the histogram using geom_bar as this allows for more flexibility with custom bins.
I also want a small gap between the histgram bars (i.e width = 0.95) .... which adds to this
problem's complexity.
library(dplyr)
library(ggplot2)
# create custom bins
vct_seq <- c(seq(from = 10, to = 25, by = 5), 34)
mtcars$bin <- cut(mtcars$mpg, breaks = vct_seq)
# create data.frame for the ggplot graph..using bins above
df_mtcars_count <- mtcars %>% group_by(bin) %>% summarise(count = n())
# indicative labels
vct_labels <- c("bin 1", "bin 2", "bin 3", "bin 4")
# attempt 1 - basic plot -- no rug plot
p <- ggplot(data = df_mtcars_count, aes(x = bin, y = count))
p <- p + geom_bar(stat = "identity", width = 0.95)
p <- p + geom_text(aes(label = count), vjust = -0.5)
p <- p + scale_x_discrete("x title to go here", labels = df_mtcars_count$bin, breaks = df_mtcars_count$bin)
p
Next, try and add a basic rug plot on the X axis. This obviously doesn't work as the geom_bar and geom_rug have completely different scales.
# attempt 2 with no scaling.... doesn't work as x scale for ordinal (bins) and
# x scale for continuous (mpg) do not match
p <- ggplot(data = df_mtcars_count, aes(x = bin, y = count))
p <- p + geom_bar(stat = "identity", width = 0.95)
p <- p + geom_text(aes(label = count), vjust = -0.5)
p <- p + scale_x_discrete("x title to go here", labels = df_mtcars_count$bin, breaks = df_mtcars_count$bin)
p <- p + geom_rug(data = mtcars, aes(x = mpg), inherit.aes = F, alpha = 0.3)
p
Now, try and rescale the mpg column to match with the ordinal scale....
First define a linear mapping function...
fn_linear_map <- function(vct_existing_val, vct_new_range) {
# example....converts 1:20 into the range 1 to 10 like this:
# fn_linear_map(1:20, c(1, 10))
fn_r_diff <- function(x) x %>% range() %>% diff()
flt_ratio <- fn_r_diff(vct_new_range) / fn_r_diff(vct_existing_val)
vct_old_min_offset <- vct_existing_val - min(vct_existing_val)
vct_new_range_val <- (vct_old_min_offset * flt_ratio) + min(vct_new_range)
return(vct_new_range_val)
}
Now apply the function...we try and map mpg to the range 1 to 4 (which is an attempt to match
the ordinal scale)
mtcars$mpg_remap <- fn_linear_map(mtcars$mpg, c(1, 4))
Try the plot again.... getting closer ... but not really accurate...
# attempt 3: getting closer but doesn't really match the ordinal scale
p <- ggplot(data = df_mtcars_count, aes(x = bin, y = count))
p <- p + geom_bar(stat = "identity", width = 0.95)
p <- p + geom_text(aes(label = count), vjust = -0.5)
p <- p + scale_x_discrete("x title to go here", labels = df_mtcars_count$bin, breaks = df_mtcars_count$bin)
p <- p + geom_rug(data = mtcars, aes(x = mpg_remap), inherit.aes = F, alpha = 0.3)
p
The graph above is getting close to what I want....but rug plot does not line up
with the actual data ... example the max observation (33.9) should be displayed
almost aligning with the right hand side of the bar.. see below:
mtcars %>% filter(bin == "(25,34]") %>% arrange(mpg) %>% dplyr::select(mpg, mpg_remap)
Your scale makes no sense to me, as you are showing a bin that is twice as wide using the same bar width. Doing that in combination with a rug strikes me as confusing as best and misleading at worst. I suggest you plot the bars with their correct widths, after which the rug is trivial.
I think the best solution is to just use geom_histogram:
ggplot(mtcars, aes(mpg)) +
geom_histogram(breaks = vct_seq, col = 'grey80') +
geom_rug(aes(mpg, y = NULL))
If you really want the gaps between the bars you'll have to do more work:
library(tidyr)
d <- mtcars %>%
count(bin) %>%
separate(bin, c('min', 'max'), sep = ',', remove = FALSE) %>%
mutate_at(vars('min', 'max'), readr::parse_number) %>%
mutate(
middle = min + (max - min) / 2,
width = 0.9 * (max - min)
)
ggplot(d, aes(middle, n)) +
geom_col(width = d$width) +
geom_rug(aes(mpg, y = NULL), mtcars)

R geom_ribbon after specific value

I am trying to find a way to colour the background after a specific value.
Here in this example, I want to colour the spaces after the value 5 (here shown with a vertical line).
#
library(lme4)
library(tidyverse)
data("sleepstudy")
#
sleepstudy = sleepstudy %>% mutate(days = ifelse(Days > 5, 1, 0))
#
m1 = sleepstudy %>% group_by(Days, days) %>% summarise(m = mean(Reaction))
m1
m1 %>% ggplot(aes(Days, m)) +
geom_point() +
geom_vline(xintercept = 6) +
theme_minimal()
I want to achieve something like this
However, when I use the following line, I get an error message.
m1 %>% ggplot(aes(Days, m)) +
geom_point() +
geom_vline(xintercept = 6) +
theme_minimal() +
geom_ribbon(data = m1, aes(x = c(6,9), ymin=0, ymax = 400), fill = 'khaki', alpha = 0.2)
Maybe the following does what the question asks for.
First of all, if the error bars are to be plotted, the data preparation code must change.
There is no need to compute an extra variable, days that tells if Days are greater than 6.
The standard errors must be computed.
This can be all done in one pipe only.
library(lme4)
library(tidyverse)
data("sleepstudy")
m1 <- sleepstudy %>%
group_by(Days) %>%
summarise(m = mean(Reaction),
s = sd(Reaction))
Now the plot.
I have changed the order of the geoms, to have the points, error bars and vertical line over the ribbon.
I have also increased the alpha level to 0.30.
There is no need to reset the x aesthetic, it is set since the beginning of the plot.
It's the latter point that caused the code error.
Error: Aesthetics must be either length 1 or the same as the data (10): x
m1 %>% ggplot(aes(Days, m)) +
theme_minimal() +
geom_ribbon(data = m1 %>% filter(Days > 5),
aes(ymin = 0, ymax = 400),
fill = 'khaki',
alpha = 0.30) +
geom_vline(xintercept = 6) +
geom_point() +
geom_errorbar(aes(ymin = m - s, ymax = m + s))

position_dodge when using separate datasets

I am attempting to produce a graph that shows two groups of error bars, but the different error bars represent different estimates of central tendency/variability (e.g., mean with sd and median with quantiles). I'm trying to use position_dodge, but it's not working, and I suspect this is because I'm feeding it values from a different dataset. Here's a reproducible example:
#### simulate dosages
dose = factor(rep(c("small", "medium", "large"), times=10))
dose = relevel(dose, "small")
#### simulate fevers, based on dosage (but highly skewed)
fever = rnorm(length(dose), 100, 1)
betas = matrix(c(0, -3, -6), nrow=1)
fever = fever + as.numeric(betas%*%t(model.matrix(fever~dose)))
#### put into data frame
d = data.frame(dose=dose, fever=fever)
#### compute means and standard errors
means = d %>% group_by(dose) %>% summarise(mean=mean(fever), lower=mean - sd(fever), upper = mean + sd(fever))
medians = d %>% group_by(dose) %>% summarise(median=median(fever), lower=quantile(fever, .25), upper = quantile(fever, .75))
#### put all into a ggplot
ggplot(d, aes(x=dose, y=fever)) +
geom_jitter(alpha=.2, width=.2) +
geom_point(data=means, aes(x=dose, y=mean)) +
geom_point(data=medians, aes(x=dose, y=median), col="red") +
geom_errorbar(data=means, aes(y=mean, ymin=lower, ymax=upper), width=.2, position=position_dodge(width=.2)) +
geom_errorbar(data= medians, aes(y=median, ymin=lower, ymax=upper), width=.2, position=position_dodge(width=.2), col="red")
Which gives the results of the following image:
Notice dodging isn't working.
Let's assume I can't just use stat_summary (I can't...I'm actually comparing means with some robust estimates from another package). Is there any way to offset the error bars/dots so they can be better seen?
Combine your dataframes for both statistics so you can map the kind of statistic on group:
means <- df %>%
group_by(dose) %>%
summarise(Statistic = "Mean", Value = mean(fever), lower=mean(fever) - sd(fever), upper = mean(fever) + sd(fever))
medians <- df %>%
group_by(dose) %>%
summarise(Statistic = "Median", Value = median(fever), lower=quantile(fever, 0.25), upper = quantile(fever, 0.75))
df2 <- bind_rows(means, medians)
#### put all into a ggplot
ggplot(df, aes(x = dose, y = fever)) +
geom_jitter(alpha = .2, width = .2) +
geom_point(data = df2, aes(x = dose, y = Value, color = Statistic)) +
geom_errorbar(data = df2, aes(y = Value, ymin = lower, ymax = upper,
group = Statistic, color = Statistic),
width=.2, position = position_dodge(width = .2))

Does geom_abline(...) plot data multiple times?

Does using geom_abline(...) (as well as geom_vline and geom_hline) result in multiple overplotting of the same line when used "naively"?
For example, say we are interested in the following faceted scatterplot:
library(ggplot2)
library(dplyr)
k <- 4
data.frame(id = letters[1:k],
m = rnorm(k),
b = rnorm(k))[rep(1:k, 30),] %>%
mutate(x = rnorm(n()),
eps = 0.1*rnorm(n()),
y = m*x + b + eps) %>%
ggplot(aes(x, y)) +
geom_point() +
facet_wrap(~ id) ->
p
The easiest way to add an ab-line to it is as follows:
print(p + geom_abline(aes(slope = m, intercept = b), color = 'red'))
Is that the "right way" to do it? In particular, doesn't the above plot the ab-line 30 times in each facet? For example this makes it seem like it does:
print(p + geom_abline(aes(slope = m, intercept = b + eps), color = 'red'))
If so, is it better to do something like this?
print(p + geom_abline(aes(slope = m, intercept = b),
data = Z %>% group_by(id) %>% summarize(m = unique(m), b = unique(b)),
color = 'red'))
Note that this produces something visually indistiguishable from the first plot. My question is about the right way to use these ggplot functions.
geom_abline takes care already of multiple lines at the same spot by making the values unique. You can also verify that by running e.g. p + geom_abline(aes(slope = m, intercept = b), color = 'red', alpha = .1) - if it were 30 lines at the same spot, they would be opaque.

Resources