I've got a dataset similar to this:
x <- 100 - abs(rnorm(1e6, 0, 5))
y <- 50 + rnorm(1e6, 0, 3)
dist <- sqrt((x - 100)^2 + (y - 50)^2)
z <- exp(-(dist / 8)^2)
which can be visualised as follows:
data.frame(x, y, z) %>%
ggplot() + geom_point(aes(x, y, color = z))
What I would like to do is a stacked half-circle plot with averaged value of z in subsequent layers. I think it can be done with the combination of geom_col and coord_polar(), although the farthest I can get is
data.frame(x, y, z, dist) %>%
mutate(dist_fct = cut(dist, seq(0, max(dist), by = 5))) %>%
ggplot() + geom_bar(aes(x = 1, y = 1, fill = dist_fct), stat = 'identity', position = 'fill') +
coord_polar()
which is obviously far from the expectation (layers should be of equal size, plot should be clipped on the right half).
The problem is that I can't really use coord_polar() due to further use of annotate_custom(). So my question are:
can plot like this can be done without coord_polar()?
If not, how can it be done with coord_polar()?
The result should be similar to a graphic below, except from plotting layers constructed from points I would like to plot only layers as a whole with color defined as an average value of z inside a layer.
If you want simple radius bands, perhaps something like this would work as you pictured it in your question:
# your original sample data
x <- 100 - abs(rnorm(1e6, 0, 5))
y <- 50 + rnorm(1e6, 0, 3)
dist <- sqrt((x - 100)^2 + (y - 50)^2)
nbr_bands <- 6 # set nbr of bands to plot
# calculate width of bands
band_width <- max(dist)/(nbr_bands-1)
# dist div band_width yields an integer 0 to nbr bands
# as.factor makes it categorical, which is what you want for the plot
band = as.factor(dist %/% (band_width))
library(dplyr)
library(ggplot2)
data.frame(x, y, band) %>%
ggplot() + geom_point(aes(x, y, color = band)) + coord_fixed() +
theme_dark() # dark theme
Edit to elaborate:
As you first attempted, it would be nice to use the very handy cut() function to calculate the radius color categories.
One way to get categorical (discrete) colors, rather than continuous shading, for your plot color groups is to set your aes color= to a factor column.
To directly get a factor from cut() you may use option ordered_result=TRUE:
band <- cut(dist, nbr_bands, ordered_result=TRUE, labels=1:nbr_bands) # also use `labels=` to specify your own labels
data.frame(x, y, band) %>%
ggplot() + geom_point(aes(x, y, color = band)) + coord_fixed()
Or more simply you may use cut() without options and convert to a factor using as.factor():
band <- as.factor( cut(dist, nbr_bands, labels=FALSE) )
data.frame(x, y, band) %>%
ggplot() + geom_point(aes(x, y, color = band)) + coord_fixed()
Sounds like you may find the circle & arc plotting functions from the ggforce package useful:
# data
set.seed(1234)
df <- data.frame(x = 100 - abs(rnorm(1e6, 0, 5)),
y = 50 + rnorm(1e6, 0, 3)) %>%
mutate(dist = sqrt((x - 100)^2 + (y - 50)^2)) %>%
mutate(z = exp(-(dist / 8)^2))
# define cut-off values
cutoff.values <- seq(0, ceiling(max(df$dist)), by = 5)
df %>%
# calculate the mean z for each distance band
mutate(dist_fct = cut(dist, cutoff.values)) %>%
group_by(dist_fct) %>%
summarise(z = mean(z)) %>%
ungroup() %>%
# add the cutoff values to the dataframe for inner & outer radius
arrange(dist_fct) %>%
mutate(r0 = cutoff.values[-length(cutoff.values)],
r = cutoff.values[-1]) %>%
# add coordinates for circle centre
mutate(x = 100, y = 50) %>%
# plot
ggplot(aes(x0 = x, y0 = y,
r0 = r0, r = r,
fill = z)) +
geom_arc_bar(aes(start = 0, end = 2 * pi),
color = NA) + # hide outline
# force equal aspect ratio in order to get true circle
coord_equal(xlim = c(70, 100), expand = FALSE)
Plot generation took <1s on my machine. Yours may differ.
I'm not sure this satisfies everything, but it should be a start. To cut down on the time for plotting, I'm summarizing the data into a grid, which lets you use geom_raster. I don't entirely understand the breaks and everything you're using, so you might want to tweak some of how I divided the data for making the distinct bands. I tried out a couple ways with cut_interval and cut_width--this would be a good place to plug in different options, such as the number or width of bands.
Since you mentioned getting the average z for each band, I'm grouping by the gridded x and y and the cut dist, then using mean of z for setting bands. I threw in a step to make labels like in the example--you probably want to reverse them or adjust their positioning--but that comes from getting the number of each band's factor level.
library(tidyverse)
set.seed(555)
n <- 1e6
df <- data_frame(
x = 100 - abs(rnorm(n, 0, 5)),
y = 50 + rnorm(n, 0, 3),
dist = sqrt((x - 100)^2 + (y - 50)^2),
z = exp(-(dist / 8)^2)
) %>%
mutate(brk = cut(dist, seq(0, max(dist), by = 5), include.lowest = T))
summarized <- df %>%
filter(!is.na(brk)) %>%
mutate(x_grid = floor(x), y_grid = floor(y)) %>%
group_by(x_grid, y_grid, brk) %>%
summarise(avg_z = mean(z)) %>%
ungroup() %>%
# mutate(z_brk = cut_width(avg_z, width = 0.15)) %>%
mutate(z_brk = cut_interval(avg_z, n = 9)) %>%
mutate(brk_num = as.numeric(z_brk))
head(summarized)
#> # A tibble: 6 x 6
#> x_grid y_grid brk avg_z z_brk brk_num
#> <dbl> <dbl> <fct> <dbl> <fct> <dbl>
#> 1 75 46 (20,25] 0.0000697 [6.97e-05,0.11] 1
#> 2 75 47 (20,25] 0.000101 [6.97e-05,0.11] 1
#> 3 75 49 (20,25] 0.0000926 [6.97e-05,0.11] 1
#> 4 75 50 (20,25] 0.0000858 [6.97e-05,0.11] 1
#> 5 75 52 (20,25] 0.0000800 [6.97e-05,0.11] 1
#> 6 76 51 (20,25] 0.000209 [6.97e-05,0.11] 1
To make the labels, summarize that data to have a single row per band--I did this by taking the minimum of the gridded x, then using the average of y so they'll show up in the middle of the plot.
labels <- summarized %>%
group_by(brk_num) %>%
summarise(min_x = min(x_grid)) %>%
ungroup() %>%
mutate(y_grid = mean(summarized$y_grid))
head(labels)
#> # A tibble: 6 x 3
#> brk_num min_x y_grid
#> <dbl> <dbl> <dbl>
#> 1 1 75 49.7
#> 2 2 88 49.7
#> 3 3 90 49.7
#> 4 4 92 49.7
#> 5 5 93 49.7
#> 6 6 94 49.7
geom_raster is great for these situations where you have data in an evenly spaced grid that just needs uniform tiles at each position. At this point, the summarized data has 595 rows, instead of the original 1 million, so the time to plot shouldn't be an issue.
ggplot(summarized) +
geom_raster(aes(x = x_grid, y = y_grid, fill = z_brk)) +
geom_label(aes(x = min_x, y = y_grid, label = brk_num), data = labels, size = 3, hjust = 0.5) +
theme_void() +
theme(legend.position = "none", panel.background = element_rect(fill = "gray40")) +
coord_fixed() +
scale_fill_brewer(palette = "PuBu")
Created on 2018-11-04 by the reprex package (v0.2.1)
Related
I'd like to create a graph like the one below. It's kind of a combination of using geom_area and geom_point.
Let's say my data looks like this:
library(gcookbook, janitor)
ggplot(uspopage, aes(x = Year, y = Thousands, fill = AgeGroup)) +
geom_area()
I obtain the following graph
Then, I'd like to add the exact number of points as the total for each category, which would be:
library(dplyr)
uspopage |>
group_by(AgeGroup) |>
summarize(total = sum(Thousands))
# A tibble: 8 × 2
AgeGroup total
<fct> <int>
1 <5 1534529
2 5-14 2993842
3 15-24 2836739
4 25-34 2635986
5 35-44 2331680
6 45-54 1883088
7 55-64 1417496
8 >64 1588163
Following some twitter comments my workaround is as follows:
1 - create the original plot with ggplot2
2 - grab the areas of the plot as a data.frame (ggplot_build)
3 - create polygons of the points given in 2, and make it a sensible sf object (downscale to a flatter earth)
4 - generate N random points inside each polygon (st_sample)
5 - grab these points and upscale back to the original scale
6 - ggplot2 once again, now with geom_point
7 - enjoy the wonders of ggplot2
library(gcookbook)
library(tidyverse)
library(sf)
set.seed(42)
# original data
d <- uspopage
# number of points for each group (I divide it by 1000)
d1 <- d |>
group_by(AgeGroup) |>
summarize(n_points = round(sum(Thousands) / 1e3)) |>
mutate(group = 1:n())
# original plot
g <- ggplot(data = d,
aes(x = Year,
y = Thousands,
fill = AgeGroup)) +
geom_area()
# get the geom data from ggplot
f <- ggplot_build(g)$data[[1]]
# polygons are created point by point in order. So let´s, by group, add the data.frame back to itself first part is the ymin line the secound the inverse of ymax line (to make a continous line from encompassing each area).
# list of groups
l_groups <- unique(f$group)
# function to invert and add back the data.frame
f_invert <- function(groups) {
k <- f[f$group == groups,]
k$y <- k$ymin
k1 <- k[nrow(k):1,]
k1$y <- k1$ymax
k2 <- rbind(k, k1)
return(k2)
}
# create a new data frame of the points in order
f1 <- do.call("rbind", lapply(l_groups, f_invert))
# for further use at the end of the script (to upscale back to the original ranges)
max_x <- max(f1$x)
max_y <- max(f1$y)
min_x <- min(f1$x)
min_y <- min(f1$y)
# normalizing: limiting sizes to a fairy small area on the globe (flat earth wannabe / 1 X 1 degrees)
f1$x <- scales::rescale(f1$x)
f1$y <- scales::rescale(f1$y)
# create polygons
polygons <- f1 |>
group_by(group) |>
sf::st_as_sf(coords = c("x", "y"), crs = 4326) |>
summarise(geometry = sf::st_combine(geometry)) |>
sf::st_cast("POLYGON")
# cast N number of points randomly inside each geometry (N is calculated beforehand in d1)
points <- polygons %>%
st_sample(size = d1$n_points,
type = 'random',
exact = TRUE) %>%
# Give the points an ID
sf::st_sf('ID' = seq(length(.)), 'geometry' = .) %>%
# Get underlying polygon attributes (group is the relevant attribute that we want to keep)
sf::st_intersection(., polygons)
# rescale back to the original ranges
points <- points |>
mutate(x = unlist(map(geometry,1)),
y = unlist(map(geometry,2))) |>
mutate(x = (x * (max_x - min_x) + min_x),
y = (y * (max_y - min_y) + min_y))
# bring back the legends
points <- left_join(points, d1, by = c("group"))
# final plot
g1 <- ggplot() +
geom_point(data = points,
aes(x = x,
y = y,
color = AgeGroup),
size = 0.5) +
labs(x = element_blank(),
y = element_blank()) +
theme_bw()
g1
Here's a version without any smoothing, just adding noise to where the dots would go naturally. One nice thing here is we can specify how many people are represented per dot.
dots_per_thou <- 1
uspopage %>%
uncount(round(dots_per_thou * Thousands / 1000)) %>%
group_by(Year) %>%
mutate(x_noise = runif(n(), 0, 1) - 0.5,
x_pos = Year + x_noise,
y_noise = runif(n(), 0, 1000*dots_per_thou),
y_pos = cumsum(row_number() + y_noise)) %>%
ungroup() %>%
ggplot(aes(x_pos, y_pos, color = AgeGroup)) +
geom_point(size = 0.1) +
ggthemes::scale_color_tableau()
You could come close-ish to that look with the ggbeeswarm package. It includes a few positions which "offset points within a category based on their density using quasirandom noise" (this is the description in the vipor package which underlies those positions).
The approach is just a hack and certainly not exactly satisfying. The number of dots might not be accurate and are more like "guessed", and they are too regular with position_beeswarm - I couldn't yet get it to run with the probably more appropriate position_quasirandom.
Also, it is computationally very intense and it made my reprex crash, thus simply copied from my script.
library(gcookbook)
library(ggplot2)
library(dplyr)
## ggbeeswarm needs to be in the development version
# devtools::install_github("eclarke/ggbeeswarm")
library(ggbeeswarm)
uncount_df <- uspopage %>%
group_by(Year) %>%
## inflate every group artificllay to add up to the previous group
## and make numbers much much smaller so to make computations not cray
mutate(cumul_sum = as.integer(cumsum(Thousands)/ 10^3)) %>%
## uncount
tidyr::uncount(cumul_sum)
## I am creating a list of layers
ls_layers <- lapply(split(uncount_df, uncount_df$AgeGroup), function(dat){
## I switched x and y aesthetic so to avoid coord_flip
## side is an argument in the dev version
## the size is a bit of a trial and error
geom_beeswarm(data = dat, aes( x = Year, y = "x", color = AgeGroup),
side = 1L,
size = .4)
})
## reversing the order, a trick to plot from small to large numbers
ls_layers <- ls_layers[length(ls_layers):1]
ggplot() +
## you can now simply add the list of layers to your ggplot object
ls_layers
I can plot geom_freqpoly without problems using the number of observation
ggplot(data=demo) +
geom_freqpoly(mapping=aes(x = value))
But I'd like to use the precalculated obeservation count contained in the data.
I tried using stat = "identity" but it apparently doesn't work.
ggplot(data=demo) +
geom_freqpoly(mapping=aes(x = value, y = cnt), stat = "identity")
This is my sample data
demo <- tribble(
~value, ~cnt,
.25, 20,
.25, 30,
.1, 40
)
TL;DR: You didn't get the graph you want, because the data of pre-calculated counts you passed to ggplot was NOTHING like what was used to produce the freqpoly graph.
Since you didn't include code for the original demo used to generate graph 1, I'll venture a guess:
demo.orig <- data.frame(value = c(0.25, 0.25, 0.1))
p <- ggplot(demo.orig, aes(x = value)) +
geom_freqpoly()
p # show plot to verify its appearance, which matches the graph in the question
layer_data(p) # look at the calculated data used by geom_freqpoly
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
y count x xmin xmax width density ncount ndensity PANEL group colour size linetype alpha
1 0 0 0.09310345 0.09051724 0.09568966 0.005172414 0.00000 0.0 0.0 1 -1 black 0.5 1 NA
2 1 1 0.09827586 0.09568966 0.10086207 0.005172414 64.44444 0.5 0.5 1 -1 black 0.5 1 NA
3 0 0 0.10344828 0.10086207 0.10603448 0.005172414 0.00000 0.0 0.0 1 -1 black 0.5 1 NA
... (omitted to conserve space)
30 0 0 0.24310345 0.24051724 0.24568966 0.005172414 0.00000 0.0 0.0 1 -1 black 0.5 1 NA
31 2 2 0.24827586 0.24568966 0.25086207 0.005172414 128.88889 1.0 1.0 1 -1 black 0.5 1 NA
32 0 0 0.25344828 0.25086207 0.25603448 0.005172414 0.00000 0.0 0.0 1 -1 black 0.5 1 NA
From a small dataframe with only two unique values, stat_bin generated a much larger dataframe with the x-axis split into 30 bins (the default number), and count / y = 0 everywhere except for the two bins containing the original values.
> geom_freqpoly
function (mapping = NULL, data = NULL, stat = "bin", position = "identity",
..., na.rm = FALSE, show.legend = NA, inherit.aes = TRUE)
{
params <- list(na.rm = na.rm, ...)
if (identical(stat, "bin")) {
params$pad <- TRUE
}
layer(data = data, mapping = mapping, stat = stat, geom = GeomPath,
position = position, show.legend = show.legend, inherit.aes = inherit.aes,
params = params)
}
A quick check by printing geom_freqpoly to console shows that its underlying geom is simply GeomPath, which plots x/y pairs in sequential order.
In other words, if you want to get the peaks from graph 1, you need to provide a similar dataset, with rows indicating where y should drop to 0. While it's certainly possible to calculate this by digging into the code for StatBin$compute_group, I'd think it's simpler to expand from the data of pre-calculated counts and let ggplot do its normal job:
demo %>%
tidyr::uncount(cnt) %>%
ggplot(aes(x = value)) +
geom_freqpoly() +
theme_minimal()
Edit: solution without fully expanding dataframe of aggregated counts
Sample dataset with 2 groups:
demo <- data.frame(value = c(0.25, 0.5, 0.1, 0.25, 0.75, 0.1),
cnt = c(5, 2, 4, 3, 8, 7) * 10e8,
group = rep(c("a", "b"), each = 3))
Code:
library(ggplot2)
library(dplyr)
demo %>%
rename(x = value, y = cnt) %>% # rename here so approach below can be easily applied
# to other datasets with different column names
tidyr::nest(data = c(x, y)) %>% # nest to apply same approach for each group
mutate(data = purrr::map(
data,
function(d) ggplot2:::bin_vector( # cut x's range into appropriate bins
x = d$x,
bins = ggplot2:::bin_breaks_bins(
x_range = range(d$x),
bins = 30), # default bin count is 30; change if desired
pad = TRUE) %>%
select(x, xmin, xmax) %>%
# place y counts into the corresponding x bins (this is probably similar
# to interval join, but I don't have that package installed on my machine)
tidyr::crossing(d %>% rename(x2 = x)) %>%
mutate(y = ifelse(x2 >= xmin & x2 < xmax, y, 0)) %>%
select(-x2) %>%
group_by(x) %>%
filter(y == max(y)) %>%
ungroup() %>%
unique())) %>%
tidyr::unnest(cols = c(data)) %>% # unnest to get one flat dataframe back
ggplot(aes(x = x, y = y, colour = group)) + # plot as per normal
geom_path() +
theme_bw()
# package versions used: dplyr 1.0.0, ggplot2 3.3.1, tidyr 1.1.0, purrr 0.3.4
Based on the similar problem for histograms the solution seems to be as simple as to use the weight parameter in the aesthetics.
The solution using the sample data from the other answer would be
demo <- data.frame(value = c(0.25, 0.5, 0.1, 0.25, 0.75, 0.1),
cnt = c(5, 2, 4, 3, 8, 7) * 10e8,
group = rep(c("a", "b"), each = 3))
ggplot(demo, aes(value, weight = cnt, color = group)) + geom_freqpoly()
I would like to animate the same network using different layouts and having a smooth transition between layouts. I'd like to do this inside the gganimate framework.
library(igraph)
library(ggraph)
library(gganimate)
set.seed(1)
g <- erdos.renyi.game(10, .5, "gnp")
V(g)$name <- letters[1:vcount(g)]
l1 <- create_layout(g, "kk")
l2 <- create_layout(g, "circle")
l3 <- create_layout(g, "nicely")
long <- rbind(l1,l2,l3)
long$frame <- rep(1:3, each =10)
Following the ggplot approach, I store the node positions in the long format (long) and add a frame variable to each layout.
I tried to make it work with the following code, which is working fine and almost what I want. However, I cannot seem to find a way to include the edges:
ggplot(long, aes(x, y, label = name, color = as.factor(name), frame = frame)) +
geom_point(size = 3, show.legend = F) +
geom_text(show.legend = F) +
transition_components(frame)
I also tried to add the edges as geom_segment but ended up with them being static while the nodes kept moving. This is why I use the ggraph package and fail:
ggraph(g, layout = "manual", node.position = long) +
geom_node_point() +
geom_edge_link() +
transition_components(frame)
I'd like to have an animation of one network with changing node positions that both displays nodes and edges.
Any help is much appreciated!
Edit: I learned that one can include the layout directly into ggraph (and even manipulate the attributes). This is what I've done in the following gif. Additionally geom_edge_link0' instead of geom_edge_link is being used.
ggraph(long) +
geom_edge_link0() +
geom_node_point() +
transition_states(frame)
Note that the edges are not moving.
I'm not sure this is currently ready in gganimate as is. As of May 2019, here's what looks to be a related issue: https://github.com/thomasp85/gganimate/issues/139
EDIT I've replaced with a working solution. Fair warning, I'm a newbie with network manipulations, and I expect someone with more experience could refactor the code to be much shorter.
My general approach was to create the layouts, put the nodes into a table long2, and then create another table with all the edges. gganimate then calls the respective data source each layer needs.
1. Create the nodes table for the three layouts:
set.seed(1)
g <- erdos.renyi.game(10, .5, "gnp")
V(g)$name <- letters[1:vcount(g)]
layouts <- c("kk", "circle", "nicely")
long2 <- lapply(layouts, create_layout, graph = g) %>%
enframe(name = "frame") %>%
unnest()
> head(long2)
# A tibble: 6 x 7
frame x y name ggraph.orig_index circular ggraph.index
<int> <dbl> <dbl> <fct> <int> <lgl> <int>
1 1 -1.07 0.363 a 1 FALSE 1
2 1 1.06 0.160 b 2 FALSE 2
3 1 -1.69 -0.310 c 3 FALSE 3
4 1 -0.481 0.135 d 4 FALSE 4
5 1 -0.0603 -0.496 e 5 FALSE 5
6 1 0.0373 1.02 f 6 FALSE 6
2. Convert the edges from the original layout into a table.
Here, I extract the edges from g and reshape into format that geom_segment can use, with columns for x, y, xend, and yend. This is ripe for refactoring, but it works.
edges_df <- igraph::as_data_frame(g, "edges") %>%
tibble::rowid_to_column() %>%
gather(end, name, -rowid) %>%
# Here we get the coordinates for each node from `long2`.
left_join(long2 %>% select(frame, name, x, y)) %>%
gather(coord, val, x:y) %>%
# create xend and yend when at the "to" end, for geom_segment use later
mutate(col = paste0(coord, if_else(end == "to", "end", ""))) %>%
select(frame, rowid, col, val) %>%
arrange(frame, rowid) %>%
spread(col, val) %>%
# Get the node names for the coordinates we're using, so that we
# can name the edge from a to b as "a_b" and gganimate can tween
# correctly between frames.
left_join(long2 %>% select(frame, x, y, start_name = name)) %>%
left_join(long2 %>% select(frame, xend = x, yend = y, end_name = name)) %>%
unite(edge_name, c("start_name", "end_name"))
> head(edges_df)
frame rowid x xend y yend edge_name
1 1 1 -1.0709480 -1.69252646 0.3630563 -0.3095612 a_c
2 1 2 -1.0709480 -0.48086213 0.3630563 0.1353664 a_d
3 1 3 -1.6925265 -0.48086213 -0.3095612 0.1353664 c_d
4 1 4 -1.0709480 -0.06032354 0.3630563 -0.4957609 a_e
5 1 5 1.0571895 -0.06032354 0.1596417 -0.4957609 b_e
6 1 6 -0.4808621 -0.06032354 0.1353664 -0.4957609 d_e
3. Plot!
ggplot() +
geom_segment(data = edges_df,
aes(x = x, xend = xend, y = y, yend = yend, color = edge_name)) +
geom_point(data = long2, aes(x, y, color = name), size = 4) +
geom_text(data = long2, aes(x, y, label = name)) +
guides(color = F) +
ease_aes("quadratic-in-out") +
transition_states(frame, state_length = 0.5) -> a
animate(a, nframes = 400, fps = 30, width = 700, height = 300)
# Create the Data Frame
library(tidyverse)
library(ggQC)
set.seed(5555)
Golden_Egg_df <- data.frame(month = 1:12,
egg_diameter = rnorm(n=12, mean=1.5, sd=0.2)) %>%
mutate(grp = c(rep("A", 3), rep("B", 9)))
Golden_Egg_df$egg_diameter[3] <- 5
# Determine the control limit values (red lines)
p <- ggplot(Golden_Egg_df, aes(x = month, y = egg_diameter)) +
geom_point() +
geom_line() +
stat_QC(method = "XmR")
pb <- ggplot_build(p)
thres <- range(pb$data[[3]]$yintercept)
# Circle anything outside the control limits (red lines)
p + geom_point(
data = subset(Golden_Egg_df,
egg_diameter > max(thres) | egg_diameter < min(thres)),
shape = 21,
size = 4,
col = "red"
)
The code chunk above determines the y-values of the control limits (red lines) from the ggplot_build() function. It then draws red circles around outliers. This works great until I facet the plot. It's because the logic of thres <- range(pb$data[[3]]$yintercept) isn't "smart" enough to wade through the different facet groupings.
# ONLY ONE 'Y-INTERCEPT' RANGE HERE TO WORRY ABOUT WITHOUT FACETING
#> $`data`[[3]]
#> yintercept y x label
#> 1 -0.2688471 -0.2688471 -Inf LCL
#> 2 3.7995203 3.7995203 -Inf UCL
#> 3 -0.2688471 -0.2688471 Inf -0.3
#> 4 3.7995203 3.7995203 Inf 3.8
# MULTIPLE 'Y-INTERCEPT' RANGES HERE TO WORRY ABOUT WITH FACETING
#> $`data`[[3]]
#> yintercept y x label
#> 1 -0.8759612 -0.8759612 -Inf LCL
#> 2 4.5303358 4.5303358 -Inf UCL
#> 3 -0.8759612 -0.8759612 Inf -0.9
#> 4 4.5303358 4.5303358 Inf 4.5
#> 5 1.2074161 1.2074161 -Inf LCL
#> 6 1.9521532 1.9521532 -Inf UCL
#> 7 1.2074161 1.2074161 Inf 1.2
#> 8 1.9521532 1.9521532 Inf 2
How do I get my code block below to work properly and circle the outliers? I obviously need a more sophisticated thres2, that can recognize there are different groupings of control limits (red lines) between the different facets.
# Determine the control limit values (red lines)
Golden_Egg_df$egg_diameter[11] <- 5
p2 <- ggplot(Golden_Egg_df, aes(x = month, y = egg_diameter)) +
geom_point() +
geom_line() +
stat_QC(method = "XmR") +
facet_grid(~ grp, scales = "free_x", space = "free_x") +
scale_x_continuous(breaks = 1:12, labels = month.abb)
pb2 <- ggplot_build(p2)
thres2 <- range(pb2$data[[3]]$yintercept)
thres2
#> [1] -2.274056 7.445141
# Circle anything outside the control limits (red lines)
p2 + geom_point(
data = subset(Golden_Egg_df,
egg_diameter > max(thres2) | egg_diameter < min(thres2)),
shape = 21,
size = 4,
col = "red"
)
I think the best way is to get the ranges in the same data.frame as your data. I'am not sure if this is the most elegant solution, but it works with your example:
library(tidyverse)
library(ggQC)
set.seed(5555)
Golden_Egg_df <- data.frame(month = 1:12,
egg_diameter = rnorm(n=12, mean=1.5, sd=0.2)) %>%
mutate(grp = c(rep("A", 3), rep("B", 9)))
Golden_Egg_df$egg_diameter[3] <- 5
Golden_Egg_df$egg_diameter[11] <- 5
# create the plot
p2 <- ggplot(Golden_Egg_df, aes(x = month,
y = egg_diameter)) +
geom_point() +
geom_line() +
stat_QC(method = "XmR") +
facet_grid(~ grp,
scales = "free_x",
space = "free_x") +
scale_x_continuous(breaks = 1:12,
labels = month.abb)
# get all the info about the plot
pb2 <- ggplot_build(p2)
# extract the UCL and LCL for each plot (facet)
Golden_Egg_df <- Golden_Egg_df %>%
mutate(min = ifelse(grp == "A",
min(pb2$data[[3]]$yintercept[1:4]), # LCL of 1st plot
min(pb2$data[[3]]$yintercept[5:8])), # LCL of 1st plot
max = ifelse(grp == "A",
max(pb2$data[[3]]$yintercept[1:4]), # UCL 2nd plot
max(pb2$data[[3]]$yintercept[5:8]))) # UCL 2nd plot
# add the circled outlier
p2 + geom_point(data = subset(Golden_Egg_df,
egg_diameter > max |
egg_diameter < min),
shape = 21,
size = 4,
col = "red")
Cheers, Rico
When generating smoothed plots with faceting in ggplot, if the range of the data changes from facet to facet the smoothing may acquire too many degress of freedom for the facets with less data.
For example
library(dplyr)
library(ggplot2) # ggplot2_2.2.1
set.seed(1234)
expand.grid(z = -5:2, x = seq(-5,5, len = 50)) %>%
mutate(y = dnorm(x) + 0.4*runif(n())) %>%
filter(z <= x) %>%
ggplot(aes(x,y)) +
geom_line() +
geom_smooth(method = 'loess', span = 0.3) +
facet_wrap(~ z)
generates the following:
The z=-5 facet is fine, but as one moves to subsequent facets the smoothing seems to 'overfit'; indeed z=-1 already suffers from that, and in the last facet, z=2, the smoothed line fits the data perfectly. Ideally, what I would like is a less dynamic smoothing that for example always smooths about 4 points (or kernel smoothing with a fixed kernel).
The following SO question is related but perhaps more ambitious (in that it wants more control over span); here I want a simpler form of smoothing.
I moved a few things around in your code to get this to work. I'm not sure if it's the best way to do it, but it's a simple way.
First we group by your z variable and then generate a number span that is small for large numbers of observations but large for small numbers. I guessed at 10/length(x). Perhaps there's some more statistically sound way of looking at it. Or perhaps it should be 2/diff(range(x)). Since this is for your own visual smoothing, you'll have to fine tune that parameter yourself.
expand.grid(z = -5:2, x = seq(-5,5, len = 50)) %>%
filter(z <= x) %>%
group_by(z) %>%
mutate(y = dnorm(x) + 0.4*runif(length(x)),
span = 10/length(x)) %>%
distinct(z, span)
# A tibble: 8 x 2
# Groups: z [8]
z span
<int> <dbl>
1 -5 0.2000000
2 -4 0.2222222
3 -3 0.2500000
4 -2 0.2857143
5 -1 0.3333333
6 0 0.4000000
7 1 0.5000000
8 2 0.6666667
Update
The method I did have here was not working correctly. The best way to do this (and the most flexible way to do model-fitting in general) is to pre-compute it.
So we take our grouped dataframe with the computed span, fit a loess model to each group with the appropriate span, and then use broom::augment to form it back into a dataframe.
library(broom)
expand.grid(z = -5:2, x = seq(-5,5, len = 50)) %>%
filter(z <= x) %>%
group_by(z) %>%
mutate(y = dnorm(x) + 0.4*runif(length(x)),
span = 10/length(x)) %>%
do(fit = list(augment(loess(y~x, data = ., span = unique(.$span)), newdata = .))) %>%
unnest()
# A tibble: 260 x 7
z z1 x y span .fitted .se.fit
<int> <int> <dbl> <dbl> <dbl> <dbl> <dbl>
1 -5 -5 -5.000000 0.045482851 0.2 0.07700057 0.08151451
2 -5 -5 -4.795918 0.248923802 0.2 0.18835244 0.05101045
3 -5 -5 -4.591837 0.243720422 0.2 0.25458037 0.04571323
4 -5 -5 -4.387755 0.249378098 0.2 0.28132026 0.04947480
5 -5 -5 -4.183673 0.344429272 0.2 0.24619206 0.04861535
6 -5 -5 -3.979592 0.256269425 0.2 0.19213489 0.05135924
7 -5 -5 -3.775510 0.004118627 0.2 0.14574901 0.05135924
8 -5 -5 -3.571429 0.093698117 0.2 0.15185599 0.04750935
9 -5 -5 -3.367347 0.267809673 0.2 0.17593182 0.05135924
10 -5 -5 -3.163265 0.208380125 0.2 0.22919335 0.05135924
# ... with 250 more rows
This has the side effect of duplicating the grouping column z, but it intelligently renames it to avoid name-collision, so we can ignore it. You can see that there are the same number of rows as the original data, and the original x, y, and z are there, as well as our computed span.
If you want to prove to yourself that it's really fitting each group with the right span, you can do something like:
... mutate(...) %>%
do(fit = (loess(y~x, data = ., span = unique(.$span)))) %>%
pull(fit) %>% purrr::map(summary)
That will print out the model summaries with the span included.
Now it's just a matter of plotting the augmented dataframe we just made, and manually reconstructing the smoothed line and confidence interval.
... %>%
ggplot(aes(x,y)) +
geom_line() +
geom_ribbon(aes(x, ymin = .fitted - 1.96*.se.fit,
ymax = .fitted + 1.96*.se.fit),
alpha = 0.2) +
geom_line(aes(x, .fitted), color = "blue", size = 1) +
facet_wrap(~ z)
I would simply remove the span option (because 0.3 seems too granular) or use lm method to do polynomial fit.
library(dplyr)
library(ggplot2) # ggplot2_2.2.1
set.seed(1234)
expand.grid(z = -5:2, x = seq(-5,5, len = 50)) %>%
mutate(y = dnorm(x) + 0.4*runif(n())) %>%
filter(z <= x) %>%
ggplot(aes(x,y)) +
geom_line() +
geom_smooth(method = 'lm', formula = y ~ poly(x, 4)) +
#geom_smooth(method = 'loess') +
#geom_smooth(method = 'loess', span = 0.3) +
facet_wrap(~ z)
Since I asked how to do kernel smoothing I wanted to provide an answer for that.
I'll start by just adding it as extra data to data frame and plotting that, much as the accepted answer does.
First here is the data and packages I'll be using (same as in my post):
library(dplyr)
library(ggplot2) # ggplot2_2.2.1
set.seed(1234)
expand.grid(z = -5:2, x = seq(-5,5, len = 50)) %>%
mutate(y = dnorm(x) + 0.4*runif(n())) %>%
filter(z <= x) ->
Z
Next here is the plot:
Z %>%
group_by(z) %>%
do(data.frame(ksmooth(.$x, .$y, 'normal', bandwidth = 2))) %>%
ggplot(aes(x,y)) +
geom_line(data = Z) +
geom_line(color = 'blue', size = 1) +
facet_wrap(~ z)
which simply uses ksmooth from base R. Note that it's quite simple to avoid the dynamic smoothing (making the bandwidth constant takes care of that). In fact, one can recover the a dynamic style smoothing (i.e., more like geom_smooth) as follows:
Z %>%
group_by(z) %>%
do(data.frame(ksmooth(.$x, .$y, 'normal', bandwidth = diff(range(.$x))/5))) %>%
ggplot(aes(x,y)) +
geom_line(data = Z) +
geom_line(color = 'blue', size = 1) +
facet_wrap(~ z)
I also followed the example in https://github.com/hrbrmstr/ggalt/blob/master/R/geom_xspline.r to turn this idea into an actual stat_ and geom_ as follows:
geom_ksmooth <- function(mapping = NULL, data = NULL, stat = "ksmooth",
position = "identity", na.rm = TRUE, show.legend = NA,
inherit.aes = TRUE,
bandwidth = 0.5, ...) {
layer(
geom = GeomKsmooth,
mapping = mapping,
data = data,
stat = stat,
position = position,
show.legend = show.legend,
inherit.aes = inherit.aes,
params = list(bandwidth = bandwidth,
...)
)
}
GeomKsmooth <- ggproto("GeomKsmooth", GeomLine,
required_aes = c("x", "y"),
default_aes = aes(colour = "blue", size = 1, linetype = 1, alpha = NA)
)
stat_ksmooth <- function(mapping = NULL, data = NULL, geom = "line",
position = "identity", na.rm = TRUE, show.legend = NA, inherit.aes = TRUE,
bandwidth = 0.5, ...) {
layer(
stat = StatKsmooth,
data = data,
mapping = mapping,
geom = geom,
position = position,
show.legend = show.legend,
inherit.aes = inherit.aes,
params = list(bandwidth = bandwidth,
...
)
)
}
StatKsmooth <- ggproto("StatKsmooth", Stat,
required_aes = c("x", "y"),
compute_group = function(self, data, scales, params,
bandwidth = 0.5) {
data.frame(ksmooth(data$x, data$y, kernel = 'normal', bandwidth = bandwidth))
}
)
(Note that I have a very poor understanding of the above code.) But now we can do:
Z %>%
ggplot(aes(x,y)) +
geom_line() +
geom_ksmooth(bandwidth = 2) +
facet_wrap(~ z)
And the smoothing is not dynamic, as I originally wanted.
I do wonder if there is a simpler way, though.