Geom_freqpoly with Predefined Count - r

I can plot geom_freqpoly without problems using the number of observation
ggplot(data=demo) +
geom_freqpoly(mapping=aes(x = value))
But I'd like to use the precalculated obeservation count contained in the data.
I tried using stat = "identity" but it apparently doesn't work.
ggplot(data=demo) +
geom_freqpoly(mapping=aes(x = value, y = cnt), stat = "identity")
This is my sample data
demo <- tribble(
~value, ~cnt,
.25, 20,
.25, 30,
.1, 40
)

TL;DR: You didn't get the graph you want, because the data of pre-calculated counts you passed to ggplot was NOTHING like what was used to produce the freqpoly graph.
Since you didn't include code for the original demo used to generate graph 1, I'll venture a guess:
demo.orig <- data.frame(value = c(0.25, 0.25, 0.1))
p <- ggplot(demo.orig, aes(x = value)) +
geom_freqpoly()
p # show plot to verify its appearance, which matches the graph in the question
layer_data(p) # look at the calculated data used by geom_freqpoly
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
y count x xmin xmax width density ncount ndensity PANEL group colour size linetype alpha
1 0 0 0.09310345 0.09051724 0.09568966 0.005172414 0.00000 0.0 0.0 1 -1 black 0.5 1 NA
2 1 1 0.09827586 0.09568966 0.10086207 0.005172414 64.44444 0.5 0.5 1 -1 black 0.5 1 NA
3 0 0 0.10344828 0.10086207 0.10603448 0.005172414 0.00000 0.0 0.0 1 -1 black 0.5 1 NA
... (omitted to conserve space)
30 0 0 0.24310345 0.24051724 0.24568966 0.005172414 0.00000 0.0 0.0 1 -1 black 0.5 1 NA
31 2 2 0.24827586 0.24568966 0.25086207 0.005172414 128.88889 1.0 1.0 1 -1 black 0.5 1 NA
32 0 0 0.25344828 0.25086207 0.25603448 0.005172414 0.00000 0.0 0.0 1 -1 black 0.5 1 NA
From a small dataframe with only two unique values, stat_bin generated a much larger dataframe with the x-axis split into 30 bins (the default number), and count / y = 0 everywhere except for the two bins containing the original values.
> geom_freqpoly
function (mapping = NULL, data = NULL, stat = "bin", position = "identity",
..., na.rm = FALSE, show.legend = NA, inherit.aes = TRUE)
{
params <- list(na.rm = na.rm, ...)
if (identical(stat, "bin")) {
params$pad <- TRUE
}
layer(data = data, mapping = mapping, stat = stat, geom = GeomPath,
position = position, show.legend = show.legend, inherit.aes = inherit.aes,
params = params)
}
A quick check by printing geom_freqpoly to console shows that its underlying geom is simply GeomPath, which plots x/y pairs in sequential order.
In other words, if you want to get the peaks from graph 1, you need to provide a similar dataset, with rows indicating where y should drop to 0. While it's certainly possible to calculate this by digging into the code for StatBin$compute_group, I'd think it's simpler to expand from the data of pre-calculated counts and let ggplot do its normal job:
demo %>%
tidyr::uncount(cnt) %>%
ggplot(aes(x = value)) +
geom_freqpoly() +
theme_minimal()
Edit: solution without fully expanding dataframe of aggregated counts
Sample dataset with 2 groups:
demo <- data.frame(value = c(0.25, 0.5, 0.1, 0.25, 0.75, 0.1),
cnt = c(5, 2, 4, 3, 8, 7) * 10e8,
group = rep(c("a", "b"), each = 3))
Code:
library(ggplot2)
library(dplyr)
demo %>%
rename(x = value, y = cnt) %>% # rename here so approach below can be easily applied
# to other datasets with different column names
tidyr::nest(data = c(x, y)) %>% # nest to apply same approach for each group
mutate(data = purrr::map(
data,
function(d) ggplot2:::bin_vector( # cut x's range into appropriate bins
x = d$x,
bins = ggplot2:::bin_breaks_bins(
x_range = range(d$x),
bins = 30), # default bin count is 30; change if desired
pad = TRUE) %>%
select(x, xmin, xmax) %>%
# place y counts into the corresponding x bins (this is probably similar
# to interval join, but I don't have that package installed on my machine)
tidyr::crossing(d %>% rename(x2 = x)) %>%
mutate(y = ifelse(x2 >= xmin & x2 < xmax, y, 0)) %>%
select(-x2) %>%
group_by(x) %>%
filter(y == max(y)) %>%
ungroup() %>%
unique())) %>%
tidyr::unnest(cols = c(data)) %>% # unnest to get one flat dataframe back
ggplot(aes(x = x, y = y, colour = group)) + # plot as per normal
geom_path() +
theme_bw()
# package versions used: dplyr 1.0.0, ggplot2 3.3.1, tidyr 1.1.0, purrr 0.3.4

Based on the similar problem for histograms the solution seems to be as simple as to use the weight parameter in the aesthetics.
The solution using the sample data from the other answer would be
demo <- data.frame(value = c(0.25, 0.5, 0.1, 0.25, 0.75, 0.1),
cnt = c(5, 2, 4, 3, 8, 7) * 10e8,
group = rep(c("a", "b"), each = 3))
ggplot(demo, aes(value, weight = cnt, color = group)) + geom_freqpoly()

Related

ggplot2 heatmap with tile height and width as aes()

I'm trying to create a heat map for an OD matrix, but I wanted to scale the rows and columns by certain weights. Since these weights are constant across each category I would expect the plot would keep the rows and columns structure.
# Tidy OD matrix
df <- data.frame (origin = c(rep("A", 3), rep("B", 3),rep("C", 3)),
destination = rep(c("A","B","C"),3),
value = c(0, 1, 10, 5, 0, 11, 15, 6, 0))
# Weights
wdf <- data.frame(region = c("A","B","C"),
w = c(1,2,3))
# Add weights to the data.
plot_df <- df %>%
merge(wdf %>% rename(w_origin = w), by.x = 'origin', by.y = 'region') %>%
merge(wdf %>% rename(w_destination = w), by.x = 'destination', by.y = 'region')
Here's how the data looks like:
> plot_df
destination origin value w_origin w_destination
1 A A 0 1 1
2 A C 15 3 1
3 A B 5 2 1
4 B A 1 1 2
5 B B 0 2 2
6 B C 6 3 2
7 C B 11 2 3
8 C A 10 1 3
9 C C 0 3 3
However, when passing the weights as width and height in the aes() I get this:
ggplot(plot_df,
aes(x = destination,
y = origin)) +
geom_tile(
aes(
width = w_destination,
height = w_origin,
fill = value),
color = 'black')
It seems to be working for the size of the columns (width), but not quite because the proportions are not the right. And the rows are all over the place and not aligned.
I'm only using geom_tile because I could pass height and width as aesthetics, but I accept other suggestions.
The issue is that your tiles are overlapping. The reason is that while you could pass the width and the heights as aesthetics, geom_tile will not adjust the x and y positions of the tiles for you. As your are mapping a discrete variable on x and y your tiles are positioned on a equidistant grid. In your case the tiles are positioned at .5, 1.5 and 2.5. The tiles are then drawn on these positions with the specified width and height.
This could be easily seen by adding some transparency to your plot:
library(ggplot2)
library(dplyr)
ggplot(plot_df,
aes(x = destination,
y = origin)) +
geom_tile(
aes(
width = w_destination,
height = w_origin,
fill = value), color = "black", alpha = .2)
To achieve your desired result you have to manually compute the x and y positions according to the desired widths and heights to prevent the overlapping of the boxes. To this end you could switch to a continuous scale and set the desired breaks and labels via scale_x/y_ continuous:
breaks <- wdf %>%
mutate(cumw = cumsum(w),
pos = .5 * (cumw + lag(cumw, default = 0))) %>%
select(region, pos)
plot_df <- plot_df %>%
left_join(breaks, by = c("origin" = "region")) %>%
rename(y = pos) %>%
left_join(breaks, by = c("destination" = "region")) %>%
rename(x = pos)
ggplot(plot_df,
aes(x = x,
y = y)) +
geom_tile(
aes(
width = w_destination,
height = w_origin,
fill = value), color = "black") +
scale_x_continuous(breaks = breaks$pos, labels = breaks$region, expand = c(0, 0.1)) +
scale_y_continuous(breaks = breaks$pos, labels = breaks$region, expand = c(0, 0.1))
So I think I have a partial solution for you. After playing arround with geom_tile, it appears that the order of your dataframe matters when you are using height and width.
Here is some example code I came up with off of yours (run your code first). I converted your data_frame to a tibble (part of dplyr) to make it easier to sort by a column.
# Converted your dataframe to a tibble dataframe
plot_df_tibble = tibble(plot_df)
# Sorted your dataframe by your w_origin column:
plot_df_tibble2 = plot_df_tibble[order(plot_df_tibble$w_origin),]
# Plotted the sorted data frame:
ggplot(plot_df_tibble2,
aes(x = destination,
y = origin)) +
geom_tile(
aes(
width = w_destination,
height = w_origin,
fill = value),
color = 'black')
And got this plot:
Link to image I made
I should note that if you run the converted tibble before you sort that you get the same plot you posted.
It seems like the height and width arguements may not be fully developed for this portion of geom_tile, as I feel that the order of the df should not matter.
Cheers

Plot grouped barplot with absolute and percent values + labels

I am quite new to R and especially to ggplot. For my next result I think I have to change from plot() to ggplot() where I need your help:
I have a dataframe with numeric values. One column is an absolute number, the other one is the belonging percentage value. I have 3 of this "two groups" indicators a, b and c.
The rownames are the 6 observations and are stored in the first column "X".
I want to plot them in a kind of grouped barplot, where the absolute+percent column is next to each other for the 3 indicators.
Sample dataframe:
df = data.frame(X = c("e 1","e 1,5","e 2","e 2,5","e 3","e 3,5","e 4"),
a_abs=c(-0.3693,-0.0735,-0.019,0.0015,0,-0.0224,-0.0135),
a_per=c(-0.4736,-0.0943,-0.0244,0.0019,0,-0.0287,-0.0173),
b_abs=c(-0.384,-0.0733,-0.0173,0.0034,0,-0.0204,-0.0179),
b_per=c(-0.546,-0.1042,-0.0246,0.0048,0,-0.029,-0.0255),
c_abs=c(-0.3876,-0.0738,-0.019,0.0015,0,-0.0225,-0.0137),
c_per=c(-0.4971,-0.0946,-0.0244,0.0019,0,-0.0289,-0.0176))
Thanks to #jonspring i got the following plot by using this code:
df3 <- df %>%
gather(column, value, -X) %>%
mutate(group = str_sub(column, end = 2),
stat = str_sub(column, start = 4)) %>%
select(-column) %>%
spread(stat, value) %>%
mutate(combo_label = paste(sep="\n",
scales::comma(abs, accuracy = 0.001),
scales::percent(per, accuracy = 0.01)))
df3$group = gsub(df3$group,pattern = "CK",replacement = "Cohen's\nKappa")
df3$group = gsub(df3$group,pattern = "JA",replacement = "Jaccard")
df3$group = gsub(df3$group,pattern = "KA",replacement = "Krippen-\ndorff's Alpha")
crg = ifelse(df3$abs< 0,"red","darkgreen")
ggplot(df3, aes(group, abs, label = combo_label)) +
geom_segment(aes(xend = group,
yend = 0),
color = crg) +
geom_point() +
geom_text(vjust = 1.5,
size = 3,
lineheight = 1.2) +
scale_y_continuous(expand = c(0.2,0)) +
facet_grid(~X) +
labs(x= "Exponent", y = "Wert")
plot output
When i zoom and have the positive values visible, the labels are written inside the segments. How to place them above / below depending of a positive or negative value?
Zoom with coord_cartesian(ylim = c(-0.015,0.005))
zoomed plot
Thank you for your helping hands.
EDIT: I found the solution already. Like the color changement from red to green i used ifelse for the vjust parameter.
There are a lot of varieties of ways to display this sort of data with ggplot. I highly recommend you check out https://r4ds.had.co.nz/data-visualisation.html if you haven't already.
One suggestion you'll find there is that ggplot almost always works better if you first convert your data into long (aka "tidy") form. This puts each of the dimensions of the data into its own column, so that you can map the dimension to a visual aesthetic. Here's one way to do that:
library(tidyverse)
df2 <- df %>%
gather(column, value, -X) %>%
mutate(group = str_sub(column, end = 1),
stat = str_sub(column, start = 3),
value_label = if_else(stat == "per",
scales::percent(value, accuracy = 0.1),
scales::comma(value, accuracy = 0.01)))
Now, the group a/b/c is in its own column, as is the type of data abs/per, the values are all together in one column, and we also have text labels that suit the type of data.
> head(df2)
X column value group stat value_label
1 e 1 a_abs -0.3693 a abs -0.37
2 e 1,5 a_abs -0.0735 a abs -0.07
3 e 2 a_abs -0.0190 a abs -0.02
4 e 2,5 a_abs 0.0015 a abs 0.00
5 e 3 a_abs 0.0000 a abs 0.00
6 e 3,5 a_abs -0.0224 a abs -0.02
With that out of the way, it's simpler to try out different combinations of ggplot options, which can help highlight different comparisons within the data.
For instance, if you want to compare the different observations within each group, you could put each group into a facet, and each observation along the x axis:
ggplot(df2, aes(X, value, label = value_label)) +
geom_segment(aes(xend = X, yend = 0), color = "blue") +
geom_point() +
geom_text(vjust = 2, size = 2) +
facet_grid(stat~group)
Or if you want to highlight how the different groups compared within each observation, you could swap them, like this:
ggplot(df2, aes(group, value, label = value_label)) +
geom_segment(aes(xend = group, yend = 0), color = "blue") +
geom_point() +
geom_text(vjust = 2, size = 2) +
facet_grid(stat~X)
You might also try combining the abs and per data, since they only vary slightly based on the different denominators applicable to each group and/or observation. To do that, it might be simpler to transform the data to keep each abs and per together:
df3 <- df %>%
gather(column, value, -X) %>%
mutate(group = str_sub(column, end = 1),
stat = str_sub(column, start = 3)) %>%
select(-column) %>%
spread(stat, value) %>%
mutate(combo_label = paste(sep="\n",
scales::comma(abs, accuracy = 0.01),
scales::percent(per, accuracy = 0.1)))
ggplot(df3, aes(group, abs, label = combo_label)) +
geom_segment(aes(xend = group, yend = 0), color = "blue") +
geom_point() +
geom_text(vjust = 1.5, size = 2, lineheight = 0.8) +
scale_y_continuous(expand = c(0.2,0)) +
facet_grid(~X)

How to make stacked circle plot without coord_polar

I've got a dataset similar to this:
x <- 100 - abs(rnorm(1e6, 0, 5))
y <- 50 + rnorm(1e6, 0, 3)
dist <- sqrt((x - 100)^2 + (y - 50)^2)
z <- exp(-(dist / 8)^2)
which can be visualised as follows:
data.frame(x, y, z) %>%
ggplot() + geom_point(aes(x, y, color = z))
What I would like to do is a stacked half-circle plot with averaged value of z in subsequent layers. I think it can be done with the combination of geom_col and coord_polar(), although the farthest I can get is
data.frame(x, y, z, dist) %>%
mutate(dist_fct = cut(dist, seq(0, max(dist), by = 5))) %>%
ggplot() + geom_bar(aes(x = 1, y = 1, fill = dist_fct), stat = 'identity', position = 'fill') +
coord_polar()
which is obviously far from the expectation (layers should be of equal size, plot should be clipped on the right half).
The problem is that I can't really use coord_polar() due to further use of annotate_custom(). So my question are:
can plot like this can be done without coord_polar()?
If not, how can it be done with coord_polar()?
The result should be similar to a graphic below, except from plotting layers constructed from points I would like to plot only layers as a whole with color defined as an average value of z inside a layer.
If you want simple radius bands, perhaps something like this would work as you pictured it in your question:
# your original sample data
x <- 100 - abs(rnorm(1e6, 0, 5))
y <- 50 + rnorm(1e6, 0, 3)
dist <- sqrt((x - 100)^2 + (y - 50)^2)
nbr_bands <- 6 # set nbr of bands to plot
# calculate width of bands
band_width <- max(dist)/(nbr_bands-1)
# dist div band_width yields an integer 0 to nbr bands
# as.factor makes it categorical, which is what you want for the plot
band = as.factor(dist %/% (band_width))
library(dplyr)
library(ggplot2)
data.frame(x, y, band) %>%
ggplot() + geom_point(aes(x, y, color = band)) + coord_fixed() +
theme_dark() # dark theme
Edit to elaborate:
As you first attempted, it would be nice to use the very handy cut() function to calculate the radius color categories.
One way to get categorical (discrete) colors, rather than continuous shading, for your plot color groups is to set your aes color= to a factor column.
To directly get a factor from cut() you may use option ordered_result=TRUE:
band <- cut(dist, nbr_bands, ordered_result=TRUE, labels=1:nbr_bands) # also use `labels=` to specify your own labels
data.frame(x, y, band) %>%
ggplot() + geom_point(aes(x, y, color = band)) + coord_fixed()
Or more simply you may use cut() without options and convert to a factor using as.factor():
band <- as.factor( cut(dist, nbr_bands, labels=FALSE) )
data.frame(x, y, band) %>%
ggplot() + geom_point(aes(x, y, color = band)) + coord_fixed()
Sounds like you may find the circle & arc plotting functions from the ggforce package useful:
# data
set.seed(1234)
df <- data.frame(x = 100 - abs(rnorm(1e6, 0, 5)),
y = 50 + rnorm(1e6, 0, 3)) %>%
mutate(dist = sqrt((x - 100)^2 + (y - 50)^2)) %>%
mutate(z = exp(-(dist / 8)^2))
# define cut-off values
cutoff.values <- seq(0, ceiling(max(df$dist)), by = 5)
df %>%
# calculate the mean z for each distance band
mutate(dist_fct = cut(dist, cutoff.values)) %>%
group_by(dist_fct) %>%
summarise(z = mean(z)) %>%
ungroup() %>%
# add the cutoff values to the dataframe for inner & outer radius
arrange(dist_fct) %>%
mutate(r0 = cutoff.values[-length(cutoff.values)],
r = cutoff.values[-1]) %>%
# add coordinates for circle centre
mutate(x = 100, y = 50) %>%
# plot
ggplot(aes(x0 = x, y0 = y,
r0 = r0, r = r,
fill = z)) +
geom_arc_bar(aes(start = 0, end = 2 * pi),
color = NA) + # hide outline
# force equal aspect ratio in order to get true circle
coord_equal(xlim = c(70, 100), expand = FALSE)
Plot generation took <1s on my machine. Yours may differ.
I'm not sure this satisfies everything, but it should be a start. To cut down on the time for plotting, I'm summarizing the data into a grid, which lets you use geom_raster. I don't entirely understand the breaks and everything you're using, so you might want to tweak some of how I divided the data for making the distinct bands. I tried out a couple ways with cut_interval and cut_width--this would be a good place to plug in different options, such as the number or width of bands.
Since you mentioned getting the average z for each band, I'm grouping by the gridded x and y and the cut dist, then using mean of z for setting bands. I threw in a step to make labels like in the example--you probably want to reverse them or adjust their positioning--but that comes from getting the number of each band's factor level.
library(tidyverse)
set.seed(555)
n <- 1e6
df <- data_frame(
x = 100 - abs(rnorm(n, 0, 5)),
y = 50 + rnorm(n, 0, 3),
dist = sqrt((x - 100)^2 + (y - 50)^2),
z = exp(-(dist / 8)^2)
) %>%
mutate(brk = cut(dist, seq(0, max(dist), by = 5), include.lowest = T))
summarized <- df %>%
filter(!is.na(brk)) %>%
mutate(x_grid = floor(x), y_grid = floor(y)) %>%
group_by(x_grid, y_grid, brk) %>%
summarise(avg_z = mean(z)) %>%
ungroup() %>%
# mutate(z_brk = cut_width(avg_z, width = 0.15)) %>%
mutate(z_brk = cut_interval(avg_z, n = 9)) %>%
mutate(brk_num = as.numeric(z_brk))
head(summarized)
#> # A tibble: 6 x 6
#> x_grid y_grid brk avg_z z_brk brk_num
#> <dbl> <dbl> <fct> <dbl> <fct> <dbl>
#> 1 75 46 (20,25] 0.0000697 [6.97e-05,0.11] 1
#> 2 75 47 (20,25] 0.000101 [6.97e-05,0.11] 1
#> 3 75 49 (20,25] 0.0000926 [6.97e-05,0.11] 1
#> 4 75 50 (20,25] 0.0000858 [6.97e-05,0.11] 1
#> 5 75 52 (20,25] 0.0000800 [6.97e-05,0.11] 1
#> 6 76 51 (20,25] 0.000209 [6.97e-05,0.11] 1
To make the labels, summarize that data to have a single row per band--I did this by taking the minimum of the gridded x, then using the average of y so they'll show up in the middle of the plot.
labels <- summarized %>%
group_by(brk_num) %>%
summarise(min_x = min(x_grid)) %>%
ungroup() %>%
mutate(y_grid = mean(summarized$y_grid))
head(labels)
#> # A tibble: 6 x 3
#> brk_num min_x y_grid
#> <dbl> <dbl> <dbl>
#> 1 1 75 49.7
#> 2 2 88 49.7
#> 3 3 90 49.7
#> 4 4 92 49.7
#> 5 5 93 49.7
#> 6 6 94 49.7
geom_raster is great for these situations where you have data in an evenly spaced grid that just needs uniform tiles at each position. At this point, the summarized data has 595 rows, instead of the original 1 million, so the time to plot shouldn't be an issue.
ggplot(summarized) +
geom_raster(aes(x = x_grid, y = y_grid, fill = z_brk)) +
geom_label(aes(x = min_x, y = y_grid, label = brk_num), data = labels, size = 3, hjust = 0.5) +
theme_void() +
theme(legend.position = "none", panel.background = element_rect(fill = "gray40")) +
coord_fixed() +
scale_fill_brewer(palette = "PuBu")
Created on 2018-11-04 by the reprex package (v0.2.1)

How to make geom_smooth less dynamic

When generating smoothed plots with faceting in ggplot, if the range of the data changes from facet to facet the smoothing may acquire too many degress of freedom for the facets with less data.
For example
library(dplyr)
library(ggplot2) # ggplot2_2.2.1
set.seed(1234)
expand.grid(z = -5:2, x = seq(-5,5, len = 50)) %>%
mutate(y = dnorm(x) + 0.4*runif(n())) %>%
filter(z <= x) %>%
ggplot(aes(x,y)) +
geom_line() +
geom_smooth(method = 'loess', span = 0.3) +
facet_wrap(~ z)
generates the following:
The z=-5 facet is fine, but as one moves to subsequent facets the smoothing seems to 'overfit'; indeed z=-1 already suffers from that, and in the last facet, z=2, the smoothed line fits the data perfectly. Ideally, what I would like is a less dynamic smoothing that for example always smooths about 4 points (or kernel smoothing with a fixed kernel).
The following SO question is related but perhaps more ambitious (in that it wants more control over span); here I want a simpler form of smoothing.
I moved a few things around in your code to get this to work. I'm not sure if it's the best way to do it, but it's a simple way.
First we group by your z variable and then generate a number span that is small for large numbers of observations but large for small numbers. I guessed at 10/length(x). Perhaps there's some more statistically sound way of looking at it. Or perhaps it should be 2/diff(range(x)). Since this is for your own visual smoothing, you'll have to fine tune that parameter yourself.
expand.grid(z = -5:2, x = seq(-5,5, len = 50)) %>%
filter(z <= x) %>%
group_by(z) %>%
mutate(y = dnorm(x) + 0.4*runif(length(x)),
span = 10/length(x)) %>%
distinct(z, span)
# A tibble: 8 x 2
# Groups: z [8]
z span
<int> <dbl>
1 -5 0.2000000
2 -4 0.2222222
3 -3 0.2500000
4 -2 0.2857143
5 -1 0.3333333
6 0 0.4000000
7 1 0.5000000
8 2 0.6666667
Update
The method I did have here was not working correctly. The best way to do this (and the most flexible way to do model-fitting in general) is to pre-compute it.
So we take our grouped dataframe with the computed span, fit a loess model to each group with the appropriate span, and then use broom::augment to form it back into a dataframe.
library(broom)
expand.grid(z = -5:2, x = seq(-5,5, len = 50)) %>%
filter(z <= x) %>%
group_by(z) %>%
mutate(y = dnorm(x) + 0.4*runif(length(x)),
span = 10/length(x)) %>%
do(fit = list(augment(loess(y~x, data = ., span = unique(.$span)), newdata = .))) %>%
unnest()
# A tibble: 260 x 7
z z1 x y span .fitted .se.fit
<int> <int> <dbl> <dbl> <dbl> <dbl> <dbl>
1 -5 -5 -5.000000 0.045482851 0.2 0.07700057 0.08151451
2 -5 -5 -4.795918 0.248923802 0.2 0.18835244 0.05101045
3 -5 -5 -4.591837 0.243720422 0.2 0.25458037 0.04571323
4 -5 -5 -4.387755 0.249378098 0.2 0.28132026 0.04947480
5 -5 -5 -4.183673 0.344429272 0.2 0.24619206 0.04861535
6 -5 -5 -3.979592 0.256269425 0.2 0.19213489 0.05135924
7 -5 -5 -3.775510 0.004118627 0.2 0.14574901 0.05135924
8 -5 -5 -3.571429 0.093698117 0.2 0.15185599 0.04750935
9 -5 -5 -3.367347 0.267809673 0.2 0.17593182 0.05135924
10 -5 -5 -3.163265 0.208380125 0.2 0.22919335 0.05135924
# ... with 250 more rows
This has the side effect of duplicating the grouping column z, but it intelligently renames it to avoid name-collision, so we can ignore it. You can see that there are the same number of rows as the original data, and the original x, y, and z are there, as well as our computed span.
If you want to prove to yourself that it's really fitting each group with the right span, you can do something like:
... mutate(...) %>%
do(fit = (loess(y~x, data = ., span = unique(.$span)))) %>%
pull(fit) %>% purrr::map(summary)
That will print out the model summaries with the span included.
Now it's just a matter of plotting the augmented dataframe we just made, and manually reconstructing the smoothed line and confidence interval.
... %>%
ggplot(aes(x,y)) +
geom_line() +
geom_ribbon(aes(x, ymin = .fitted - 1.96*.se.fit,
ymax = .fitted + 1.96*.se.fit),
alpha = 0.2) +
geom_line(aes(x, .fitted), color = "blue", size = 1) +
facet_wrap(~ z)
I would simply remove the span option (because 0.3 seems too granular) or use lm method to do polynomial fit.
library(dplyr)
library(ggplot2) # ggplot2_2.2.1
set.seed(1234)
expand.grid(z = -5:2, x = seq(-5,5, len = 50)) %>%
mutate(y = dnorm(x) + 0.4*runif(n())) %>%
filter(z <= x) %>%
ggplot(aes(x,y)) +
geom_line() +
geom_smooth(method = 'lm', formula = y ~ poly(x, 4)) +
#geom_smooth(method = 'loess') +
#geom_smooth(method = 'loess', span = 0.3) +
facet_wrap(~ z)
Since I asked how to do kernel smoothing I wanted to provide an answer for that.
I'll start by just adding it as extra data to data frame and plotting that, much as the accepted answer does.
First here is the data and packages I'll be using (same as in my post):
library(dplyr)
library(ggplot2) # ggplot2_2.2.1
set.seed(1234)
expand.grid(z = -5:2, x = seq(-5,5, len = 50)) %>%
mutate(y = dnorm(x) + 0.4*runif(n())) %>%
filter(z <= x) ->
Z
Next here is the plot:
Z %>%
group_by(z) %>%
do(data.frame(ksmooth(.$x, .$y, 'normal', bandwidth = 2))) %>%
ggplot(aes(x,y)) +
geom_line(data = Z) +
geom_line(color = 'blue', size = 1) +
facet_wrap(~ z)
which simply uses ksmooth from base R. Note that it's quite simple to avoid the dynamic smoothing (making the bandwidth constant takes care of that). In fact, one can recover the a dynamic style smoothing (i.e., more like geom_smooth) as follows:
Z %>%
group_by(z) %>%
do(data.frame(ksmooth(.$x, .$y, 'normal', bandwidth = diff(range(.$x))/5))) %>%
ggplot(aes(x,y)) +
geom_line(data = Z) +
geom_line(color = 'blue', size = 1) +
facet_wrap(~ z)
I also followed the example in https://github.com/hrbrmstr/ggalt/blob/master/R/geom_xspline.r to turn this idea into an actual stat_ and geom_ as follows:
geom_ksmooth <- function(mapping = NULL, data = NULL, stat = "ksmooth",
position = "identity", na.rm = TRUE, show.legend = NA,
inherit.aes = TRUE,
bandwidth = 0.5, ...) {
layer(
geom = GeomKsmooth,
mapping = mapping,
data = data,
stat = stat,
position = position,
show.legend = show.legend,
inherit.aes = inherit.aes,
params = list(bandwidth = bandwidth,
...)
)
}
GeomKsmooth <- ggproto("GeomKsmooth", GeomLine,
required_aes = c("x", "y"),
default_aes = aes(colour = "blue", size = 1, linetype = 1, alpha = NA)
)
stat_ksmooth <- function(mapping = NULL, data = NULL, geom = "line",
position = "identity", na.rm = TRUE, show.legend = NA, inherit.aes = TRUE,
bandwidth = 0.5, ...) {
layer(
stat = StatKsmooth,
data = data,
mapping = mapping,
geom = geom,
position = position,
show.legend = show.legend,
inherit.aes = inherit.aes,
params = list(bandwidth = bandwidth,
...
)
)
}
StatKsmooth <- ggproto("StatKsmooth", Stat,
required_aes = c("x", "y"),
compute_group = function(self, data, scales, params,
bandwidth = 0.5) {
data.frame(ksmooth(data$x, data$y, kernel = 'normal', bandwidth = bandwidth))
}
)
(Note that I have a very poor understanding of the above code.) But now we can do:
Z %>%
ggplot(aes(x,y)) +
geom_line() +
geom_ksmooth(bandwidth = 2) +
facet_wrap(~ z)
And the smoothing is not dynamic, as I originally wanted.
I do wonder if there is a simpler way, though.

In ggplot2, how do you combine small valued bars in a stacked histogram together?

Example data:
tmp_df <-
data.frame(a = rnorm(100, 0, 1),
b = rnorm(100, 0.5, 1),
c = rnorm(100, -0.5, 1),
d = rnorm(100, 1, 1),
e = rnorm(100, -1, 1)) %>%
tidyr::gather()
and producing a stacked histogram:
tmp_df %>%
ggplot(aes(x = value, fill = key)) +
geom_histogram(binwidth = 0.1, position = 'stack')
All well and good, in each bin, we have 5 different coloured bars showing the counts in each bin for each group.
What do I do if I only want to show the counts for the top N (say N = 2) groups for each bin, and classify and aggregate the other counts into an 'other' group?
For example, for N = 2 and the bin centred on zero, I would like to show the count of a and c falling into this bins as separate bars, but combine the bar lengths of b, d, and e into one. For the bin centred on approx -1.4, I want to show the counts for group e and c, but aggregate the other two.
You can do this by creating a new grouping variable (which we'll call group) that, for each bin, takes the value of key for the top two levels of key or other for other three levels of key. To make this work, you bin and tally the data before plotting it, and then create the new group column and use it as the fill aesthetic in ggplot.
library(dplyr)
library(ggplot2)
# Set a seed for reproducibility
set.seed(59)
tmp_df <-
data.frame(a = rnorm(100, 0, 1),
b = rnorm(100, 0.5, 1),
c = rnorm(100, -0.5, 1),
d = rnorm(100, 1, 1),
e = rnorm(100, -1, 1)) %>%
tidyr::gather()
In the code below, we bin the data and create the new grouping variable. I used bins 0.2 units wide with labels equal to the mid-points of the bins. To create the group column, we use rank to find the two most common values of key in each bin and set the rest to "other".
tmp_df = tmp_df %>%
group_by(key,
bins=cut(value, seq(-10,10,0.2), labels=seq(-9.9,9.9,0.2))) %>%
tally %>%
group_by(bins) %>%
mutate(group = ifelse(key %in% key[rank(-n, ties="first") %in% 1:2], key, "other")) %>%
arrange(bins, key)
Now, for the plot we use geom_bar and we fill with the new group column we created above. Also, we convert bins (the bin labels) from factor to numeric, so that the x-axis will be continuous, rather than discrete.
tmp_df %>%
ungroup %>%
mutate(bins = as.numeric(as.character(bins))) %>%
ggplot(aes(x=bins, y=n, fill = group)) +
geom_bar(stat='identity') +
scale_fill_manual(values=c(hcl(seq(15,375,length.out=6)[1:5],100,65),"black"))
Let me know if this is what you had in mind.
I used a post-hoc approach. ggplot creates its own data frame for drawing a figure. The data frame has all details, and you can wisely use them.
# Let's create a data set with set.seed().
library(dplyr)
library(tidyr)
library(ggplot2)
library(gridExtra)
set.seed(111)
tmp_df <- data.frame(a = rnorm(100, 0, 1),
b = rnorm(100, 0.5, 1),
c = rnorm(100, -0.5, 1),
d = rnorm(100, 1, 1),
e = rnorm(100, -1, 1)) %>%
tidyr::gather()
# Save the original data
tmp_df %>%
ggplot(aes(x = value, fill = key)) +
geom_histogram(binwidth = 0.1, position = 'stack') -> g
Now you create a new data frame using g. You can see how this data frame looks like below.
# Create a data frame
ggplot_build(g)$data[[1]] %>%
data.frame -> temp
# fill y count x xmin xmax density ncount ndensity PANEL group ymin ymax colour size linetype
#1 #E76BF3 1 1 -4.2 -4.25 -4.15 0.1 0.125 1.25 1 5 0 1 NA 0.5 1
#2 #00B0F6 1 0 -4.2 -4.25 -4.15 0.0 0.000 0.00 1 4 1 1 NA 0.5 1
#3 #00BF7D 1 0 -4.2 -4.25 -4.15 0.0 0.000 0.00 1 3 1 1 NA 0.5 1
#4 #A3A500 1 0 -4.2 -4.25 -4.15 0.0 0.000 0.00 1 2 1 1 NA 0.5 1
#5 #F8766D 1 0 -4.2 -4.25 -4.15 0.0 0.000 0.00 1 1 1 1 NA 0.5 1
#6 #E76BF3 0 0 -4.1 -4.15 -4.05 0.0 0.000 0.00 1 5 0 0 NA 0.5 1
I wanted to check how colors were assigned to each group. So I took a part of data which has 0 for x-axis. This information will be used later.
# Check how colors are assigned to each group
filter(temp, x == 0) %>%
select(fill) %>%
unlist %>%
rev
# fill5 fill4 fill3 fill2 fill1
# "#F8766D" "#A3A500" "#00BF7D" "#00B0F6" "#E76BF3"
Then, I wanted to manipulate the data frame a bit. In order to find the top 2 groups for each group (for each bin), I subtracted ymin from ymax and created a new column called y2. The values in this column tells which groups stay in the top two positions. So, for each group (each x value), I arranged the data in descending order with y2. Then, I replaced the values in y2 for the groups staying in the 3rd-5th position. If there were ties, in each group, the first one was chosen here.
temp %>%
mutate(y2 = ymax - ymin) %>%
arrange(x, desc(y2)) %>%
group_by(x) %>%
mutate(group = as.character(c(group[1:2], rep(6, times = 3)))) %>%
ungroup -> temp2
The final step was to draw a figure again. As eipi10 used geom_bar, I used the same function.
ggplot(data = temp2, aes(x = x, y = y2, fill = group)) +
geom_bar(width = 0.1, stat = "identity") +
scale_fill_manual(name = "key", labels = c("a", "b", "c", "d", "e", "others"),
values = c("#F8766D", "#A3A500", "#00BF7D", "#00B0F6", "#E76BF3", "#000000")) +
labs(x = "value", y = "count") -> g2
For the comparison graphic below
arrangeGrobe(g, g2, ncol = 2) -> g3
ggsave(g3, file = "whatever.png", width = 12, height = 9)
Comparison with the original figure (left)

Resources