I have measurements of a quantity (value) at specific points (lon and lat), like the example data below:
library(ggplot2)
set.seed(1)
dat <- data.frame(lon = runif(1000, 1, 15),
lat = runif(1000, 40, 60),
value = rnorm(1000))
I want to make a 2D summary (e.g. mean) of the measured values with color in space and on top of that I want to show the counts as labels.
I can plot the labels and to the summary plot
## Left plot
ggplot(dat) +
aes(x = lon, y = lat, z = value) +
stat_summary_hex(bins = 5, fun = "mean", geom = "hex")
## Right plot
ggplot(dat) +
aes(x = lon, y = lat, z = value) +
stat_binhex(aes(label = ..count..), bins = 5, geom = "text")
But when I combine both I loose the summary:
ggplot(dat) +
aes(x = lon, y = lat, z = value) +
stat_summary_hex(bins = 5, fun = "mean", geom = "hex") +
stat_binhex(aes(label = ..count..), bins = 5, geom = "text")
I can achieve the opposite, counts as color and summary as labels:
ggplot(dat, aes(lon, lat, z = value)) +
geom_hex(bins = 5) +
stat_summary_hex(aes(label=..value..), bins = 5,
fun = function(x) round(mean(x), 3),
geom = "text")
While writing the question, which took some hours of testing, I found a solution: adding a fill=NULL, or fill=mean(value) in the text one gives me what I want. Below the code and their resulting plots; the only difference is the label of the legend.
But it feels very hacky, so I would appreciate a better solution.
ggplot(dat) +
aes(x = lon, y = lat, z = value) +
stat_summary_hex(bins = 5, fun = "mean", geom = "hex") +
stat_binhex(aes(label = ..count.., fill = NULL), bins = 5, geom = "text") +
theme_bw()
ggplot(dat) +
aes(x = lon, y = lat, z = value) +
stat_summary_hex(bins = 5, fun = "mean", geom = "hex") +
stat_binhex(aes(label = ..count.., fill = mean(value)), bins = 5, geom = "text") +
theme_bw()
I propose a completely different approach to this problem. However, it needs to be clarified a bit first. You write "I have measurements of a quantity (value) at specific points (lon and lat)" but you do not specify these points exactly. Your data (generated) contains 1000 lon points and the same number of lat points.
Anyway, see for yourself.
library(tidyverse)
set.seed(1)
dat <-
tibble(
lon = runif(1000, 1, 15),
lat = runif(1000, 40, 60),
value = rnorm(1000)
)
dat %>% distinct(lon) %>% nrow() #1000
dat %>% distinct(lat) %>% nrow() #1000
My guess is that for real data you have a much smaller set of values for lon and lat.
Let me break it down to an accuracy of 2.
grid = 2
dat %>% mutate(
lon = round(lon/grid)*grid,
lat = round(lat/grid)*grid,
) %>%
group_by(lon, lat) %>%
summarise(
mean = mean(value),
label = n()
)
As you can see after rounding, the data was grouped according to these two variables and then I calculated the statistics you are interested in (mean and number of observations).
Also note that these statistics are generated at the intersection of lon and lat, so we have a square grid. In your solution, this is not the case at all. You are not getting the number of observations at these points and your grid is not square.
So let's make a graph.
dat %>% ggplot(aes(lon,lat,z=mean)) +
geom_contour_filled(binwidth = 0.25) +
geom_text(aes(label = label)) +
theme_bw()
Nothing stands in the way of increasing your grid a bit, let's say 4.
grid = 4
datg = dat %>% mutate(
lon = round(lon/grid)*grid,
lat = round(lat/grid)*grid,
) %>%
group_by(lon, lat) %>%
summarise(
mean = mean(value),
label = n()
)
datg %>% ggplot(aes(lon,lat,z=mean)) +
geom_contour_filled(binwidth = 0.25) +
geom_text(aes(label = label)) +
theme_bw()
Using such a solution, we can easily supplement the labels in the points of interest to us, e.g. with the average value. This time we will use grid = 1.5.
grid = 1.5
datg = dat %>% mutate(
lon = round(lon/grid)*grid,
lat = round(lat/grid)*grid,
) %>%
group_by(lon, lat) %>%
summarise(
mean = mean(value),
label = n(),
lab2 = paste0("(", round(mean, 2), ")")
)
datg %>% ggplot(aes(lon,lat,z=mean)) +
geom_contour_filled(binwidth = 0.25) +
geom_text(aes(label = label)) +
geom_text(aes(label = lab2), nudge_y = -.5, size = 3) +
theme_bw()
Hope this solution fits your needs much better than the stat_binhex based solution.
The problem here is that both plots share the same legend scale.
As the scales ranges are different : 0-40 vs -1.5 - 0.5, the biggest range makes values of the smallest range appear with (almost) the same color.
This is why displaying count as color works, but the opposite doesn't seem to work.
As an illustration, if you rescale the mean calculation, colors variations are visible:
rescaled_mean <- function(x) mean(x)*40
ggplot(dat) +
aes(x = lon, y = lat, z = value) +
stat_summary_hex(bins = 5, fun = "rescaled_mean", geom = "hex")+
stat_binhex(aes(label = ..count..), bins = 5, geom = "text") +
theme_bw()
To be fair, I find this a very strange behaviour. I like your solution though - I really don't find it very hacky to add fill = NULL. In contrary, I find this very elegant. Here a more hacky approach, basically resulting the same, but with one more line. It's using ggnewscale.
library(ggplot2)
set.seed(1)
dat <- data.frame(lon = runif(1000, 1, 15),
lat = runif(1000, 40, 60),
value = rnorm(1000))
ggplot(dat) +
aes(x = lon, y = lat,z = value) +
stat_summary_hex(bins = 5, fun = "mean", geom = "hex") +
ggnewscale::new_scale_fill() +
stat_binhex(aes(label = ..count..), bins = 5, geom = "text")
Created on 2022-02-17 by the reprex package (v2.0.1)
Related
I have the following data:
library(ggplot2)
library(gganimate)
library(tidyverse)
createData<- function(vintage, id){#create data
# Generate a sequence of dates from 2010-01-01 to 2025-12-31 with a quarterly frequency
Dates <- seq(from = as.Date("2010-01-01"), to = as.Date("2025-12-31"), by = "quarter")
RLG<- cumsum(sample(c(-1, 1), 64, TRUE))
df<- data.frame( Dates,RLG, vintage,id)
return(df)
}
#createData
df<- createData("2018-01-01",1) %>%
rbind(createData("2019-01-01",2))%>%
rbind(createData("2020-01-01",3)) %>%
rbind(createData("2021-01-01",4))%>%
rbind(createData("2022-01-01",5))%>%
rbind(createData("2023-01-01",6))%>%
rbind(createData("2024-01-01",7))%>%
rbind(createData("2025-01-01",8))
Which I use to make the following chart:
options(gganimate.nframes = 8*length(unique(df$vintage)), gganimate.res = 30)
p<- ggplot(df) +
aes(x = Dates, y = RLG, group = as.Date(vintage), colour = "RLG") +
geom_line()+
scale_y_continuous(labels = \(x) paste0(x, "%"))+
theme(axis.title = element_blank(),legend.position="none")+
transition_time(id)+
exit_fade(alpha = 0.5)+
shadow_mark(alpha = 0.2)
animate(p, end_pause = 30)
I would like to add a geom_rect which goes from vintage to max(Dates). At each frame, vintage will increase, so the geom_rect will shrink slightly. How can I do this without interfering with the shadow_mark and exit_fades which I am applying to the lines?
If you mean something like a progress bar you could do it like so:
create an DF for the geom which is a subset of the original
df_geom <- df |>
mutate(vintage = as.Date(vintage)) |>
group_by(id) |>
slice(n())
Use geom_segment with the DF from above.
If you want to leave shadow_mark in you can do shadow_mark(exclude_layer = 2).
p <- ggplot(df) +
aes(x = Dates, y = RLG, group = as.Date(vintage), colour = RLG) +
geom_line()+
scale_y_continuous(labels = \(x) paste0(x, "%"))+
theme(axis.title = element_blank(),legend.position="none") +
geom_segment(
data = df_geom,
mapping = aes(x=vintage, xend=Dates,
y = 18, yend = 18),
size = 10, alpha =.4, color ='lightblue'
) +
transition_time(id)+
exit_fade(alpha = 0.5)
# shadow_mark(alpha = 0.2)
animate(p)
I would like to make a simple flow graph.
Here is my code:
## Data
x = tibble(qms = c("FLOW", "FLOW"),
move1 = c("Birth", "Birth"),
move2 = c("Direct", NA),
freq = c(100, 50))
## Graph
x %>%
mutate(id = qms) %>%
to_lodes_form(axis = 2:3, id = id) %>%
na.omit() %>%
ggplot(aes(x = x, stratum = stratum, alluvium = id,
y = freq, label = stratum)) +
scale_x_discrete(expand = c(.1, .1)) +
geom_flow(aes(fill = qms),stat = "alluvium") +
geom_stratum(aes(fill = stratum), show.legend=FALSE) +
geom_text(stat = "stratum", size = 3)
This is the outcome:
My desired outcome is that:
How can I express the decreasing pattern with the missing value?
By slightly reshaping your data you can get what you want. I think the key is to map the alluvium to something fixed like 1 so that it will be a single flow, and mapping stratum to the same variable as x.
library(tidyverse)
library(ggalluvial)
x <- tibble(x = c("Birth", "Direct"),
y = c(100, 50))
x %>%
ggplot(aes(x, y, alluvium = 1, stratum = x)) +
geom_alluvium() +
geom_stratum()
Created on 2022-11-15 with reprex v2.0.2
I would like to visualize Vargha & Delaney's A in ggplot for educational purposes.
A is an effect size used to compare ordinal data of two groups that depend on each data point's upward/downward/sideways comparison to all data points of the other group.
For this, I would like to be able to show all upward, downward, and equal comparisons of data points in different colors. For an example of what I'm looking for, check out this rough scribble
For reproducibility's sake here is some data to try it with:
library(tidyverse)
data_VD <- tibble(
A = c(1, 2, 3, 6),
B = c(1, 3, 7, 9)
)
For reference to how A is calculated, see https://journals.sagepub.com/doi/10.3102/10769986025002101, though it shouldn't be necessary for creating the plot.
You could do:
library(tidyverse)
long_dat <- data_VD %>%
{expand.grid(A = .$A, B = .$B)} %>%
mutate(change = factor(sign(B - A)))
ggplot(pivot_longer(data_VD, everything()), aes(x = name, y = value)) +
geom_segment(data = long_dat, size = 1.5,
aes(x = 'A', xend = 'B', y = A, yend = B, color = change)) +
geom_point(size = 4) +
scale_color_manual(values = c('#ed1e26', '#fff205', '#26b24f')) +
theme_classic(base_size = 20) +
scale_y_continuous(breaks = 1:10) +
labs(x = '', y = '') +
theme(legend.position = 'none')
I would like to overlay two ggplots from different data sources. I don't think a left_join will work because the dataframes are of two different lengths and would potential change the underlying plots.[Maybe?]
library(tidyverse)
set.seed(123)
player_df <- tibble(name = rep(c("A","B","C","D"), each = 10, times = 1),
pos = rep(c("DEF","DEF","MID","MID"), each = 10, times = 1),
load = c(rnorm(10, mean = 200, sd = 100),
rnorm(10, mean = 300, sd = 50),
rnorm(10, mean = 400, sd = 100),
rnorm(10, mean = 500, sd = 50)))
p1 <- player_df %>%
ggplot(aes(x = load, y = name)) +
geom_point()
pos_df <- tibble(pos = rep(c("DEF","MID"), each = 30, times = 1),
load = (c(rnorm(30, mean = 250, sd = 100),
rnorm(30, mean = 350, sd = 100))))
p2 <- pos_df %>%
ggplot(aes(x = load, y = pos)) +
geom_boxplot()
p1
p2
# add p2 to every p1 player plot by pos
I would like p1 to have the corresponding p2 - by pos - appear behind it. So... add the matching p2 boxplot to each p1 scatterplot.
p1:
p2:
It's not really advisable to attempt to superimpose two plots on each other. A ggplot is made of layers already, so usually it's just a case of superimposing one geom on another. This can be difficult if (as in your case) one of the axes has different labels. However, with a little work it is possible to wrangle your data so that it all sits on a single plot. In your case, you could do something like:
levs <- c("A", "DEF", "B", "C", "MID", "D")
ggplot(within(pos_df, pos <- factor(pos, levs)), aes(x = load, y = pos)) +
geom_boxplot(width = 2.3) +
geom_point(data = within(player_df, pos <- factor(name, levs))) +
scale_y_discrete(limits = c("A", "DEF", "B", " ", "C", "MID", "D"))
Dug into ggplot a bit and re-engineered a boxplot bit by bit.
# manually calculate stats that are used in boxplots
pos_df_summary <- pos_df %>%
group_by(pos, .drop = FALSE) %>%
summarise(min = fivenum(load)[1],
Q1 = fivenum(load)[2],
median = fivenum(load)[3],
Q3 = fivenum(load)[4],
max = fivenum(load)[5]
)
# add the boxplot data to each player
joined_df <- player_df %>%
left_join(., pos_df_summary, by = "pos") %>%
distinct(name, .keep_all = TRUE)
# plot
ggplot(data = NULL, aes(group = name)) +
# create the line from min to max
geom_segment(data = joined_df, aes(y = name, yend = name, x=min, xend=max), color="black") +
#create the box with median line
geom_crossbar(data = joined_df,
aes(y = name, xmin = Q1, xmax = Q3, x = median, fill = "NA"),
color = "black",
fatten = 1) +
scale_fill_manual(values = "white") +
# add the points from the player_df
geom_point(data = player_df,
aes(x = load, y = name, group=name),
color = "red",
show.legend=FALSE) +
theme(legend.position = "none")
There may be some extraneous code in here as I cobbled it from some other resources. Specifically, I'm not sure what the aes(group = name) in the ggplot() call does exactly.
I have some troubles with my code. I'm very very beginner in R, so I would like some help. I have a dataframe and I need to make an hist chart and then highlight some points. But I cannot understand how to find those points in my dataset. Here is and example of what I have.
x <- c("a","b","c","d","f","g","h","i","j","k")
y <- c(197421,77506,130474,18365,30470,22518,70183,15378,29747,11148)
z <- data.frame(x,y)
hist(z$y)
For example, how can I find in the hist where is the "a" and "h" value placed? and in a barplot? I tried the function points, but I cannot find the coordinates. Please let me know how could I make that . Thanks in advance.
Here is a way with dplyr and ggplot2. The approach is to cut the y variable into bins and then use summarise to create the counts and the labels.
library(dplyr)
library(ggplot2)
z %>%
mutate(bins = cut(y, seq(0, 200000, 50000))) %>%
group_by(bins) %>%
summarise(xes = paste0(x, collapse = ", "),
count = n()) %>%
ggplot() +
geom_bar(aes(x = bins, y = count), stat = "identity", color = "black", fill = "grey") +
geom_text(aes(x = bins, y = count + 0.5, label = xes)) +
xlab("y")
Here is a more complicated way that makes a plot that looks more like what hist() produces.
z2 <- z %>%
mutate(bins = cut(y, seq(0, 200000, 50000))) %>%
group_by(bins) %>%
summarise(xes = paste0(x, collapse = ", "),
count = n()) %>%
separate(bins, into = c("start", "end"), sep = ",") %>%
mutate(across(start:end, ~as.numeric(str_remove(., "\\(|\\]"))))
ggplot() +
geom_histogram(data = z, aes(x = y), breaks = seq(0, 200000, 50000),
color = "black", fill = "grey") +
geom_text(data = z2, aes(x = (start + end) / 2, y = count + 0.5, label = xes))