Normalize/proportionalize count of individual bins 2d histogram - r

I am trying to normalize the count of individual bins in a 2d histogram. Here, group 3 has a substantially higher number of inputs, however, I want to compare bins. So I am trying to get it to show the proportional y values of each bin, that the total count of each bin adds up to e.g. 100.
I reckon that this has to be done with the dataframe beforehand. I have managed to normalize the values per group, however, I havent managed to reduce the count to be able to visualize it like so in with the 2d histogram function.
perClassNormalized <- Variables %>%
group_by(Class) %>%
mutate(Nor = procntStad/(max(procntStad)))
Variables <- dataframe with about 10 variables (columns), each with x entries per one of 5 classes. The current total counts per class are: 1 = 639, 2 = 247, 3 = 9881, 4 = 1084, 5 = 823. So the number of inputs for 3 is substantially higher than the others.
Class
variable1
variable2
1
3
7
1
2
3
2
2
6
2
5
8
3
3
9
3
2
1
3
2
3
3
8
4
4
9
5
5
10
2
Example of what image I currently have
my_breaks = c(2, 10, 50, 100, 5000)
##
procentStadVSKlasse <- ggplot(perClassNormalized , aes(x = Class, y = (Nor))) + geom_bin2d(bins = 10) +
ylab("Percentage bebouwd oppervlak") + xlab("Norm klasse regionale kering") +
labs(title = "Bebouwd oppervlak") +
scale_fill_gradient(name = "count", trans = "log", breaks = my_breaks, labels = my_breaks,
low = '#55C667FF', high = '#FDE725FF') +
theme_bw() +
scale_x_discrete(limits = c(1,2,3,4,5)) +
theme(plot.title = element_text(hjust = 0.5, size = 14, face = "bold"),
axis.title.x = element_text(size=14),
axis.text.x = element_text(size=12),
axis.title.y = element_text(size=14))
The new image should likely be similar, however, the visualization is likely to be improved and distinctions are hopefully more easily spotted.

Related

Why does gganimate fail to order lines correctly by date with transition_reveal?

I'm aiming to reproduce an animated figure by Ed Hawkins on climate change in R with gganimate. The figure is called climate spiral. While a static ggplot figure shows the correct order of lines by year (the most recent data on top), the animated plot with transition_reveal() results in a wrong order of the lines.
Here is a reproducible example code with synthetic data:
library(tidyverse)
library(lubridate)
library(gganimate)
library(RColorBrewer)
# Create monthly data from 1950 to 2020 (and a component for rising values with time)
df <- tibble(year = rep(1950:2020, each = 12),
month = rep(month.abb, 2020-1950+1)) %>%
mutate(date = dmy(paste("01",month,year)),
value = rnorm(n(), 0, 2) + row_number()*0.005) %>%
with_groups(year, mutate, value_yr = mean(value))
temp <- df %>%
ggplot(aes(x = month(date, label=T), y = value, color = value_yr)) +
geom_line(size = 0.6, aes(group = year)) +
geom_hline(yintercept = 0, color = "white") +
geom_hline(yintercept = c(-4,4), color = c("skyblue3","red1"), size = 0.2) +
geom_vline(xintercept = 1:12, color = "white", size = 0.2) +
annotate("label", x = 12.5, y = c(-4,0,4), label = c("-4°C","0°C","+4°C"),
color = c("skyblue3","white","red1"), size = 2.5, fill = "#464950",
label.size = NA, label.padding = unit(0.1, "lines"),) +
geom_point(x = 1, y = -11, size = 15, color = "#464950") +
geom_label(aes(x = 1, y = -11, label = year),
color = "white", size = 4,
fill = "#464950", label.size = NA) +
coord_polar(start = 0) +
scale_color_gradientn(colors = rev(brewer.pal(n=11, name = "RdBu")),
limits = range(df$value_yr)) +
labs(x = "", y = "") +
theme_bw() +
theme(panel.background = element_blank(),
panel.border = element_blank(),
panel.grid.major = element_blank(),
plot.background=element_rect(fill="#464950", color="#464950"),
axis.text.x = element_text(margin = margin(t = -20, unit = "pt"),
color = "white"),
axis.text.y = element_blank(),
axis.ticks = element_blank(),
legend.position = "none")
Now, we can either save the plot as PNG or animate and save as GIF:
ggsave(temp, filename = "test.png", width = 5, height = 5, dpi = 320)
# Animate by date:
anim <- temp +
transition_reveal(date) +
ease_aes('linear')
output <- animate(anim, nframes = 100, end_pause = 30,
height = 5, width = 5, units = "in", res = 300)
anim_save("test.gif", output)
Let's see the results!
Static PNG:
Animated GIF:
At first glance, the results look equal, however, the detail shows differences (for instance, the marked blue line).
In this example code with synthetic data, the differences are minor. But with real data, the figures look pretty different as many red lines (recent data points with high temperatures) disappear in the background. So, how can you retain the order in transition_reveal() by date? Any help appreciated, thanks a lot!
This isn't the answer per se. This is the why. You'll have to tell me what you prefer given this information for me to give you a solution.
I tried a few things—each of which I was just sure would work but did not. So, I wanted to see what was happening in ggplot. My hunch proved correct. Your data is in order of value_yr in the png, not year.
I repeat this question at the end:
Either you can put the animation in order of value_yr or you can put the color in ggplot in order by year. Which would you prefer?
How do I know? I extracted the assigned colors in the object.
tellMe <- ggplot_build(temp)$data[[1]]
head(tellMe)
# colour x y group PANEL flipped_aes size linetype alpha
# 1 #1E60A4 1 -1.75990067 1 1 FALSE 0.6 1 NA
# 2 #1E60A4 2 -0.08968196 1 1 FALSE 0.6 1 NA
# 3 #1E60A4 3 -0.69657130 1 1 FALSE 0.6 1 NA
# 4 #1E60A4 4 -0.10777727 1 1 FALSE 0.6 1 NA
# 5 #1E60A4 5 1.57710505 1 1 FALSE 0.6 1 NA
# 6 #1E60A4 6 1.63277369 1 1 FALSE 0.6 1 NA
gimme <- tellMe %>% group_by(group) %>%
summarise(color = unique(colour)) %>%
print(n = 100) # there are less than 100, I just want them all
head(gimme)
# # A tibble: 6 × 2
# group color
# <int> <chr>
# 1 1 #1E60A4
# 2 2 #114781
# 3 3 #175290
# 4 4 #053061
# 5 5 #1C5C9E
# 6 6 #3E8BBF
To me, this indicated that the colors weren't in group order, so I wanted to see the colors to visualize the order.
I used this function. I know it came from a demo, but I don't remember which one. I looked just so I could include that here, but I didn't find it.
# this is from a demo (not sure which one anymore!
showCols <- function(cl=colors(), bg = "lightgrey",
cex = .75, rot = 20) {
m <- ceiling(sqrt(n <-length(cl)))
length(cl) <- m*m; cm <- matrix(cl, m)
require("grid")
grid.newpage(); vp <- viewport(w = .92, h = .92)
grid.rect(gp=gpar(fill=bg))
grid.text(cm, x = col(cm)/m, y = rev(row(cm))/m, rot = rot,
vp=vp, gp=gpar(cex = cex, col = cm))
}
showCols(gimme$color)
The top left color is the oldest year, the value below it is the following year, and so on. The most recent year is the bottom value in the right-most column.
df %>% group_by(yr) %>% summarise(value_yr = unique(value_yr))
# they are in 'value_yr' order in ggplot, not year
# # A tibble: 71 × 2
# yr value_yr
# <int> <dbl>
# 1 1950 0.0380
# 2 1951 -0.215
# 3 1952 -0.101
# 4 1953 -0.459
# 5 1954 -0.00130
# 6 1955 0.559
# 7 1956 -0.457
# 8 1957 -0.251
# 9 1958 1.10
# 10 1959 0.282
# # … with 61 more rows
Either you can put the animation in order of value_yr or you can put the color in ggplot in order by year. Which would you prefer?
Update
You won't use transition_reveal to group and transition by the same element. Unfortunately, I can't tell you why, but it seems to get stuck at 1958!
To make this gif on the left match that ggplot png on the right:
First, I modified the calls to ggplot and geom_line
ggplot(aes(x = month(date, label = T), y = value,
group = yr, color = yr)) +
geom_line(size = .6)
Then I tried to use transition_reveal but noticed that subsequent years were layered underneath other years. I can't account for that odd behavior. When I ran showCol after changing temp, the colors were in order. That ruled out what I had thought the problem was initially.
I modified the object anim, using transition_manual to force the order of the plot layers.
anim <- temp +
transition_manual(yr, cumulative = T) +
ease_aes('linear')
That's it. Now the layers match.
As to whether this would have worked before you changed the color assignment: original plot with manual transitions of the year on the left, ggplot png on the right:
It looks like that would've have worked, as well. So, my original drawn-out explanation wasn't nearly as useful as I thought, but at least you have a working solution now. (Sigh.)

How to make this graph - that compares ranks - in R?

I'm trying to make a graph like the one on the picture in R. I tried with this piece of code, however it doesn't look the same, I want it to be symmetrical just like the one on the picture.
My data.frame looks like this:
Group Ranking1 Ranking2 Pop
a 1 1 12345
b 2 4 127868
c 3 2 123477
d 4 3 9485
e 5 7 132588
f 6 5 38741
g 7 9 8372
h 8 11 53423
i 9 6 238419
j 10 16 31314
And the code I used was:
ggparcoord(data,
columns = 2:3, groupColumn = 1,
scale="globalminmax",
showPoints = TRUE,
title = "Ranking",
alphaLines = 0.3
) + scale_color_viridis(discrete=TRUE) + theme_ipsum()+ theme_void()
But I can`t make it look like this one:
If I understand correctly what you mean with "symmetrical": You won't be able to reproduce a graph like this if the Rankings in the two columns don't match. In Ranking1 you have c(1:10), in Ranking2 you have c(1:7, 9, 11, 16).
Here's a minimal example to get closer to your goal:
Data
# Data with corrected rankings (1:10)
data <- read.table(text="
Group Ranking1 Ranking2 Pop
a 1 1 12345
b 2 4 127868
c 3 2 123477
d 4 3 9485
e 5 7 132588
f 6 5 38741
g 7 9 8372
h 8 8 53423
i 9 6 238419
j 10 10 31314
", header = TRUE)
Code
# Build plot
GGally::ggparcoord(data,
columns = 2:3, groupColumn = 1,
scale="globalminmax",
showPoints = TRUE,
title = "Ranking"
) +
# Reversed y axis with custom breaks to recreate 1:10 rankings
scale_y_reverse(breaks = 1:10)
Edit: Making it pretty
If you want to add some pizzaz (as you were trying to do) you can do the following (no need to use theme_void()):
GGally::ggparcoord(data,
columns = 2:3, groupColumn = 1,
scale="globalminmax",
showPoints = TRUE,
title = "Ranking"
) +
# Reverses scale, adds pretty breaks
scale_y_reverse(breaks = 1:10) +
# Prettifies typography etc.
hrbrthemes::theme_ipsum() +
# Removes gridlines
theme(
panel.grid.major = element_blank(),
panel.grid.minor = element_blank()
) +
# Removes axis labels
labs(
y = element_blank(),
x = element_blank()
)

Imposing normal distribution to column bars by factor

I have a dataframe with 3 columns and several rows, with this structure
Label Year Frequency
1 a 1 86.45
2 b 1 35.32
3 c 1 10.94
4 a 2 13.55
5 b 2 46.30
6 c 2 12.70
up until 20 years. I plot it like this:
ggplot(data=df, aes(x=df$Year, y=df$Frequency, fill=df$Label))+
geom_col(position=position_dodge2(width = 0.1, preserve = "single"))+
scale_fill_manual(name=NULL,
labels=c("A", "B", "C"),
values=c("red", "cyan", "green")) +
scale_x_continuous(breaks = seq(0, 20, by = 1),
limits = c(0, 20)) +
scale_y_continuous(expand = c(0, 0),
limits = c(0, 90),
breaks = seq(0, 90, by = 10)) +
theme_bw()
What I want to do is to add three normal distribution to the plot, so that each group of data (A, B, C) can be visually compared with the normal distribution more similar to its distribution, using the same colors (the normal distribution for label A will be red, and so on).
From the data used in here as an example, I will expect to see a red distribution higher and narrower than the green distribution, which will be shorter and wider. How can I add them to the plot?

Selectively colored geom_hline

I am using hline from ggplot to construct an axis for a data set I am looking out. Essentially I want to selectively color this axis based on a dataframe. This dataframe consists of an array of (7684, 7685,...,7853) and each corresponds to a letter "a", "b", "c", and "d". I would like to correspond each letter with a color used to color that interval on the axis.
For example row 1 of this data frame is: (7684, "c") so I would want to color the interval on the axis from 7684 to 7685 the color of "c" which could be red for instance. I have yet to think of a straightforward solution to this, I am not sure if hline would be the way to go with this.
> df
p nucleotide
1 c 7684
2 c 7685
3 t 7686
4 t 7687
5 a 7688
6 c 7689
7 a 7690
8 t 7691
9 a 7692
10 c 7693
Small snippet of what I am talking about. Basically want to associate df$p with colors. And color the interval of the corresponding df$nucleotide
You never use a for loop in ggplot and you should never use df$.. in an aesthetic.
library(dplyr)
library(ggplot2)
ggplot(df) +
geom_segment(aes(x = nucleotide, xend = lead(nucleotide), y = 1, yend = 1, color = p), size = 4)
#> Warning: Removed 1 rows containing missing values (geom_segment).
This takes us half the way. What is does is draw a segment from x to xend. x is mapped to the nucleotide value, xend is mapped to lead(nucleotide), meaning the next value. This of course lead to leaving out the last line, as it does not have a next value.
The following code takes care of that, admittedly in a hackish way, adding a row to the df, and then limiting scale_x . It may be not generalizable.
It also add some graphical embellishment.
df %>%
add_row(p = '', nucleotide = max(.$nucleotide) + 1) %>%
ggplot() +
geom_segment(aes(x = nucleotide, xend = lead(nucleotide), y = 1, yend = 1, color = p), size = 4) +
geom_text(aes(x = nucleotide, y = 1, label = nucleotide), nudge_x = .5, size = 3) +
scale_x_continuous(breaks = NULL, limits = c(min(df$nucleotide), max(df$nucleotide) + 1)) +
scale_color_brewer(palette = 'Dark2', limits = c('a', 'c', 't'), direction = 1) +
theme(aspect.ratio = .2,
panel.background = element_blank(),
axis.text.y = element_blank(),
axis.ticks.y = element_blank(),
axis.title.y = element_blank())
#> Warning: Removed 1 rows containing missing values (geom_segment).
#> Warning: Removed 1 rows containing missing values (geom_text).
Data
df <- read.table(text = ' p nucleotide
1 c 7684
2 c 7685
3 t 7686
4 t 7687
5 a 7688
6 c 7689
7 a 7690
8 t 7691
9 a 7692
10 c 7693', header = T)

ggplot2 - create stacked histogram of proportions for indiciduals, and seperate them by population

Essentially, I have a dataset in which I have 4 columns containing the following information: individuals ("Ind"), the geographic population to which those individuals belong ("Pop"), the proportion of their genome that belongs to cluster1 and the proportion of their genome that belongs to cluster2 (these last two add up to 1).
Example:
Ind <- c(1:20)
Pop <- rep(1:2, each = 10)
set.seed(234)
Cluster1 <- runif(20, 0.0, 1.0)
Cluster2 <- 1-Cluster1
df <- data.frame(Ind, Pop, Cluster1, Cluster2)
Data:
Ind Pop Cluster1 Cluster2
1 1 1 0.745619998 0.25438000
2 2 1 0.781712425 0.21828758
3 3 1 0.020037114 0.97996289
4 4 1 0.776085387 0.22391461
5 5 1 0.066910093 0.93308991
6 6 1 0.644795124 0.35520488
7 7 1 0.929385959 0.07061404
8 8 1 0.717642189 0.28235781
9 9 1 0.927736510 0.07226349
10 10 1 0.284230120 0.71576988
11 11 2 0.555724930 0.44427507
12 12 2 0.547701653 0.45229835
13 13 2 0.582847855 0.41715215
14 14 2 0.582989913 0.41701009
15 15 2 0.001198341 0.99880166
16 16 2 0.441117854 0.55888215
17 17 2 0.313152501 0.68684750
18 18 2 0.740014466 0.25998553
19 19 2 0.138326844 0.86167316
20 20 2 0.871777777 0.12822222
I want to try and produce a plot using ggplot2 that resembles the "A" panel in this figure. In this figure, each individual is a bar with the proportion of each cluster, but the x ticks are the populations and the vertical grids separate these populations. I know that I can easily produce a stacked histogram if I ignore Pop and use melt(). But I would like to know how to incorporate Pop to produce elegant an elegant plot such as the one in the link above.
Thanks!
How about melting with both Ind and Pop as id variables and graphing it with a facet_grid? It's not 100% like the plot you were looking for but gets pretty close with a few theme adjustments:
dfm <- melt(df, id = c("Ind", "Pop"))
ggplot(dfm, aes(Ind, value, fill = variable)) +
geom_bar(stat="identity", width = 1) +
facet_grid(~Pop, scales = "free_x") +
scale_y_continuous(name = "", expand = c(0, 0)) +
scale_x_continuous(name = "", expand = c(0, 0), breaks = dfm$Ind) +
theme(
panel.border = element_rect(colour = "black", size = 1, fill = NA),
strip.background = element_rect(colour = "black", size = 1),
panel.margin = unit(0, "cm"),
axis.text.x = element_blank()
)
UPDATE: my example fails to cover the more complex case of multiple populations with uneven numbers of individuals. Quick amendment to deal with this case using the spaces = "free_x" attribute, complete code for example:
require(ggplot2)
require(reshape2)
require(grid)
Ind <- c(1:30)
Pop <- rep(paste("Pop", 1:3), times = c(5, 15, 10))
set.seed(234)
Cluster1 <- runif(30, 0.0, 1.0)
Cluster2 <- 1-Cluster1
df <- data.frame(Ind, Pop, Cluster1, Cluster2)
dfm <- melt(df, id = c("Ind", "Pop"))
ggplot(dfm, aes(Ind, value, fill = variable)) +
geom_bar(stat="identity", width = 1) +
facet_grid(~Pop, scales = "free_x", space = "free_x") +
scale_y_continuous(name = "", expand = c(0, 0)) +
scale_x_continuous(name = "", expand = c(0, 0), breaks = dfm$Ind) +
theme(
panel.border = element_rect(colour = "black", size = 1, fill = NA),
strip.background = element_rect(colour = "black", size = 1),
panel.margin = unit(0, "cm"),
axis.text.x = element_blank()
)

Resources