NA value breaks ggplot2 heatmap? - r

I'm using ggplot2 to generate a heatmap, but NA values cause the heatmap to be all one color.
Example dataframe:
id<-as.factor(c(1:5))
year<-as.factor(c("Y13", "Y14", "Y15"))
freq<-c(26, 137, 166, 194, 126, 8, 4, 76, 20, 92, 4, NA, 6, 6, 17)
test<-data.frame(id, year, freq)
test
id year freq
1 Y13 26
2 Y14 137
3 Y15 166
4 Y13 194
5 Y14 126
1 Y15 8
2 Y13 4
3 Y14 76
4 Y15 20
5 Y13 92
1 Y14 4
2 Y15 NA
3 Y13 6
4 Y14 6
5 Y15 17
I used the following for the heatmap:
# set color palette
jBuPuFun <- colorRampPalette(brewer.pal(n = 9, "RdBu"))
paletteSize <- 256
jBuPuPalette <- jBuPuFun(paletteSize)
# heatmap
ggplot(test, aes(x = year, y = id, fill = freq)) +
theme(axis.text.x = element_text(angle = 90, hjust = 1, vjust = 0.5)) +
geom_tile() +
scale_fill_gradient2(low = jBuPuPalette[1],
mid = jBuPuPalette[paletteSize/2],
high = jBuPuPalette[paletteSize],
midpoint = (max(test$freq) + min(test$freq)) / 2,
name = "Number of Violations")
The result is a gray color over the entire heatmap.
When I removed the "NA" from the dataframe, the heatmap renders correctly.
I've experimented with this by specifically assigning color to th "NA" values (for example, by
scale_fill_gradient2(low = jBuPuPalette[1],
mid = jBuPuPalette[paletteSize/2],
high = jBuPuPalette[paletteSize],
na.value="yellow",
midpoint = (max(test$freq) + min(test$freq)) / 2,
name = "Number of Violations")
However, that just made the entire heatmap yellow.
Am I missing something obvious? Any suggestions are appreciated.
Thanks.

Comment to answer:
ggplot deals with NAs just fine, but the defaults for min and max are to return NA if the vector contains any NA. You just need to set na.rm = TRUE for these when you define the midpoint of your scale:
midpoint = (max(test$freq, na.rm = TRUE ) + min(test$freq, na.rm = TRUE)) / 2,

Related

Using ggplot to plot number of TRUE statements from a df

I'm trying to plot a graph where number of TRUE statement from a df column.
I have a df that looks like this
Speed Month_1
12 67
12 114
12 155
12 44
13 77
13 165
13 114
13 177
...
And I would like to plot a bargraph where we have x = Speed and y = Number of rows that are above 100 in Month_1 column.
So for X = 12 I would have a bargraph with a Y-value of 2 and for X = 13 I would have a Y-value of 3.
Can I do this directly in ggplot, or do I have to create a new DF first?
Sure, just filter out the values below 100 in the data you pass to ggplot and do a normal geom_bar
ggplot(df[df$Month_1 >= 100, ], aes(factor(Speed))) +
geom_bar(width = 0.5, fill = 'deepskyblue4') +
theme_bw(base_size = 16) +
labs(x = 'Speed')
If, for some reason, you really need to pass the full data frame without filtering it, you can fill the < 100 values with a fully transparent colour:
ggplot(df, aes(factor(Speed), fill = Month_1 > 100)) +
geom_bar(width = 0.5) +
theme_bw(base_size = 16) +
scale_fill_manual(values = c('#00000000', 'deepskyblue4')) +
labs(x = 'Speed') +
theme(legend.position = 'none')
You can use dplyr to filter your data frame and then plot it with ggplot.
library(tidyverse)
df <- tibble(Speed = c(12, 12, 12, 12, 13, 13, 13, 13),
Month_1 = c(67, 114, 155, 44, 77, 165, 114, 177))
df %>% filter(Month_1 > 100) %>%
ggplot(aes(x = Speed)) + geom_bar()

Define each label in heatmap clearly in ggplot in R

I have following data frame:
ID position hum_chr_pos CHROM a1 a2 a3 a4 ID_rn
rs1 197_V 897738 1 0.343442666 0.074361225 1 0.028854932 1
rs3 1582_N 2114271 2 0.015863115 1 0.003432604 0.840242328 2
rs6 2266_I 79522907 3 0.177445544 0.090282782 1 0.038199399 3
rs8 521_D 86959173 4 0.542804846 0.088721027 1 0.047758851 4
rs98 1368_G 92252015 5 0.02861059 0.979995611 0.007545923 1 5
rs23 540_A 96162102 5 0.343781806 0.062643599 1 0.024992095 6
rs43 2358_S 147351955 6 0.042592955 0.862087128 0.013001476 1 7
rs65 577_E 168572720 6 0.517111734 0.080471431 1 0.034521778 8
rs602 1932_T 169483561 6 0.043270585 1 0.009731403 0.988762282 9
rs601 1932_T 169511878 6 0.042963813 0.911392392 0.010562154 1 10
rs603 1932_T 169513583 6 0.04096538 0.956129216 0.010983517 1 11
rs606 1936_T 169513573 7 0.04838 0.0126129216 0.090983517 1 12
rs609 1935_T 169513574 7 0.056 0.045 0.086 1 13
I created a heatmap with the values a1, a2, a3, a4:
For this I used this code:
df_melt <- melt(dummy, id.vars=c("ID", "position","hum_chr_pos","CHROM","ID_rn"))
pos <- df_melt %>%
group_by(CHROM) %>%
summarize(avg = round(mean(ID_rn))) %>%
pull(avg)
ggplot(df_melt, aes(x=variable, y=ID_rn)) + geom_tile(aes(fill=value))+theme_bw()+
scale_fill_gradient2(low="lightblue", mid="white", high="darkblue", midpoint=0.5, limits=range(df_melt$value))+
theme_classic()+ labs(title="graph", x= "a", fill = "value")+
ylab("CHROM") +
scale_y_discrete(limits = pos,labels = unique(limits = pos,df_melt$CHROM))
I would like to find a way to see more clearly the separation of each factor on the y axis. At the moment it is not really clear which row belong to which label on the y axis. So I would like to have something like that:
Also it is weird, that the numbers are sometimes not really in the middle of each factor. For example, the 5 and 7 on the y axis are not centered.
But I have searching how to do this, but couldn't find anything.
You could use geom_hline
ggplot(df_melt, aes(x = variable, y = ID_rn)) +
geom_tile(aes(fill = value)) +
theme_bw() +
scale_fill_gradient2(low = "lightblue", mid = "white", high = "darkblue",
midpoint = 0.5, limits = range(df_melt$value)) +
theme_classic() +
labs(title="graph", x= "a", fill = "value", y = "CHROM") +
scale_y_discrete(limits = c(1, 2, 3, 4, 5.5, 9, 12.5),
labels = unique(df_melt$CHROM)) +
geom_hline(yintercept = c(1, 2, 3, 4, 6, 11, 13) + 0.5, color = 'red')

Imposing normal distribution to column bars by factor

I have a dataframe with 3 columns and several rows, with this structure
Label Year Frequency
1 a 1 86.45
2 b 1 35.32
3 c 1 10.94
4 a 2 13.55
5 b 2 46.30
6 c 2 12.70
up until 20 years. I plot it like this:
ggplot(data=df, aes(x=df$Year, y=df$Frequency, fill=df$Label))+
geom_col(position=position_dodge2(width = 0.1, preserve = "single"))+
scale_fill_manual(name=NULL,
labels=c("A", "B", "C"),
values=c("red", "cyan", "green")) +
scale_x_continuous(breaks = seq(0, 20, by = 1),
limits = c(0, 20)) +
scale_y_continuous(expand = c(0, 0),
limits = c(0, 90),
breaks = seq(0, 90, by = 10)) +
theme_bw()
What I want to do is to add three normal distribution to the plot, so that each group of data (A, B, C) can be visually compared with the normal distribution more similar to its distribution, using the same colors (the normal distribution for label A will be red, and so on).
From the data used in here as an example, I will expect to see a red distribution higher and narrower than the green distribution, which will be shorter and wider. How can I add them to the plot?

How can I keep 0 values in ggplot's legend after log transformation?

How can I add on the graph legend an NA or "0" label if I used trans="log" in the scale_fill_viridis or another continuous scale
> d1
persona num.de.puntos puntos
1 p1 1 3
2 p1 2 4
3 p1 3 0
4 p1 4 4
5 p1 5 2
6 p2 1 2
7 p2 2 3
8 p2 3 0
9 p2 4 0
10 p2 5 4
11 p3 1 0
12 p3 2 1
13 p3 3 0
14 p3 4 5
15 p3 5 8
p <- ggplot(d1, aes(persona, num.de.puntos, fill = puntos)) +
scale_fill_viridis(trans="log", option ="D", direction=1, na.value = "gray50",
breaks=c(0,1,5,8),
name="Number of people",
guide=guide_legend(label.position = "bottom",
title.position = 'top',
nrow=1,
label.theme = element_text(size = 6,
face = "bold",
color = "black"),
title.theme = element_text(size = 6,
face = "bold",
color = "black"))) +
geom_tile(colour="grey", show.legend = TRUE)
p
I want
Note: Code below is run in R 3.5.1 and ggplot2 3.1.0. You might be using an older version of the ggplot2 package, since your code uses scale_fill_viridis instead of scale_fill_viridis_c.
TL;DR solution:
I'm pretty sure this is not orthodox, but assuming your plot is named p, setting:
p$scales$scales[[1]]$is_discrete <- function() TRUE
will get the NA value in your legend without changing any of the existing fill aesthetic mappings.
Demonstration:
p <- ggplot(d1, aes(persona, num.de.puntos, fill = puntos)) +
scale_fill_viridis_c(trans="log", option ="D", direction=1,
na.value = "gray50", # optional, change NA colour here
breaks = c(0, 1, 5, 8),
labels = c("NA label", "1", "5", "8"), # optional, change NA label here
name="Number of people",
guide=guide_legend(label.position = "bottom",
title.position = 'top',
nrow=1,
label.theme = element_text(size = 6,
face = "bold",
color = "black"),
title.theme = element_text(size = 6,
face = "bold",
color = "black"))) +
geom_tile(colour = "grey", show.legend = TRUE)
p # no NA mapping
p$scales$scales[[1]]$is_discrete <- function() TRUE
p # has NA mapping
Explanation:
I dug deep into the plotting mechanics by converting p into a grob object (via ggplotGrob), & running debug() on functions that affect the legend generation part of the plotting process.
After debugging through ggplot2:::ggplot_gtable.ggplot_built, ggplot2:::build_guides, and ggplot2:::guides_train, I got to ggplot2:::guide_train.legend, an un-exported function from the ggplot2 package:
> ggplot2:::guide_train.legend
function (guide, scale, aesthetic = NULL)
{
breaks <- scale$get_breaks()
if (length(breaks) == 0 || all(is.na(breaks))) {
return()
}
key <- as.data.frame(setNames(list(scale$map(breaks)), aesthetic %||%
scale$aesthetics[1]), stringsAsFactors = FALSE)
key$.label <- scale$get_labels(breaks)
if (!scale$is_discrete()) {
limits <- scale$get_limits()
noob <- !is.na(breaks) & limits[1] <= breaks & breaks <=
limits[2]
key <- key[noob, , drop = FALSE]
}
if (guide$reverse)
key <- key[nrow(key):1, ]
guide$key <- key
guide$hash <- with(guide, digest::digest(list(title, key$.label,
direction, name)))
guide
}
I saw that up until the key$.label <- scale$get_labels(breaks) step, key looks like this:
> key
fill .label
1 gray50 NA label
2 #440154 1
3 #72CC59 5
4 #FDE725 8
But scale$is_discrete() is FALSE, so !scale$is_discrete() is TRUE, and the row in key corresponding to the NA value gets dropped in the next step, since there is an NA value in the fill scale's breaks after log transformation:
> scale$get_breaks()
[1] NA 0.000000 1.609438 2.079442
Thus, if we can have scale$is_discrete() evaluate to TRUE instead of FALSE, this step would be skipped, and we end up with the full legend including the NA value.

How to define day from 6am to 6am on x axis in ggplot?

I am trying to do a bar chart of an aggregate, by the hour.
hourly <- data.frame(
hour = 0:23,
N = 7+0:23,
hour.mod = c(18:23, 0:17))
The day is from 6am to 6am, so I added an offset, hour.mod, and then:
ggplot(hourly, aes(x = hour.mod, y = N)) +
geom_col() +
labs(x = "6am to 6am", y = "Count")
Except, the x-axis scale at 0 contradicts the label. While tinkering with scales: scale_x_discrete(breaks = c(6, 10, 14, 18, 22)) disappeared the scale altogether; which works for now but sub-optimal.
How do I specify x axis to start at an hour other than 0 or 23? Is there way to do so without creating an offset column? I am a novice, so please assume you are explaining to the village idiot.
You don't say what you want to see, but it's fairly clear that you should be using scale_x_continuous and shifting your labels somehow, either "by hand" or with some simple math:
ggplot(hourly, aes(x = hour.mod, y = N)) +
geom_col() +
labs(x = "6am to 6am", y = "Count") +
scale_x_continuous(breaks= c(0,4,8,12,16), labels = c(6, 10, 14, 18, 22) )
Or perhaps:
ggplot(hourly, aes(x = hour.mod, y = N)) +
geom_col() +
labs(x = "6am to 6am", y = "Count") +
scale_x_continuous(breaks= c(6, 10, 14, 18, 22)-6, # shifts all values lower
labels = c(6, 10, 14, 18, 22) )
It's possible you need to use modulo arithmetic, which in R involves the use of %% and %/%:
1:24 %% 12
[1] 1 2 3 4 5 6 7 8 9 10 11 0 1 2 3 4 5 6 7 8 9 10 11 0

Resources