ggplot2 automatically removes missing values but does not rescale axes - r

The code below show that ggplot2 automatically removes the 2nd observation, and yet still keep the y-axis's range from 1 to 1000. How to make ggplot2 scale appropriately without hard-coding the range myself?
df <- data.frame(x = c(1, NA),
y = c(1, 1000))
ggplot(df) + geom_point(aes(x, y))

How about removing rows with missing values in x before plotting?
library(dplyr)
df %>%
filter(!is.na(x)) %>%
ggplot() +
geom_point(aes(x, y))
Or use na.omit
df %>%
na.omit() %>%
ggplot() +
geom_point(aes(x, y))

Related

Labeling faceted plot with ggrepel and jittered dots

I am trying to plot three boxplots and label always the 2 highest values.
I tried creating a position element and use this to get same position for points and labels, but somehow this does not work.
set.seed(1)
df <-
data.frame(a=rep(letters, 3),
b=LETTERS[1:3],
int=runif(78, 15, 30))
jitter_pos <- position_jitter(width=.4, seed = 1)
ggplot(df,
aes(1, int,
color=b)) +
geom_point(position=jitter_pos) +
geom_boxplot(alpha=.3, outlier.shape=NA, fill=NA, color='#993404') +
facet_wrap(~ b) +
guides(color=FALSE) +
geom_label_repel(data=df %>%
group_by(b) %>%
arrange(desc(int)) %>%
slice(1:2),
aes(label=a),
size=2.5, color='black',
fill='#FFFFFF33',
box.padding=1,
position=jitter_pos)
I am pretty sure it is just a little mistake but somehow I can't find my error. The labels do not match the dot positions.
Maybe a better solution would be with b on the x axis and using jitterdodge somehow, but this didn't worked ether, so I tried to get it running with facets. So nothing works for me yet.
Your mistake is in position_jitter's behavior and the issue is discussed at length in this github issue.
The solution suggested by the ggrepel author is to add an explicit label column to the dataframe, with empty strings for the rows you want to omit:
library(ggplot2)
library(ggrepel)
library(tidyverse)
set.seed(1)
df <-
data.frame(a=rep(letters, 3),
b=LETTERS[1:3],
int=runif(78, 15, 30)) %>%
group_by(b) %>%
mutate(label = if_else(rank(-int) %in% 1:2, as.character(a), ""))
jitter_pos <- position_jitter(width=.4, height = 0, seed = 1)
ggplot(df,
aes(1, int,
color=b)) +
geom_jitter(position=jitter_pos) +
geom_boxplot(alpha=.3, outlier.shape=NA, fill=NA, color='#993404') +
guides(color=FALSE) +
facet_wrap(~ b) +
geom_label_repel(aes(x=1, y=int, label=label),
size=2.5, color='black',
fill='#FFFFFF33',
box.padding=1,
position=jitter_pos)
Created on 2020-09-01 by the reprex package (v0.3.0)

Within a function, how to create a discrete axis with _repeated and ordered_ labels

I want to create a function that makes a heatmap where the y axis will have unique breaks, but repeated and ordered labels. I know that this is might not be a great practice. I am also aware that similar questions have been asked before. For example: ggplot in R, reordering the bars. But I want to achieve these repeated and ordered labels through sorting within a function, not by typing them manually. I am aware of solutions for reordering axes based on the values of factor (e.g., Order Bars in ggplot2 bar graph), but I don't think they apply or can't see how to apply these to my case, where the breaks are unique but the labels repeat.
Here is some code to reproduce the problem and some of my attempts:
Libraries and data
library(ggplot2)
library(dplyr)
library(tidyr)
set.seed(4)
id <- LETTERS[1:10]
lab <- paste(c("AB", "CD"), 1:5, sep = "_") %>%
sample(., size = 10, replace = TRUE)
val <- sample.int(n = 6, size = 10, replace = TRUE)
tes <- ifelse(val >= 4, 1, 0)
dat <- data.frame(id, lab, val, tes)
A heatmap with unique breaks on the y axis
dat2 <- dat %>% gather(kind, value, val:tes)
ggplot(dat2) +
geom_tile(aes(x = kind, y = id, fill = value), color="white", size=1)
A heatmap where the y axis is labeled with repeated labels instead of the unique breaks
This works, to the point that labels are used instead of unique ids, but the y axis is not ordered by the labels. Also, I am not sure about setting breaks and labels from the data frame in wide format (dat), rather than the data frame in long format used by ggplot (dat2).
dat2 <- dat %>% gather(kind, value, val:tes)
ggplot(dat2) +
geom_tile(aes(x = kind, y = id, fill = value), color="white", size=1) +
scale_y_discrete(breaks=dat$id, labels=dat$lab)
Mapping the vector of with repeated values on the y axis obviously doesn't work
dat2 <- dat %>% gather(kind, value, val:tes)
ggplot(dat2) +
geom_tile(aes(x = kind, y = lab, fill = value), color="white", size=1)
Repeated and ordered labels, try 1
As expected, merely sorting the input data by the non-unique lab variable does not work.
dat2 <- dat %>% gather(kind, value, val:tes) %>%
arrange(lab)
ggplot(dat2) +
geom_tile(aes(x = kind, y = id, fill = value), color="white", size=1) +
scale_y_discrete(breaks=id, label=lab)
Repeated and ordered labels, try 2
Try to create a named breaks vector ordered by the (repeating) labels. This gets me nowhere. Half the labels are missing and they are still not sorted.
dat2 <- dat %>% gather(kind, value, val:tes)
brks <- setNames(dat$id, dat$lab)[sort(dat$lab)]
ggplot(dat2) +
geom_tile(aes(x = kind, y = id, fill = value), color="white", size=1) +
scale_y_discrete(breaks = brks, labels = names(brks))
Repeated and ordered labels, try 3
Starting with the data frame sorted by label, try to create an ordered factor for lab. Then sort the table by this ordered factor. No luck.
dat2 <- dat %>% gather(kind, value, val:tes) %>% arrange(lab)
dat2 <- mutate(dat2, lab_f=factor(lab, levels=sort(unique(lab)), ordered = TRUE))
dat2 <- arrange(dat2, lab_f)
# check
dat2$lab_f
ggplot(dat2) +
geom_tile(aes(x = kind, y = id, fill = value), color="white", size=1) +
scale_y_discrete(breaks = dat2$id, labels = dat2$lab_f)
A workaround, which I can use if I have to, but I am trying to avoid
We can create a combination of id and lab which will be unique and use it for the y axis
dat2 <- dat %>% gather(kind, value, val:tes) %>%
mutate(id_lab=paste(lab, id, sep="_"))
ggplot(dat2) +
geom_tile(aes(x = kind, y = id_lab, fill = value), color="white", size=1)
I must be missing something. Any help is much appreciated.
The goal is to have a function that will take an arbitrarily long table and plot a y axis with unique breaks but (possibly) repeated and ordered labels.
heat <- function(dat) {
dat2 <- dat %>% gather(kind, value, val:tes)
# any other manipulation here
ggplot(dat2) +
geom_tile(aes(x = kind, y = id, fill = value), color="white", size=1)
# scale_y_discrete() (if needed)
}
The plot I am looking for is something like this (created in inkscape)
Using limits instead of breaks sets the order:
ggplot(dat2) +
geom_tile(aes(x = kind, y = id, fill = value), color="white", size=1) +
geom_text(aes(x = 1, y = id, label = id), col = 'white') +
scale_y_discrete(limits = dat$id[order(dat$lab)], labels = sort(dat$lab))

How to normalize data series to start value = 0?

I have a dataset similar to this:
library(ggplot2)
data(economics_long)
economics_long$date2 <- as.numeric(economics_long$date) + 915
ggplot(economics_long, aes(date2, value01, colour = variable)) +
geom_line()
Which gives the following plot:
Now I would like to normalize it to the start value of the green line (or the mean), so all variables start at the same point of the Y axes. Similar to this:
Thanks for any help.
You could subtract the starting value of each vector depending on variable-value using by().
library(ggplot2)
l <- by(economics_long, economics_long$variable, function(x)
within(x, varnorm <- value01 - value01[1]))
dat <- do.call(rbind, l)
ggplot(dat, aes(date2, value01.n, colour = variable)) +
geom_line()
use group_by() and mutate() to shift each variable by its initial y-value.
library(tidyverse)
data(economics_long)
economics_long %>%
group_by(variable) %>%
mutate(value_shifted = value01 - value01[1]) %>%
ungroup() %>%
ggplot(aes(date2, value_shifted, colour = variable)) +
geom_line()

R Highlight point on ecdf line graph

I'm creating a frequency plot using ggplot and the stat_ecdf function. I would like to add the Y-value to the graph for specific X-values, but just can't figure out how. geom_point or geom_text seems likely options, but as stat_ecdf automatically calculates Y, I don't know how to call that value in the geom_point/text mappings.
Sample code for my initial plot is:
x = as.data.frame(rnorm(100))
ggplot(x, aes(x)) +
stat_ecdf()
Now how would I add specific y-x points here, e.g. y-value at x = -1.
The easiest way is to create the ecdf function beforehand using ecdf() from the stats package, then plot it using geom_label().
library(ggplot2)
# create a data.frame with column name
x = data.frame(col1 = rnorm(100))
# create ecdf function
e = ecdf(x$col1)
# plot the result
ggplot(x, aes(col1)) +
stat_ecdf() +
geom_label(aes(x = -1, y = e(-1)),
label = e(-1))
You can try
library(tidyverse)
# data
set.seed(123)
df = data.frame(x=rnorm(100))
# Plot
Values <- c(-1,0.5,2)
df %>%
mutate(gr=FALSE) %>%
bind_rows(data.frame(x=Values,gr=TRUE)) %>%
mutate(y=ecdf(x)(x)) %>%
mutate(xmin=min(x)) %>%
ggplot(aes(x, y)) +
stat_ecdf() +
geom_point(data=. %>% filter(gr), aes(x, y)) +
geom_segment(data=. %>% filter(gr),aes(y=y,x=xmin, xend=x,yend=y), color="red")+
geom_segment(data=. %>% filter(gr),aes(y=0,x=x, xend=x,yend=y), color="red") +
ggrepel::geom_label_repel(data=. %>% filter(gr),
aes(x, y, label=paste("x=",round(x,2),"\ny=",round(y,2))))
The idea is to add the y values in the beginning, together with the index gr specifing which Values you want to show.
Edit:
Since this code adds points to the actual data, which could be wrong for the curve, one should consider to remove these points at least in the ecdf function stat_ecdf(data=. %>% filter(!gr))

gradient fill violin plots using ggplot2

I want to gradient fill a violin plot based on the density of points in the bins (blue for highest density and red for lowest).
I have generated a plot using the following commands but failed to color it based on density (in this case the width of the violin. I also would like to generate box plots with similar coloring).
library("ggplot2")
data(diamonds)
ggplot(diamonds, aes(x=cut,y=carat)) + geom_violin()
to change the colour of the violin plot you use fill = variable, like this:
ggplot(diamonds, aes(x=cut,y=carat)) + geom_violin(aes(fill=cut))
same goes for boxplot
ggplot(diamonds, aes(x=cut,y=carat)) + geom_boxplot(aes(fill=cut))
but whatever value you have has to have the same value for each cut, that is, if you wanted to use for example mean depth/cut as the color variable you would have to code it.
with dplyr group your diamonds by cut and with summarize get the mean depth (or any other variable)
library(dplyr)
diamonds_group <- group_by(diamonds, cut)
diamonds_group <- summarize(diamonds_group, Mean_Price = mean(price))
Then I used diamonds2 as a copy of diamonds to then manipulate the dataset
diamonds2 <- diamonds
I merge both dataframes to get the Mean_Depth as a variable in diamonds2
diamonds2 <- merge(diamonds2, diamonds_group)
And now I can plot it with mean depth as a color variable
ggplot(diamonds2, aes(x=cut,y=carat)) + geom_boxplot(aes(fill=Mean_Price)) + scale_fill_gradient2(midpoint = mean(diamonds2$price))
Just answered this for another thread, but believe it's possibly more appropriate for this thread. You can create a pseudo-fill by drawing many segments. You can get those directly from the underlying data in the ggplot_built object.
If you want an additional polygon outline ("border"), you'd need to create this from the x/y coordinates. Below one option.
library(tidyverse)
p <- ggplot(diamonds, aes(x=cut,y=carat)) + geom_violin()
mywidth <- .35 # bit of trial and error
# all you need for the gradient fill
vl_fill <- data.frame(ggplot_build(p)$data) %>%
mutate(xnew = x- mywidth*violinwidth, xend = x+ mywidth*violinwidth)
# the outline is a bit more convoluted, as the order matters
vl_poly <- vl_fill %>%
select(xnew, xend, y, group) %>%
pivot_longer(-c(y, group), names_to = "oldx", values_to = "x") %>%
arrange(y) %>%
split(., .$oldx) %>%
map(., function(x) {
if(all(x$oldx == "xnew")) x <- arrange(x, desc(y))
x
}) %>%
bind_rows()
ggplot() +
geom_polygon(data = vl_poly, aes(x, y, group = group),
color= "black", size = 1, fill = NA) +
geom_segment(data = vl_fill, aes(x = xnew, xend = xend, y = y, yend = y,
color = violinwidth))
Created on 2021-04-14 by the reprex package (v1.0.0)

Resources