Speed up rendering of large heatmap from ggplot in R - r

I am trying to plot a large heatmap, generated with ggplot, in R. Ultimately, I would like to 'polish' this heat map using Illustrator.
Sample code:
# Load packages (tidyverse)
library(tidyverse)
# Create dataframe
df <- expand.grid(x = seq(1,100000), y = seq(1,100000))
# add variable: performance
set.seed(123)
df$z <- rnorm(nrow(df))
ggplot(data = df, aes(x = x, y = y)) +
geom_raster(aes(fill = z))
Although I save the plot as a vectorized image (.pdf; that is not that large), the pdf is loading very slowly when opening. I expect that every individual point in the data frame is rendered when opening the file.
I have read other posts (e.g. Data exploration in R: display heatmap of large matrix, quickly?) that use image() to visualize matrices, however I would like to use ggplot to modify the image.
Question: How do I speed up the rendering of this plot? Is there a way (besides lowering the resolution of the plot), while keeping the image vectorized, to speed this process up? Is it possible to downsample a vectorized ggplot?

The first thing I tried was stat_summary_2d to get average binning, but it seemed slow and also created some artifacts on the right and top edges:
library(tidyverse)
df <- expand.grid(x = seq(1,1000), y = seq(1,1000))
set.seed(123)
df$z <- rnorm(nrow(df))
print(object.size(df), units = "Mb")
#15.4 Mb
ggplot(data = df, aes(x = x, y = y, z = z)) +
stat_summary_2d(bins = c(100,100)) + #10x downsample, in this case
scale_x_continuous(breaks = 100*0:10) +
labs(title = "stat_summary_2d, 1000x1000 downsampled to 100x100")
Even though this is much smaller than your suggested data, this still took about 3 seconds to plot on my machine, and had artifacts on the top and right edges, I presume due to those bins being smaller ones from the edges, leaving more variation.
It got slower from there when I tried a larger grid like you are requesting.
(As an aside, it may be worth clarifying that a vector graphic file like a PDF, unlike a raster graphic, can be resized without loss of resolution. However, in this use case, the output is 10,000 megapixel raster file, far beyond the limits of human perception, that is getting exported into a vector format, where each "pixel" becomes a very tiny rectangle in the PDF. That use of a vector format could be useful for certain unusual cases, like if you ever need to blow up your heatmap without loss of resolution onto a gigantic surface, like a football field. But it sounds like in this case it might be the wrong tool for the job, since you're putting heaps of data into the vector file that won't be perceptible.)
What worked more efficiently was to do the averaging with dplyr before ggplot. With that, I could take a 10k x 10k array and downsample it 100x before sending to ggplot. This necessarily reduces the resolution, but I don't understand the value in this use case of preserving resolution beyond human abilities to perceive it.
Here's some code to do the bucketing ourselves and then plot the downsampled version:
# Using 10k x 10k array, 1527.1 Mb when initialized
downsample <- 100
df2 <- df %>%
group_by(x = downsample * round(x / downsample),
y = downsample * round(y / downsample)) %>%
summarise(z = mean(z))
ggplot(df2, aes(x = x, y = y)) +
geom_raster(aes(fill = z)) +
scale_x_continuous(breaks = 1000*0:10) +
labs(title = "10,000x10,000 downsampled to 100x100")

Your reproducible example just shows noise so it's hard to know what kind of output you would like.
One way would be to follow #dww's suggestion and use geom_hex to show aggregated data.
Another way, as you ask "Is it possible to downsample a vectorized ggplot?", is to use dplyr::sample_frac or dplyr::sample_n in the data argument of your geom_raster. I have to take a smaller sample than in your example though or I can't build the df.
library(tidyverse)
# Create dataframe
df <- expand.grid(x = seq(1,1000), y = seq(1,1000))
# add variable: performance
set.seed(123)
df$z <- rnorm(nrow(df))
ggplot(data = df, aes(x = x, y = y)) +
geom_raster(aes(fill = z), . %>% sample_frac(0.1))
If you want to start from your high resolution ggplot object you can do for the same effect:
gg <- ggplot(data = df, aes(x = x, y = y)) +
geom_raster(aes(fill = z))
gg$data <- sample_frac(gg$data,0.1)
gg

Related

R crashed when using Geom_Point for large data frame

Background: I have a large data frame data_2014, containing ~ 1,000,000 rows like this
library(tidyverse)
tibble(
date_time = "4/1/2014 0:11:00",
Lat = 40.7690,
Lon = -73.9549,
Base = "B02512"
)
Problem: I want to create a plot like this
This is what I've attempted to do:
library(tidyverse)
library(ggthemes)
library(scales)
min_lat <- 40.5774
max_lat <- 40.9176
min_long <- -74.15
max_long <- -73.7004
ggplot(data_2014, aes(Lon, Lat)) +
geom_point(size = 1, color = "chocolate") +
scale_x_continuous(limits = c(min_long, max_long)) +
scale_y_continuous(limits = c(min_lat, max_lat)) +
theme_map() +
ggtitle("NYC Map Based on Uber Rides Data (April-September 2014)")
However, when I ran this code, Rstudio crashed. I'm not particularly sure how to fix or improve this. Is there any suggestion?
A million points is a lot for ggplot2, but do-able if your computer is good enough. Yours may or may not be. Short of getting a bigger computer here's what you should do.
This is spatial data, so use the sf package.
library(sf)
data_2014 <- st_as_sf(data_2014, coords = c('Lon', 'Lat')) %>%
st_set_crs(4326)
If you're only plotting the points, get rid of the columns of data you don't need. I'm guessing they might include trip distance, time, borough, etc. Use dplyr's select, or whatever other method you're familiar with.
Try plotting some of the data, and then a little more. See where your computer slows down & stop there. You can plot the data from row 1:n, or sample x number of rows.
# try starting with 100,000 and go up from there.
n <- 100000
ggplot(data_2014[1:n,]) +
geom_sf()
# Alternatively sample a fraction of the data.
# Start with ~10% and go up until R crashes again.
data_2015 %>%
sample_frac(.1) %>%
ggplot() +
geom_sf()

Can I set free scales for aesthetics other than x and y (e.g. size) when using facet_grid?

facet_grid and facet_wrap have the scales parameter, which as far as I know allows each plot to adjust the scales of the x and/or y axis to the data being plotted. Since according to the grammar of ggplot x and y are just two among many aesthetics, and there's a scale for each aesthetic, I figured it would be reasonable to have the option of letting each aesthetic be free, but so far I didn't find a way to do it.
I was trying to set it in particular for the Size, since sometimes a variables lives in a different order of magnitude depending on the group I'm using for the facet, and having the same scale for every group blocks the possibility of seeing within-group variation.
A reproducible example:
set.seed(1)
x <- runif(20,0,1)
y <- runif(20,0,1)
groups <- c(rep('small', 10), rep('big', 10))
size_small <- runif(10,0,1)
size_big <- runif(10,0,1) * 1000
df <- data.frame(x, y, groups, sizes = c(size_small, size_big))
And an auxiliary function for plotting:
basic_plot <- function(df) ggplot(df) +
geom_point(aes(x, y, size = sizes, color = groups)) +
scale_color_manual(values = c('big' = 'red', 'small' = 'blue')) +
coord_cartesian(xlim=c(0,1), ylim=c(0,1))
If I we plot the data as is, we get the following:
basic_plot(df)
Non faceted plot
The blue dots are relatively small, but there is nothing we can do.
If we add the facet:
basic_plot(df) +
facet_grid(~groups, scales = 'free')
Faceted plot
The blue dots continue being small. But I would like to take advantage of the fact that I'm dividing the data in two, and allow the size scale to adjust to the data of each plot. I would like to have something like the following:
plot_big <- basic_plot(df[df$groups == 'big',])
plot_small <- basic_plot(df[df$groups == 'small',])
grid.arrange(plot_big, plot_small, ncol = 2)
What I want
Can it be done without resorting to this kind of micromanaging, or a manual rescaling of the sizes like the following?
df %>%
group_by(groups) %>%
mutate(maximo = max(sizes),
sizes = scale(sizes, center = F)) %>%
basic_plot() +
facet_grid(~groups)
I can manage to do those things, I'm just trying to see if I'm not missing another option, or if I'm misunderstanding the grammar of graphics.
Thank you for your time!
As mentioned, original plot aesthetics are maintained when calling facet_wrap. Since you need grouped graphs, consider base::by (the subsetting data frame function) wrapped in do.call:
do.call(grid.arrange,
args=list(grobs=by(df, df$groups, basic_plot),
ncol=2,
top="Grouped Point Plots"))
Should you need to share a legend, I always use this wrapper from #Steven Lockton's answer
do.call(grid_arrange_shared_legend, by(df, df$groups, basic_plot))

Scale huge axis in R for plotting

I have a huge file I load into one vector
y = scan("my_file)
My x axis is also really huge, lets say it is in range of x=1:5000000
My question is now how can I scale my plot so that I actually can see something?So far I am doing the following
UPDATE:
plot(x, y, log="x", pch=".")
However only the logarithm is not enough. Can i somehow scale the x more, like taking a sqrt or something, and if yes how? Sorry this may be a simple question but I am really new to R..
I am not sure how to add a file, but the file I a using to load into vector y is as I said, only 5 million values of entry 1,2 or 0: so
y=c(1,0,1,....................)
the x axis as I mentioned above.
The second thing I tried was:
zerotwo <- data.frame(x, y)
ggplot(aes(x, y, fill=as.factor(y)), data=zerotwo) + geom_tile() + scale_x_continuous(trans='log2') + geom_tile()
But here also the fill=as.factor doesnt do its job
Another possibility is to rely on color-coding to encode the value, and use the y-axis to put the data into different rows. 5m tiles on the x-axis is a bit much for ggplot, but 50*100k works ok if the plot size is large enough.read from left to right, then top to bottom.
# create test data
zerototwo <- data.frame(position=1:5000000, value=sample(0:2, 5000000, replace=TRUE))
# for your data: zerototwo <- data.frame(position=1:length(y), value=y)
zerototwo$row <- floor((zerototwo$position -1)/100000)
zerototwo$rowpos <- (zerototwo$position - 1) %% 100000
ggplot(aes(x=rowpos, y=row, fill=as.factor(value)), data=zerototwo) +
geom_tile(height=0.9) + scale_y_reverse()

Getting counts on bins in a heat map using R

This question follows from these two topics:
How to use stat_bin2d() to compute counts labels in ggplot2?
How to show the numeric cell values in heat map cells in r
In the first topic, a user wants to use stat_bin2d to generate a heatmap, and then wants the count of each bin written on top of the heat map. The method the user initially wants to use doesn't work, the best answer stating that stat_bin2d is designed to work with geom = "rect" rather than "text". No satisfactory response is given.
The second question is almost identical to the first, with one crucial difference, that the variables in the second question question are text, not numeric. The answer produces the desired result, placing the count value for a bin over the bin in a stat_2d heat map.
To compare the two methods i've prepared the following code:
library(ggplot2)
data <- data.frame(x = rnorm(1000), y = rnorm(1000))
ggplot(data, aes(x = x, y = y))
geom_bin2d() +
stat_bin2d(geom="text", aes(label=..count..))
We know this first gives you the error:
"Error: geom_text requires the following missing aesthetics: x, y".
Same issue as in the first question. Interestingly, changing from stat_bin2d to stat_binhex works fine:
library(ggplot2)
data <- data.frame(x = rnorm(1000), y = rnorm(1000))
ggplot(data, aes(x = x, y = y))
geom_binhex() +
stat_binhex(geom="text", aes(label=..count..))
Which is great and all, but generally, I don't think hex binning is very clear, and for my purposes wont work for the data i'm trying to desribe. I really want to use stat_2d.
To get this to work, i've prepared the following work around based on the second answer:
library(ggplot2)
data <- data.frame(x = rnorm(1000), y = rnorm(1000))
x_t<-as.character(round(data$x,.1))
y_t<-as.character(round(data$y,.1))
x_x<-as.character(seq(-3,3),1)
y_y<-as.character(seq(-3,3),1)
data<-cbind(data,x_t,y_t)
ggplot(data, aes(x = x_t, y = y_t)) +
geom_bin2d() +
stat_bin2d(geom="text", aes(label=..count..))+
scale_x_discrete(limits =x_x) +
scale_y_discrete(limits=y_y)
This works around allows one to bin numerical data, but to do so, you have to determine bin width (I did it via rounding) before bringing it into ggplot. I actually figured it out while writing this question, so I may as well finish.
This is the result: (turns out I can't post images)
So my real question here, is does any one have a better way to do this? I'm happy I at least got it to work, but so far I haven't seen an answer for putting labels on stat_2d bins when using a numerical variable.
Does any one have a method for passing on x and y arguments to geom_text from stat_2dbin without having to use a work around? Can any one explain why it works with text variables but not with numbers?
Another work around (but perhaps less work). Similar to the ..count.. method you can extract the counts from the plot object in two steps.
library(ggplot2)
set.seed(1)
dat <- data.frame(x = rnorm(1000), y = rnorm(1000))
# plot
p <- ggplot(dat, aes(x = x, y = y)) + geom_bin2d()
# Get data - this includes counts and x,y coordinates
newdat <- ggplot_build(p)$data[[1]]
# add in text labels
p + geom_text(data=newdat, aes((xmin + xmax)/2, (ymin + ymax)/2,
label=count), col="white")

Add multiple barchart or piechart at coordinate location in ggplot2

I cannot figure out how to add multiple barcharts (or, even better, piecharts) to one plot.
The simplest case would be to add two barcharts at different x,y locations onto a plane.
An application example would be to illustrate both the number of people living in a certain area, and the number of migrants (for lack of better example) living there as well.
By packaging this population information with spatial information, I hope to convey the corresponding information efficiently.
Solutions involving ggmaps are fine, however, I do not require them (displaying the data without a map layer in the background is acceptable).
To be more precise, here is some code, that is not working as I would like it to. In particular, the bar-charts are replaced by rectangles, which are not stacked, but overlap each other, leading to wrongly displayed information.
Furthermore, at each location, the total height of each bar in the bar chart (or size of the pie, for that matter) should correspond to the sum of both parts.
require(ggplot2)
x <- c(1,2,3)
y <- c(3,2,4)
pop <- c(1,7,8)
mig <- c(1,5,2)
df <- rbind(x,y,pop,mig)
df <- t(df)
df <- data.frame(df)
# bring data in long format
require(reshape2)
tmp <- melt(df, id.vars = c("x","y"))
p <- ggplot(tmp, aes(x=x, y=y, fill = variable))
p <- p + geom_rect(aes(xmin = x, xmax = x + 0.1,
ymin = y, ymax = y + value
))
print(p)
Eventually, this should serve as an input into a larger animation, that visualizes temporal development of the variables.

Resources