I'm trying to use R to make some maps but I have having problems getting anything to generate. I took the following code from a post here:
library(ggplot2)
library(sf)
eng_reg_map <-
st_read("data/Regions_(December_2017)_Boundaries/Regions_(December_2017)_Boundaries.shp")
eng_reg_map |>
ggplot() +
geom_sf(fill = "white",
colour = "black") +
theme_void()
I have the relevant files in the right place, but when I run this code it just runs and never stops. I've waited for an hour.
Any help would be much appreciated.
As discussed in the comments, I personally find this problem for one of two reasons.
The shapefile you want to plot is really large. It might not be surprising that R doesn't want to plot a 30 gigabyte shapefile, but it might be surprising to you that your file is that large. You can usually get around this by reducing the number of vertices, combining like shapes, filtering out some unnecessary features, etc.
You are trying to print the plot to the consol. For some reason, actually making the map is relatively fast, but displaying the map takes a really long time. I'm sure this varies computer by computer, but this has been my experience. In this case, it works best to save the plot as a pdf or something else and then view the plot outside of R.
Related
I've created a large gganim of a lorenz curve using packages
ggplot2, gglorenz, gganimate, transformr and gifski.
I've created the gganim plot using 'wealth_lorenz', a df of 5 variables and ~2.5 million rows using the below code,
lorenz_chart <- ggplot(wealth_lorenz, aes(x = value, color = Limits)) + stat_lorenz() + transition_states(Time) + facet_wrap(~Limits)
The gganim object created is 103.4MB in size.
Understandably, it takes too long to render in Rstudio using animate(lorenz_chart).
Is there an alternative that could be faster to run out? I understand it's a very large dataset with faceting so it may not be possible. Ideally I'd like to include the animation in a bookdown PDF_2 using the animate package (see here) if possible.
Thanks for any help!
The problem here really is the length of the data and the need to capture all of it. To that end, the stat_lorenz() function is a very resource-intensive calculation (which needs repeated many times), so I decided to take another route by calculating the formula of each curve and then plotting as normal using geom_line() - I recommend anyone else using this function for large datasets do the same.
Thanks.
Matplotlib allows to rasterize individual elements of a plot and save it as a mixed pixel/vector graphic (.pdf) (see e.g. this answer). How can the same achieved in R with ggplot2?
The following is a toy problem in which I would like to rasterize only the geom_point layer.
set.seed(1)
x <- rlnorm(10000,4)
y <- 1+rpois(length(x),lambda=x/10+1/x)
z <- sample(letters[1:2],length(x), replace=TRUE)
p <- ggplot(data.frame(x,y,z),aes(x=x,y=y)) +
facet_wrap("z") +
geom_point(size=0.1,alpha=0.1) +
scale_x_log10()+scale_y_log10() +
geom_smooth(method="gam",formula = y ~ s(x, bs = "cs"))
print(p)
ggsave("out.pdf", p)
When saved as .pdf as is, Adobe reader DC needs ~1s to render the figure. Below you can see a .png version:
Of course, it is often possible to avoid the problem by not plotting raw data
Thanks to the ggrastr package by Viktor Petukhov & Evan Biederstedt, it is now possible to rasterize individual layers. However, currently (2018-08-13), only geom_point and geom_tile are supported. and work by Teun van den Brand it is now possible to rasterize any individual ggplot layer by wrapping it in ggrastr::rasterise():
# install.packages('devtools')
# remotes::install_github('VPetukhov/ggrastr')
df %>% ggplot(aes(x=x, y=y)) +
# this layer will be rasterized:
ggrastr::rasterise(geom_point(size=0.1, alpha=0.1)) +
# this one will still be "vector":
geom_smooth()
Previously, only a few geoms were supported:
To use it, you had to replace geom_point by ggrastr::geom_point_rast.
For example:
# install.packages('devtools')
# devtools::install_github('VPetukhov/ggrastr')
library(ggplot2)
set.seed(1)
x <- rlnorm(10000, 4)
y <- 1+rpois(length(x), lambda = x/10+1/x)
z <- sample(letters[1:2], length(x), replace = TRUE)
ggplot(data.frame(x, y, z), aes(x=x, y=y)) +
facet_wrap("z") +
ggrastr::geom_point_rast(size=0.1, alpha=0.1) +
scale_x_log10() + scale_y_log10() +
geom_smooth(method="gam", formula = y ~ s(x, bs = "cs"))
ggsave("out.pdf")
This yields a pdf that contains only the geom_point layer as raster and everything else as vector graphic. Overall the figure looks as the one in the question, but zooming in reveals the difference:
Compare this to an all-raster graphic:
I think you've set yourself up to not have this question answered. You write:
I expect an answer to provide an extension to ggplot2 that allows to export plots with rasterized layers with minimal changes to to existing plotting commands, i.e. as wrapper for geom_... commands or as an additional parameter to these or a ggsave command that expects a list of unevaluated parts of a plot command (every second to be rasterized), not a hacky workaround as provided in the linked question.
This is a major development effort that could easily require several weeks or more of effort by a highly skilled developer. It's unlikely anybody will do this just because of a Stack Overflow question. In lieu of a functioning implementation, I'll describe here how one could implement what you're asking for and why it's rather challenging.
The players
Let's start with the key players we'll be dealing with. At the highest level sits the ggplot2 library. It takes data frames and turns them into figures. ggplot2 itself doesn't know anything about low-level drawing, though. It only deals with lines, polygons, text, etc., which it hands off to the grid library in the form of graphics objects (grobs).
The grid library itself is a fairly high-level library. It also doesn't know much about low-level drawing. It primarily deals with lines, polygons, text, etc., which it hands off to an R graphics device. The device does the actual drawing.
There are many different R graphics devices. Enter ?Devices in an R command line to see an incomplete list. There are vector-graphics devices, such as pdf, postscript, or svg, raster devices such as png, jpeg, or tiff, and interactive devices such as X11 or quartz. Obviously, rasterization as a concept only makes sense for vector-graphics devices, since raster devices raster everything anyways. Importantly, neither ggplot2 nor grid know or care which graphics device you're currently drawing on. They deal with graphical objects that can be drawn on any device.
Ideal high-level interface
The high-level interface should consist of an option rasterize in the layer() function of ggplot2. In this way, one could simply write, e.g., geom_point(rasterize = TRUE) to rasterize the points layer. This would work transparently for all geoms and stats, since they all call layer().
Possible implementations
I see four possible routes of implementation, ordered from most impossible to least.
1. Ideally, the layer() function would simply hand off the rasterize option to the grid library, which would hand it off to the graphics device to tell it which parts of the plot to rasterize. This approach would require major changes in the graphics device API. I don't see this happening. Not in my lifetime, at least.
2. Alternatively, one could write a new grob type that can take any arbitrary grob and rasterize it on demand when the grob is drawn on a graphics device. This approach would not require changes in the graphics device API, but it would require detailed knowledge of the low-level implementation of the grid library. It would also possibly make interactive viewing of such figures very slow.
3. A slightly simpler alternative to 2. would be to rasterize the arbitrary grob only once, on grob construction, and then reuse whenever that grob is drawn. This would be quite a bit faster on interactive graphics devices but the drawing would get distorted if the aspect ratio is changed interactively. Nevertheless, since the primary use of this functionality would be to generate pdf output (I assume), this option might be sufficient.
4. Finally, rasterization could also happen in the layer() function, and that function could simply place a regular raster grob into the grob tree. That solution is similar to the technique described here. Technically, it's not much different from 3. Either way, one needs to write code to rasterize a grob tree and then replace it by a raster grob.
Technical hurdles
To rasterize parts of the grob tree, we'd have to send them to an R raster graphics device to render. However, there isn't one that renders to memory. So, one would have to render to a temporary file (e.g., using png()), and then read the file back in. That's possible but ugly. It also depends on functionality (such as png()) that isn't guaranteed to be available on every R installation.
Second, to render parts of the grob tree separately from the overall rendering, we'll have to open a new graphics device in addition to the one currently open. That's possible but can lead to unexpected bugs. I'm dealing with such bugs all the time, see e.g. here or here for issues related to code using this technique. Whoever implements the rasterization functionality would have to deal with such issues.
Finally, we'll have to get the rasterization code accepted into the ggplot2 library, since we need to replace the layer() function and I don't think there's a way to do that from a separate package. Given how hackish the rasterization solutions are going to be (see previous two paragraphs), that may be a tall order.
I am working on making some rather involved plots that combine several data sets in R. ggplot2 is working great for this endeavor, but man is it slow. I realize that I am working with a large number of data points, but I think I have an arbitrary bottleneck somewhere. Let me explain...
I have 10 different vectors, each 150,000 entries long. I want to use ggplot2 to create a figure with these on the command line, and have the resulting png saved to disk. Each of the 10 vectors will be different colors and some will be lines and some will be bars. The code looks like this:
bulk = data.frame(vector1=c(1,5,3,5,...), ... vector10=c(5,3,77,5,3, ...))
png(filename="figure.png", width=4000, height=800)
ggplot(bulk, aes(x=vector1), aes(alpha=0.2)) +
geom_bar(aes(y=vector2), color="red", stat="identity") +
geom_bar(aes(y=vector3), color="black", stat="identity") +
..................
geom_line(aes(y=vector10), color="black", size=1) +
scale_y_log10()
Please keep in mind I have 10 vectors, each 150,000 entries long, so I have 1.5M data points to plot. However, I am on an 8 core, 4Ghz/core machine with 32GB RAM, but R is using almost no RAM and only 1 core. This is expected, since as far as I know this process can't be multithreaded, but should the rendering really take ~1 hour per figure?
It feels like something about my code is arbitrarily inflating this processing time. Especially since the same problem with 20,000 entries per 10 vectors only takes about 20 seconds. Scaling it up takes way more than linearly scaled time.
Does anyone have solution or suspicion for this question? Thanks for any help!
If you want or need to plot that many points you have to use base R. ggplot is very slow with medium to large data sets. This issue is known , I don't know if things has changed performance wise since then. Using a faster machine won't make a much of a difference either. Try base R. In my experience its much much faster even for very large infographics and visualizations.
Some thing to consider is that different geom's take more or less time, and for some reason that I can't really work out, geom_bar is one of the slowest (along with geom_area). Try using a different geom, at least when protyping the plot. You can switch to bar for the final production plot.
In my experience, it seems like adding the alpha argument slows down the plot generation substantially.
For instance, in a project I'm currently working on, I'm plotting a map of 31 000 data points. On top of this, I add a layer of another 6000 data points. If plotted normally, this takes 1.2 seconds. If the 6000 data points are plotted with alpha=0.7, it takes 12.6 seconds. Experimenting with different settings along shape and size does not nearly affect the computation time as drastically.
I don't know if you have seen some unwanted bold-face font like picture below:
As you see the third line is bold-faced, while the others are not. This happens to me when I try to use ggplot() with lapply() or specially mclapply(), to make the same chart template based on different data, and put all the results as different charts in a single PDF file.
One solution is to avoid using lapply(x, f) when f() is a function that returns a ggplot() plot, but I have to do so for combining charts (i.e. as input for grid.arrange()) in some situation.
Sorry not able to provide you reproducible example, I tried really hard but was not successful because the size of code and data is too big with several nested functions and when I reduced complexity to make a reproducible example, the problem did not happen.
I asked the question because I guessed maybe someone has faced the same experience and know how to solve it.
My intuition is that it's not actually being printed in bold, but rather double-printed for some reason, which then looks bold. This would explain why it doesn't come up with a simpler example. Especially given your mention of nested functions and probably other complicated structures where it's easy to get an off-by-one or similar error, I would try doing something where you can see exactly what's being plotted -- perhaps by examining the length() of the return value from apply().
Changing the order of elements of the vector, so that the order of the elements in the key is different, may also help. If you consistently get the bold-face on the last element, that also tells you a little bit more about where something is going wrong.
As #Dinre also mentioned, it could also be related to your plotting device. You can try out changing your plotting device. I have my doubts about this though, seeing as it's not a consistent problem. You could also try changing the position of the key, which depending on your plotting device and settings, may move you in or out of a compression block, thus changing which artifacts crop up.
Reproducible example and a solution may be as follows:
library(ggplot2)
d <- data.frame(x=1:10, y=1:10)
ggplot(data = d, aes(x=x, y=y)) +
geom_point() +
geom_text(aes(3,7,label = 'some text 10 times')) +
geom_text(data = data.frame(x=1,y=1),
aes(7,3, label = 'some text one time'))
When we try to add a label by geom_text() manually inserting x and y do not shorten the data. Then same label happen to be printed as many times as the number of rows our data has. Data length may be forced to 1 by replacing data within geom_text().
I recently found this web page Crime in Downtown Houston that I'm interested in reproducing. This is my first learning experience with mapping in R and thus lack the vocabulary and understanding necessary to make appropriate decisions.
At the end of the page David Kahle states:
One last point might be helpful. In making these kinds of plots, one
might tempted to use the map raster file itself as a background. This
method can be used to make map plots much more quickly than the
methods described above. However, the method has one very significant
disadvantage which, if not handled properly, can destroy the entire
purpose of using the map.
In very plain English what is the difference between the raster file
approach and his approach?
Does the RgoogleMaps package have the ability to produce these types
of high quality maps as seen on the page I referenced above that
calls a google map into R?
I ask not because I lack information but the opposite. There's too much and I want to make a good decision(s) about the approach to pursue so I'm not wasting my time on outdated or inefficient techniques.
Feel free to pass along any readings you think would benefit me.
Thank you in advance for your direction.
Basically, you had two options at the time this plot was made:
draw the map as a layer using geom_tile, where each pixel of the image is mapped onto the x,y axes (slow but accurate)
add a background image to the plot, as a purely "cosmetic" annotation. This method is faster, because you can use grid.raster which draws images more efficiently, but the image is not constrained by the axes of the plotting region. In other words, you have to manually adjust the x and y axes limits to make sure that the image corresponds to the actual positions on the plot.
Now, I would suggest you look at the new annotation_raster in ggplot2 v. 0.9.0. It should have the advantage of speed and leaner output files, and still conform to the data space of the plot. I believe that this function, as well as geom_raster and annotation_map did not exist when David made those plots.