I've created a large gganim of a lorenz curve using packages
ggplot2, gglorenz, gganimate, transformr and gifski.
I've created the gganim plot using 'wealth_lorenz', a df of 5 variables and ~2.5 million rows using the below code,
lorenz_chart <- ggplot(wealth_lorenz, aes(x = value, color = Limits)) + stat_lorenz() + transition_states(Time) + facet_wrap(~Limits)
The gganim object created is 103.4MB in size.
Understandably, it takes too long to render in Rstudio using animate(lorenz_chart).
Is there an alternative that could be faster to run out? I understand it's a very large dataset with faceting so it may not be possible. Ideally I'd like to include the animation in a bookdown PDF_2 using the animate package (see here) if possible.
Thanks for any help!
The problem here really is the length of the data and the need to capture all of it. To that end, the stat_lorenz() function is a very resource-intensive calculation (which needs repeated many times), so I decided to take another route by calculating the formula of each curve and then plotting as normal using geom_line() - I recommend anyone else using this function for large datasets do the same.
Thanks.
Related
I'm trying to use R to make some maps but I have having problems getting anything to generate. I took the following code from a post here:
library(ggplot2)
library(sf)
eng_reg_map <-
st_read("data/Regions_(December_2017)_Boundaries/Regions_(December_2017)_Boundaries.shp")
eng_reg_map |>
ggplot() +
geom_sf(fill = "white",
colour = "black") +
theme_void()
I have the relevant files in the right place, but when I run this code it just runs and never stops. I've waited for an hour.
Any help would be much appreciated.
As discussed in the comments, I personally find this problem for one of two reasons.
The shapefile you want to plot is really large. It might not be surprising that R doesn't want to plot a 30 gigabyte shapefile, but it might be surprising to you that your file is that large. You can usually get around this by reducing the number of vertices, combining like shapes, filtering out some unnecessary features, etc.
You are trying to print the plot to the consol. For some reason, actually making the map is relatively fast, but displaying the map takes a really long time. I'm sure this varies computer by computer, but this has been my experience. In this case, it works best to save the plot as a pdf or something else and then view the plot outside of R.
I am trying to visualize several hours of neuronal recordings sampled at 500Hz using R in Ubuntu 16.04. Simply I want to have a 2D plot that shows a value (voltage) over time. Its important for to have the plot in an interactive way. I need to have an overall look, compare different times and zoom in and out, therefor I don't want to split my data into different parts and visualize them separately.(Also I can not use the normal R plot since zooming there is a pain and sometimes impossible) What I came up with so far is to use "plot_ly" with scatterrgl type to get started and I could successfully plot 300'000 data points. But that is the limit I can get so far. Above this amount of data the whole R software freezes and exits. The frustrating part is that this can be done easily in MATLAB and with R it seems impossible. Is there any alternative to plot_ly for plotting large data in R?
You might try the dygraph package, working fine here with 500k points:
library(dygraphs)
my_data = data.frame(x = 1:500000, y = rnorm(500000))
dygraph(my_data) %>% dyRangeSelector()
I have a plotting problem with curves when using mixtools
Using the following R code
require(mixtools)
x <- c(rnorm(10000,8,2),rnorm(10000,18,5))
xMix <- normalmixEM(x, lambda=NULL, mu=NULL, sigma=NULL)
plot(xMix, which = 2, nclass=25)
I get a nice histogram, with the 2 normal curves estimated from the model superimposed.
The problem is with the default colours (i.e. red and green), which I need to change for a publication to be black and grey.
One way I thought to doing this was first to produce the histogram
hist(xMix$x, freq=FALSE, nclass=25)
and then add the lines using the "curve" function.
....... but I lost my way, and couldn't solve it
I would be grateful for any pointers or the actual solution
thanks
PS. Note that there is an alternative work-around to this problem using ggplot:
Any suggestions for how I can plot mixEM type data using ggplot2
but for various reasons I need to keep using the base graphics
You can also edit the colours directly using the col2 argument in the mixtools plotting function
For example
plot(xMix, which = 2, nclass=25, col2=c("dimgrey","black"))
giving the problem a bit more thought, I managed to rephrase the problem and ask the question in a much more direct way
Using user-defined functions within "curve" function in R graphics
this delivered two nice solutions of how to use the "curve" function to draw normal distributions produced by the mixture modelling.
the overall answer therefore is to use the "hist" function to draw a histogram of the raw data, then the "curve" function (incorporating the sdnorm function) to draw each normal distribution. This gives total control of the colours (and potentially any other graphic parameter).
And not to forget - this is where I got the code for the sdnorm function - and other useful insights
Any suggestions for how I can plot mixEM type data using ggplot2
Thanks as always to StackOverflow and the contributors who provide such helpful advice.
I am working on making some rather involved plots that combine several data sets in R. ggplot2 is working great for this endeavor, but man is it slow. I realize that I am working with a large number of data points, but I think I have an arbitrary bottleneck somewhere. Let me explain...
I have 10 different vectors, each 150,000 entries long. I want to use ggplot2 to create a figure with these on the command line, and have the resulting png saved to disk. Each of the 10 vectors will be different colors and some will be lines and some will be bars. The code looks like this:
bulk = data.frame(vector1=c(1,5,3,5,...), ... vector10=c(5,3,77,5,3, ...))
png(filename="figure.png", width=4000, height=800)
ggplot(bulk, aes(x=vector1), aes(alpha=0.2)) +
geom_bar(aes(y=vector2), color="red", stat="identity") +
geom_bar(aes(y=vector3), color="black", stat="identity") +
..................
geom_line(aes(y=vector10), color="black", size=1) +
scale_y_log10()
Please keep in mind I have 10 vectors, each 150,000 entries long, so I have 1.5M data points to plot. However, I am on an 8 core, 4Ghz/core machine with 32GB RAM, but R is using almost no RAM and only 1 core. This is expected, since as far as I know this process can't be multithreaded, but should the rendering really take ~1 hour per figure?
It feels like something about my code is arbitrarily inflating this processing time. Especially since the same problem with 20,000 entries per 10 vectors only takes about 20 seconds. Scaling it up takes way more than linearly scaled time.
Does anyone have solution or suspicion for this question? Thanks for any help!
If you want or need to plot that many points you have to use base R. ggplot is very slow with medium to large data sets. This issue is known , I don't know if things has changed performance wise since then. Using a faster machine won't make a much of a difference either. Try base R. In my experience its much much faster even for very large infographics and visualizations.
Some thing to consider is that different geom's take more or less time, and for some reason that I can't really work out, geom_bar is one of the slowest (along with geom_area). Try using a different geom, at least when protyping the plot. You can switch to bar for the final production plot.
In my experience, it seems like adding the alpha argument slows down the plot generation substantially.
For instance, in a project I'm currently working on, I'm plotting a map of 31 000 data points. On top of this, I add a layer of another 6000 data points. If plotted normally, this takes 1.2 seconds. If the 6000 data points are plotted with alpha=0.7, it takes 12.6 seconds. Experimenting with different settings along shape and size does not nearly affect the computation time as drastically.
Background:
I'm running a Monte Carlo simulation to show that a particular process (a cumulative mean) does not converge over time, and often diverges wildly in simulation (the expectation of the random variable = infinity). I want to plot about 10 of these simulations on a line chart, where the x axis has the iteration number, and the y axis has the cumulative mean up to that point.
Here's my problem:
I'll run the first simulation (each sim. having 10,000 iterations), and build the main plot based on its current range. But often one of the simulations will have a range a few orders of magnitude large than the first one, so the plot flies outside of the original range. So, is there any way to dynamically update the ylim or xlim of a plot upon adding a new set of points or lines?
I can think of two workarounds for this: 1. store each simulation, then pick the one with the largest range, and build the base graph off of that (not elegant, and I'd have to store a lot of data in memory, but would probably be laptop-friendly [[EDIT: as Marek points out, this is not a memory-intense example, but if you know of a nice solution that'd support far more iterations such that it becomes an issue (think high dimensional walks that require much, much larger MC samples for convergence) then jump right in]]) 2. find a seed that appears to build a nice looking version of it, and set the ylim manually, which would make the demonstration reproducible.
Naturally I'm holding out for something more elegant than my workarounds. Hoping this isn't too pedestrian a problem, since I imagine it's not uncommon with simulations in R. Any ideas?
I'm not sure if this is possible using base graphics, if someone has a solution I'd love to see it. However graphics systems based on grid (lattice and ggplot2) allow the graphics object to be saved and updated. It's insanely easy in ggplot2.
require(ggplot2)
make some data and get the range:
foo <- as.data.frame(cbind(data=rnorm(100), numb=seq_len(100)))
make an initial ggplot object and plot it:
p <- ggplot(as.data.frame(foo), aes(numb, data)) + layer(geom='line')
p
make some more data and add it to the plot
foo <- as.data.frame(cbind(data=rnorm(200), numb=seq_len(200)))
p <- p + geom_line(aes(numb, data, colour="red"), data=as.data.frame(foo))
plot the new object
p
I think (1) is the best option. I actually don't think this isn't elegant. I think it would be more computationally intensive to redraw every time you hit a point greater than xlim or ylim.
Also, I saw in Peter Hoff's book about Bayesian statistics a cool use of ts() instead of lines() for cumulative sums/means. It looks pretty spiffy: