Interactive plotting of large data (few millions) with R - r

I am trying to visualize several hours of neuronal recordings sampled at 500Hz using R in Ubuntu 16.04. Simply I want to have a 2D plot that shows a value (voltage) over time. Its important for to have the plot in an interactive way. I need to have an overall look, compare different times and zoom in and out, therefor I don't want to split my data into different parts and visualize them separately.(Also I can not use the normal R plot since zooming there is a pain and sometimes impossible) What I came up with so far is to use "plot_ly" with scatterrgl type to get started and I could successfully plot 300'000 data points. But that is the limit I can get so far. Above this amount of data the whole R software freezes and exits. The frustrating part is that this can be done easily in MATLAB and with R it seems impossible. Is there any alternative to plot_ly for plotting large data in R?

You might try the dygraph package, working fine here with 500k points:
library(dygraphs)
my_data = data.frame(x = 1:500000, y = rnorm(500000))
dygraph(my_data) %>% dyRangeSelector()

Related

Cannot use plot() function in RStudio for large objects

I am trying to use the default plot() function in R to try and plot a shapefile that is about 100MB, using RStudio. When I try and plot the shapefile, the command doesn't finish executing for around 5 minutes, and when it finally does, the plotting window remains blank. When I execute the same process exactly in VS Code, the plot appears almost instantly, as expected.
I have tried uninstalling and reinstalling RStudio with no success.
I can't speak for what VStudio does, but I can guarantee that plotting 100MB worth of data points is useless (unless the final plot is going to be maybe 6 by 10 meters in size).
First thing: can you load the source file into R at all? One would hope so since that's not a grossly huge data blob. Then use your choice of reduction algorithms to get a reasonable number of points to plot, e.g. 800 by 1600, which is all a monitor can display anyway.
Next try plotting a small subset to verify the data are in a valid form, etc.
Then consider reducing the data by collapsing maybe each 10x10 region to a single average value, or by using ggplot2:geom_hex .

gganimate object too large to render

I've created a large gganim of a lorenz curve using packages
ggplot2, gglorenz, gganimate, transformr and gifski.
I've created the gganim plot using 'wealth_lorenz', a df of 5 variables and ~2.5 million rows using the below code,
lorenz_chart <- ggplot(wealth_lorenz, aes(x = value, color = Limits)) + stat_lorenz() + transition_states(Time) + facet_wrap(~Limits)
The gganim object created is 103.4MB in size.
Understandably, it takes too long to render in Rstudio using animate(lorenz_chart).
Is there an alternative that could be faster to run out? I understand it's a very large dataset with faceting so it may not be possible. Ideally I'd like to include the animation in a bookdown PDF_2 using the animate package (see here) if possible.
Thanks for any help!
The problem here really is the length of the data and the need to capture all of it. To that end, the stat_lorenz() function is a very resource-intensive calculation (which needs repeated many times), so I decided to take another route by calculating the formula of each curve and then plotting as normal using geom_line() - I recommend anyone else using this function for large datasets do the same.
Thanks.

R plotting strangeness with large dataset

I have a data frame with several million points in it - each having two values.
When I plot this like this:
plot(myData)
All the points are plotted, but the plot is quite busy, so I thought I'd plot it as a line:
plot(myData, type="l")
But while the x axis doesn't change (i.e. goes from 0 to 7e+07), the actual plotting stops at about 3e+07 and I don't actually get a proper line plot either.
Is there a limitation on line plotting?
Update
If I use
plot(myData, type="h")
I get correct and useable output, but I still wonder why the type="l" option fails so badly.
Further update
I am plotting a time series - here is one output using type="h":
That's perfectly usable, but having a line would allow me to compare several outputs.
High dimensional data graphic representation is growing issue in data analysis. The problem, actually, is not create the graph. The problem is make the graph capable of communicate information that we could transform in useful knowledge. Allow me to present an example to produce this point, by considering a data with a million observations, that is, not that big.
x <- rnorm(10^6, 0, 1)
y <- rnorm(10^6, 0, 1)
Let's plot it. R can yes easily manage such a problem. But can we? Probably not.
Afterall, what kind of information can we deduce from an ink hard stain? Probably, no more than a tasseographyst trying to divinate the future in patterns of tea leaves, coffee grounds, or wine sediments.
plot(x, y)
A different approach is represented by the smoothScatter function. It creates a density plot of bivariate data. There, we create two examples.
First, with defaults.
smoothScatter(x, y)
Second, the bandwidth was specified to be a little larger than the default, and five points are specified to be shown using a different symbol pch = 3.
smoothScatter(x, y, bandwidth=c(5,1)/(1/3), nrpoints=5, pch=3)
As you can see, the problem is not solved. Nevertheless, we can have a better grasp on the distribution of our data. This kind of approach is still in development, and there are several matters that are discussed and evolved. If this approach represents a more suitable approach to represent your big dataset, I suggest you to visit this blog that discuss throughfully the issue.
For what it's worth, all the evidence I have is that is computer - even though it was a lump of big iron - ran out of memory.

Plotting graphs using R in Jupyter is slow

I am plotting heavy graphs in Jupyter using the language R. It is extremely slow as I expect it is first exporting it into EPS and then converting it to a png.
If you try to plot on a native R setup ( R for windows for example ) the plotting is nearly instantaneous.
Is there a way to get R in Jupyter to plot more quickly?
I came here looking for a solution to a potentially related issue-- the browser window became relatively unresponsive with lots of lag when drawing plots with a lot of datapoints, likely because everything was being rendered as vector graphics.
In trying to solve my problem, it sped up initial drawing of graphs by an appreciable amount as well. The solution was to change the jupyter plot output type to png using the command:
options(jupyter.plot_mimetypes = 'image/png')
Now when I plot graphs with 10s of thousands of points, the window remains crisply responsive. The downside is that the plots are now bitmap, but you can always remove the options if you want vector graphics.

R: update plot [xy]lims with new points() or lines() additions?

Background:
I'm running a Monte Carlo simulation to show that a particular process (a cumulative mean) does not converge over time, and often diverges wildly in simulation (the expectation of the random variable = infinity). I want to plot about 10 of these simulations on a line chart, where the x axis has the iteration number, and the y axis has the cumulative mean up to that point.
Here's my problem:
I'll run the first simulation (each sim. having 10,000 iterations), and build the main plot based on its current range. But often one of the simulations will have a range a few orders of magnitude large than the first one, so the plot flies outside of the original range. So, is there any way to dynamically update the ylim or xlim of a plot upon adding a new set of points or lines?
I can think of two workarounds for this: 1. store each simulation, then pick the one with the largest range, and build the base graph off of that (not elegant, and I'd have to store a lot of data in memory, but would probably be laptop-friendly [[EDIT: as Marek points out, this is not a memory-intense example, but if you know of a nice solution that'd support far more iterations such that it becomes an issue (think high dimensional walks that require much, much larger MC samples for convergence) then jump right in]]) 2. find a seed that appears to build a nice looking version of it, and set the ylim manually, which would make the demonstration reproducible.
Naturally I'm holding out for something more elegant than my workarounds. Hoping this isn't too pedestrian a problem, since I imagine it's not uncommon with simulations in R. Any ideas?
I'm not sure if this is possible using base graphics, if someone has a solution I'd love to see it. However graphics systems based on grid (lattice and ggplot2) allow the graphics object to be saved and updated. It's insanely easy in ggplot2.
require(ggplot2)
make some data and get the range:
foo <- as.data.frame(cbind(data=rnorm(100), numb=seq_len(100)))
make an initial ggplot object and plot it:
p <- ggplot(as.data.frame(foo), aes(numb, data)) + layer(geom='line')
p
make some more data and add it to the plot
foo <- as.data.frame(cbind(data=rnorm(200), numb=seq_len(200)))
p <- p + geom_line(aes(numb, data, colour="red"), data=as.data.frame(foo))
plot the new object
p
I think (1) is the best option. I actually don't think this isn't elegant. I think it would be more computationally intensive to redraw every time you hit a point greater than xlim or ylim.
Also, I saw in Peter Hoff's book about Bayesian statistics a cool use of ts() instead of lines() for cumulative sums/means. It looks pretty spiffy:

Resources