I consider switching from R, Python and SAS to Julia and wondering whether Julia has convenient tools for out-of-core operations. I intend to use Julia on datasets of the size 10-20 Gb, so I would like to be able to manipulate them without loading them into RAM. Is there any package in Julia, allowing Julia to ‘just work’ with larger-than-RAM data the way SAS functions work?
JuliaDB is a dataframe package in Julia which allows for out-of-core computations, online (streaming) statistics, and parallelism.
Related
I'm working with more than 500 Gigabyte Rasters in Rstudio.
My code is working fine but the problem is that R is writing all raster data into a temporal folder, that means the computation time is more than 4 days (even on SSD). Is there a way to make the processing faster?
I'm working on a Computer with 64Gigabyte RAM and 1.5 Gigabyte SSD.
best regards
I don't know Sentinel 2, so it's complicated to help you on performance. Basically, you have to try to (a) use some parallel computation with foreach and doparallel packages, (b) find better packages to working with, or (c) reducing the complexity, in addition to the bad-answers like 'R is not suited for large datasets'.
A) One of the solutions would be a parallel computing, if it is possible to divide your calculations (e.g., your problem consists in a lot of calculations but you simply write results). For example, with the foreach and doparallel packages, observing many temporal networks is much faster than with a 'normal' serial for-loop (e.g., foreach/doparallel are very useful to compute basic statistics for each member of the network and for the global network, as soon as you need to repeat these computations to many 'sub-networks' or many 'networks at a time T' and .combine the results in a maxi-dataset). This last .combine arg. will be useless for a single 500 gb networks, so you have to write the results one by one and it will be very long (4 days = several hours or parallel computation, assuming parallel computing will be 6 or 7 times fastest than your actual computation).
B) Sometimes, it is simply a matter of identifying a more suitable package, as in the case of text-mining computations, and the performance offered by the quanteda package. I prefer to compute text-mining with tidyverse style, but for large datasets and before migrating to another language than R, quanteda is very powerful and fast in the end, even on large datasets of texts. In this example, if Quanteda is too slow to compute a basic text-mining on your dataset, you have to migrate to another technology or stop deploying 'death computing' and/or reduce the complexity of your problem / solution / size of datasets (e.g., Quanteda is not - yet - fast to compute a GloVe model on a very large dataset of 500 gb and you are reaching the border of the methods offered by the package Quanteda, so you have to try another langage than R: librairies in Python or Java like SpaCy will be better than R for deploy GloVe model on very large dataset, and it's not a very big step from R).
I would suggest trying the terra package, it has pretty much the same functions as raster, but it can be much faster.
Is there a convenient way to call R functions from Julia?
If so, what mechanisms for doing so exist? (Potentially ranging from simply calling an R script from the shell & hand-coding the I/O to/from Julia, to interacting with an R environment over multiple Julia calls with Julia DataFrames being seamlessly converted to/from R DataFrames).
Calling R scripts and handcoding I/O is the best way to work with R for the moment. We have functions for reading the RDA binary format that R likes and should add some tools for working with it more easily and also writing data in that format, which will speed up I/O considerably relative to passing CSV files around -- which I've done in the past.
Converting between R and Julia DataFrames could be done, but would be quite costly as Julia isn't using a binary representation of data (e.g. NA) that's nearly equivalent to R's. So you'd need to do some non-trivial work to make this work in a way that would be substantially more efficient than using the RDA binary format.
One thing that would be really nice is to build solid Thrift bindings for both R and Julia and then call back and forth using those bindings.
For calling out to R from within Julia, the RCall package is currently your best bet. For calling out to Julia from within R, try the RJulia package. Both are a bit in the works.
Do you know if there are any plans to introduce parallel programming in R for all packages?
I'm aware of some developments such as R-revolution and parallel programming packages, but they seem to have specialised functions which replace the most popular functions (linear programming etc..). However one of the great things about R is the huge amount of specialised packages which prop up every day and make complex and time-consuming analysis very easy to run. Many of these use very popular functions such as the generalised linear model, but also use the results for additional calculation and comparison and finally sort out the output. As far as I understand you need to define which parts of a function can be run in parallel programming so this is probably why most specialised R packages don't have this functionality and cannot have it unless the code is edited.
Are there are any plans (or any packages) to enable all the most popular R functions to run in parallel processing so that all the less popular functions containing these can be run in parallel processing? For example, the package difR uses the glm function for most of its functions; if the glm package was enabled to run in parallel processing (or re-written and then released in a new R version) for all multi-processor machines then there would be no need to re-write the difR package and this could then run some of its most burdensome procedures with the aid of parallel programming on a Windows PC.
I completely agree with Paul's answer.
In addition, a general system for parallelization needs some very non-trivial calibration, even for those functions that can be easily parallelized: What if you have a call stack of several functions that offer parallel computation (e.g. you are bootstrapping some model fitting, the model fitting may already offer parallelization and low level linear algebra can be implicitly parallel)? You need to estimate (or choose manually) at which level explicit parallelization should be done. In addition, you possibly have implicit parallelization, so you need to trade off between these.
However, there is one particularly easy and general way to parallelize computations implicitly in R: linear algebra can be parallelized and sped up considerably by using an optimized BLAS. Using this can (depending on your system) be as easy as telling your package manager to install the optimized BLAS and R will use it. Once it is linked to R, all packages that use the base linear algebra functions like %*%, crossprod, solve etc. will profit.
See e.g. Dirk Eddelbüttel's gcbd package and its vignette, and also discussions how to use GotoBLAS2 / OpenBLAS.
How to parallelize a certain problem is often non-trivial. Therefore, a specific implementation has to be made in each and every case, in this case for each R package. So, I do not think a general implementation of parallel processing in R will be made, or is even possible.
I have noticed that R only uses one core while executing one of my programs which requires lots of calculations. I would like to take advantage of my multi-core processor to make my program run faster.
I have not yet investigated the question in depth but I would appreciate to benefit from your comments because I do not have good knowledge in computer science and it is difficult for me to get easily understandable information on that subject.
Is there a package that allows R to automatically use several cores when needed?
I guess it is not that simple.
R can only make use of multiple cores with the help of add-on packages, and only for some types of operation. The options are discussed in detail on the High Performance Computing Task View on CRAN
Update: From R Version 2.14.0 add-on packages are not necessarily required due to the inclusion of the parallel package as a recommended package shipped with R. parallel includes functionality from the multicore and snow packages, largely unchanged.
The easiest way to take advantage of multiprocessors is the multicore package which includes the function mclapply(). mclapply() is a multicore version of lapply(). So any process that can use lapply() can be easily converted to an mclapply() process. However, multicore does not work on Windows. I wrote a blog post about this last year which might be helpful. The package Revolution Analytics created, doSMP, is NOT a multi-threaded version of R. It's effectively a Windows version of multicore.
If your work is embarrassingly parallel, it's a good idea to get comfortable with the lapply() type of structuring. That will give you easy segue into mclapply() and even distributed computing using the same abstraction.
Things get much more difficult for operations that are not "embarrassingly parallel".
[EDIT]
As a side note, Rstudio is getting increasingly popular as a front end for R. I love Rstudio and use it daily. However it needs to be noted that Rstudio does not play nice with Multicore (at least as of Oct 2011... I understand that the RStudio team is going to fix this). This is because Rstudio does some forking behind the scenes and these forks conflict with Multicore's attempts to fork. So if you need Multicore, you can write your code in Rstuido, but run it in a plain-Jane R session.
On this question you always get very short answers. The easiest solution according to me is the package snowfall, based on snow. That is, on a Windows single computer with multiple cores. See also here the article of Knaus et al for a simple example. Snowfall is a wrapper around the snow package, and allows you to setup a multicore with a few commands. It's definitely less hassle than most of the other packages (I didn't try all of them).
On a sidenote, there are indeed only few tasks that can be parallelized, for the very simple reason that you have to be able to split up the tasks before multicore calculation makes sense. the apply family is obviously a logical choice for this : multiple and independent computations, which is crucial for multicore use. Anything else is not always that easily multicored.
Read also this discussion on sfApply and custom functions.
Microsoft R Open includes multi-threaded math libraries to improve the performance of R.It works in Windows/Unix/Mac all OS type. It's open source and can be installed in a separate directory if you have any existing R(from CRAN) installation. You can use popular IDE Rstudio also with this.From its inception, R was designed to use only a single thread (processor) at a time. Even today, R works that way unless linked with multi-threaded BLAS/LAPACK libraries.
The multi-core machines of today offer parallel processing power. To take advantage of this, Microsoft R Open includes multi-threaded math libraries.
These libraries make it possible for so many common R operations, such as matrix multiply/inverse, matrix decomposition, and some higher-level matrix operations, to compute in parallel and use all of the processing power available to reduce computation times.
Please check the below link:
https://mran.revolutionanalytics.com/rro/#about-rro
http://www.r-bloggers.com/using-microsoft-r-open-with-rstudio/
As David Heffernan said, take a look at the Blog of revolution Analytics. But you should know that most packages are for Linux. So, if you use windows it will be much harder.
Anyway, take a look at these sites:
Revolution. Here you will find a lecture about parallerization in R. The lecture is actually very good, but, as I said, most tips are for Linux.
And this thread here at Stackoverflow will disscuss some implementation in Windows.
The package future makes it extremely simple to work in R using parallel and distributed processing. More info here. If you want to apply a function to elements in parallel, the future.apply package provides a quick way to use the "apply" family functions (e.g. apply(), lapply(), and vapply()) in parallel.
Example:
library("future.apply")
library("stats")
x <- 1:10
# Single core
y <- lapply(x, FUN = quantile, probs = 1:3/4)
# Multicore in parallel
plan(multiprocess)
y <- future_lapply(x, FUN = quantile, probs = 1:3/4)
I need a relatively efficient way to share data between Matlab and R.
I have checked SaveR and MATLAB R-link, but SaveR formats Matlab's binary data as text strings first and then prints them to an ASCII file, which is not efficient for large datasets, and MATLAB R-link only works on Windows (it uses a COM-based interface).
Update:
Dirk has posted a list of what seem to be better solutions to this problem than SaveR and Matlab R-link. I also learned recently about RAM disks (see here and here for some implementation examples), and thought that they might facilitate the task of sharing large datasets between Matlab and R (or similar computational environments) further. This leads me to the following questions:
Assumming that the data fits in the machines' memory in Matlab's or R's native data containers:
Are any of the solutions listed so
far a better fit for RAM disks?
Are there any additional
considerations to be taken into
account when dealing with RAM disks
instead of with secundary-storage
solutions?
Thanks!
Couple of ideas, and with the caveat that I know more about the R side of things:
Tthe R.matlab package on CRAN can help: This package provides methods to read and write MAT files. It also makes it possible to communicate (evaluate code, send and retrieve objects etc.) with Matlab v6 or higher running locally or on a remote host
HDF5, as you suggested, is a possibility but I heard that the R support in CRAN package hdf5 is somewhat basic
NetCDF may be an alternative; CRAN has packages RNetCDF, ncdf and ncdf4
Use a database, especially a light and file-based one like SQLite or H4 both of which have R support
Use a common serialization / de-serialization format; R has support for Google Protocol Buffers via RProtoBuf and Google points to protobuf-matlab for Matlab
Write your own! Especially when you only need something basic like large rectangular matrices then nothing will beat a direct binary write; I did this once years ago for Octave (which is close to Matlab). You can extend Matab via mex files; R has its API and helpers like Rcpp. The larger your data sets, the more attractive this may look as you save the conversions.
Matlab use HDF5 natively in last versions ("save" and "load"). There is a package for R. Then HDF5 might be a good solution.