In R which packages for loading larger data quickly - r

In R usually data is loaded in RAM.
Are there any packages which load the data in disk rather than RAM

Check out the bigmemory package, along with related packages like bigtabulate, bigalgebra, biganalytics, and more. There's also ff, though I don't find it as user-friendly as the bigmemory suite. The bigmemory suite was reportedly partially motivated by the difficulty of using ff. I like it because it required very few changes to my code to be able to access a bigmatrix object: it can be manipulated in almost exactly the same ways as a standard matrix, so my code is very reusable.
There's also support for HDF5 via NetCDF4, in packages like RNetCDF and ncdf. This is a popular, multi-platform, multi-language method for efficient storage and access of large data sets.
If you want basic memory mapping functionality, look at the mmap package.

Yes, the ff package can do that.
You may want to look at the Task View on High-Performance Computing for more details.

Related

Eliminating the need for packages in base R?

I know one of the reasons R is so popular is because of its amazing packages. But for data security reasons, I can't install packages on my work computer. So, it got me thinking if I could still make R do what I would typically make it do using packages with just base R, since packages are, after all, a compiled list of functions. I am wondering if it is possible run regression models and make charts in base R (without using, say ggplot2, caret, etc.). Is it possible to copy the functions in these packages into base R to get the same functionality out of base R as one would if they were using the packages? Is the list of functions that are published as part of these packages available somewhere publicly by chance?
I am wondering if it is possible run regression models and make charts in base R (without using, say ggplot2, caret, etc.).
Yes, before ggplot2 was invented, R was genereally praised for publication ready graphics. R comes with great plotting capabilities without ggplot2 even though the latter is definitively an improvement.
Obviously, people used R for regression decades before caret was invented. A base R installation comes with a solid set of linear and nonlinear regression methods but obviously, all those packages (well, most of them) have a reason to exist. It will mainly depend on what you plan to do use. Many things are implemented in a base installation, many are not.
You can find lists of packages included with all binary distributions of R here: https://cran.r-project.org/doc/manuals/r-release/R-FAQ.html#Add_002don-packages-in-R
You will find, that that not only includes the stats package but lots of useful modelling packages like MASS, splines, boot, mgcv, nlme, cluster, rpart, spatial and survival, so a large number of even specialized models is at hand without additional downloading of packages.
Is it possible to copy the functions in these packages into base R to get the same functionality out of base R as one would if they were using the packages?
Many packages contain just plain R code, others will contain code in other languages, mostly C and C++, which will need a compiler to be translated on your system. However, where the use of foreign code / packages is considered a security breach, you should refrain from that and talk to your employer.
If it is not considered a problem but they do not want to make exceptions for you and your installation -- I was in the same place for quite some time and I just ran R from a USB stick. If that is allowed and feasible on your system, you can download packages to that USB stick installation.

Stealing methods and data from other R packages

I am currently developing an R package that make use of different datasets from other R packages. As a result, my package has a large number of dependencies, and the user is required to install various unrelated packages in order for my package to work.
I would prefer to copy these datasets to my own package and give proper credit in the documentation, but is there a problem with that?
And what about simple functions from other packages? For example, I need the Matern function from the fields package, and it seems much simpler to just copy that function to my own package instead of having a dependency on a whole package full of unrelated functionality.
Why not just ask the authors/maintainers of those packages for their permission or thoughts on the matter? They may know something that the rest of us don't about how the functions are implemented and how easy they are to copy.
Two different people asked me if they could include a function from my package in theirs, they explained why they wanted to and what they were doing and I agreed that having the user install my whole package for just the 1 function would be overkill and gave them my blessing (and the original source code) to include the functions in their packages (technically due to the license they did not need my permission). Now when I update either of the functions, I also send the updated source code to those 2 authors so that they can keep their copy up to date if they want to.

R package, size of dataset vis-a-vis code

I am designing an R package (http://github.com/bquast/decompr) to run the Wang-Wei-Zhu export decomposition (http://www.nber.org/papers/w19677).
The complete package is only about 79 kilobyte.
I want to supply an example dataset especially because the input objects are somewhat complex. A relevant real world dataset is available from http://www.wiod.org, however, the total size of the .Rdata object would come to about 1 megabyte.
My question therefore is, would it be a good idea to include the relevant dataset that is so much larger than the package itself?
It is not usual for code to be significantly smaller than data. However, I will not be the only one to suggest the following (especially if you want to submit to CRAN):
Consult the R Extensions manual. In particular, make sure that the data file is in a compressed format and use LazyData when applicable.
The CRAN Repository Policies also have a thing or two to say about data files. There is a hard maximum of 5MB for documentation and data. If the code is likely to change and the data are not, consider creating a separate data package.
PDF documentation can also be distributed, so it is possible to write a "vignette" that is not built by running code when the package is bundled, but instead illustrates usage with static code snippets that show how to download the data. Downloading in the vignette itself is prohibited, as the manual states that all files necessary to build it must be available on the local file system.
I also would have to ask if including a subset of the data is not sufficient to illustrate the use of the package.
Finally, if you don't intend to submit to a package repository, I can't imagine a megabyte download being a breach of etiquette.

R package that automatically uses several cores?

I have noticed that R only uses one core while executing one of my programs which requires lots of calculations. I would like to take advantage of my multi-core processor to make my program run faster.
I have not yet investigated the question in depth but I would appreciate to benefit from your comments because I do not have good knowledge in computer science and it is difficult for me to get easily understandable information on that subject.
Is there a package that allows R to automatically use several cores when needed?
I guess it is not that simple.
R can only make use of multiple cores with the help of add-on packages, and only for some types of operation. The options are discussed in detail on the High Performance Computing Task View on CRAN
Update: From R Version 2.14.0 add-on packages are not necessarily required due to the inclusion of the parallel package as a recommended package shipped with R. parallel includes functionality from the multicore and snow packages, largely unchanged.
The easiest way to take advantage of multiprocessors is the multicore package which includes the function mclapply(). mclapply() is a multicore version of lapply(). So any process that can use lapply() can be easily converted to an mclapply() process. However, multicore does not work on Windows. I wrote a blog post about this last year which might be helpful. The package Revolution Analytics created, doSMP, is NOT a multi-threaded version of R. It's effectively a Windows version of multicore.
If your work is embarrassingly parallel, it's a good idea to get comfortable with the lapply() type of structuring. That will give you easy segue into mclapply() and even distributed computing using the same abstraction.
Things get much more difficult for operations that are not "embarrassingly parallel".
[EDIT]
As a side note, Rstudio is getting increasingly popular as a front end for R. I love Rstudio and use it daily. However it needs to be noted that Rstudio does not play nice with Multicore (at least as of Oct 2011... I understand that the RStudio team is going to fix this). This is because Rstudio does some forking behind the scenes and these forks conflict with Multicore's attempts to fork. So if you need Multicore, you can write your code in Rstuido, but run it in a plain-Jane R session.
On this question you always get very short answers. The easiest solution according to me is the package snowfall, based on snow. That is, on a Windows single computer with multiple cores. See also here the article of Knaus et al for a simple example. Snowfall is a wrapper around the snow package, and allows you to setup a multicore with a few commands. It's definitely less hassle than most of the other packages (I didn't try all of them).
On a sidenote, there are indeed only few tasks that can be parallelized, for the very simple reason that you have to be able to split up the tasks before multicore calculation makes sense. the apply family is obviously a logical choice for this : multiple and independent computations, which is crucial for multicore use. Anything else is not always that easily multicored.
Read also this discussion on sfApply and custom functions.
Microsoft R Open includes multi-threaded math libraries to improve the performance of R.It works in Windows/Unix/Mac all OS type. It's open source and can be installed in a separate directory if you have any existing R(from CRAN) installation. You can use popular IDE Rstudio also with this.From its inception, R was designed to use only a single thread (processor) at a time. Even today, R works that way unless linked with multi-threaded BLAS/LAPACK libraries.
The multi-core machines of today offer parallel processing power. To take advantage of this, Microsoft R Open includes multi-threaded math libraries.
These libraries make it possible for so many common R operations, such as matrix multiply/inverse, matrix decomposition, and some higher-level matrix operations, to compute in parallel and use all of the processing power available to reduce computation times.
Please check the below link:
https://mran.revolutionanalytics.com/rro/#about-rro
http://www.r-bloggers.com/using-microsoft-r-open-with-rstudio/
As David Heffernan said, take a look at the Blog of revolution Analytics. But you should know that most packages are for Linux. So, if you use windows it will be much harder.
Anyway, take a look at these sites:
Revolution. Here you will find a lecture about parallerization in R. The lecture is actually very good, but, as I said, most tips are for Linux.
And this thread here at Stackoverflow will disscuss some implementation in Windows.
The package future makes it extremely simple to work in R using parallel and distributed processing. More info here. If you want to apply a function to elements in parallel, the future.apply package provides a quick way to use the "apply" family functions (e.g. apply(), lapply(), and vapply()) in parallel.
Example:
library("future.apply")
library("stats")
x <- 1:10
# Single core
y <- lapply(x, FUN = quantile, probs = 1:3/4)
# Multicore in parallel
plan(multiprocess)
y <- future_lapply(x, FUN = quantile, probs = 1:3/4)

Sharing large datasets between Matlab and R

I need a relatively efficient way to share data between Matlab and R.
I have checked SaveR and MATLAB R-link, but SaveR formats Matlab's binary data as text strings first and then prints them to an ASCII file, which is not efficient for large datasets, and MATLAB R-link only works on Windows (it uses a COM-based interface).
Update:
Dirk has posted a list of what seem to be better solutions to this problem than SaveR and Matlab R-link. I also learned recently about RAM disks (see here and here for some implementation examples), and thought that they might facilitate the task of sharing large datasets between Matlab and R (or similar computational environments) further. This leads me to the following questions:
Assumming that the data fits in the machines' memory in Matlab's or R's native data containers:
Are any of the solutions listed so
far a better fit for RAM disks?
Are there any additional
considerations to be taken into
account when dealing with RAM disks
instead of with secundary-storage
solutions?
Thanks!
Couple of ideas, and with the caveat that I know more about the R side of things:
Tthe R.matlab package on CRAN can help: This package provides methods to read and write MAT files. It also makes it possible to communicate (evaluate code, send and retrieve objects etc.) with Matlab v6 or higher running locally or on a remote host
HDF5, as you suggested, is a possibility but I heard that the R support in CRAN package hdf5 is somewhat basic
NetCDF may be an alternative; CRAN has packages RNetCDF, ncdf and ncdf4
Use a database, especially a light and file-based one like SQLite or H4 both of which have R support
Use a common serialization / de-serialization format; R has support for Google Protocol Buffers via RProtoBuf and Google points to protobuf-matlab for Matlab
Write your own! Especially when you only need something basic like large rectangular matrices then nothing will beat a direct binary write; I did this once years ago for Octave (which is close to Matlab). You can extend Matab via mex files; R has its API and helpers like Rcpp. The larger your data sets, the more attractive this may look as you save the conversions.
Matlab use HDF5 natively in last versions ("save" and "load"). There is a package for R. Then HDF5 might be a good solution.

Resources