Handling large objects in R - r

I'm trying to do a project on object recognition using the CIFAR image dataset using R, however there is one problem with handling the dataset, when I read the data into some data frame or matrix, the object size for the objects gets large around 98MB, and this causes my program to run very slow, my code is vectorized but still I face performance issues. Please can someone suggest how can I resolve this problem.
Thanks.

It looks like the binary form of those data would fit nicely into a flat-file type format, like that handled by bigmemory, ff, or rhdf5 packages.

Related

Subset of features on external memory

I have a large file that I'm not able to load so I'm using a local file with xgb.DMatrix. But I'd like to use only a subset of the features. The documentation on xgboost says that the colset argument on slice is "currently not used" and there is no metion of this feature in the github page. And I haven't found any other clue of how to do column subsetting with external memory.
I wish to compare models generated with different features subsettings. The only thing I could think of is to create a new file with the features that I want to use but it's taking a long time and will take a lot of memory... I can't help wondering if there is a better way.
ps.: I tried using h2o package too but h2o.importFile froze.

Using R with tidyquant and massiv data

While working with R I encountered a strange problem:
I am processing date in the follwing manner:
Reading data from a database into a dataframe, filling missing values, grouping and nesting the data to a combined primary key, creating a timeseries and forecastting it for every group, ungroup and clean the data, write it back into the DB.
Somehting like this:
https://cran.rstudio.com/web/packages/sweep/vignettes/SW01_Forecasting_Time_Series_Groups.html
For small data sets this works like a charm, but with lager ones (over about 100000 entries) I do get the "R Session Aborted" screen from R-Studio and the nativ R GUI just stops execution and implodes.
There is no information in every log file that I've look into. I suspect that it is some kind of (leaking) memory issue.
As a work around I'm processing the data in chunks with a for-loop. But no matter how small the chunk size is, I do get the "R Session Aborted" screen, which looks a lot like leaking memory.
The whole date consist of about 5 million rows.
I've looked a lot into packages like ff, the big-Family and matter basically everything from https://cran.r-project.org/web/views/HighPerformanceComputing.html
but this dose not seem to work well with tibbles and the tidyverse way of data processing.
So, how can I improve my scrip to work with massiv amounts of data?
How can I gather clues about why the R Session is Aborted?
Check out the article at:
datascience.la/dplyr-and-a-very-basic-benchmark
There is a table that shows runtime comparisons for some of the data wrangling tasks you are performing. From the table, it looks as though dplyr with data.table behind it is likely going to do much better than dplyr with a dataframe behind it.
There’s a link to the benchmarking code used to make the table, too.
In short, try adding a key, and try using data.table over dataframe.
To make x your key, and say your data.table is named dt, use setkey(dt,x).
While Pakes answer deals with the described problem I found a solution to the underlying problem. For Compatibility reason I used R in the 3.4.3 version. Now I'm using the newer 3.5.1 version which works quite fine.

How to use zoo or xts with large data?

How can I use the R packages zoo or xts with very large data sets? (100GB)
I know there are some packages such as bigrf, ff, bigmemory that can deal with this problem but you have to use their limited set of commands, they don't have the functions of zoo or xts and I don't know how to make zoo or xts to use them.
How can I use it?
I've seen that there are also some other things, related with databases, such as sqldf and hadoopstreaming, RHadoop, or some other used by Revolution R. What do you advise?, any other?
I just want to aggreagate series, cleanse, and perform some cointegrations and plots.
I wouldn't like to need to code and implement new functions for every command I need, using small pieces of data every time.
Added: I'm on Windows
I have had a similar problem (albeit I was only playing with 9-10 GBs). My experience is that there is no way R can handle so much data on its own, especially since your dataset appears to contain time series data.
If your dataset contains a lot of zeros, you may be able to handle it using sparse matrices - see Matrix package ( http://cran.r-project.org/web/packages/Matrix/index.html ); this manual may also come handy ( http://www.johnmyleswhite.com/notebook/2011/10/31/using-sparse-matrices-in-r/ )
I used PostgreSQL - the relevant R package is RPostgreSQL ( http://cran.r-project.org/web/packages/RPostgreSQL/index.html ). It allows you to query your PostgreSQL database; it uses SQL syntax. Data is downloaded into R as a dataframe. It may be slow (depending on the complexity of your query), but it is robust and can be handy for data aggregation.
Drawback: you would need to upload data into the database first. Your raw data needs to be clean and saved in some readable format (txt/csv). This is likely to be the biggest issue if your data is not already in a sensible format. Yet uploading "well-behaved" data into the DB is easy ( see http://www.postgresql.org/docs/8.2/static/sql-copy.html and How to import CSV file data into a PostgreSQL table? )
I would recommend using PostgreSQL or any other relational database for your task. I did not try Hadoop, but using CouchDB nearly drove me round the bend. Stick with good old SQL

R: Help reading a particular .mat file into R

So I've been trying to read this particular .mat file into R. I don't know too much about matlab, but I know enough that the R.matlab package can only read uncompressed data into R, and to save it as uncompressed I need to save it as such in matlab by using
save new.mat -v6.
Okay, so I did that, but when I used readMat("new.mat") in R, it just got stuck loading that forever. I also tried using package hdf5 via:
> hdf5load("new.mat", load=FALSE)->g
Error in hdf5load("new.mat", load = FALSE) :
can't handle hdf type 201331051
I'm not sure what this problem could be, but if anyone wants to try to figure this out the file is located at http://dibernardo.tigem.it/MANTRA/MANTRA_online/Matlab_Code%26Data.html and is called inventory.mat (the first file).
Thanks for your help!
This particular file has one object, inventory, which is a struct object, with a lot of different things inside of it. Some are cell arrays, others are vectors of doubles or logicals, and a couple are matrices of doubles. It looks like R.matlab does not like cells arrays within structs, but I'm not sure what's causing issues for R to load this. For reasons like this, I'd generally recommend avoiding mapping structs in Matlab to objects in R. It is similar to a list, and this one can be transformed to a list, but it's not always a good idea.
I recommend creating a new file, one for each object, e.g. ids = inventory.instance_ids and save each object to either a separate .mat file, or save all of them, except for the inventory object, into 1 file. Even better is to go to text, e.g via csvwrite, so that you can see what's being created.
I realize that's going around use of a Matlab to R reader, but having things in a common, universal format is much more useful for reproducibility than to acquire a bunch of different readers for a proprietary format.
Alternatively, you can pass objects in memory via R.matlab, or this set of functions + the R/DCOM interface (on Windows).
Although this doesn't address how to use R.matlab, I've done a lot of transferring of data between R and Matlab, in both directions, and I find that it's best to avoid .mat files (and, similarly, .rdat files). I like to pass objects in memory, so that I can inspect them on each side, or via standard text files. Dealing with application specific file formats, especially those that change quite a bit and are inefficient (I'm looking at you MathWorks), is not a good use of time. I appreciate the folks who work on readers, but having a lot more control over the data structures used in the target language is very much worth the space overhead of using a simple output file format. In-memory data transfer is very nice because you can interface programs, but that may be a distraction if your only goal is to move data.
Have you run the examples in http://cran.r-project.org/web/packages/R.matlab/R.matlab.pdf on pages 22 to 24? That will test your ability to read from versions 4 and 5. I'm not sure that R cannot read compressed files. There is an Rcompresssion package in Omegahat.

XTS size limitation

I have been working on large datasets lately (more than 400 thousands lines). So far, I have been using XTS format, which worked fine for "small" datasets of a few tenth of thousands elements.
Now that the project grows, R simply crashes when retrieving the data for the database and putting it into the XTS.
It is my understanding that R should be able to have vectors with size up to 2^32-1 elements (or 2^64-1 according the the version). Hence, I came to the conclusion that XTS might have some limitations but I could not find the answer in the doc. (maybe I was a bit overconfident about my understanding of theoretical possible vector size).
To sum up, I would like to know if:
XTS has indeed a size limitation
What do you think is the smartest way to handle large time series? (I was thinking about splitting the analysis into several smaller datasets).
I don't get an error message, R simply shuts down automatically. Is this a known behavior?
SOLUTION
The same as R and it depends on the kind of memory being used (64bits, 32 bits). It is anyway extremely large.
Chuncking data is indeed a good idea, but it is not needed.
This problem came from a bug in R 2.11.0 which has been solved in R 2.11.1. There was a problem with long dates vector (here the indexes of the XTS).
Regarding your two questions, my $0.02:
Yes, there is a limit of 2^32-1 elements for R vectors. This comes from the indexing logic, and that reportedly sits 'deep down' enough in R that it is unlikely to be replaced soon (as it would affect so much existing code). Google the r-devel list for details; this has come up before. The xts package does not impose an additional restriction.
Yes, splitting things into chunks that are manageable is the smartest approach. I used to do that on large data sets when I was working exclusively with 32-bit versions of R. I now use 64-bit R and no longer have this issue (and/or keep my data sets sane),
There are some 'out-of-memory' approaches, but I'd first try to rethink the problem and affirm that you really need all 400k rows at once.

Resources