Work in R with very large data set - r

I am working with a very large data set which I am downloading from an Oracle data base. The Data frame has about 21 millions rows and 15 columns.
My OS is windows xp (32-bit), I have 2GB RAM. Short-term I cannot upgrade my RAM or my OS (it is at work, it will take months before I get a decent pc).
library(RODBC)
sqlQuery(Channel1,"Select * from table1",stringsAsFactor=FALSE)
I get here already stuck with the usual "Cannot allocate xMb to vector".
I found some suggestion about using the ff package. I would appreciate to know if anybody familiar with the ff package can tell me if it would help in my case.
Do you know another way to get around the memory problem?
Would a 64-bit solution help?
Thanks for your suggestions.

If you are working with package ff and have your data in SQL, you can easily get them in ff using package ETLUtils, see the documentation for an example when using ROracle.
In my experience, ff is perfectly suited for the type of dataset you are working with (21 Mio rows and 15 columns) - in fact your setup is kind of small to ff unless your columns contain a lot of character data which will be converted to factors (meaning all your factor levels should be able to fit in your RAM).
Packages ETLUtils, ff and the package ffbase allow you to get your data in R using ff and do some basic statistics on it. Depending on what you will do with your data, your hardware, you might have to consider sampling when you build models. I prefer having my data in R, building a model based on a sample and score using the tools in ff (like chunking) or from package ffbase.
The drawback is that you have to get used to the fact that your data are ffdf objects and that might take some time - especially if you are new to R.

In my experience, processing your data in chunks can almost always help greatly in processing big data. For example, if you calculate a temporal mean only one timestep needs to be in memory at any given time. You already have your data in a database, so obtaining the subset is easy. Alternatively, if you cannot easily process in chunks, you could always try and take a subset of your data. Repeat the analysis a few times to see if your results are sensitive to which subset you take. The bottomline is that some smart thinking can get you a long way with 2 Gb of RAM. If you need more specific advice, you need to ask more specific questions.

Sorry I can't help with ff, but on the topic of the RAM: I'm not familiar with the memory usage of R data frames, but for sake of argument let's say each cell takes 8 bytes (e.g. a double-precision float or long integer).
21 million * 15 * 8 bytes = about 2.5 GB.
Update and see the comments below; this figure is probably an underestimate!
So you could really do with more RAM, and a 64-bit machine would help a lot as 32-bit machines are limited to 4GB (and can't use that fully).
Might be worth trying a subset of the dataset so you know how much you can load with your existing RAM, then extrapolate to estimate how much you actually need. If you can subdivide the data and process it in chunks, that would be great, but lots of problems don't submit to that approach easily.
Also, I have been assuming that you need all the columns! Obviously, if you can filter the data in any way to reduce the size (e.g. removing any irrelevant columns) than that may help greatly!

There's another very similar question. In particular, one way to to handle your data is to write it to the file and then map memory region to it (see, for example, mmap package).

Related

R - How to work with large dataset that does not (but could) fit within memory?

I am trying to work with some data, and it is brutally large (genetic data). With only using the bare minimum amount of columns I need, I am still looking at ~30GB of data.
Assuming I upgrade my PC to have 64GB of RAM, would I even be able to work with that data in R, or will I run into issues somewhere else? I.e., CPU not being beefy enough (AMD Ryzen 3600X), RStudio not being able to handle it, etc.
If RStudio will not be able to handle it or will be extremely slow, is there another way I can work with this data? I just want to do dimension reduction (which may make it a lot easier for me to use R) and run logistic regression on the data, maybe with some varied train/test splits.
Any help here is appreciated.
Thank you!

Efficient way to handle big data in R

I have a huge csv-file, 1.37 GB, and when running my glm in R, it crashes because I do not have enough memory allocated. You know, the regular error..
Are there no alternative to packages ff and bigmemory, because they do not seem to work well for me, because my columns are a mix of integer and characters, and it seems with the two packages I have to specify what type my columns are, either char or integer.
We are soon in 2018 and about to put people on Mars; are there no simple "read.csv.xxl" function we can use?
I would first address your question by recognizing that just because your sample data takes 1.37 GB does not at all mean that 1.37 GB would be satisfactory to do all your calculations using the glm package. Most likely, one of your calculations could spike at at least a multiple of 1.37 GB.
For the second part, a practical workaround here would be to just take a reasonable sub sample of your 1.37 GB data set. Do you really need to build your model using all the data points in the original data set? Or, would say a 10% sub sample also give you a statistically significant model? If you lower the size of the data set, then you solve the memory problem with R.
Keep in mind here that R runs completely in-memory, meaning that once you have exceeded available memory, you may be out of luck.

R, RAM amounts, and specific limitations to avoid memory errors

I have read about various big data packages with R. Many seem workable except that, at least as I understand the issue, many of the packages I like to use for common models would not be available in conjunction with the recommended big data packages (for instance, I use lme4, VGAM, and other fairly common varieties of regression analysis packages that don't seem to play well with the various big data packages like ff, etc.).
I recently attempted to use VGAM to do polytomous models using data from the General Social Survey. When I tossed some models on to run that accounted for the clustering of respondents in years as well as a list of other controls I started hitting the whole "cannot allocate vector of size yadda yadda..." I've tried various recommended items such as clearing memory out and using matrices where possible to no good effect. I am inclined to increase the RAM on my machine (actually just buy a new machine with more RAM), but I want to get a good idea as to whether that will solve my woes before letting go of $1500 on a new machine, particularly since this is for my personal use and will be solely funded by me on my grad student budget.
Currently I am running a Windows 8 machine with 16GB RAM, R 3.0.2, and all packages I use have been updated to the most recent versions. The data sets I typically work with max out at under 100,000 individual cases/respondents. As far as analyses go, I may need matrices and/or data frames that have many rows if for example I use 15 variables with interactions between factors that have several levels or if I need to have multiple rows in a matrix for each of my 100,000 cases based on shaping to a row per each category of some DV per each respondent. That may be a touch large for some social science work, but I feel like in the grand scheme of things my requirements are actually not all that hefty as far as data analysis goes. I'm sure many R users do far more intense analyses on much bigger data.
So, I guess my question is this - given the data size and types of analyses I'm typically working with, what would be a comfortable amount of RAM to avoid memory errors and/or having to use special packages to handle the size of the data/processes I'm running? For instance, I'm eye-balling a machine that sports 32GB RAM. Will that cut it? Should I go for 64GB RAM? Or do I really need to bite the bullet, so to speak, and start learning to use R with big data packages or maybe just find a different stats package or learn a more intense programming language (not even sure what that would be, Python, C++ ??). The latter option would be nice in the long run of course, but would be rather prohibitive for me at the moment. I'm mid-stream on a couple of projects where I am hitting similar issues and don't have time to build new language skills all together under deadlines.
To be as specific as possible - What is the max capability of 64 bit R on a good machine with 16GB, 32GB, and 64GB RAM? I searched around and didn't find clear answers that I could use to gauge my personal needs at this time.
A general rule of thumb is that R roughly needs three times the dataset size in RAM to be able to work comfortably. This is caused by the copying of objects in R. So, divide your RAM size by three to get a rough estimate of your maximum dataset size. Then you can look at the type of data you use, and choose how much RAM you need.
Of course, R can also process data out-of-memory, see the HPC task view. This earlier answer of mine might also be of interest.

Large matrix: solve(crossprod(X)) when dim(X) = 100,000:5000

I need to do run a simple two-stage least squares regression on large data matrices. This just requires some crossprod() and solve() commands, but the matrices have dimensions 100,000 by 5000 matrix. My understanding is that holding a matrix like this in memory would take up a bit less than 4GB of memory. Unfortunately, my 64-bit Win7 machine only has 8GB of RAM. When I try to manipulate the matrices in question, I get the usual 'can't allocate vector of size' message.
I have considered a number of options such as the ff and bigmemory packages. However, the base R functions for the matrix operations I need only support the usual matrix object type, not the bigmatrix type.
It seems like it may be possible to extend the code from biglm(), but I'm on a tight schedule for this project, so I wanted to check-in with you all to see if there existed a ready-made solution for problems like this. Apologies if this was addressed before (I couldn't find it) or if the question is too generic.
Yes, a ready-made solution exist in biglm, the package you already identified. Linear regression can work with an updating scheme; that basic property is implemented in the package.
Dump your data to disk, say to SQLite and study the package documentation and proceed in, say, 10 chunks on 10,000 each.

XTS size limitation

I have been working on large datasets lately (more than 400 thousands lines). So far, I have been using XTS format, which worked fine for "small" datasets of a few tenth of thousands elements.
Now that the project grows, R simply crashes when retrieving the data for the database and putting it into the XTS.
It is my understanding that R should be able to have vectors with size up to 2^32-1 elements (or 2^64-1 according the the version). Hence, I came to the conclusion that XTS might have some limitations but I could not find the answer in the doc. (maybe I was a bit overconfident about my understanding of theoretical possible vector size).
To sum up, I would like to know if:
XTS has indeed a size limitation
What do you think is the smartest way to handle large time series? (I was thinking about splitting the analysis into several smaller datasets).
I don't get an error message, R simply shuts down automatically. Is this a known behavior?
SOLUTION
The same as R and it depends on the kind of memory being used (64bits, 32 bits). It is anyway extremely large.
Chuncking data is indeed a good idea, but it is not needed.
This problem came from a bug in R 2.11.0 which has been solved in R 2.11.1. There was a problem with long dates vector (here the indexes of the XTS).
Regarding your two questions, my $0.02:
Yes, there is a limit of 2^32-1 elements for R vectors. This comes from the indexing logic, and that reportedly sits 'deep down' enough in R that it is unlikely to be replaced soon (as it would affect so much existing code). Google the r-devel list for details; this has come up before. The xts package does not impose an additional restriction.
Yes, splitting things into chunks that are manageable is the smartest approach. I used to do that on large data sets when I was working exclusively with 32-bit versions of R. I now use 64-bit R and no longer have this issue (and/or keep my data sets sane),
There are some 'out-of-memory' approaches, but I'd first try to rethink the problem and affirm that you really need all 400k rows at once.

Resources