Efficient way to handle big data in R - r

I have a huge csv-file, 1.37 GB, and when running my glm in R, it crashes because I do not have enough memory allocated. You know, the regular error..
Are there no alternative to packages ff and bigmemory, because they do not seem to work well for me, because my columns are a mix of integer and characters, and it seems with the two packages I have to specify what type my columns are, either char or integer.
We are soon in 2018 and about to put people on Mars; are there no simple "read.csv.xxl" function we can use?

I would first address your question by recognizing that just because your sample data takes 1.37 GB does not at all mean that 1.37 GB would be satisfactory to do all your calculations using the glm package. Most likely, one of your calculations could spike at at least a multiple of 1.37 GB.
For the second part, a practical workaround here would be to just take a reasonable sub sample of your 1.37 GB data set. Do you really need to build your model using all the data points in the original data set? Or, would say a 10% sub sample also give you a statistically significant model? If you lower the size of the data set, then you solve the memory problem with R.
Keep in mind here that R runs completely in-memory, meaning that once you have exceeded available memory, you may be out of luck.

Related

What are the minimum system requirements for analysing large datasets (30gb) in R?

I tried running Apriori algorithm on 30GB CSV file in which each row is a basket upto 34 items(columns) in it. So R studio died just after execution. I want to know what are the minimum system requirements like how much RAM and CPU config I need to run algorithms on large data sets?
This question cannot be answered as such. It highly depends on what you want to do with the data.
Example
If you are able to process all lines 1 by 1, you just need a tiny bit of ram (for example if you want to count them, I believe this also holds for the most trivial use of Apriori)
If you want to calculate the distance between all points efficiently, you will want a ton of ram, and another few GB to store the output (I believe this is even less intense than the most extreme use of Apriori).
Conclusion
As such I would recommend:
Use whatever hardware you have to process a subset of the data. Check your memory and CPU usage, as you increase the data size (or other parameters) and extrapolate your results to see what you probably need.

Time complexity and memory limit of glm in R

I want to use glm( ... , family = "binomial") to do a logistic regression with my big dataset which has 80,000,000 rows and 125 columns as a data.frame. But when I run in RStudio, it just crashes:
So I wonder what the time complexity of glm() is, and whether there are any solutions to handle such data? Someone suggested I try running the code from command line: does this make any difference (I tried, but it seems that doesn't work either)?
Memory requirement: R has to load the entire dataset into memory (RAM). However, your dataset is (assuming entries are 32-bits) is roughly 37 gigabytes -- much larger than the amount of RAM you have on your computer. Therefore, it crashes. You cannot use R for this dataset unless you use some special big data packages, and I'm not sure it's even feasible then.
There are other languages do not need to load it into memory to look at it, and so it might be wise to do that.
Time complexity for GLMs: if N = # of observations (usually # of rows), and p = # of variables (usually # of columns), it is O(p^3 + Np^3) for most standard GLM algorithms.
For your situation, it has a time complexity of approximately 10^12 which is still barely in the realm of possibility, but you probably need more than one modern PC running for at least a few days.

Why is an R object so much larger than the same data in Stata/SPSS?

I have survey data in SPSS and Stata which is ~730 MB in size. Each of these programs also occupy approximately the amount of space you would expect(~800MB) in the memory if I'm working with that data.
I've been trying to pick up R, and so attempted to load this data into R. No matter what method I try(read.dta from the stata file, fread from a csv file, read.spss from the spss file) the R object(measured using object.size()) is between 2.6 to 3.1 GB in size. If I save the object in an R file, that is less than 100 MB, but on loading it is the same size as before.
Any attempts to analyse the data using the survey package, particularly if I try and subset the data, take significantly longer than the equivalent command in stata.
e.g I have a household size variable 'hhpers' in my data 'hh', weighted by variable 'hhwt' , subset by 'htype'
R code :
require(survey)
sv.design <- svydesign(ids = ~0,data = hh, weights = hh$hhwt)
rm(hh)
system.time(svymean(~hhpers,sv.design[which
(sv.design$variables$htype=="rural"),]))
pushes the memory used by R upto 6 GB and takes a very long time -
user system elapsed
3.70 1.75 144.11
The equivalent operation in stata
svy: mean hhpers if htype == 1
completes almost instantaneously, giving me the same result.
Why is there such a massive difference between both memory usage(by object as well as the function), and time taken between R and Stata?
Is there anything I can do to optimise the data and how R is working with it?
ETA: My machine is running 64 bit Windows 8.1, and I'm running R with no other programs loaded. At the very least, the environment is no different for R than it is for Stata.
After some digging, I expect the reason for this is R's limited number of data types. All my data is stored as int, which takes 4 bytes per element. In survey data, each response is categorically coded, and typically requires only one byte to store, which stata stores using the 'byte' data type, and R stores using the 'int' data type, leading to some significant inefficiency in large surveys.
Regarding difference in memory usage - you're on the right track and (mostly) its because of object types. Indeed integer saving will take up a lot of your memory. So proper setting of variable types would improve memory usage by R. as.factor() would help. See ?as.factor for more details to update this after reading data. To fix this during reading data from the file refer to colClasses parameter of read.table() (and similar functions specific for stata & SPSS formats). This will help R store data more efficiently (its on the fly guessing of types is not top-notch).
Regarding the second part - calculation speed - large dataset parsing is not perfect in base R, that's where data.table package comes handy - its fast and quite similar to original data.frame behavior. Summary calcuations are really quick. You would use it via hh <- as.data.table(read.table(...)) and you can calculate something similar to your example with
hh <- as.data.table(hh)
hh[htype == "rural",mean(hhpers*hhwt)]
## or
hh[,mean(hhpers*hhwt),by=hhtype] # note 'empty' first argument
Sorry, I'm not familiar with survey data studies, so I can't be more specific.
Another detail into memory usage by function - most likely R made a copy of your entire dataset to calculate the summaries you were looking for. Again, in this case data.table would help and prevent R from making excessive copies and improve memory usage.
Of interest may also be the memisc package which, for me, resulted in much smaller eventual files than read.spss (I was however working at a smaller scale than you)
From the memisc vignette
... Thus this package provides facilities to load such subsets of variables, without the need to load a complete data set. Further, the loading of data from SPSS files is organized in such a way that all informations about variable labels, value labels, and user-defined missing values are retained. This is made possible by the definition of importer objects, for which a subset method exists. importer objects contain only the information about the variables in the external data set but not the data. The data itself is loaded into memory when the functions subset or as.data.set are used.

Work in R with very large data set

I am working with a very large data set which I am downloading from an Oracle data base. The Data frame has about 21 millions rows and 15 columns.
My OS is windows xp (32-bit), I have 2GB RAM. Short-term I cannot upgrade my RAM or my OS (it is at work, it will take months before I get a decent pc).
library(RODBC)
sqlQuery(Channel1,"Select * from table1",stringsAsFactor=FALSE)
I get here already stuck with the usual "Cannot allocate xMb to vector".
I found some suggestion about using the ff package. I would appreciate to know if anybody familiar with the ff package can tell me if it would help in my case.
Do you know another way to get around the memory problem?
Would a 64-bit solution help?
Thanks for your suggestions.
If you are working with package ff and have your data in SQL, you can easily get them in ff using package ETLUtils, see the documentation for an example when using ROracle.
In my experience, ff is perfectly suited for the type of dataset you are working with (21 Mio rows and 15 columns) - in fact your setup is kind of small to ff unless your columns contain a lot of character data which will be converted to factors (meaning all your factor levels should be able to fit in your RAM).
Packages ETLUtils, ff and the package ffbase allow you to get your data in R using ff and do some basic statistics on it. Depending on what you will do with your data, your hardware, you might have to consider sampling when you build models. I prefer having my data in R, building a model based on a sample and score using the tools in ff (like chunking) or from package ffbase.
The drawback is that you have to get used to the fact that your data are ffdf objects and that might take some time - especially if you are new to R.
In my experience, processing your data in chunks can almost always help greatly in processing big data. For example, if you calculate a temporal mean only one timestep needs to be in memory at any given time. You already have your data in a database, so obtaining the subset is easy. Alternatively, if you cannot easily process in chunks, you could always try and take a subset of your data. Repeat the analysis a few times to see if your results are sensitive to which subset you take. The bottomline is that some smart thinking can get you a long way with 2 Gb of RAM. If you need more specific advice, you need to ask more specific questions.
Sorry I can't help with ff, but on the topic of the RAM: I'm not familiar with the memory usage of R data frames, but for sake of argument let's say each cell takes 8 bytes (e.g. a double-precision float or long integer).
21 million * 15 * 8 bytes = about 2.5 GB.
Update and see the comments below; this figure is probably an underestimate!
So you could really do with more RAM, and a 64-bit machine would help a lot as 32-bit machines are limited to 4GB (and can't use that fully).
Might be worth trying a subset of the dataset so you know how much you can load with your existing RAM, then extrapolate to estimate how much you actually need. If you can subdivide the data and process it in chunks, that would be great, but lots of problems don't submit to that approach easily.
Also, I have been assuming that you need all the columns! Obviously, if you can filter the data in any way to reduce the size (e.g. removing any irrelevant columns) than that may help greatly!
There's another very similar question. In particular, one way to to handle your data is to write it to the file and then map memory region to it (see, for example, mmap package).

SVM modeling with BIG DATA

For modeling with SVM in R, I have used kernlab package (ksvm method)with Windows Xp operating system and 2 GB RAM. But having more data rows as 201497, I can'nt able to provide more memory for processing of data modeling (getting issue : can not allocate vector size greater than 2.7 GB).
Therefore, I have used Amazon micro and large instance for SCM modeling. But, it have same issue as local machine (can not allocate vector size greater than 2.7 GB).
Can any one suggest me the solution of this problem with BIG DATA modeling or Is there something wrong with this.
Without a reproducible example it is hard to say if the dataset is just too big, or if some parts of your script are suboptimal. A few general pointers:
Take a look at the High Performance Computing Taskview, this lists the main R packages relevant for working with BigData.
You use your entire dataset for training your model. You could try to take a subset (say 10%) and fit your model on that. Repeating this procedure a few times will yield insight into if the model fit is sensitive to which subset of the data you use.
Some analysis techniques, e.g. PCA analysis, can be done by processing the data iteratively, i.e. in chunks. This makes analyses possible on very big datasets possible (>> 100 gb). I'm not sure if this is possible with kernlab.
Check if the R version you are using is 64 bit.
This earlier question might be of interest.

Resources