Read from SAS to R for only a subset of rows - r

I have a very large dataset in SAS (> 6million rows). I'm trying to read that to R. For this purpose, I'm using "read_sas" from the "haven" library in R.
However, due to its extremely large size, I'd like to split the data into subsets (e.g., 12 subsets each having 500000 rows), and then read each subset into R. I was wondering if there is any possible way to address this issue. Any input is highly appreciated!

Is there any way you can split the data with SAS beforehand ... ?
read_sas has skip and n_max arguments, so if your increment size is N=5e5 you should be able to set an index i to read in the ith chunk of data using read_sas(..., skip=(i-1)*N, n_max=N). (There will presumably be some performance penalty to skipping rows, but I don't know how bad it will be.)

Related

R: convert data frame columns to least memory demanding data type without loss of information

My data is massive and I was wondering if there is a way I could tell R to convert each column to data types which are less memory demanding without any loss of information.
In Stata, there is a function called compress that does that. I was wondering if there is something similar in R.
I would also be grateful if you have other simple advice of how to handle large datasets in R (in addition to using data.table instead of dplyr).

Operating on Spark data frames with SparkR and Sparklyr - unrealistic settings?

I am currently working with the SparkR and sparklyr package and I think that they are not suitable for high-dimensional sparse data sets.
Both packages have the paradigm that you can select/filter columns and rows of data frames by simple logical conditions on a few columns or rows. But this is often not what you would do on such large data sets. There you need to select rows and columns based on the values of hundreds of row or column entries. Often, you first have to calculate statistics on each row/column, and then use these values for the selections. Or, you want to address certain values in the data frame only.
For example,
How can I select all rows or columns that have less than 75% missing values?
How can I impute missing values with column- or row-specific values that were derived from each column or row?
To solve (#2), I need to execute functions on each row or column of a data frame separately. However, even functions like dapplyCollect of SparkR do not really help, as they are far too slow.
Maybe I am missing something, but I would say that SparkR and sparklyr do not really help in these situations. Am I wrong?
As a side note, I do not understand how libraries like MLlib or H2O could be integrated with Sparklyr if there are such severe limitations, e.g. in handling missing values.

How can I efficiently best fit large data with large numbers of variables

I have a data set with, 10 million rows and 1,000 variables, and I want to best fit those variables, so I can estimate a new rows value. I am using Jama's QR decomposition to do it (better suggestions welcome, but I think this question applies to any implementation). Unfortunately that takes too long.
It appears I have two choices. Either I can split the data into, say, 1000 size 10,000 chunks and then average the results. Or I can add up every , say, 100 rows, and stick those combined rows into the QR decomposition.
One or both ways may be mathematical disasters, and I'm hoping someone can point me in the right direction.
For such big datasets I'd have to say you need to use HDF5. HDF5 is Hierarchical Data Format v5. They have C/C++ implementation APIs, and other bindings for different languages. HDF uses B-trees to keep index of datasets.
HDF5 is supported by Java, MATLAB, Scilab, Octave, Mathematica, IDL, Python, R, and Julia.
Unfortunately I don't know more than this about it. However I'd suggest you'd begin your research with a simple exploratory internet search!

Why is an R object so much larger than the same data in Stata/SPSS?

I have survey data in SPSS and Stata which is ~730 MB in size. Each of these programs also occupy approximately the amount of space you would expect(~800MB) in the memory if I'm working with that data.
I've been trying to pick up R, and so attempted to load this data into R. No matter what method I try(read.dta from the stata file, fread from a csv file, read.spss from the spss file) the R object(measured using object.size()) is between 2.6 to 3.1 GB in size. If I save the object in an R file, that is less than 100 MB, but on loading it is the same size as before.
Any attempts to analyse the data using the survey package, particularly if I try and subset the data, take significantly longer than the equivalent command in stata.
e.g I have a household size variable 'hhpers' in my data 'hh', weighted by variable 'hhwt' , subset by 'htype'
R code :
require(survey)
sv.design <- svydesign(ids = ~0,data = hh, weights = hh$hhwt)
rm(hh)
system.time(svymean(~hhpers,sv.design[which
(sv.design$variables$htype=="rural"),]))
pushes the memory used by R upto 6 GB and takes a very long time -
user system elapsed
3.70 1.75 144.11
The equivalent operation in stata
svy: mean hhpers if htype == 1
completes almost instantaneously, giving me the same result.
Why is there such a massive difference between both memory usage(by object as well as the function), and time taken between R and Stata?
Is there anything I can do to optimise the data and how R is working with it?
ETA: My machine is running 64 bit Windows 8.1, and I'm running R with no other programs loaded. At the very least, the environment is no different for R than it is for Stata.
After some digging, I expect the reason for this is R's limited number of data types. All my data is stored as int, which takes 4 bytes per element. In survey data, each response is categorically coded, and typically requires only one byte to store, which stata stores using the 'byte' data type, and R stores using the 'int' data type, leading to some significant inefficiency in large surveys.
Regarding difference in memory usage - you're on the right track and (mostly) its because of object types. Indeed integer saving will take up a lot of your memory. So proper setting of variable types would improve memory usage by R. as.factor() would help. See ?as.factor for more details to update this after reading data. To fix this during reading data from the file refer to colClasses parameter of read.table() (and similar functions specific for stata & SPSS formats). This will help R store data more efficiently (its on the fly guessing of types is not top-notch).
Regarding the second part - calculation speed - large dataset parsing is not perfect in base R, that's where data.table package comes handy - its fast and quite similar to original data.frame behavior. Summary calcuations are really quick. You would use it via hh <- as.data.table(read.table(...)) and you can calculate something similar to your example with
hh <- as.data.table(hh)
hh[htype == "rural",mean(hhpers*hhwt)]
## or
hh[,mean(hhpers*hhwt),by=hhtype] # note 'empty' first argument
Sorry, I'm not familiar with survey data studies, so I can't be more specific.
Another detail into memory usage by function - most likely R made a copy of your entire dataset to calculate the summaries you were looking for. Again, in this case data.table would help and prevent R from making excessive copies and improve memory usage.
Of interest may also be the memisc package which, for me, resulted in much smaller eventual files than read.spss (I was however working at a smaller scale than you)
From the memisc vignette
... Thus this package provides facilities to load such subsets of variables, without the need to load a complete data set. Further, the loading of data from SPSS files is organized in such a way that all informations about variable labels, value labels, and user-defined missing values are retained. This is made possible by the definition of importer objects, for which a subset method exists. importer objects contain only the information about the variables in the external data set but not the data. The data itself is loaded into memory when the functions subset or as.data.set are used.

How to Handle Creating Large Data Sets in R

There's a fair amount of support, through things like the various Revolution R modules, in what to do if you're bringing a large dataset into R, and it's too large to be stored in RAM. But is there any way to deal with data sets being created within R that are too big to store in RAM, beyond simply (and by hand) breaking the creation step into a series of RAM-sized chunks, writing that chunk to disk, clearing it, and continuing on?
For example, just doing a large simulation, or using something like SurvSplit() to take a single observation with a survival time from 1 to N and break it into N seperate observations?
If you're creating the data in R and you can do your analysis on a small chunk of the total data, then only create as large of a chunk as you need for any given analysis.

Resources