How to Handle Creating Large Data Sets in R - r

There's a fair amount of support, through things like the various Revolution R modules, in what to do if you're bringing a large dataset into R, and it's too large to be stored in RAM. But is there any way to deal with data sets being created within R that are too big to store in RAM, beyond simply (and by hand) breaking the creation step into a series of RAM-sized chunks, writing that chunk to disk, clearing it, and continuing on?
For example, just doing a large simulation, or using something like SurvSplit() to take a single observation with a survival time from 1 to N and break it into N seperate observations?

If you're creating the data in R and you can do your analysis on a small chunk of the total data, then only create as large of a chunk as you need for any given analysis.

Related

Parallel processing a function being applied with mapply a large dataset

My problem is the following: I have a large dataset on R (I run it on VS-Code), which I'll call full of about 1.2GB (28kk rows with 10 columns) and subset of this dataset, which I'll call main (4.3kk rows and 10 columns). I use a Windows with an i7-10700k CPU, 3.8 GHz, 8-core, with 16gb of RAM.
These datasets contain unique identifiers for products, which then spam over multiple time periods and stores. For each (product-store) combination, I need to calculate summary statistics for similar products excluding that store and product. For this reason, I essentially need the full dataset to be loaded, and I cannot split it.
I have created a function that takes a given product-store, filters the dataset to exclude that product-store, and then perform the summary statistics.
There are over 1 million product-stores, so an apply would take 1 million runs. Each run of the function is taking about 0.5 seconds, which is a lot.
I then decided to use furrr's future_map2 along with plan(cluster, workers=8) to try and parallelize the process.
One adviced that normally goes against parallelization is that, if a lot of data needs to be moved around for each cluster, this process can take a long time. My understanding is that the parallelization would move the large datasets to each cluster once, and then it would perform the apply in parallel. This seems to imply that my process will be more efficient under parallelization, even with a large dataset.
I wanted to know if overall I am doing the most advisable thing in terms of speeding up the function. I already switched fully to data.table functions to improve speed, so I don't believe there's a lot to be done within the function.
Tried parallelizing, worried about whats the smartest approach

Is there an R function / package to sort data on disk space (bigger than Ram datasets), similar to PROC SORT in Sas?

I find myself working with distributed datasets (parquet) taking up to >100gb on disk space.
Together they sum up to approx 2.4B rows and 24 cols.
I manage to work on it with R/Arrow, simple operations are quite good, but when it comes to perform a sort by an ID sparse across different files Arrow requires to pull data first (collect()) and no amount of Ram seems to be enough.
From working experience I know that SAS Proc Sort is mostly performed on disk rather than on Ram, I was wondering if there's an R package with similar approach.
Any idea how to approach the problem in R, rather than buy a server with 256gb of Ram?
Thanks,
R

Subsampling very long time series in R with the goal to display it

Background: I have a very long vector (think many millions of rows) that I cannot display easily, as the data is simply too large. The data is time-series - it exhibits temporal dependency.
My goal is to somehow visualize a part (or parts) of it that is representative enough (i.e. not just the first 10k rows or so)
Normally, if the data were iid and I wanted to display a part of it, I would just do resampling with replacement.
Question Since the data is time series, I was thinking of using "block resampling" (I don't know if this is a real term, I was thinking more of block bootstrap but without actually computing any statistics). Does anybody have a good idea (or even packages) of how I can achieve what I am looking for in a clever way?

What are the minimum system requirements for analysing large datasets (30gb) in R?

I tried running Apriori algorithm on 30GB CSV file in which each row is a basket upto 34 items(columns) in it. So R studio died just after execution. I want to know what are the minimum system requirements like how much RAM and CPU config I need to run algorithms on large data sets?
This question cannot be answered as such. It highly depends on what you want to do with the data.
Example
If you are able to process all lines 1 by 1, you just need a tiny bit of ram (for example if you want to count them, I believe this also holds for the most trivial use of Apriori)
If you want to calculate the distance between all points efficiently, you will want a ton of ram, and another few GB to store the output (I believe this is even less intense than the most extreme use of Apriori).
Conclusion
As such I would recommend:
Use whatever hardware you have to process a subset of the data. Check your memory and CPU usage, as you increase the data size (or other parameters) and extrapolate your results to see what you probably need.

Why is an R object so much larger than the same data in Stata/SPSS?

I have survey data in SPSS and Stata which is ~730 MB in size. Each of these programs also occupy approximately the amount of space you would expect(~800MB) in the memory if I'm working with that data.
I've been trying to pick up R, and so attempted to load this data into R. No matter what method I try(read.dta from the stata file, fread from a csv file, read.spss from the spss file) the R object(measured using object.size()) is between 2.6 to 3.1 GB in size. If I save the object in an R file, that is less than 100 MB, but on loading it is the same size as before.
Any attempts to analyse the data using the survey package, particularly if I try and subset the data, take significantly longer than the equivalent command in stata.
e.g I have a household size variable 'hhpers' in my data 'hh', weighted by variable 'hhwt' , subset by 'htype'
R code :
require(survey)
sv.design <- svydesign(ids = ~0,data = hh, weights = hh$hhwt)
rm(hh)
system.time(svymean(~hhpers,sv.design[which
(sv.design$variables$htype=="rural"),]))
pushes the memory used by R upto 6 GB and takes a very long time -
user system elapsed
3.70 1.75 144.11
The equivalent operation in stata
svy: mean hhpers if htype == 1
completes almost instantaneously, giving me the same result.
Why is there such a massive difference between both memory usage(by object as well as the function), and time taken between R and Stata?
Is there anything I can do to optimise the data and how R is working with it?
ETA: My machine is running 64 bit Windows 8.1, and I'm running R with no other programs loaded. At the very least, the environment is no different for R than it is for Stata.
After some digging, I expect the reason for this is R's limited number of data types. All my data is stored as int, which takes 4 bytes per element. In survey data, each response is categorically coded, and typically requires only one byte to store, which stata stores using the 'byte' data type, and R stores using the 'int' data type, leading to some significant inefficiency in large surveys.
Regarding difference in memory usage - you're on the right track and (mostly) its because of object types. Indeed integer saving will take up a lot of your memory. So proper setting of variable types would improve memory usage by R. as.factor() would help. See ?as.factor for more details to update this after reading data. To fix this during reading data from the file refer to colClasses parameter of read.table() (and similar functions specific for stata & SPSS formats). This will help R store data more efficiently (its on the fly guessing of types is not top-notch).
Regarding the second part - calculation speed - large dataset parsing is not perfect in base R, that's where data.table package comes handy - its fast and quite similar to original data.frame behavior. Summary calcuations are really quick. You would use it via hh <- as.data.table(read.table(...)) and you can calculate something similar to your example with
hh <- as.data.table(hh)
hh[htype == "rural",mean(hhpers*hhwt)]
## or
hh[,mean(hhpers*hhwt),by=hhtype] # note 'empty' first argument
Sorry, I'm not familiar with survey data studies, so I can't be more specific.
Another detail into memory usage by function - most likely R made a copy of your entire dataset to calculate the summaries you were looking for. Again, in this case data.table would help and prevent R from making excessive copies and improve memory usage.
Of interest may also be the memisc package which, for me, resulted in much smaller eventual files than read.spss (I was however working at a smaller scale than you)
From the memisc vignette
... Thus this package provides facilities to load such subsets of variables, without the need to load a complete data set. Further, the loading of data from SPSS files is organized in such a way that all informations about variable labels, value labels, and user-defined missing values are retained. This is made possible by the definition of importer objects, for which a subset method exists. importer objects contain only the information about the variables in the external data set but not the data. The data itself is loaded into memory when the functions subset or as.data.set are used.

Resources