I am working with a very large dataset (~30 million rows). I have typically work with this dataset in SAS but would like to use some machine learning applications that do not exist in SAS but do in R. Unfortunately my PC can't handle a dataset of this size in R due to how R stores the entire dataset in memory.
Will calling the R functions from a SAS program solve this? At the very least I can run SAS on a server (I cannot do this with R).
Related
I find myself working with distributed datasets (parquet) taking up to >100gb on disk space.
Together they sum up to approx 2.4B rows and 24 cols.
I manage to work on it with R/Arrow, simple operations are quite good, but when it comes to perform a sort by an ID sparse across different files Arrow requires to pull data first (collect()) and no amount of Ram seems to be enough.
From working experience I know that SAS Proc Sort is mostly performed on disk rather than on Ram, I was wondering if there's an R package with similar approach.
Any idea how to approach the problem in R, rather than buy a server with 256gb of Ram?
Thanks,
R
I am running an exploratory data analysis with SmartEDA package (https://cran.r-project.org/web/packages/SmartEDA/SmartEDA.pdf) and one of its functions "ExpReport" allows to create a exploratory data analysis report in HTML format automatically.
I have a dataset with 172 variables and 16487 rows and this is taking so much time to run! Is there a way to speed up R in every tasks we do?
I will also have to run some models with this data (and more data eventually) like randomForest, logistic regression, etc and would like to have a method to do this quickly.
I heard about parallel-processing but can't really understand how it works and if it works only with specific packages or functions...
Thank you all!
This picture shows how memory and cpu are affected just running "ExpReport":
info about memory and CPU consuption
The Problem with large data sets in R is that R reads the entire data set into RAM all at once and R Objects entirely live in memory.
Package doMC Provides a parallel backend for the %dopar% function using the multicore functionality of the parallel package.
Secondly packages like Bigmemory, ff & data.table come in real handy.
Here is a vignette that will help you handle large datasets:
https://rpubs.com/msundar/large_data_analysis
Hope you find this helpful.
For modeling with SVM in R, I have used kernlab package (ksvm method)with Windows Xp operating system and 2 GB RAM. But having more data rows as 201497, I can'nt able to provide more memory for processing of data modeling (getting issue : can not allocate vector size greater than 2.7 GB).
Therefore, I have used Amazon micro and large instance for SCM modeling. But, it have same issue as local machine (can not allocate vector size greater than 2.7 GB).
Can any one suggest me the solution of this problem with BIG DATA modeling or Is there something wrong with this.
Without a reproducible example it is hard to say if the dataset is just too big, or if some parts of your script are suboptimal. A few general pointers:
Take a look at the High Performance Computing Taskview, this lists the main R packages relevant for working with BigData.
You use your entire dataset for training your model. You could try to take a subset (say 10%) and fit your model on that. Repeating this procedure a few times will yield insight into if the model fit is sensitive to which subset of the data you use.
Some analysis techniques, e.g. PCA analysis, can be done by processing the data iteratively, i.e. in chunks. This makes analyses possible on very big datasets possible (>> 100 gb). I'm not sure if this is possible with kernlab.
Check if the R version you are using is 64 bit.
This earlier question might be of interest.
Situation: 1GB CSV file, 100000 rows, 4000 independent numeric variable, 1 dependent variable.
R on Windows Citrix Server, with 16GB memory.
Problem: It took me 2 hours! to do:
read.table("full_data.csv", header=T, sep",")
and the glm process crashes, the program is not responding, and I have to shut it down in Task Manager.
I often resort to the package sqldf to load large .csv in memory. A good pointer is here.
Being an R user, I'm now trying to learn the SPSS syntax.
I sed to add the command rm(list=ls()) at the being of R script to ensure that R is empty before I go on my work.
Is there a similar command for SPSS? Thanks.
Close to the functional equivalent in SPSS would be
dataset close all.
This simply closes all open dataframes except for the active dataframe (and strips it of its name). If you open another dataset the previous dataframe will close automatically.
Since the way SPSS uses memory is fundamentally different from how R uses it, there really isn't a close equivalent between rm and SPSS memory management mechanisms. SPSS does not keep datasets in memory in most cases - which is why it can process files of unlimited size. When you close an SPSS dataset, all its associated metadata - which is in memory, is removed.
DATASET CLOSE ALL
closes all open datasets, but there can still be an unnamed dataset remaining. To really remove everything, you would write
dataset close all.
new file.
because a dataset cannot remain open if another one is opened unless it has a dataset name.
You might also be interested to know that you can run R code from within SPSS via
BEGIN PROGRAM R.
END PROGRAM.
SPSS provides apis for reading the active SPSS data, creating SPSS pivot tables, creating new SPSS datasets etc. You can even use the SPSS Custom Dialog Builder to create a dialog box interface for your R program. In addition, there is a mechanism for building SPSS extension commands that are actually implemented in R or Python. All this apparatus is free once you have the basic SPSS Statistics. So it is easy to use SPSS to provide a nice user interface and nice output for an R program.
You can download the R Essentials and a good number of R extensions for SPSS from the SPSS Community website at www.ibm.com/developerworks/spssdevcentral. All free, but registration is required.
p.s. rm(ls()) is useful in some situations - it is often used with R code within SPSS, because the state of the R workspace is retained between R programs within the same SPSS session.
Regards,
Jon Peck