Situation: 1GB CSV file, 100000 rows, 4000 independent numeric variable, 1 dependent variable.
R on Windows Citrix Server, with 16GB memory.
Problem: It took me 2 hours! to do:
read.table("full_data.csv", header=T, sep",")
and the glm process crashes, the program is not responding, and I have to shut it down in Task Manager.
I often resort to the package sqldf to load large .csv in memory. A good pointer is here.
Related
I am working with a very large dataset (~30 million rows). I have typically work with this dataset in SAS but would like to use some machine learning applications that do not exist in SAS but do in R. Unfortunately my PC can't handle a dataset of this size in R due to how R stores the entire dataset in memory.
Will calling the R functions from a SAS program solve this? At the very least I can run SAS on a server (I cannot do this with R).
I apologize in advance since this post will not have any reproducible example.
I am using R x64 3.4.2 to run some cross-validated analyses on quite big matrices (number of columns ~ 80000, number of rows between 40 and 180). The analyses involve several features selection steps (performed with in-house functions or with functions from the CORElearnpackage, which is written in C++), as well as some clustering of the features and the fitting of a SVM model (by means of the package RWeka, that is written in Java).
I am working on a DELL Precision T7910 machine, with 2 processors Intel Xeon E5-2695 v3 2.30 GHz, 192 Gb RAM and Windows 7 x64 operating system.
To speed up the running time of my analysis I thought to use the doParallel package in combination with foreach. I would set up the cluster as follow
cl <- makeCluster(number_of_cores, type='PSOCK')
registerDoParallel(cl)
with number_of_clusterset to various numbers between 2 and 10 (detectCore() tells me that I have 56 cores in total).
My problem is that even if only setting number_of_cluster to 2, I got a protection from stack overflowerror message. The thing is that I monitor the RAM usage while the script is running and not even 20 Gb of my 192 Gb RAM are being used.
If I run the script in a sequential way it takes its sweet time (~ 3 hours with 42 rows and ~ 80000 columns), but it does run until the end.
I have tried (almost) every trick in the book for good memory management in R:
I am loading and removing big variables as needed in order to reduce memory usage
I am breaking down the steps with functions rather than scripting them directly, to take advantage of scoping
I am calling gc()every time I delete a big object in order to prompt R to return memory to the operating system
But I am still unable to run the script in parallel.
Do someone have any suggestion about this ? Should I just give up and wait > 3 hours every time I run the analyses ? And more generally: how is it possible to have a stack overflow problem when having a lot of free RAM ?
UPDATE
I have now tried to "pseudo-parallelize" the work using the same machine: since I am running a 10-fold cross-validation scheme, I am opening 5 different instances of Rgui and running 2 folds in each instances. Proceeding in this way, everything run smoothly, and the process indeed take 10 times less than running it in a single instance of R. What makes me wonder is that if 10 instances of Rgui can run at the same time and get the job done, this means that the machine has the computational resources needed. Hence I can not really get my head around the fact that %dopar% with 10 clusters does not work.
The "protection stack overflow" means that you have run out of the "protection stack", that is too many pointers have been PROTECTed but not (yet) UNPROTECTed. This could be because of a bug or inefficiency in the code you are running (in native code of a package or in native code of R, but not a bug in R source code).
This problem has nothing to do with the amount of available memory on the heap, so calling gc() will have no impact, and it is not important how much physical memory the machine has. Please do not call gc() explicitly at all, even if there was a problem with the heap usage, it just makes the program run slower but does not help: if there is not enough heap space but it could be obtained by garbage collection, the garbage collector will run automatically. As the problem is the protection stack, neither restructuring the R code nor removing dead variables explicitly will help. In principle, structuring the code into (relatively small) functions is a good thing for maintainability/readability and it also indirectly reduces scope of variables, so removing variables explicitly should become unnecessary.
It might help to increase the pointer protection stack size, which can be done at R startup from the command line using --max-ppsize.
I have a large data set, one of the files is 5GB. Can someone suggest me how to quickly read it into R (RStudio)? Thanks
If you only have 4 GBs of RAM you cannot put 5 GBs of data 'into R'. You can alternatively look at the 'Large memory and out-of-memory data' section of the High Perfomance Computing task view in R. Packages designed for out-of-memory processes such as ff may help you. Otherwise you can use Amazon AWS services to buy computing time on a larger computer.
My package filematrix is made for working with matrices while storing them in files in binary format. The function fm.create.from.text.file reads a matrix from a text file and stores it in a binary file without loading the whole matrix in memory. It can then be accessed by parts using usual subscripting fm[1:4,1:3] or loaded quickly in memory as a whole fm[].
I am running my code in a PC and I don't think I have problem with the RAM.
When I run this step:
dataset <- rbind(dataset_1, dataset_2,dataset_3,dataset_4,dataset_5)
I got the
Error: cannot allocate vector of size 261.0 Mb
The dataset_1 until dataset_5 have around 5 million observation each.
Could anyone please advise how to solve this problem?
Thank you very much!
There are several packages available that may solve your problem under the High Performance Computing CRAN taskview. See "Large memory and out-of-memory data", the ff package, for example.
R, as matlab, load all the data into the memory which means you can quickly run out of RAM (especially for big datasets). The only alternative I can see is to partition your data (i.e. load only part of the data), do the analysis on that part and write the results to files before loading the next chunk.
In your case you might want to use Linux tools to merge the datasets.
Say you have two files dataset1.txt and dataset2.txt, you can merge them using the shell command join, cat or awk.
More generally, using Linux shell tools for parsing big datasets is usually much faster and requires much less memory.
I would like to read-in a number of CSV files (~50), run a number of operations, and then use write.csv() to output a master file. Since the CSV files are on the larger side (~80 Mb), I was wondering if it might be more efficient to open two instances of R, reading-in half the CSVs on one instance, and half on the other. Then I would write each to a large CSV, read-in both CSVs, and combine them into a master CSV. Does anyone know if running two instances of R will improve the time it takes to read-in all the csv's?
I'm using a Macbook Pro OSX 10.6 with 4Gb RAM.
If the majority of your code execution time is spent reading the files, then it will likely be slower because the two R processes will be competing for disk I/O. But it would be faster if the majority of the time is spent "running a number of operations".
read.table() and related can be quite slow.
The best way to tell if you can benefit from parallelization is to time your R script, and the basic reading of your files. For instance, in a terminal:
time cat *.csv > /dev/null
If the "cat" time is significantly lower, your problem is not I/O bound and you may
parallelize. In which case you should probably use the parallel package, e.g
library(parallel)
csv_files <- c(.....)
my_tables <- mclapply(csv_files, read.csv)