R running out memory for large data set - r

I am running my code in a PC and I don't think I have problem with the RAM.
When I run this step:
dataset <- rbind(dataset_1, dataset_2,dataset_3,dataset_4,dataset_5)
I got the
Error: cannot allocate vector of size 261.0 Mb
The dataset_1 until dataset_5 have around 5 million observation each.
Could anyone please advise how to solve this problem?
Thank you very much!

There are several packages available that may solve your problem under the High Performance Computing CRAN taskview. See "Large memory and out-of-memory data", the ff package, for example.

R, as matlab, load all the data into the memory which means you can quickly run out of RAM (especially for big datasets). The only alternative I can see is to partition your data (i.e. load only part of the data), do the analysis on that part and write the results to files before loading the next chunk.
In your case you might want to use Linux tools to merge the datasets.
Say you have two files dataset1.txt and dataset2.txt, you can merge them using the shell command join, cat or awk.
More generally, using Linux shell tools for parsing big datasets is usually much faster and requires much less memory.

Related

Is there an R function / package to sort data on disk space (bigger than Ram datasets), similar to PROC SORT in Sas?

I find myself working with distributed datasets (parquet) taking up to >100gb on disk space.
Together they sum up to approx 2.4B rows and 24 cols.
I manage to work on it with R/Arrow, simple operations are quite good, but when it comes to perform a sort by an ID sparse across different files Arrow requires to pull data first (collect()) and no amount of Ram seems to be enough.
From working experience I know that SAS Proc Sort is mostly performed on disk rather than on Ram, I was wondering if there's an R package with similar approach.
Any idea how to approach the problem in R, rather than buy a server with 256gb of Ram?
Thanks,
R

R running very slowly after loading large datasets > 8GB

I have been unable to work in R given how slow it is operating once my datasets are loaded. These datasets total around 8GB. I am running on a 8GB RAM and have adjusted memory.limit to exceed my RAM but nothing seems to be working. Also, I have used fread from the data.table package to read these files; simply because read.table would not run.
After seeing a similar post on the forum addressing the same issue, I have attempted to run gctorture(), but to no avail.
R is running so slowly that I cannot even check the length of the list of datasets I have uploaded, cannot View or do any basic operation once these datasets are uploaded.
I have tried uploading the datasets in 'pieces', so 1/3 of the total files over 3 times, which seemed to make things run more smoothly for the importing part, but has not changed anything with regards to how slow R runs after this.
Is there any way to get around this issue? Any help would be much appreciated.
Thank you all for your time.
The problem arises because R loads the full dataset into the RAM which mostly brings the system to a halt when you try to View your data.
If it's a really huge dataset, first make sure the data contains only the most important columns and rows. Valid columns can be identified through the domain and world knowledge you have about the problem. You can also try to eliminate rows with missing values.
Once this is done, depending on your size of the data, you can try different approaches. One is through the use of packages like bigmemory and ff. bigmemory for example, creates a pointer object using which you can read the data from disk without loading it to the memory.
Another approach is through parallelism (implicit or explicit). MapReduce is another package which is very useful for handling big datasets.
For more information on these, check out this blog post on rpubs and this old but gold post from SO.

R - Creating new file takes up too much memory

I'm relatively new and poor at R, and am trying to do something that appears to be giving me trouble.
I have several large spatialpolygonsdataframes that I am trying to combine into 1 spatialpolygonsdataframe. There are 7 and they combine to about 5 GB total. My mac only has 8GB of RAM.
When I try and create the aggregate spatialpolygonsdataframe R takes an incredibly long time to run and I have to quit out. I presume it is because I do not have sufficient RAM.
my code is simple: aggregate <-rbind(file1,file2,....). Is there a smarter/better way to do this?
Thank you.
I would disagree, a major component of reading large datasets isn't RAM capacity (although I would suggest that you upgrade if you can). But rather read/write speeds. Hardware, a HDD at 7200RPM is substantially slower vs. SSD. If you are able to install a SSD and have that as your working directory, I would recommend it.

Parallel cpu processing tm Dcorpus polarity

I am trying to examine a data base containing roughly 80.000 txt-documents through the polarity of each sentence in the text with R.
My problem is that my computer isn't able to transform the txt-files into a corpus (12gb RAM, 8 CPUs, Windows 10) - it takes more than two days.
I found out that there is a way to use all CPU's parallely with the DCorpus-function. However, starting with the Dcorpus, I don't know how to run the splitSentence-function, the transformation to a data frame and the scoring via the polarity-function using all CPUs parallely again.
Moreover, I am not sure whether a parallelization of the code helps me with the RAM-usage?
Thanks for your help in advance!
All your problems raise from tm package usage which is incredibly inefficient.
Try, for example, text2vec package. I believe you will be able to perform your analysis in minutes and with very moderate ram usage.
Disclosure - I'm the author of this package.

Running two instances of R in order to improve large data reading performance

I would like to read-in a number of CSV files (~50), run a number of operations, and then use write.csv() to output a master file. Since the CSV files are on the larger side (~80 Mb), I was wondering if it might be more efficient to open two instances of R, reading-in half the CSVs on one instance, and half on the other. Then I would write each to a large CSV, read-in both CSVs, and combine them into a master CSV. Does anyone know if running two instances of R will improve the time it takes to read-in all the csv's?
I'm using a Macbook Pro OSX 10.6 with 4Gb RAM.
If the majority of your code execution time is spent reading the files, then it will likely be slower because the two R processes will be competing for disk I/O. But it would be faster if the majority of the time is spent "running a number of operations".
read.table() and related can be quite slow.
The best way to tell if you can benefit from parallelization is to time your R script, and the basic reading of your files. For instance, in a terminal:
time cat *.csv > /dev/null
If the "cat" time is significantly lower, your problem is not I/O bound and you may
parallelize. In which case you should probably use the parallel package, e.g
library(parallel)
csv_files <- c(.....)
my_tables <- mclapply(csv_files, read.csv)

Resources