I'm relatively new and poor at R, and am trying to do something that appears to be giving me trouble.
I have several large spatialpolygonsdataframes that I am trying to combine into 1 spatialpolygonsdataframe. There are 7 and they combine to about 5 GB total. My mac only has 8GB of RAM.
When I try and create the aggregate spatialpolygonsdataframe R takes an incredibly long time to run and I have to quit out. I presume it is because I do not have sufficient RAM.
my code is simple: aggregate <-rbind(file1,file2,....). Is there a smarter/better way to do this?
Thank you.
I would disagree, a major component of reading large datasets isn't RAM capacity (although I would suggest that you upgrade if you can). But rather read/write speeds. Hardware, a HDD at 7200RPM is substantially slower vs. SSD. If you are able to install a SSD and have that as your working directory, I would recommend it.
Related
I find myself working with distributed datasets (parquet) taking up to >100gb on disk space.
Together they sum up to approx 2.4B rows and 24 cols.
I manage to work on it with R/Arrow, simple operations are quite good, but when it comes to perform a sort by an ID sparse across different files Arrow requires to pull data first (collect()) and no amount of Ram seems to be enough.
From working experience I know that SAS Proc Sort is mostly performed on disk rather than on Ram, I was wondering if there's an R package with similar approach.
Any idea how to approach the problem in R, rather than buy a server with 256gb of Ram?
Thanks,
R
This is more of a hardware-related question. In my company we work with both R (for data analysis) and Power BI (for data visualization). We have an asset build in R to perform domain specific calculations and we display the output in PowerBI with some more or less complex graphs and calculated fields.
We recently had to deal with a dataset of 2Gb/7 million records with our PCs (HP 830 G5 - i58250U - 8Gb RAM). R timed out during our calculations with the above dataset while Power BI was able to handle it relatively easily.
I know R actively uses RAM memory as temporary storage of objects, and that might have been a problem. But why it was not a problem with Power BI? The reason I am asking this is to figure out whether just buying/installing more RAM in our all laptops is a good solution. We need both R and Power BI for our asset to work.
Thank you very much!
No amount of RAM will make up for efficient code. Throwing hardware at a software problem might help initially, but won't scale as well as good code. Without knowing more about your particular usecase, the best advice I can give is to utilize more efficient libraries such as data.table, or one of the other efficient solutions highlighted in this piece.
You should also consider working in parallell, utilizing more than one core for your instance. Power BI might to this automatically, but in R you will have to explicitly state so.
With files less than 10 GB, R should have no problem. However, above 10 GB requires some workaround, as highlighted in this article.
I tried a, as I came to see, quite memory intensive operation with R (write an xslx file with r of a dataset with 500k observations and 2000 variables).
I tried the method explained here. (First comment)
I set the max VSIZE to 10 GB, as I did not want to try more, because I was afraid to damage my computer (I saved money for a long time:)) and it still did not work.
I then looked up Cloud Computing with R, which I found to be quite difficult as well.
So finally, I wanted to ask here, if anyone could give me an answer on how much I can set the VSIZE without damaging my computer or if there is another way to solve my problem. (The goal is to transform an SAS file to an xslx or xsl file. The files are between 1.4 GB and 1.6 GB. My RAM is about 8GB big.) I am open to download programs if that's not too complicated.
Cheers.
I have written R program that generates a random vector of length 1 million. I need to simulate it 1 million times. Out of the 1 million simulations, I will be using 50K observed vectors (chosen in some random manner) as samples. So, 50K cross 1M is the sample size. Is there way to deal it in R?
There are few problems and some not so good solutions.
First R cannot store such huge matrix in my machine. It exceeds RAM memory. I looked into packages like bigmemory, ffbase etc that uses hard disk space. But such a huge data can have size in TB. I have 200GB hard disk available in my machine.
Even if storing is possible, there is a problem of running time. The code may take more than 100Hrs of running time!
Can anyone please suggest a way out! Thanks
This answer really stands as something in between a comment and an answer. The easy way out of your dilemma is to not work with such massive data sets. You can most likely take a reasonably-sized representative subset of that data (say requiring no more than a few hundred MB) and train your model this way.
If you have to use the model in production on actual data sets with millions of observations, then the problem would no longer be related to R.
If possible use sparse matrix techniques
If possible try leveraging storage memory and chunking the object into parts
If possible try to use Big Data tools such as H2O
Leverage multicore and HPC computing with pbdR, parallel, etc
Consider using a spot instance of a Big Data / HPC cloud VPS instance on AWS, Azure, DigitalOcean, etc. Most offer distributions with R preinstalled and with a high RAM multicore instance you can "spin up" (start) and down (stop) quickly and cheaply
Use sampling and statistical solutions when possible
Consider doing some of your simulations or pre-simulation steps in a relational database, or something like Spark + Scala; some have R integration nowadays, actually
I am running my code in a PC and I don't think I have problem with the RAM.
When I run this step:
dataset <- rbind(dataset_1, dataset_2,dataset_3,dataset_4,dataset_5)
I got the
Error: cannot allocate vector of size 261.0 Mb
The dataset_1 until dataset_5 have around 5 million observation each.
Could anyone please advise how to solve this problem?
Thank you very much!
There are several packages available that may solve your problem under the High Performance Computing CRAN taskview. See "Large memory and out-of-memory data", the ff package, for example.
R, as matlab, load all the data into the memory which means you can quickly run out of RAM (especially for big datasets). The only alternative I can see is to partition your data (i.e. load only part of the data), do the analysis on that part and write the results to files before loading the next chunk.
In your case you might want to use Linux tools to merge the datasets.
Say you have two files dataset1.txt and dataset2.txt, you can merge them using the shell command join, cat or awk.
More generally, using Linux shell tools for parsing big datasets is usually much faster and requires much less memory.