vector memory exhausted (R) workaround? - r

I tried a, as I came to see, quite memory intensive operation with R (write an xslx file with r of a dataset with 500k observations and 2000 variables).
I tried the method explained here. (First comment)
I set the max VSIZE to 10 GB, as I did not want to try more, because I was afraid to damage my computer (I saved money for a long time:)) and it still did not work.
I then looked up Cloud Computing with R, which I found to be quite difficult as well.
So finally, I wanted to ask here, if anyone could give me an answer on how much I can set the VSIZE without damaging my computer or if there is another way to solve my problem. (The goal is to transform an SAS file to an xslx or xsl file. The files are between 1.4 GB and 1.6 GB. My RAM is about 8GB big.) I am open to download programs if that's not too complicated.
Cheers.

Related

Understanding Memory Constraints In R [duplicate]

This question already has answers here:
Tricks to manage the available memory in an R session
(28 answers)
Recommended package for very large dataset processing and machine learning in R [closed]
(5 answers)
Closed 3 years ago.
As the title suggests, I am trying to fully understand memory constraints with R because I have a project that is quickly growing in scale, and I am worried that memory constraints will soon become a major issue.
I am aware of object.size, and I get the following output when run on the largest item in my environment:
> object.size(raw.pbp.data)
457552240 bytes
...so the largest item is ~457MB. I have also checked on my macbook pro's memory, in the About This Mac --> Storage, and it shows my Memory as 8 GB 1600 MHz DDR3, so I assume I have 8 GB to work with.
Obviously the 457MB dataframe is not the only object in my R environment, but I do not want to manually run object.size for every single object and add up the bytes to find the total size of memory used. Is there a better way to do this? A function that tells me the memory used in total by all objects in my RStudio Environment would be great. Does such a function exist?
Also, what happens when I get closer to 8GB - is my R script going to stop working? I'm anticipating my data is going to increase by a factor of 5 - 10x in the near future, which will probably bring the total memory used in the environment close-to, or even greater than, 8GB.
Lastly, if hitting 8GB of memory is going to hault my R script from running, what are my options? If I convert my dataframe into a datatable, could that reduce the size of the object overall?
Any help with this is greatly appreciated, thanks!!
Edit: saved as a .rda file, raw.pbp.data is only 32MB, so that makes me optimistic that there is a way to potentially reduce its size when loaded into R.
I am not aware of any functions, but this works. You could make a function out of this:
env <- eapply(environment(), object.size, USE.NAMES = FALSE)
sizes <- c()
for (i in 1:length(env)) {
sizes[i] <- env[[i]][1]
}
sum(sizes)
Besides the obvious (running this on a server or buying more RAM), I've heard data.table is more efficient than data.frame. Try using it. The syntax is more concise too! I cannot recommend data.table enough.

R - Creating new file takes up too much memory

I'm relatively new and poor at R, and am trying to do something that appears to be giving me trouble.
I have several large spatialpolygonsdataframes that I am trying to combine into 1 spatialpolygonsdataframe. There are 7 and they combine to about 5 GB total. My mac only has 8GB of RAM.
When I try and create the aggregate spatialpolygonsdataframe R takes an incredibly long time to run and I have to quit out. I presume it is because I do not have sufficient RAM.
my code is simple: aggregate <-rbind(file1,file2,....). Is there a smarter/better way to do this?
Thank you.
I would disagree, a major component of reading large datasets isn't RAM capacity (although I would suggest that you upgrade if you can). But rather read/write speeds. Hardware, a HDD at 7200RPM is substantially slower vs. SSD. If you are able to install a SSD and have that as your working directory, I would recommend it.

R raster timeseries: what's the most efficient read and write?

I have the following problem/question:
I've written an R functions which is smoothing values from a time series. The time series is defined by a big number of single global raster files, hence each single pixel a series with n timesteps (generally more than 500). Even though I've plenty of RAM, I have to rely on blockwise processing because loading the entire dataset is just too much. So far so good.
I've written (IMHO) a fairly decent code, which leverages parallel processing when possible. I have a processing machine which should be more than well equipped to handle this amount of data and computation. This leads me to believe that most of the time will be spent by reading lots of values from the disk and then, after smoothing, writing lots of values to the disk.
So I've tried running the code with the files being on either a normal HDD or a normal SSD.
Against my expectations, it didn't really matter much.
Then I tried running a test function which reads a file, gets the values and writes them back to disk with the raster being on either the HDD, the SSD or a blazing fast SSD. Again, no significant difference.
I've already done a fair share of profiling to find bottlenecks, as well as a good amount of time googling for efficient solutions. There's bits of info here and there, but I decided to post this question here to get a definitive answer and maybe some pointers for me and others how to efficient manage things.
So without further ado (and for people who skipped the above), here's my question:
In a setting as described above (high data volume, blockwise processing, reading and writing from/to disk), what's the most efficient (and/or fastest) way to do computation on a long raster time series which involves reading and writing values from/to disk? (especially regarding the read write aspect)
Assuming I have a fast SSD, how can I leverage the speed? Is it done automatically?
What are the influencing factors (filesize, filetype, caching) and the most efficient setting of these factors?
I know that in terms of raster, R works the fastest with .grd, but I would like to avoid this format for flexibility, compatibility and diskspace reasons.
Maybe I'm also having a misconception of how the raster package interacts with the files on disk. In that case, should I use different functions than getValues and writeValues ?
-- Some system info and example code: --
Os: Win7 x64
CPU: Xenon E5-1650 # 3.5 GHz
RAM: 128 GB
R-version: 3.2
Raster file format: .rst
Read/write benchmark function:
benchfun <- function(x){
# x ... raster file
xr <- raster(x)
x2 <- raster(xr)
xval <- getValues(xr)
x2 <- setValues(x2,xval)
writeRaster(x2,'testras.tif',overwrite=TRUE)
}
If needed I can also provide a little example code for the time series processing, but for now I don't think it's needed.
Appreciate all tips!
Thanks,
Val

Forcing R (and Rstudio) to use the virtual memory on Windows

I'm working with large datasets and quite often R produces an error telling it can't allocate a vector of that size or it doesn't have enough memory.
My computer has 16GB RAM (Windows 10) and I'm working with datasets of around 4GB but some operations need a lot of memory, for example converting the dataset from wide format to long.
In some situations I can use gc() to realease some memory but many times it's not enough.
Sometimes I can break the dataset on smaller chunks but sometimes I need to work with all the table at once.
I've read that Linux users don't have this problem, but what about Windows?
I've tried setting a large pagefile on a SSD (200GB) but I've found that R doesn't use it at all.
I can see the task manager and when the memory consumption reaches 16GB R stops working. The size of the pagefile doesn't seem to make any difference.
How can I force R to use the pagefile?
Do I need to compile it myself with some special flags?
PD: My experience is that deleting an object rm() and later using gc() doesn't recover all the memory. As I perform operations with large datasets my computer has less and less free memory at every step, no matter if I use gc().
PD2: I expect not to hear trivial solutions like "you need more RAM memory"
PD3: I've been testing and the problem only happens in Rstudio. If I use directly R it works well. Does anybody know how to do it in RStudio.
In order to get it working automatically every time you start RStudio the solution with R_MAX_MEM_SIZE is ignored, both if created as an environment variable or if created inside the .Rprofile.
Writing memory.limit(64000) is ignored too.
The proper way is adding the following line in the file .Rprofile
invisible(utils::memory.limit(64000))
or whatever number you want.
Of course you need to have a pagefile big enough. That number includes free RAM and free pagefile space.
Using the pagefile is slower but it's going to be used only when needed.
Something strange I've found is that it only let's you increase the maximum memory to use but it doesn't allow you to decrease it.

R running out memory for large data set

I am running my code in a PC and I don't think I have problem with the RAM.
When I run this step:
dataset <- rbind(dataset_1, dataset_2,dataset_3,dataset_4,dataset_5)
I got the
Error: cannot allocate vector of size 261.0 Mb
The dataset_1 until dataset_5 have around 5 million observation each.
Could anyone please advise how to solve this problem?
Thank you very much!
There are several packages available that may solve your problem under the High Performance Computing CRAN taskview. See "Large memory and out-of-memory data", the ff package, for example.
R, as matlab, load all the data into the memory which means you can quickly run out of RAM (especially for big datasets). The only alternative I can see is to partition your data (i.e. load only part of the data), do the analysis on that part and write the results to files before loading the next chunk.
In your case you might want to use Linux tools to merge the datasets.
Say you have two files dataset1.txt and dataset2.txt, you can merge them using the shell command join, cat or awk.
More generally, using Linux shell tools for parsing big datasets is usually much faster and requires much less memory.

Resources