This question already has answers here:
Tricks to manage the available memory in an R session
(28 answers)
Recommended package for very large dataset processing and machine learning in R [closed]
(5 answers)
Closed 3 years ago.
As the title suggests, I am trying to fully understand memory constraints with R because I have a project that is quickly growing in scale, and I am worried that memory constraints will soon become a major issue.
I am aware of object.size, and I get the following output when run on the largest item in my environment:
> object.size(raw.pbp.data)
457552240 bytes
...so the largest item is ~457MB. I have also checked on my macbook pro's memory, in the About This Mac --> Storage, and it shows my Memory as 8 GB 1600 MHz DDR3, so I assume I have 8 GB to work with.
Obviously the 457MB dataframe is not the only object in my R environment, but I do not want to manually run object.size for every single object and add up the bytes to find the total size of memory used. Is there a better way to do this? A function that tells me the memory used in total by all objects in my RStudio Environment would be great. Does such a function exist?
Also, what happens when I get closer to 8GB - is my R script going to stop working? I'm anticipating my data is going to increase by a factor of 5 - 10x in the near future, which will probably bring the total memory used in the environment close-to, or even greater than, 8GB.
Lastly, if hitting 8GB of memory is going to hault my R script from running, what are my options? If I convert my dataframe into a datatable, could that reduce the size of the object overall?
Any help with this is greatly appreciated, thanks!!
Edit: saved as a .rda file, raw.pbp.data is only 32MB, so that makes me optimistic that there is a way to potentially reduce its size when loaded into R.
I am not aware of any functions, but this works. You could make a function out of this:
env <- eapply(environment(), object.size, USE.NAMES = FALSE)
sizes <- c()
for (i in 1:length(env)) {
sizes[i] <- env[[i]][1]
}
sum(sizes)
Besides the obvious (running this on a server or buying more RAM), I've heard data.table is more efficient than data.frame. Try using it. The syntax is more concise too! I cannot recommend data.table enough.
Related
I apologize in advance since this post will not have any reproducible example.
I am using R x64 3.4.2 to run some cross-validated analyses on quite big matrices (number of columns ~ 80000, number of rows between 40 and 180). The analyses involve several features selection steps (performed with in-house functions or with functions from the CORElearnpackage, which is written in C++), as well as some clustering of the features and the fitting of a SVM model (by means of the package RWeka, that is written in Java).
I am working on a DELL Precision T7910 machine, with 2 processors Intel Xeon E5-2695 v3 2.30 GHz, 192 Gb RAM and Windows 7 x64 operating system.
To speed up the running time of my analysis I thought to use the doParallel package in combination with foreach. I would set up the cluster as follow
cl <- makeCluster(number_of_cores, type='PSOCK')
registerDoParallel(cl)
with number_of_clusterset to various numbers between 2 and 10 (detectCore() tells me that I have 56 cores in total).
My problem is that even if only setting number_of_cluster to 2, I got a protection from stack overflowerror message. The thing is that I monitor the RAM usage while the script is running and not even 20 Gb of my 192 Gb RAM are being used.
If I run the script in a sequential way it takes its sweet time (~ 3 hours with 42 rows and ~ 80000 columns), but it does run until the end.
I have tried (almost) every trick in the book for good memory management in R:
I am loading and removing big variables as needed in order to reduce memory usage
I am breaking down the steps with functions rather than scripting them directly, to take advantage of scoping
I am calling gc()every time I delete a big object in order to prompt R to return memory to the operating system
But I am still unable to run the script in parallel.
Do someone have any suggestion about this ? Should I just give up and wait > 3 hours every time I run the analyses ? And more generally: how is it possible to have a stack overflow problem when having a lot of free RAM ?
UPDATE
I have now tried to "pseudo-parallelize" the work using the same machine: since I am running a 10-fold cross-validation scheme, I am opening 5 different instances of Rgui and running 2 folds in each instances. Proceeding in this way, everything run smoothly, and the process indeed take 10 times less than running it in a single instance of R. What makes me wonder is that if 10 instances of Rgui can run at the same time and get the job done, this means that the machine has the computational resources needed. Hence I can not really get my head around the fact that %dopar% with 10 clusters does not work.
The "protection stack overflow" means that you have run out of the "protection stack", that is too many pointers have been PROTECTed but not (yet) UNPROTECTed. This could be because of a bug or inefficiency in the code you are running (in native code of a package or in native code of R, but not a bug in R source code).
This problem has nothing to do with the amount of available memory on the heap, so calling gc() will have no impact, and it is not important how much physical memory the machine has. Please do not call gc() explicitly at all, even if there was a problem with the heap usage, it just makes the program run slower but does not help: if there is not enough heap space but it could be obtained by garbage collection, the garbage collector will run automatically. As the problem is the protection stack, neither restructuring the R code nor removing dead variables explicitly will help. In principle, structuring the code into (relatively small) functions is a good thing for maintainability/readability and it also indirectly reduces scope of variables, so removing variables explicitly should become unnecessary.
It might help to increase the pointer protection stack size, which can be done at R startup from the command line using --max-ppsize.
I'm relatively new and poor at R, and am trying to do something that appears to be giving me trouble.
I have several large spatialpolygonsdataframes that I am trying to combine into 1 spatialpolygonsdataframe. There are 7 and they combine to about 5 GB total. My mac only has 8GB of RAM.
When I try and create the aggregate spatialpolygonsdataframe R takes an incredibly long time to run and I have to quit out. I presume it is because I do not have sufficient RAM.
my code is simple: aggregate <-rbind(file1,file2,....). Is there a smarter/better way to do this?
Thank you.
I would disagree, a major component of reading large datasets isn't RAM capacity (although I would suggest that you upgrade if you can). But rather read/write speeds. Hardware, a HDD at 7200RPM is substantially slower vs. SSD. If you are able to install a SSD and have that as your working directory, I would recommend it.
I have the following problem/question:
I've written an R functions which is smoothing values from a time series. The time series is defined by a big number of single global raster files, hence each single pixel a series with n timesteps (generally more than 500). Even though I've plenty of RAM, I have to rely on blockwise processing because loading the entire dataset is just too much. So far so good.
I've written (IMHO) a fairly decent code, which leverages parallel processing when possible. I have a processing machine which should be more than well equipped to handle this amount of data and computation. This leads me to believe that most of the time will be spent by reading lots of values from the disk and then, after smoothing, writing lots of values to the disk.
So I've tried running the code with the files being on either a normal HDD or a normal SSD.
Against my expectations, it didn't really matter much.
Then I tried running a test function which reads a file, gets the values and writes them back to disk with the raster being on either the HDD, the SSD or a blazing fast SSD. Again, no significant difference.
I've already done a fair share of profiling to find bottlenecks, as well as a good amount of time googling for efficient solutions. There's bits of info here and there, but I decided to post this question here to get a definitive answer and maybe some pointers for me and others how to efficient manage things.
So without further ado (and for people who skipped the above), here's my question:
In a setting as described above (high data volume, blockwise processing, reading and writing from/to disk), what's the most efficient (and/or fastest) way to do computation on a long raster time series which involves reading and writing values from/to disk? (especially regarding the read write aspect)
Assuming I have a fast SSD, how can I leverage the speed? Is it done automatically?
What are the influencing factors (filesize, filetype, caching) and the most efficient setting of these factors?
I know that in terms of raster, R works the fastest with .grd, but I would like to avoid this format for flexibility, compatibility and diskspace reasons.
Maybe I'm also having a misconception of how the raster package interacts with the files on disk. In that case, should I use different functions than getValues and writeValues ?
-- Some system info and example code: --
Os: Win7 x64
CPU: Xenon E5-1650 # 3.5 GHz
RAM: 128 GB
R-version: 3.2
Raster file format: .rst
Read/write benchmark function:
benchfun <- function(x){
# x ... raster file
xr <- raster(x)
x2 <- raster(xr)
xval <- getValues(xr)
x2 <- setValues(x2,xval)
writeRaster(x2,'testras.tif',overwrite=TRUE)
}
If needed I can also provide a little example code for the time series processing, but for now I don't think it's needed.
Appreciate all tips!
Thanks,
Val
I'm working with large datasets and quite often R produces an error telling it can't allocate a vector of that size or it doesn't have enough memory.
My computer has 16GB RAM (Windows 10) and I'm working with datasets of around 4GB but some operations need a lot of memory, for example converting the dataset from wide format to long.
In some situations I can use gc() to realease some memory but many times it's not enough.
Sometimes I can break the dataset on smaller chunks but sometimes I need to work with all the table at once.
I've read that Linux users don't have this problem, but what about Windows?
I've tried setting a large pagefile on a SSD (200GB) but I've found that R doesn't use it at all.
I can see the task manager and when the memory consumption reaches 16GB R stops working. The size of the pagefile doesn't seem to make any difference.
How can I force R to use the pagefile?
Do I need to compile it myself with some special flags?
PD: My experience is that deleting an object rm() and later using gc() doesn't recover all the memory. As I perform operations with large datasets my computer has less and less free memory at every step, no matter if I use gc().
PD2: I expect not to hear trivial solutions like "you need more RAM memory"
PD3: I've been testing and the problem only happens in Rstudio. If I use directly R it works well. Does anybody know how to do it in RStudio.
In order to get it working automatically every time you start RStudio the solution with R_MAX_MEM_SIZE is ignored, both if created as an environment variable or if created inside the .Rprofile.
Writing memory.limit(64000) is ignored too.
The proper way is adding the following line in the file .Rprofile
invisible(utils::memory.limit(64000))
or whatever number you want.
Of course you need to have a pagefile big enough. That number includes free RAM and free pagefile space.
Using the pagefile is slower but it's going to be used only when needed.
Something strange I've found is that it only let's you increase the maximum memory to use but it doesn't allow you to decrease it.
I am working with a very large data set which I am downloading from an Oracle data base. The Data frame has about 21 millions rows and 15 columns.
My OS is windows xp (32-bit), I have 2GB RAM. Short-term I cannot upgrade my RAM or my OS (it is at work, it will take months before I get a decent pc).
library(RODBC)
sqlQuery(Channel1,"Select * from table1",stringsAsFactor=FALSE)
I get here already stuck with the usual "Cannot allocate xMb to vector".
I found some suggestion about using the ff package. I would appreciate to know if anybody familiar with the ff package can tell me if it would help in my case.
Do you know another way to get around the memory problem?
Would a 64-bit solution help?
Thanks for your suggestions.
If you are working with package ff and have your data in SQL, you can easily get them in ff using package ETLUtils, see the documentation for an example when using ROracle.
In my experience, ff is perfectly suited for the type of dataset you are working with (21 Mio rows and 15 columns) - in fact your setup is kind of small to ff unless your columns contain a lot of character data which will be converted to factors (meaning all your factor levels should be able to fit in your RAM).
Packages ETLUtils, ff and the package ffbase allow you to get your data in R using ff and do some basic statistics on it. Depending on what you will do with your data, your hardware, you might have to consider sampling when you build models. I prefer having my data in R, building a model based on a sample and score using the tools in ff (like chunking) or from package ffbase.
The drawback is that you have to get used to the fact that your data are ffdf objects and that might take some time - especially if you are new to R.
In my experience, processing your data in chunks can almost always help greatly in processing big data. For example, if you calculate a temporal mean only one timestep needs to be in memory at any given time. You already have your data in a database, so obtaining the subset is easy. Alternatively, if you cannot easily process in chunks, you could always try and take a subset of your data. Repeat the analysis a few times to see if your results are sensitive to which subset you take. The bottomline is that some smart thinking can get you a long way with 2 Gb of RAM. If you need more specific advice, you need to ask more specific questions.
Sorry I can't help with ff, but on the topic of the RAM: I'm not familiar with the memory usage of R data frames, but for sake of argument let's say each cell takes 8 bytes (e.g. a double-precision float or long integer).
21 million * 15 * 8 bytes = about 2.5 GB.
Update and see the comments below; this figure is probably an underestimate!
So you could really do with more RAM, and a 64-bit machine would help a lot as 32-bit machines are limited to 4GB (and can't use that fully).
Might be worth trying a subset of the dataset so you know how much you can load with your existing RAM, then extrapolate to estimate how much you actually need. If you can subdivide the data and process it in chunks, that would be great, but lots of problems don't submit to that approach easily.
Also, I have been assuming that you need all the columns! Obviously, if you can filter the data in any way to reduce the size (e.g. removing any irrelevant columns) than that may help greatly!
There's another very similar question. In particular, one way to to handle your data is to write it to the file and then map memory region to it (see, for example, mmap package).