I am trying to do a dcast in R to generate a matrix as seen in another question I asked
However, I am getting an error:
Error: cannot allocate vector of size 2.8Gb.
My desktop has 8GB of RAM and I am running ubuntu 11.10 64-bit version. Am I perhaps using the wrong version of R? How would I know, is there a way to determine it while running R? I surely must have the necessary space to allocate this vector.
The error message means that R needs to allocate another 2.8Gb of memory to complete whatever operation you were trying to perform. It doesn't mean it needed to allocate 2.8Gb maximum. Run top in a shell whilst you run that R code and watch how R uses up memory until it hist a point where the extra 2.8Gb of address space is not available.
Do you have a large swap space on the box. I can easily see how what you are doing uses all 8Gb of RAM plus all your swap space and so there is no other place for R to get memory space from and thus throws the error.
Perhaps you could try doing the dcast in chunks, or try an alternative approach than using dcast. Post another Q if you want help with that.
Related
This question already has answers here:
Tricks to manage the available memory in an R session
(28 answers)
Recommended package for very large dataset processing and machine learning in R [closed]
(5 answers)
Closed 3 years ago.
As the title suggests, I am trying to fully understand memory constraints with R because I have a project that is quickly growing in scale, and I am worried that memory constraints will soon become a major issue.
I am aware of object.size, and I get the following output when run on the largest item in my environment:
> object.size(raw.pbp.data)
457552240 bytes
...so the largest item is ~457MB. I have also checked on my macbook pro's memory, in the About This Mac --> Storage, and it shows my Memory as 8 GB 1600 MHz DDR3, so I assume I have 8 GB to work with.
Obviously the 457MB dataframe is not the only object in my R environment, but I do not want to manually run object.size for every single object and add up the bytes to find the total size of memory used. Is there a better way to do this? A function that tells me the memory used in total by all objects in my RStudio Environment would be great. Does such a function exist?
Also, what happens when I get closer to 8GB - is my R script going to stop working? I'm anticipating my data is going to increase by a factor of 5 - 10x in the near future, which will probably bring the total memory used in the environment close-to, or even greater than, 8GB.
Lastly, if hitting 8GB of memory is going to hault my R script from running, what are my options? If I convert my dataframe into a datatable, could that reduce the size of the object overall?
Any help with this is greatly appreciated, thanks!!
Edit: saved as a .rda file, raw.pbp.data is only 32MB, so that makes me optimistic that there is a way to potentially reduce its size when loaded into R.
I am not aware of any functions, but this works. You could make a function out of this:
env <- eapply(environment(), object.size, USE.NAMES = FALSE)
sizes <- c()
for (i in 1:length(env)) {
sizes[i] <- env[[i]][1]
}
sum(sizes)
Besides the obvious (running this on a server or buying more RAM), I've heard data.table is more efficient than data.frame. Try using it. The syntax is more concise too! I cannot recommend data.table enough.
I just installed R version 3.5.0 and according to this article on Revolution Analytics there is a new internal representation of vectors.
When I do the following I either get no result at all (see the following example) or the whole computer freezes for good:
> x <- 1:1e9
> c(x, "a")
>
So it seems that there is some routine missing which catches an overflow error in such cases (or at least gives a warning).
My question
Is this a reproducible bug?
The same sequence of statements causes R to (apparently) hang in 3.4.x as well. You are creating a character object that requires at least 8Gb of RAM, which may take a while if it completes at all.
On R 3.4.3 I get the message "Error: cannot allocate a vector of size 7.5Gb", which I expect. On R 3.5.0 the message is "cannot allocate a vector of size 128.0Mb". The size is incorrect: R 3.5.0 is still trying to create an 8Gb object here. But the wait and ultimate failure is not surprising.
Your statement does work as expected for smaller object sizes.
I'm working with large datasets and quite often R produces an error telling it can't allocate a vector of that size or it doesn't have enough memory.
My computer has 16GB RAM (Windows 10) and I'm working with datasets of around 4GB but some operations need a lot of memory, for example converting the dataset from wide format to long.
In some situations I can use gc() to realease some memory but many times it's not enough.
Sometimes I can break the dataset on smaller chunks but sometimes I need to work with all the table at once.
I've read that Linux users don't have this problem, but what about Windows?
I've tried setting a large pagefile on a SSD (200GB) but I've found that R doesn't use it at all.
I can see the task manager and when the memory consumption reaches 16GB R stops working. The size of the pagefile doesn't seem to make any difference.
How can I force R to use the pagefile?
Do I need to compile it myself with some special flags?
PD: My experience is that deleting an object rm() and later using gc() doesn't recover all the memory. As I perform operations with large datasets my computer has less and less free memory at every step, no matter if I use gc().
PD2: I expect not to hear trivial solutions like "you need more RAM memory"
PD3: I've been testing and the problem only happens in Rstudio. If I use directly R it works well. Does anybody know how to do it in RStudio.
In order to get it working automatically every time you start RStudio the solution with R_MAX_MEM_SIZE is ignored, both if created as an environment variable or if created inside the .Rprofile.
Writing memory.limit(64000) is ignored too.
The proper way is adding the following line in the file .Rprofile
invisible(utils::memory.limit(64000))
or whatever number you want.
Of course you need to have a pagefile big enough. That number includes free RAM and free pagefile space.
Using the pagefile is slower but it's going to be used only when needed.
Something strange I've found is that it only let's you increase the maximum memory to use but it doesn't allow you to decrease it.
I am running my code in a PC and I don't think I have problem with the RAM.
When I run this step:
dataset <- rbind(dataset_1, dataset_2,dataset_3,dataset_4,dataset_5)
I got the
Error: cannot allocate vector of size 261.0 Mb
The dataset_1 until dataset_5 have around 5 million observation each.
Could anyone please advise how to solve this problem?
Thank you very much!
There are several packages available that may solve your problem under the High Performance Computing CRAN taskview. See "Large memory and out-of-memory data", the ff package, for example.
R, as matlab, load all the data into the memory which means you can quickly run out of RAM (especially for big datasets). The only alternative I can see is to partition your data (i.e. load only part of the data), do the analysis on that part and write the results to files before loading the next chunk.
In your case you might want to use Linux tools to merge the datasets.
Say you have two files dataset1.txt and dataset2.txt, you can merge them using the shell command join, cat or awk.
More generally, using Linux shell tools for parsing big datasets is usually much faster and requires much less memory.
I have run a rather large bootstrap in R with the boot package.
When I first ran boot() I got this:
Error: cannot allocate vector of size 2.8 Gb
So, to get the boot object I had to use 'simple=TRUE', which tells boot() to not allocate all the memory at the beginning (according to ?boot). This worked fine, though it took a few minutes.
Now I need to get confidence intervals:
> boot.ci(vpe.bt, type="bca", simple=TRUE)
Error: cannot allocate vector of size 2.8 Gb
Same problem! but according to ?boot.ci, there is no 'simple=TRUE' flag that you can use with this function (I've tried it).
So, is there any way around this using boot.ci()?
And, if not, what can I do to increase the amount of memory it can use?
Calculating bca (adjusted bootstrap percentile) confidence intervals in R requires the creation of an 'importance array' which has dimensions (number of observations) x (number of reps). If you don't have enough memory to handle at least two copies of such a matrix, the function will not work.
However, normal based (type='normal') and percentile based confidence intervals (type='percent') should work.
I don't know about the boot.ci but the I've had similar problems with large vectors in my 32-bit Ubuntu system. The 32-bit systems have a limited address space which is resolved in 64-bit system.
There are some downsides with 64-bits, the main one being that it still isn't standard and that not every software provider has a 64-bit compiled version of their software, the Flash player has the last I've heard only a beta-version for 64-bit. This is usually amendable though through installing a library that allows you to run 32-bit software on a 64-bit system (although with a performance penalty).
These resources might perhaps add shed some more light on the issue:
http://stat.ethz.ch/R-manual/R-devel/library/base/html/Memory-limits.html
https://help.ubuntu.com/community/32bit_and_64bit