Bug in coercion of large vectors? - r

I just installed R version 3.5.0 and according to this article on Revolution Analytics there is a new internal representation of vectors.
When I do the following I either get no result at all (see the following example) or the whole computer freezes for good:
> x <- 1:1e9
> c(x, "a")
>
So it seems that there is some routine missing which catches an overflow error in such cases (or at least gives a warning).
My question
Is this a reproducible bug?

The same sequence of statements causes R to (apparently) hang in 3.4.x as well. You are creating a character object that requires at least 8Gb of RAM, which may take a while if it completes at all.
On R 3.4.3 I get the message "Error: cannot allocate a vector of size 7.5Gb", which I expect. On R 3.5.0 the message is "cannot allocate a vector of size 128.0Mb". The size is incorrect: R 3.5.0 is still trying to create an 8Gb object here. But the wait and ultimate failure is not surprising.
Your statement does work as expected for smaller object sizes.

Related

How to solve "Error: cannot allocate vector of size XX" without using memory.size and memory.limit

I'm running a very simple code in R (using RStudio) that uses an already coded function.
When using the function, I get the classic error:
"Error: cannot allocate vector of size XX",
because one of the inputs is a "large" vector for the purposes of the function.
I have looked for solutions, but all point towards using memory.size() and memory.limit(). The problem is that I'm working in a server so those functions are not available (are only for Windows). Since I'm working on a server, in principle, I should have no problem with the memory (the available memory is far larger than the one R says cannot handle.
Any suggestions would be extremely useful, thanks!!
EDIT: this is the code:
rm(list = ls())
library(readstata13)
library(devtools)
library(csranks)
library(dplyr)
k1 <- read.dta13("k1.dta")
gc()
CS_simul <- cstauworst(k1$K1, k1$se1, tau=10, R=5, seed=101, na.rm=TRUE)
cstauworst is a function that is contained in the library cranks. The data k1 is "small" (less than a MB, around 60k obs) but large for the purposes of the function. The algorithm requires using the whole data simultaneously so I cannot run it by piece or parallelize.

Use tm's Corpus function with big data in R

I'm trying to do text mining on big data in R with tm.
I run into memory issues frequently (such as can not allocation vector of size.... ) and use the established methods of troubleshooting those issues, such as
using 64-bit R
trying different OS's (Windows, Linux, Solaris, etc)
setting memory.limit() to its maximum
making sure that sufficient RAM and compute is available on the server (which there is)
making liberal use of gc()
profiling the code for bottlenecks
breaking up big operations into multiple smaller operations
However, when trying to run Corpus on a vector of a million or so text fields, I encounter a slightly different memory error than usual and I'm not sure how to work-around the problem. The error is:
> ds <- Corpus(DataframeSource(dfs))
Error: memory exhausted (limit reached?)
Can (and should) I run Corpus incrementally on blocks of rows from that source dataframe then combine the results? Is there a more efficient way to run this?
The size of the data that will produce this error depends on the computer running it, but if you take the built-in crude dataset and replicate the documents until it's large enough, then you can replicate the error.
UPDATE
I've been experimenting with trying to combine smaller corpa, i.e.
test1 <- dfs[1:10000,]
test2 <- dfs[10001:20000,]
ds.1 <- Corpus(DataframeSource(test1))
ds.2 <- Corpus(DataframeSource(test2))
and while I haven't been successful, I did discover tm_combine which is supposed to solve this exact problem. The only catch is that for some reason, my 64-bit build of R 3.1.1 with the newest version of tm can't find the function tm_combine. Perhaps it was removed from the package for some reason? I'm investigating...
> require(tm)
> ds.12 <- tm_combine(ds.1,ds.2)
Error: could not find function "tm_combine"
I don't know if tm_combine became deprecated or why it's not found in the tm namespace, but I did find a solution through using Corpus on smaller chunks of the dataframe then combining them.
This StackOverflow post had a simple way to do that without tm_combine:
test1 <- dfs[1:100000,]
test2 <- dfs[100001:200000,]
ds.1 <- Corpus(DataframeSource(test1))
ds.2 <- Corpus(DataframeSource(test2))
#ds.12 <- tm_combine(ds.1,ds.2) ##Error: could not find function "tm_combine"
ds.12 <- c(ds.1,ds.2)
which gives you:
ds.12
<<VCorpus (documents: 200000, metadata (corpus/indexed): 0/0)>>
Sorry not to figure this out on my own before asking. I tried and failed with other ways of combining objects.

Why is R slowing down as time goes on, when the computations are the same?

So I think I don't quite understand how memory is working in R. I've been running into problems where the same piece of code gets slower later in the week (using the same R session - sometimes even when I clear the workspace). I've tried to develop a toy problem that I think reproduces the "slowing down affect" I have been observing, when working with large objects. Note the code below is somewhat memory intensive (don't blindly run this code without adjusting n and N to match what your set up can handle). Note that it will likely take you about 5-10 minutes before you start to see this slowing down pattern (possibly even longer).
N=4e7 #number of simulation runs
n=2e5 #number of simulation runs between calculating time elapsed
meanStorer=rep(0,N);
toc=rep(0,N/n);
x=rep(0,50);
for (i in 1:N){
if(i%%n == 1){tic=proc.time()[3]}
x[]=runif(50);
meanStorer[i] = mean(x);
if(i%%n == 0){toc[i/n]=proc.time()[3]-tic; print(toc[i/n])}
}
plot(toc)
meanStorer is certainly large, but it is pre-allocated, so I am not sure why the loop slows down as time goes on. If I clear my workspace and run this code again it will start just as slow as the last few calculations! I am using Rstudio (in case that matters). Also here is some of my system information
OS: Windows 7
System Type: 64-bit
RAM: 8gb
R version: 2.15.1 ($platform yields "x86_64-pc-mingw32")
Here is a plot of toc, prior to using pre-allocation for x (i.e. using x=runif(50) in the loop)
Here is a plot of toc, after using pre-allocation for x (i.e. using x[]=runif(50) in the loop)
Is ?rm not doing what I think it's doing? Whats going on under the hood when I clear the workspace?
Update: with the newest version of R (3.1.0), the problem no longer persists even when increasing N to N=3e8 (note R doesn't allow vectors too much larger than this)
Although it is quite unsatisfying that the fix is just updating R to the newest version, because I can't seem to figure out why there was problems in version 2.15. It would still be nice to know what caused them, so I am going to continue to leave this question open.
As you state in your updated question, the high-level answer is because you are using an old version of R with a bug, since with the newest version of R (3.1.0), the problem no longer persists.

R memory limit warning vs "unable to allocate..."

Does a memory warning affect my R analysis?
When running a large data analysis script in R I get a warning something like:
In '... '
reached total allocation of ___Mb: see help...
But my script continues without error, just the warning. With other data sets I get an error something like:
Error: cannot allocate vector of size ___Mb:
I know the error breaks my data analysis, but is there anything wrong with just getting the warning? I have not noticed anything missing in my data set but it is very large and I have no good means to check everything. I am at 18000Mb allocated to memory and cannot reasonably allocate more.
Way back in the R 2.5.1 news I found this reference to memory allocation warnings:
malloc.c has been updated to version 2.8.3. This version has a
slightly different allocation strategy, and is likely to work a
little better close to address space limits but may give more
warnings about reaching the total allocation before successfully
allocating.
Based on this note, I hypothesize (without any advanced knowledge of the inner implementation) that the warning is given when the memory allocation call in R (malloc.c) failed an attempt to allocate memory. Multiple attempts are made to allocate memory, possibly using different methods, and possibly with calls to the garbage collector. Only when malloc is fairly certain that the allocation cannot be made will it return an error.
Warnings do not compromise existing R objects. They just inform the user that R is nearing the limits of computer memory.
(I hope a more knowledgeable user can confirm this...)

Cannot allocate vector in R despite being in 64-bit version

I am trying to do a dcast in R to generate a matrix as seen in another question I asked
However, I am getting an error:
Error: cannot allocate vector of size 2.8Gb.
My desktop has 8GB of RAM and I am running ubuntu 11.10 64-bit version. Am I perhaps using the wrong version of R? How would I know, is there a way to determine it while running R? I surely must have the necessary space to allocate this vector.
The error message means that R needs to allocate another 2.8Gb of memory to complete whatever operation you were trying to perform. It doesn't mean it needed to allocate 2.8Gb maximum. Run top in a shell whilst you run that R code and watch how R uses up memory until it hist a point where the extra 2.8Gb of address space is not available.
Do you have a large swap space on the box. I can easily see how what you are doing uses all 8Gb of RAM plus all your swap space and so there is no other place for R to get memory space from and thus throws the error.
Perhaps you could try doing the dcast in chunks, or try an alternative approach than using dcast. Post another Q if you want help with that.

Resources