R - Big Data - vector exceeds vector length limit

R - Big Data - vector exceeds vector length limit - r

I have the following R code:
data <- read.csv('testfile.data', header = T)
mat = as.matrix(data)
Some more statistics of my testfile.data:
> ncol(data)
[1] 75713
> nrow(data)
[1] 44771
Since this is a large dataset, so I am using Amazon EC2 with 64GB Ram space. So hopefully memory isn't an issue. I am able to load the data (1st line works).
But as.matrix transformation (2nd line errors) throws the following error:
resulting vector exceeds vector length limit in 'AnswerType'
Any clue what might be the issue?

As noted, the development version of R supports vectors larger than 2^31-1. This is more-or-less transparent, for instance
> m = matrix(0L, .Machine$integer.max / 4, 5)
> length(m)
[1] 2684354555
This is with
> R.version.string
[1] "R Under development (unstable) (2012-08-07 r60193)"
Large objects consume a lot of memory (62.5% of my 16G, for my example) and to do anything useful requires several times that memory. Further, even simple operations on large data can take appreciable time. And many operations on long vectors are not yet supported
> sum(m)
Error: long vectors not supported yet:
/home/mtmorgan/src/R-devel/src/include/Rinlinedfuns.h:100
So it often makes sense to process data in smaller chunks by iterating through a larger file. This gives full access to R's routines, and allows parallel evaluation (via the parallel package). Another strategy is to down-sample the data, which should not be too intimidating to a statistical audience.

Your matrix has more elements than the maximum vector length of 2^31-1. This is a problem because a matrix is just a vector with a dim attribute. read.csv works because it returns a data.frame, which is a list of vectors.
R> 75713*44771 > 2^31-1
[1] TRUE
See ?"Memory-limits" for more details.

Related

What is the difference between "Large Matrix" and regular numeric matrix?

When a relatively large matrix is created, Rstudio marks it as a Large Matrix in its environment window:
x <- matrix(rnorm(10000 * 5000), ncol=5000)
# Large matrix (50000000 elements, 381.5 Mb)
The mode() function as expected returns "numeric" for this object:
mode(x)
## [1] "numeric"
If however I run the following command:
mode(x) <- "numeric"
Rstudio changes "Large Matrix" into a regular numeric matrix:
# x: num [1:10000, 1:5000]
So what is the difference between these 2 objects? Does this difference exist in Rstudio only or these two objects are different in R as well?

In my understanding, "Large Matrix" and matrix is the same matrix object. What matters is how these objects are displayed in the global environment in RStudio.
RStudio also distinguishes between vectors and large vectors. Consider the following vector:
n <- 256
v1 <- rnorm(n*n-5)
This vector is listed as a large vector. Now, let's decrease its size by one:
v2 <- rnorm(n*n-6)
Suddenly, it becomes a normal vector. The structure of both objects is the same (which can be verified by running str). So is their class and storage mode. What is different then? Notice that the size of v2 in memory is exactly 512 Kb.
lobstr::obj_size(v2)
>524,288 B # or exactly 512 kB
The size of v1 is slightly greater:
lobstr::obj_size(v1)
>524,296 B # or 512.0078125 KB
As far as I understand (correct me if I am wrong), for mere convenience RStudio displays objects that are greater than 512 kB differently.

how to make arithmatic operations in ffdf object of ff package

I have the script making a ffdf object:
library(ff)
library(ffbase)
setwd("D:/My_package/Personal/R/reading")
x<-cbind(rnorm(1:100000000),rnorm(1:100000000),1:100000000)
system.time(write.csv2(x,"test.csv",row.names=FALSE))
system.time(x <- read.csv2.ffdf(file="test.csv", header=TRUE, first.rows=1000, next.rows=10000,levels=NULL))
Now I want to increase the column#1 of x by 5.
To perform such an operation I use method 'add()' of ff package:
add(x[,1],5)
The ouput is Ok (column#1 is increased by 5). But the extra RAM allocation is disasterous - it looks like as if I am operating the entire dataframe in RAM but not a ffdf object.
So my question is about the correct way to deal with elements of ffdf object without drastic extra RAM allocations.

You can just do as follows
require(ffbase)
x <- ff(1:10)
y <- x + 5
x
y
ffbase has worked out all the Arithmetic operations see help("+.ff_vector")

I have used chunk approach to make arithmatic calculations without RAM extra overheads (see the initial script in the question section):
chunk_size<-100
m<-numeric(chunk_size)
chunks <- chunk(x, length.out=chunk_size)
system.time(
for(i in seq_along(chunks)){
x[chunks[[i]],][[1]]<-x[chunks[[i]],][[1]]+5
}
)
x
Now, I have increased each element of the column#1 of x object by 5 without significant RAM allocations.
The 'chunk_size' regulates the number of chunks as well -> more chunks are used the smaller RAM overheads are. But processing time issues could arise.
The brief example and explanations about chunks in ffdf are here:
https://github.com/demydd/R-for-Big-Data/blob/master/09-ff.Rmd
Anyway, It would be nice to hear alternative approaches.

Trying to replace NAs with column means in R

I am trying to run this simple code over data which is a data frame of 800 features and 200000 observations.
This simple code that I always used:
C <- ncol(data)
for (i in 1:C){
print(i)
data[is.na(data[,i]),i] <- mean(data[,i], na.rm=T)
}
returns:
[1] 1
Error: cannot allocate vector of size 1.6 Mb
I don't really understand why because I can independently call for the mean of the feature without any errors. Any

That error means you are running out of memory to compute the means.
Sometime, based on the number of references to an object, R will copy the object, make the change to the copy and then replace the original (reference). That is likely what is happening in your case.
I recommend you use the data.table package which allows copy-less variable modification.

big.matrix as data.frame in R

I've recently started using R for data analysis. Now I've got a problem in ranking a big query dataset (~1 GB in ASCII mode, over my laptop's 4GB RAM in binary mode). Using bigmemory::big.matrix for this dataset is a nice solution, but providing such a matrix 'm' in the gbm() or randomForest() algorithms causes the error:
cannot coerce class 'structure("big.matrix", package = "bigmemory")' into a data.frame
class(m) outputs the folowing:
[1] "big.matrix"
attr(,"package")
[1] "bigmemory"
Is there a way to correctly pass a big.matrix instance into these algorithms?

I obviously can't test this using data of your scale, but I can reproduce your errors by using the formula interface of each function:
require(bigmemory)
m <- matrix(sample(0:1,5000,replace = TRUE),1000,5)
colnames(m) <- paste("V",1:5,sep = "")
bm <- as.big.matrix(m,type = "integer")
require(gbm)
require(randomForest)
#Throws error you describe
rs <- randomForest(V1~.,data = bm)
#Runs without error (with a warning about the response only having two values)
rs <- randomForest(x = bm[,-1],y = bm[,1])
#Throws error you describe
rs <- gbm(V1~.,data = bm)
#Runs without error
rs <- gbm.fit(x = bm[,-1],y = bm[,1])
Not using the formula interface for randomForest is fairly common advice for large data sets; it can be quite inefficient. If you read ?gbm, you'll see a similar recommendation steering you towards gbm.fit for large data as well.

It is often the case that the memory occupied by numeric objects is more than the disk space. Each "double" element in a vector or matrix takes 8 bytes. When you coerce an object to a data.frame, it may need to be copied in RAM. You should avoid trying to use functions and data structures that are outside those supported by the bigmemory/big*** suite of packages. "biglm" is available, but I doubt that you can expect gbm() or randomForest() to recognize and use the facilities in the "big"-family.

tm package error "Cannot convert DocumentTermMatrix into normal matrix since vector is too large"

I have created a DocumentTermMatrix that contains 1859 documents (rows) and 25722 (columns). In order to perform further calculations on this matrix I need to convert it to a regular matrix. I want to use the as.matrix() command. However, it returns the following error: cannot allocate vector of size 364.8 MB.
> corp
A corpus with 1859 text documents
> mat<-DocumentTermMatrix(corp)
> dim(mat)
[1] 1859 25722
> is(mat)
[1] "DocumentTermMatrix"
> mat2<-as.matrix(mat)
Fehler: kann Vektor der Größe 364.8 MB nicht allozieren # cannot allocate vector of size 364.8 MB
> object.size(mat)
5502000 bytes
For some reason the size of the object seems to increase dramatically whenever it is transformed to a regular matrix. How can I avoid this?
Or is there an alternative way to perform regular matrix operations on a DocumentTermMatrix?

The quick and dirty way is to export your data into a sparse matrix object from an external package like Matrix.
> attributes(dtm)
$names
[1] "i" "j" "v" "nrow" "ncol" "dimnames"
$class
[1] "DocumentTermMatrix" "simple_triplet_matrix"
$Weighting
[1] "term frequency" "tf"
The dtm object has the i, j and v attributes which is the internal representation of your DocumentTermMatrix. Use:
library("Matrix")
mat <- sparseMatrix(
i=dtm$i,
j=dtm$j,
x=dtm$v,
dims=c(dtm$nrow, dtm$ncol)
)
and you're done.
A naive comparison between your objects:
> mat[1,1:100]
> head(as.vector(dtm[1,]), 100)
will each give you the exact same output.

DocumentTermMatrix uses sparse matrix representation, so it doesn't take up all that memory storing all those zeros. Depending what it is you want to do you might have some luck with the SparseM package which provides some linear algebra routines using sparse matrices..

Are you able to increase the amount of RAM available to R? See this post: Increasing (or decreasing) the memory available to R processes
Also, sometimes when working with big objects in R, I occassionally call gc() to free up wasted memory.

The number of documents should not be a problem but you may want to try removing sparse terms, this could very well reduce the dimension of document term matrix.
inspect(removeSparseTerms(dtm, 0.7))
It removes terms that has at least a sparsity of 0.7.
Another option available to you is that you specify minimum word length and minimum document frequency when you create document term matrix
a.dtm <- DocumentTermMatrix(a.corpus, control = list(weighting = weightTfIdf, minWordLength = 2, minDocFreq = 5))
use inspect(dtm) before and after your changes, you will see huge difference, more importantly you won't ruin significant relations hidden in your docs and terms.

Since you only have 1859 documents, the distance matrix you need to compute is fairly small. Using the slam package (and in particular, its crossapply_simple_triplet_matrix function), you might be able to compute the distance matrix directly, instead of converting the DTM into a dense matrix first. This means that you will have to compute the Jaccard similarity yourself. I have successfully tried something similar for the cosine distance matrix on a large number of documents.