When a relatively large matrix is created, Rstudio marks it as a Large Matrix in its environment window:
x <- matrix(rnorm(10000 * 5000), ncol=5000)
# Large matrix (50000000 elements, 381.5 Mb)
The mode() function as expected returns "numeric" for this object:
mode(x)
## [1] "numeric"
If however I run the following command:
mode(x) <- "numeric"
Rstudio changes "Large Matrix" into a regular numeric matrix:
# x: num [1:10000, 1:5000]
So what is the difference between these 2 objects? Does this difference exist in Rstudio only or these two objects are different in R as well?
In my understanding, "Large Matrix" and matrix is the same matrix object. What matters is how these objects are displayed in the global environment in RStudio.
RStudio also distinguishes between vectors and large vectors. Consider the following vector:
n <- 256
v1 <- rnorm(n*n-5)
This vector is listed as a large vector. Now, let's decrease its size by one:
v2 <- rnorm(n*n-6)
Suddenly, it becomes a normal vector. The structure of both objects is the same (which can be verified by running str). So is their class and storage mode. What is different then? Notice that the size of v2 in memory is exactly 512 Kb.
lobstr::obj_size(v2)
>524,288 B # or exactly 512 kB
The size of v1 is slightly greater:
lobstr::obj_size(v1)
>524,296 B # or 512.0078125 KB
As far as I understand (correct me if I am wrong), for mere convenience RStudio displays objects that are greater than 512 kB differently.
Related
I know this question has been asked in the past (here and here, for example), but those questions are years old and unresolved. I am wondering if any solutions have been created since then. The issue is that the Matrix package in R cannot handle long vectors (length greater than 2^31 - 1). In my case, a sparse matrix is necessary for running an XGBoost model because of memory and time constraints. The XGBoost xgb.DMatrix supports using a dgCMatrix object. However, due to the size of my data, trying to create a sparse matrix results in an error. Here's an example of the issue. (Warning: this uses 50-60 GB RAM.)
i <- rep(1, 2^31)
j <- i
j[(2^30): length(j)] <- 2
x <- i
s <- sparseMatrix(i = i, j = j, x = x)
Error in validityMethod(as(object, superClass)) : long vectors not supported yet: ../../src/include/Rinlinedfuns.h:137
As of 2019, are there any solutions to this issue?
I am using the latest version of the Matrix package, 1.2-15.
The sparse matrix algebra R package spam with its spam64 extension supports sparse matrices with more than 2^31-1 non-zero elements.
A simple example (requires ~50 Gb memory and takes ~5 mins to run):
## -- a regular 32-bit spam matrix
library(spam) # version 2.2-2
s <- spam(1:2^30)
summary(s)
## Matrix object of class 'spam' of dimension 1073741824x1,
## with 1073741824 (row-wise) nonzero elements.
## Density of the matrix is 100%.
## Class 'spam'
## -- a 64-bit spam matrix with 2^31 non-zero entries
library(spam64)
s <- cbind(s, s)
summary(s)
## Matrix object of class 'spam' of dimension 1073741824x2,
## with 2147483648 (row-wise) nonzero elements.
## Density of the matrix is 100%.
## Class 'spam'
## -- add zeros to make the dimension 2^31 x 2^31
pad(s) <- c(2^31, 2^31)
summary(s)
## Matrix object of class 'spam' of dimension 2147483648x2147483648,
## with 2147483648 (row-wise) nonzero elements.
## Density of the matrix is 4.66e-08%.
## Class 'spam'
Some links:
https://cran.r-project.org/package=spam
https://cran.r-project.org/package=spam64
https://cran.r-project.org/package=dotCall64
https://doi.org/10.1016/j.cageo.2016.11.015
https://doi.org/10.1016/j.softx.2018.06.002
I am one of the authors of dotCall64 and spam.
I am trying to decrease the memory footprint of some of my datasets where I have a small set of factors per columns (repeated a large number of times). Are there better ways to minimize it? For comparison, this is what I get from just using factors:
library(pryr)
N <- 10 * 8
M <- 10
Initial data:
test <- data.frame(A = c(rep(strrep("A", M), N), rep(strrep("B", N), N)))
object_size(test)
# 1.95 kB
Using Factors:
test2 <- as.factor(test$A)
object_size(test2)
# 1.33 kB
Aside: I naively assumed that they replaced the strings with a number and was pleasantly surprised to see test2 smaller than test3. Can anyone point me to some material on how to optimize factor representation?
test3 <- data.frame(A = c(rep("1", N), rep("2", N)))
object_size(test3)
# 1.82 kB
I'm afraid the difference is minimal.
The principle would be easy enough: instead of (in your example) 160 strings, you would just be storing 2, along with 160 integers (which are only 4 bytes).
Except that R kind of stores character internally the same way.
Every modern language supports string of (virtually) unlimited length. Which gives the problem that you can't store a vector (or array) of strings as one contiguous block, as any element can be reset to arbitrary length. So if another value is assigned to one element, which happened to be somewhat longer, that would mean the rest of the array would have to be shifted. Or the OS/language should reserve large amounts of space for each string.
Therefore, strings are stored at whatever place in memory is convenient, and arrays (or vectors in R) are stored as blocks of pointers to the place where the value actually is.
In the early days of R, each pointer pointed to another place in memory, even if the actual value was the same. So in your example, 160 pointers to 160 memory locations. But that's changed, nowadays it's implemented as 160 pointers to 2 memory locations.
There may be some small differences, mainly because a factor can normally support only 2^31-1 levels, meaning 32-bits integers are enough to store it, while a character mostly uses 64-bits pointers. Then again, there's more overhead in factors.
Generally, there may be some advantage in using factor if you really have a large percentage duplicates, but if that's not the case it may even harm your memory usage.
And the example you provided doesn't work, as you're comparing a data.frame with a factor, instead of the bare character.
Even stronger: when I reproduce your example, I only get your results if I set stringsAsFactors to FALSE, so you're comparing a factor to a factor in a data.frame.
Comparing the results otherwise gives a lot smaller difference: 1568 for character, 1328 for a factor.
And that only works if you have a lot of the same values, if you look at this you see that the factor can be larger:
> object.size(factor(sample(letters)))
2224 bytes
> object.size(sample(letters))
1712 bytes
So generally, there is no real way to compress your data while still keeping it easy to work with, except for using common sense in what you actually want to store.
I don't have a direct answer for your question but here is a few information from the book "Advanced R" by Hadley Wickham:
Factors
One important use of attributes is to define factors. A factor
is a vector that can contain only predefined values, and is used to
store categorical data. Factors are built on top of integer vectors
using two attributes: the class, “factor”, which makes them behave
differently from regular integer vectors, and the levels, which
defines the set of allowed values.
Also:
"While factors look (and often behave) like character vectors, they
are actually integers. Be careful when treating them like strings.
Some string methods (like gsub() and grepl()) will coerce factors to
strings, while others (like nchar()) will throw an error, and still
others (like c()) will use the underlying integer values. For this
reason, it’s usually best to explicitly convert factors to character
vectors if you need string-like behaviour. In early versions of R,
there was a memory advantage to using factors instead of character
vectors, but this is no longer the case."
There is a package in R called fst (Lightning Fast Serialization of Data Frames for R)
, in which you can create compressed fst objects for your data frame. A detailed explanation can be found in the fst-package manual, but I'll briefly explain about how to use it and how much space an fst object takes. First, Let's make your test dataframe a bit larger, as follows:
library(pryr)
N <- 1000 * 8
M <- 100
test <- data.frame(A = c(rep(strrep("A", M), N), rep(strrep("B", N), N)))
object_size(test)
# 73.3 kB
Now, let's convert this dataframe into an fst object, as follows:
install.packages("fst") #install the package
library(fst) #load the package
path <- paste0(tempfile(), ".fst") #create a temporary '.fst' file
write_fst(test, path) #write the dataframe into the '.fst' file
test2 <- fst(path) #load the data as an fst object
object_size(test2)
# 2.14 kB
The disk space for the created .fst file is 434 bytes. You can deal with test2 as a normal dataframe (as far as I tried).
Hope this helps.
When I save an object from R using save(), what determines the size of the saved file? Clearly it is not the same (or close to) the size of the object determined by object.size().
Example:
I read a data frame and saved it using
snpmat=read.table("Heart.txt.gz",header=T)
save(snpmat,file="datamat.RData")
The size of the file datamat.RData is 360MB.
> object.size(snpmat)
4998850664 bytes #Much larger
Then I performed some regression analysis and obtained another data frame adj.snpmat of same dimensions (6820000 rows and 80 columns).
> object.size(adj.snpmat)
4971567760 bytes
I save it using
> save(adj.snpmat,file="adj.datamat.RData")
Now the size of the file adj.datamat.RData is 3.3GB. I'm confused why the two files are so different in size while the object.size() gives similar sizes. Any idea about what determines the size of the saved object is welcome.
Some more information:
> typeof(snpmat)
[1] "list"
> class(snpmat)
[1] "data.frame"
> typeof(snpmat[,1])
[1] "integer"
> typeof(snpmat[,2])
[1] "double" #This is true for all columns except column 1
> typeof(adj.snpmat)
[1] "list"
> class(adj.snpmat)
[1] "data.frame"
> typeof(adj.snpmat[,1])
[1] "character"
> typeof(adj.snpmat[,2])
[1] "double" #This is true for all columns except column 1
Your matrices are very different and therefore compress very differently.
SNP data contains only a few values (e.g., 1 or 0) and is also very sparse. This means that is very easy to compress. For example, if you had a matrix of all zeros, you could think of compressing the data by specifying a single value (0) as well as the dimensions.
Your regression matrix contains many different types of values, and are also real numbers (I'm assuming p-values, coefficients, etc.) This makes it much less compressible.
I have the following R code:
data <- read.csv('testfile.data', header = T)
mat = as.matrix(data)
Some more statistics of my testfile.data:
> ncol(data)
[1] 75713
> nrow(data)
[1] 44771
Since this is a large dataset, so I am using Amazon EC2 with 64GB Ram space. So hopefully memory isn't an issue. I am able to load the data (1st line works).
But as.matrix transformation (2nd line errors) throws the following error:
resulting vector exceeds vector length limit in 'AnswerType'
Any clue what might be the issue?
As noted, the development version of R supports vectors larger than 2^31-1. This is more-or-less transparent, for instance
> m = matrix(0L, .Machine$integer.max / 4, 5)
> length(m)
[1] 2684354555
This is with
> R.version.string
[1] "R Under development (unstable) (2012-08-07 r60193)"
Large objects consume a lot of memory (62.5% of my 16G, for my example) and to do anything useful requires several times that memory. Further, even simple operations on large data can take appreciable time. And many operations on long vectors are not yet supported
> sum(m)
Error: long vectors not supported yet:
/home/mtmorgan/src/R-devel/src/include/Rinlinedfuns.h:100
So it often makes sense to process data in smaller chunks by iterating through a larger file. This gives full access to R's routines, and allows parallel evaluation (via the parallel package). Another strategy is to down-sample the data, which should not be too intimidating to a statistical audience.
Your matrix has more elements than the maximum vector length of 2^31-1. This is a problem because a matrix is just a vector with a dim attribute. read.csv works because it returns a data.frame, which is a list of vectors.
R> 75713*44771 > 2^31-1
[1] TRUE
See ?"Memory-limits" for more details.
I have created a DocumentTermMatrix that contains 1859 documents (rows) and 25722 (columns). In order to perform further calculations on this matrix I need to convert it to a regular matrix. I want to use the as.matrix() command. However, it returns the following error: cannot allocate vector of size 364.8 MB.
> corp
A corpus with 1859 text documents
> mat<-DocumentTermMatrix(corp)
> dim(mat)
[1] 1859 25722
> is(mat)
[1] "DocumentTermMatrix"
> mat2<-as.matrix(mat)
Fehler: kann Vektor der Größe 364.8 MB nicht allozieren # cannot allocate vector of size 364.8 MB
> object.size(mat)
5502000 bytes
For some reason the size of the object seems to increase dramatically whenever it is transformed to a regular matrix. How can I avoid this?
Or is there an alternative way to perform regular matrix operations on a DocumentTermMatrix?
The quick and dirty way is to export your data into a sparse matrix object from an external package like Matrix.
> attributes(dtm)
$names
[1] "i" "j" "v" "nrow" "ncol" "dimnames"
$class
[1] "DocumentTermMatrix" "simple_triplet_matrix"
$Weighting
[1] "term frequency" "tf"
The dtm object has the i, j and v attributes which is the internal representation of your DocumentTermMatrix. Use:
library("Matrix")
mat <- sparseMatrix(
i=dtm$i,
j=dtm$j,
x=dtm$v,
dims=c(dtm$nrow, dtm$ncol)
)
and you're done.
A naive comparison between your objects:
> mat[1,1:100]
> head(as.vector(dtm[1,]), 100)
will each give you the exact same output.
DocumentTermMatrix uses sparse matrix representation, so it doesn't take up all that memory storing all those zeros. Depending what it is you want to do you might have some luck with the SparseM package which provides some linear algebra routines using sparse matrices..
Are you able to increase the amount of RAM available to R? See this post: Increasing (or decreasing) the memory available to R processes
Also, sometimes when working with big objects in R, I occassionally call gc() to free up wasted memory.
The number of documents should not be a problem but you may want to try removing sparse terms, this could very well reduce the dimension of document term matrix.
inspect(removeSparseTerms(dtm, 0.7))
It removes terms that has at least a sparsity of 0.7.
Another option available to you is that you specify minimum word length and minimum document frequency when you create document term matrix
a.dtm <- DocumentTermMatrix(a.corpus, control = list(weighting = weightTfIdf, minWordLength = 2, minDocFreq = 5))
use inspect(dtm) before and after your changes, you will see huge difference, more importantly you won't ruin significant relations hidden in your docs and terms.
Since you only have 1859 documents, the distance matrix you need to compute is fairly small. Using the slam package (and in particular, its crossapply_simple_triplet_matrix function), you might be able to compute the distance matrix directly, instead of converting the DTM into a dense matrix first. This means that you will have to compute the Jaccard similarity yourself. I have successfully tried something similar for the cosine distance matrix on a large number of documents.