Convert Large Document Term Document Matrix into Matrix - r

I've got a large Term Document Matrix. (6 elements, 44.3 Mb)
I need to covert it into a matrix but when trying to do it I get the magical error message: "cannot allocate 100 GBs".
Is there any package/library that allows to do this transformation in chunks?
I've tried ff and bigmemory but they do not seem to allow conversions from DTMs to Matrix.

Before converting to matrix, remove sparse terms from Term Document Matrix. This will reduce your matrix size significantly. To remove sparse terms, you can do as below:
library(tm)
## tdm - Term Document Matrix
tdm2 <- removeSparseTerms(tdm, sparse = 0.2)
tdm_Matrix <- as.matrix(tdm2)
Note: I put 0.2 for sparse just for an example. You should decide that value based on your tdm.
Here are some link that would shed light on removeSparseTerms function and sparse value:
How does the removeSparseTerms in R work?
https://www.rdocumentation.org/packages/tm/versions/0.7-1/topics/removeSparseTerms

Related

Error: cannot allocate vector of size 38.3 Gb while creating a document term matrix

It's the first time that I post on Stackoverflow, I'm a student. I hope someone will be able to help me. I am trying to do sentiment analysis in R Studio and am facing vector size error:
When I try to create a Document Term Matrix using this code:
dtm2 <- as.matrix(dtm)
I get the error "Error: cannot allocate vector of size 38.3 Gb".
The dtm is the DocumentTermMatrix of a corpus that has 178884 elements, 26.7 Mb and the text consist of reviews.
I read all the other response on StackOverflow but I do not understand them and probably they don't apply to my issue. How can I increase the size of a vector in R Studio? I am using RStudio Version 1.2.5001 on a Windows 64 machine.
Is there any other information to provide?
A document-term matrix (the generic object, not the class in the tm package) is typically a count of the number of times a word (in a column) occur in a document (in a row). The columns are the vocabulary for the entire corpus. In the OPs case, there are 178,884 unique words in the vocabulary. What this means is that for each row there are lots of zeros -- the matrix is very sparse.
In most text analysis packages, the DTM is represented using a special kind of matrix that does not allocate memory for those zeros. For example, in the tm package the DTM is actually a simple_triplet_matrix from the slam package and in quanteda it is a sparseMatrix from the Matrix package.
Now, if we take either special class of matrix and convert it to a base R matrix using as.matrix(), R is now having to allocate memory for every cell in the matrix including those zeros. This why a DTM that is only a few dozen Mbs as a simple_triplet_matrix or a sparseMatrix will be a few dozen Gbs as a base R "dense" matrix.
The solution, then is to use tools from the slam or the Matrix package on the DTM instead of using as.matrix(). These packages have all the typical matrix operation functions, but specific to sparse matrix classes. For example, to use rowsum in base R would be row_sums() in the slam package or rowSums() in the Matrix package.

Rebuild a large sparse matrix in R from .npz file

I have generated a large sparse matrix in Python in the COO format and it needs to be processed in R. The COO sparse matrix contains more than 2^31-1 non-zero entries. I tried to save the COO sparse matrix in .npz and rebuild it in R.
The COO sparse matrix has a shape of (1119534, 239415) with 2 230 643 376 non-zero entries.
Code in R
library(Matrix)
library(Rcpp)
library(reticulate)
np <- import("numpy")
npz <- np$load("LARGE_SPARSE_COO.npz")
i = as.numeric(npz$f[["row"]])
j = as.numeric(npz$f[["col"]])
v = as.numeric(npz$f[["data"]])
dims = as.numeric(npz$f[["shape"]])
X <- sparseMatrix(i, j, x=v, index1=FALSE, dims=dims)
When non-zero entries < 2^31-1, the above code is ok but when it is greater than 2^31-1, the following error occurs
Error in py_ref_to_r(x):
negative length vectors are not allowed**
Calls: as.vector ... py_to_r.numpy.ndarray -> NextMethod -> py_to_r.default -> py_ref_to_r
And I think this is due to the vector size exceeding the 32-bit limit. However I think R supports 64-bit size vector as long vector. How could I save the row, col and data from the .npz as a long vector format and pass to sparseMatrix? or is there any other way to rebuild such a large sparse matrix in R?
I cannot reduce the size of the COO sparse matrix, and some of my matrices have even more non-zero entries. Any help/insight is appreciated.
Edit 1
I am aware of the spam/spam64 package in R, but have no idea how to use it in my case. Also I am not sure if the sparse matrix format from spam will be accepted by glmnet, which the sparse matrix will be finally passed to.

Adjacent matrix from igraph package to be used for autologistic model in ngspatial package in R

I am interested on running an autologistic model in ngspatial package in R. My data objects are polygones. Usually, adjacency matrices for polygones are built up based on the coordinates of the polygones centroids. However, i have define my adjacency (0/1) based on a minimum distance criterium between polygones, measured from and to the border of each polygone. I have done this in arcmap, and then with igraph package i generated the Adjacency matrix:
g<-graph_from_data_frame(My data)
A<-as_adjacency_matrix(g, attr="Dist")
A
42 x 42 sparse Matrix of class "dgCMatrix"
[[ suppressing 42 column names ‘1’, ‘2’, ‘3’ ... ]]
My matrix is just 0 and 1 values, totally symmetric (42 x 42).
However, when i try to use it in a autologistic model in ngspatial i get an error messege:
ms_autolog<-autologistic(Occupancy~Area, A=A )
'You must supply a numeric and symmetric adjacency matrix'.
I supposed that dgCMatrix is just not compatible with ngspatial, but havent found how to convert it. I have also tried directly to shape my data.csv file as a matrix, read it as a matrix, but still it cannot be read by the autologistic model.
Does anybody has any idea how can i solve this?
Many thanks in advance!
Ana María.
It's difficult to check this without a minimal working example but you could try this:
A <- as_adjacency_matrix(g, attr = "Dist", sparse = F)
This way you get a binary matrix with 0 and 1 instead of a sparse matrix.

Document similarity using LSA in R

I am working on LSA (using R) for Document Similarity Analysis.
Here are my steps
Imported the text data & created Corpus. Did basis Corpus operations like stemming, white space removal etc
Created LSA space as below
tdm <- TermDocumentMatrix(chat_corpus)
tdm_matrix <- as.matrix(tdm)
tdm.lsa <- lw_bintf(tdm_matrix)*gw_idf(tdm_matrix)
lsaSpace <- lsa(tdm.lsa)
Multi Dimensional Modelling (MDS) on LSA
'
dist.mat.lsa <- dist(t(as.textmatrix(lsaSpace)))
fit <- cmdscale(dist.mat.lsa,eig = T)
points <- data.frame(fit1$points,row.names=chat$text)
I want to create a matrix/data frame showing how similar the texts are (as shown in the attachment Result). Rows & Columns will be the texts to match while the cell values will be their similarity value. Ideally the diagonal values will be one 1 (perfect match) while the rest of the cell values will be lesser than 1.
Please trow some insights into how to do this. Thanks in advance
Note : I got the python code for this but need the same in R
similarity = np.asarray(numpy.asmatrix(dtm_lsa) * numpy.asmatrix(dtm_lsa).T)
pd.DataFrame(similarity,index=example, columns=example).head(10)
Expected Result
In order to do this you first need to take the S_k and D_k matrices from the lsa space you've created and multiply S_k by the transpose of D_k to get a k by n matrix, where k is the number of dimensions and n is the number of documents. This code would be as follows:
lsaMatrix <- diag(myLSAspace$sk) %*% t(myLSAspace$dk)
Then it's as simple as putting the resulting matrix through the cosine function from the lsa package:
simMatrix <- cosine(lsaMatrix)
Which will result in an n^2 size similarity matrix which can then be used for clustering etc.
You can read more about the S_k and D_k matrices in the lsa package documentation, they're outputs of the SVD applied.
https://cran.r-project.org/web/packages/lsa/lsa.pdf

tm package error "Cannot convert DocumentTermMatrix into normal matrix since vector is too large"

I have created a DocumentTermMatrix that contains 1859 documents (rows) and 25722 (columns). In order to perform further calculations on this matrix I need to convert it to a regular matrix. I want to use the as.matrix() command. However, it returns the following error: cannot allocate vector of size 364.8 MB.
> corp
A corpus with 1859 text documents
> mat<-DocumentTermMatrix(corp)
> dim(mat)
[1] 1859 25722
> is(mat)
[1] "DocumentTermMatrix"
> mat2<-as.matrix(mat)
Fehler: kann Vektor der Größe 364.8 MB nicht allozieren # cannot allocate vector of size 364.8 MB
> object.size(mat)
5502000 bytes
For some reason the size of the object seems to increase dramatically whenever it is transformed to a regular matrix. How can I avoid this?
Or is there an alternative way to perform regular matrix operations on a DocumentTermMatrix?
The quick and dirty way is to export your data into a sparse matrix object from an external package like Matrix.
> attributes(dtm)
$names
[1] "i" "j" "v" "nrow" "ncol" "dimnames"
$class
[1] "DocumentTermMatrix" "simple_triplet_matrix"
$Weighting
[1] "term frequency" "tf"
The dtm object has the i, j and v attributes which is the internal representation of your DocumentTermMatrix. Use:
library("Matrix")
mat <- sparseMatrix(
i=dtm$i,
j=dtm$j,
x=dtm$v,
dims=c(dtm$nrow, dtm$ncol)
)
and you're done.
A naive comparison between your objects:
> mat[1,1:100]
> head(as.vector(dtm[1,]), 100)
will each give you the exact same output.
DocumentTermMatrix uses sparse matrix representation, so it doesn't take up all that memory storing all those zeros. Depending what it is you want to do you might have some luck with the SparseM package which provides some linear algebra routines using sparse matrices..
Are you able to increase the amount of RAM available to R? See this post: Increasing (or decreasing) the memory available to R processes
Also, sometimes when working with big objects in R, I occassionally call gc() to free up wasted memory.
The number of documents should not be a problem but you may want to try removing sparse terms, this could very well reduce the dimension of document term matrix.
inspect(removeSparseTerms(dtm, 0.7))
It removes terms that has at least a sparsity of 0.7.
Another option available to you is that you specify minimum word length and minimum document frequency when you create document term matrix
a.dtm <- DocumentTermMatrix(a.corpus, control = list(weighting = weightTfIdf, minWordLength = 2, minDocFreq = 5))
use inspect(dtm) before and after your changes, you will see huge difference, more importantly you won't ruin significant relations hidden in your docs and terms.
Since you only have 1859 documents, the distance matrix you need to compute is fairly small. Using the slam package (and in particular, its crossapply_simple_triplet_matrix function), you might be able to compute the distance matrix directly, instead of converting the DTM into a dense matrix first. This means that you will have to compute the Jaccard similarity yourself. I have successfully tried something similar for the cosine distance matrix on a large number of documents.

Resources