findAssocs in a single document

findAssocs in a single document - r

I only have one document (a survey compilation). I want to do word association within a single document with findAssocs. So far all the examples i have seen are all combination of a few documents.
inspect(myDtm)
A term-document matrix (864 terms, 1 documents)
Non-/sparse entries: 864/0 (what is this for?)
Sparsity : 0% (what is this for? what does it mean if its 0%)
Maximal term length: 20
Weighting : term frequency (tf)
my data looks like this
unwanted 1
upgrade 3
valid 1
this is my code and i end up with the results = numeric (0)
findAssocs(myDtm, "salary", 0.5)
numeric(0)
please help.

Sparsity measures the percentage of elements (cf. cells) in the matrix that are equal to zero. When sparsity is high, you have a lot of terms that only occur in one or a few documents. You only have one document in your example, so all terms must occur in that doc. Very generally speaking a lower degree of sparsity is more useful for investigating document similarity (if that's what you're doing... it's not clear from your question).
The short answer is that your question has already been asked and answered: you need to have more than one doc in your dtm to calculate term associations using findAssocs.
You'll have to include a reproducible example if you want any more specific help with findAssocs. Try using the 'crude' dataset that comes with the tm package and experiment with findAssocs to see what happens when you alter the parameters. Check out the tm [documentation](http://cran.r-
project.org/web/packages/tm/vignettes/tm.pdf) to see more about how to use the built-in data.
Here's an example using the built-in data, try it for yourself:
require(tm)
data(crude)
dtm <- DocumentTermMatrix(crude)
# one doc in dtm, doesn't work...
dtm1 <- dtm[1,]
findAssocs(dtm1, "oil", 0.01)
# ten docs, does work
dtm10 <- dtm[1:10,]
findAssocs(dtm10, "oil", 0.01)

You can use findAssocs by adding your data in the following manner
data <- data.frame(text=txt, stringsAsFactors=FALSE)
tdm <- TermDocumentMatrix(Corpus(DataframeSource(data)))
Basically
Import your data into a "Source", your "Source" into a "Corpus", and then make a TDM out of your "Corpus"

I couple of years late. But I ran in to the same problem recently. It is because your term-document matrix (TDM) consists of only one document. Rather, your tdm should consist of multiple documents. If you use paste() to retrieve text from a data frame, you should not use paste(data$text, collapse = " "), but paste(data$text), before turning it into a TDM.
But if you present reproducable example maybe we can help.

Related

Fast Implementation of TF-IDF

I am trying to calculate Term Frequency Inverse Document Frequency to get normalized weights using below function. When the number of rows are in hundred's, the results are pretty quick, but, when the number of rows are in thousands (Just 20 thousand), It takes almost 3 to 4 minutes, to get the result. Can some one point me in right direction to decrease the computational time?
tfidf=function(mat){
mat = mat[,names]
tf = mat/rowSums(mat)
id=function(col){sum(!col==0)}
idf = log10(nrow(mat)/apply(mat, 2, id))
tfidf = mat
for(word in names(idf)){tfidf[,word] <- tf[,word] * idf[word]}
return(tfidf)
}

I would recommend using text2vec::'s TfIdf() class -- it is super fast when used in coordination with a matrix created by text2vec::create_dtm(). The interface is similar to sklearn if you've ever used that.
Check out this part of the vectorization vignette for an example of tfidf weighting in action.
(more generally, I've found most of text2vec::'s core functionality to be shockingly fast, so if you're using R for NLP this is a good option :p)

Convert Large Document Term Document Matrix into Matrix

I've got a large Term Document Matrix. (6 elements, 44.3 Mb)
I need to covert it into a matrix but when trying to do it I get the magical error message: "cannot allocate 100 GBs".
Is there any package/library that allows to do this transformation in chunks?
I've tried ff and bigmemory but they do not seem to allow conversions from DTMs to Matrix.

Before converting to matrix, remove sparse terms from Term Document Matrix. This will reduce your matrix size significantly. To remove sparse terms, you can do as below:
library(tm)
## tdm - Term Document Matrix
tdm2 <- removeSparseTerms(tdm, sparse = 0.2)
tdm_Matrix <- as.matrix(tdm2)
Note: I put 0.2 for sparse just for an example. You should decide that value based on your tdm.
Here are some link that would shed light on removeSparseTerms function and sparse value:
How does the removeSparseTerms in R work?
https://www.rdocumentation.org/packages/tm/versions/0.7-1/topics/removeSparseTerms

Apply text2vec embeddings to new data

I used text2vec to generate custom word embeddings from a corpus of proprietary text data that contains a lot of industry-specific jargon (thus stock embeddings like those available from google won't work). The analogies work great, but I'm having difficulty applying the embeddings to assess new data. I want to use the embeddings that I've already trained to understand relationships in new data. the approach I'm using (described below) seems convoluted, and it's painfully slow. Is there a better approach? Perhaps something already built into the package that I've simply missed?
Here's my approach (offered with the closest thing to reproducible code I can generate given that I'm using a proprietary data source):
d = list containing new data. each element is of class character
vecs = the word vectorizations obtained form text2vec's implementation of glove
new_vecs <- sapply(d, function(y){
it <- itoken(word_tokenizer(y), progressbar=FALSE) # for each statement, create an iterator punctuation
voc <- create_vocabulary(it, stopwords= tm::stopwords()) # for each document, create a vocab
vecs[rownames(vecs) %in% voc$vocab$terms, , drop=FALSE] %>% # subset vecs for the words in the new document, then
colMeans # find the average vector for each document
}) %>% t # close y function and sapply, then transpose to return matrix w/ one row for each statement
For my use case, I need to keep the results separate for each document, so anything that involves pasting-together the elements of d won't work, but surely there must be a better way than what I've cobbled together. I feel like I must be missing something rather obvious.
Any help will be greatly appreciated.

You need to do it in a "batch" mode using efficient linear algebra matrix operations. The idea is to have document-term matrix for documents d. This matrix will contain information about how many times each word appears in each document. Then need just multiply dtm by matrix of embeddings:
library(text2vec)
# we are interested in words which are in word embeddings
voc = create_vocabulary(rownames(vecs))
# now we will create document-term matrix
vectorizer = vocab_vectorizer(voc)
dtm = itoken(d, tokenizer = word_tokenizer) %>%
create_dtm(vectorizer)
# normalize - calculate term frequaency - i.e. divide count of each word
# in document by total number of words in document.
# So at the end we will receive average of word vectors (not sum of word vectors!)
dtm = normalize(dtm)
# and now we can calculate vectors for document (average of vecors of words)
# using dot product of dtm and embeddings matrix
document_vecs = dtm %*% vecs

Document similarity using LSA in R

I am working on LSA (using R) for Document Similarity Analysis.
Here are my steps
Imported the text data & created Corpus. Did basis Corpus operations like stemming, white space removal etc
Created LSA space as below
tdm <- TermDocumentMatrix(chat_corpus)
tdm_matrix <- as.matrix(tdm)
tdm.lsa <- lw_bintf(tdm_matrix)*gw_idf(tdm_matrix)
lsaSpace <- lsa(tdm.lsa)
Multi Dimensional Modelling (MDS) on LSA
'
dist.mat.lsa <- dist(t(as.textmatrix(lsaSpace)))
fit <- cmdscale(dist.mat.lsa,eig = T)
points <- data.frame(fit1$points,row.names=chat$text)
I want to create a matrix/data frame showing how similar the texts are (as shown in the attachment Result). Rows & Columns will be the texts to match while the cell values will be their similarity value. Ideally the diagonal values will be one 1 (perfect match) while the rest of the cell values will be lesser than 1.
Please trow some insights into how to do this. Thanks in advance
Note : I got the python code for this but need the same in R
similarity = np.asarray(numpy.asmatrix(dtm_lsa) * numpy.asmatrix(dtm_lsa).T)
pd.DataFrame(similarity,index=example, columns=example).head(10)
Expected Result

In order to do this you first need to take the S_k and D_k matrices from the lsa space you've created and multiply S_k by the transpose of D_k to get a k by n matrix, where k is the number of dimensions and n is the number of documents. This code would be as follows:
lsaMatrix <- diag(myLSAspace$sk) %*% t(myLSAspace$dk)
Then it's as simple as putting the resulting matrix through the cosine function from the lsa package:
simMatrix <- cosine(lsaMatrix)
Which will result in an n^2 size similarity matrix which can then be used for clustering etc.
You can read more about the S_k and D_k matrices in the lsa package documentation, they're outputs of the SVD applied.
https://cran.r-project.org/web/packages/lsa/lsa.pdf

tm package error "Cannot convert DocumentTermMatrix into normal matrix since vector is too large"

I have created a DocumentTermMatrix that contains 1859 documents (rows) and 25722 (columns). In order to perform further calculations on this matrix I need to convert it to a regular matrix. I want to use the as.matrix() command. However, it returns the following error: cannot allocate vector of size 364.8 MB.
> corp
A corpus with 1859 text documents
> mat<-DocumentTermMatrix(corp)
> dim(mat)
[1] 1859 25722
> is(mat)
[1] "DocumentTermMatrix"
> mat2<-as.matrix(mat)
Fehler: kann Vektor der Größe 364.8 MB nicht allozieren # cannot allocate vector of size 364.8 MB
> object.size(mat)
5502000 bytes
For some reason the size of the object seems to increase dramatically whenever it is transformed to a regular matrix. How can I avoid this?
Or is there an alternative way to perform regular matrix operations on a DocumentTermMatrix?

The quick and dirty way is to export your data into a sparse matrix object from an external package like Matrix.
> attributes(dtm)
$names
[1] "i" "j" "v" "nrow" "ncol" "dimnames"
$class
[1] "DocumentTermMatrix" "simple_triplet_matrix"
$Weighting
[1] "term frequency" "tf"
The dtm object has the i, j and v attributes which is the internal representation of your DocumentTermMatrix. Use:
library("Matrix")
mat <- sparseMatrix(
i=dtm$i,
j=dtm$j,
x=dtm$v,
dims=c(dtm$nrow, dtm$ncol)
)
and you're done.
A naive comparison between your objects:
> mat[1,1:100]
> head(as.vector(dtm[1,]), 100)
will each give you the exact same output.

DocumentTermMatrix uses sparse matrix representation, so it doesn't take up all that memory storing all those zeros. Depending what it is you want to do you might have some luck with the SparseM package which provides some linear algebra routines using sparse matrices..

Are you able to increase the amount of RAM available to R? See this post: Increasing (or decreasing) the memory available to R processes
Also, sometimes when working with big objects in R, I occassionally call gc() to free up wasted memory.

The number of documents should not be a problem but you may want to try removing sparse terms, this could very well reduce the dimension of document term matrix.
inspect(removeSparseTerms(dtm, 0.7))
It removes terms that has at least a sparsity of 0.7.
Another option available to you is that you specify minimum word length and minimum document frequency when you create document term matrix
a.dtm <- DocumentTermMatrix(a.corpus, control = list(weighting = weightTfIdf, minWordLength = 2, minDocFreq = 5))
use inspect(dtm) before and after your changes, you will see huge difference, more importantly you won't ruin significant relations hidden in your docs and terms.

Since you only have 1859 documents, the distance matrix you need to compute is fairly small. Using the slam package (and in particular, its crossapply_simple_triplet_matrix function), you might be able to compute the distance matrix directly, instead of converting the DTM into a dense matrix first. This means that you will have to compute the Jaccard similarity yourself. I have successfully tried something similar for the cosine distance matrix on a large number of documents.