Document similarity using LSA in R - r

I am working on LSA (using R) for Document Similarity Analysis.
Here are my steps
Imported the text data & created Corpus. Did basis Corpus operations like stemming, white space removal etc
Created LSA space as below
tdm <- TermDocumentMatrix(chat_corpus)
tdm_matrix <- as.matrix(tdm)
tdm.lsa <- lw_bintf(tdm_matrix)*gw_idf(tdm_matrix)
lsaSpace <- lsa(tdm.lsa)
Multi Dimensional Modelling (MDS) on LSA
'
dist.mat.lsa <- dist(t(as.textmatrix(lsaSpace)))
fit <- cmdscale(dist.mat.lsa,eig = T)
points <- data.frame(fit1$points,row.names=chat$text)
I want to create a matrix/data frame showing how similar the texts are (as shown in the attachment Result). Rows & Columns will be the texts to match while the cell values will be their similarity value. Ideally the diagonal values will be one 1 (perfect match) while the rest of the cell values will be lesser than 1.
Please trow some insights into how to do this. Thanks in advance
Note : I got the python code for this but need the same in R
similarity = np.asarray(numpy.asmatrix(dtm_lsa) * numpy.asmatrix(dtm_lsa).T)
pd.DataFrame(similarity,index=example, columns=example).head(10)
Expected Result

In order to do this you first need to take the S_k and D_k matrices from the lsa space you've created and multiply S_k by the transpose of D_k to get a k by n matrix, where k is the number of dimensions and n is the number of documents. This code would be as follows:
lsaMatrix <- diag(myLSAspace$sk) %*% t(myLSAspace$dk)
Then it's as simple as putting the resulting matrix through the cosine function from the lsa package:
simMatrix <- cosine(lsaMatrix)
Which will result in an n^2 size similarity matrix which can then be used for clustering etc.
You can read more about the S_k and D_k matrices in the lsa package documentation, they're outputs of the SVD applied.
https://cran.r-project.org/web/packages/lsa/lsa.pdf

Related

Calculate the probability of obtaining Spades, Hearts and Clubs in R in that order

I'm struggling with this, can anyone help me plz?
My current code is
N <- probspace(urnsamples(pcartas, 3,replace=FALSE, ordered=TRUE))
CDE <- intersect(subset(N, (N$outcomes[[]][[2]][1] == "Spade")),
subset(N, (N$outcomes[[]][[2]][2] == "Heart")), subset(N, (N$outcomes[[]][[2]][3] == "Club")))
pcorazones_diamantes_espadas <- Prob(CDE)
But it is not working.
Thank you very much
I think your pcartas object should be a simple vector, not a data frame. Otherwise you'll have difficulties with extracting elements of a list.
Clear the R workspace, install the prob package and load it.
rm(list=ls())
install.packages("prob")
library(prob)
Create the sample space.
pcartas <- cards()$suit
You'll see that this is a factor vector. I'm going to convert this to a character vector because the urnsamples function converts the labels into their numeric codes, which makes it difficult when you want to calculate the probabilities. You'll see the problem if you don't do this next step.
pcartas <- unclass(as.character(pcartas))
Sample without replacement from this sample space and convert to a probability space (your first line of code above):
L <- urnsamples(pcartas, size=3, replace=FALSE, ordered=TRUE);
N <- probspace(L);
Calculate the probability of any event. For example, the probability of obtaining a "Spade" on the first card is:
Prob(N, X1=="Spade") # Answer=0.25 or 13/52
And the probability of obtaining a spade on the first card and a heart on the second is:
Prob(N, X1=="Spade" & X2=="Heart") # Answer=0.0637 or 13/52 * 13/51
And finally, the probability of obtaining a spade on the first card and a heart on the second card and a club on the third is:
Prob(N, X1=="Spade" & X2=="Heart" & X3=="Club") # Answer=0.016568 or 13/52 * 13/51 * 13/50
If you didn't convert the sample space object to a character vector, then you would have to use the numeric codes to get the probabilities:
Prob(N, X1==1 & X2==2 & X3==3)

Adjacent matrix from igraph package to be used for autologistic model in ngspatial package in R

I am interested on running an autologistic model in ngspatial package in R. My data objects are polygones. Usually, adjacency matrices for polygones are built up based on the coordinates of the polygones centroids. However, i have define my adjacency (0/1) based on a minimum distance criterium between polygones, measured from and to the border of each polygone. I have done this in arcmap, and then with igraph package i generated the Adjacency matrix:
g<-graph_from_data_frame(My data)
A<-as_adjacency_matrix(g, attr="Dist")
A
42 x 42 sparse Matrix of class "dgCMatrix"
[[ suppressing 42 column names ‘1’, ‘2’, ‘3’ ... ]]
My matrix is just 0 and 1 values, totally symmetric (42 x 42).
However, when i try to use it in a autologistic model in ngspatial i get an error messege:
ms_autolog<-autologistic(Occupancy~Area, A=A )
'You must supply a numeric and symmetric adjacency matrix'.
I supposed that dgCMatrix is just not compatible with ngspatial, but havent found how to convert it. I have also tried directly to shape my data.csv file as a matrix, read it as a matrix, but still it cannot be read by the autologistic model.
Does anybody has any idea how can i solve this?
Many thanks in advance!
Ana María.
It's difficult to check this without a minimal working example but you could try this:
A <- as_adjacency_matrix(g, attr = "Dist", sparse = F)
This way you get a binary matrix with 0 and 1 instead of a sparse matrix.

Convert Large Document Term Document Matrix into Matrix

I've got a large Term Document Matrix. (6 elements, 44.3 Mb)
I need to covert it into a matrix but when trying to do it I get the magical error message: "cannot allocate 100 GBs".
Is there any package/library that allows to do this transformation in chunks?
I've tried ff and bigmemory but they do not seem to allow conversions from DTMs to Matrix.
Before converting to matrix, remove sparse terms from Term Document Matrix. This will reduce your matrix size significantly. To remove sparse terms, you can do as below:
library(tm)
## tdm - Term Document Matrix
tdm2 <- removeSparseTerms(tdm, sparse = 0.2)
tdm_Matrix <- as.matrix(tdm2)
Note: I put 0.2 for sparse just for an example. You should decide that value based on your tdm.
Here are some link that would shed light on removeSparseTerms function and sparse value:
How does the removeSparseTerms in R work?
https://www.rdocumentation.org/packages/tm/versions/0.7-1/topics/removeSparseTerms

Calculate similarity matrix for the 1st column

I have started working on a few ML projects and use R as the preferred language. I am trying to build a basic recommendation system
http://www.dataperspective.info/2014/05/basic-recommendation-engine-using-r.html
I need to find the similarity matrix (according to the website) and using cosine function (in 'lsa' package) to find user_similarity.
library(lsa)
data_rating <- read.csv("recommendation_basic1.csv", header = TRUE)
x = data_rating[,2:7]
x[is.na(x)] = 0
print(x)
similarity_users <- cosine(as.matrix(x))
similarity_users
But I need to find the similarity matrix among users and this code is giving me an output similarity matrix among the movies. Do I need to modify the below line?
x = data_rating[,2:7]
PS. The recommendation_basic1.csv is the same as in the link.
Putting this in so the question is not unanswered.
You can just use similarity_users <- cosine(as.matrix(t(x)))
Here, the t is matrix transpose, so it just switches the rows and columns which is equivalent to switching the users and the movies.

Apply text2vec embeddings to new data

I used text2vec to generate custom word embeddings from a corpus of proprietary text data that contains a lot of industry-specific jargon (thus stock embeddings like those available from google won't work). The analogies work great, but I'm having difficulty applying the embeddings to assess new data. I want to use the embeddings that I've already trained to understand relationships in new data. the approach I'm using (described below) seems convoluted, and it's painfully slow. Is there a better approach? Perhaps something already built into the package that I've simply missed?
Here's my approach (offered with the closest thing to reproducible code I can generate given that I'm using a proprietary data source):
d = list containing new data. each element is of class character
vecs = the word vectorizations obtained form text2vec's implementation of glove
new_vecs <- sapply(d, function(y){
it <- itoken(word_tokenizer(y), progressbar=FALSE) # for each statement, create an iterator punctuation
voc <- create_vocabulary(it, stopwords= tm::stopwords()) # for each document, create a vocab
vecs[rownames(vecs) %in% voc$vocab$terms, , drop=FALSE] %>% # subset vecs for the words in the new document, then
colMeans # find the average vector for each document
}) %>% t # close y function and sapply, then transpose to return matrix w/ one row for each statement
For my use case, I need to keep the results separate for each document, so anything that involves pasting-together the elements of d won't work, but surely there must be a better way than what I've cobbled together. I feel like I must be missing something rather obvious.
Any help will be greatly appreciated.
You need to do it in a "batch" mode using efficient linear algebra matrix operations. The idea is to have document-term matrix for documents d. This matrix will contain information about how many times each word appears in each document. Then need just multiply dtm by matrix of embeddings:
library(text2vec)
# we are interested in words which are in word embeddings
voc = create_vocabulary(rownames(vecs))
# now we will create document-term matrix
vectorizer = vocab_vectorizer(voc)
dtm = itoken(d, tokenizer = word_tokenizer) %>%
create_dtm(vectorizer)
# normalize - calculate term frequaency - i.e. divide count of each word
# in document by total number of words in document.
# So at the end we will receive average of word vectors (not sum of word vectors!)
dtm = normalize(dtm)
# and now we can calculate vectors for document (average of vecors of words)
# using dot product of dtm and embeddings matrix
document_vecs = dtm %*% vecs

Resources