Finding the cosine similarity of a sentence with many others in r - r

I would like to use R to find the cosine similarity of one sentence with many others. For example:
s1 <- "The book is on the table"
s2 <- "The pen is on the table"
s3 <- "Put the pen on the book"
s4 <- "Take the book and pen"
sn <- "Take the book and pen from the table"
I want to find the cosine similarity of s1, s2, s3 and s4 with sn. I understand that I have to use vectors (convert the sentences into vectors and use TF-IDF and/or dot product) but since I'm relatively new to R, I'm having a problem implementing it.
Would appreciate all help.

The cosine dissimilarity used by stringdist isn't based on words, or terms, but qgrams, which are sequences of q characters, which might or might not form words. We can intuitively see that there's something wrong with the output given in Rui's answer. The only difference between the two first sentences is pen and book, while the last sentence contains both of these words once, so we'd expect the s1–sn and s2–sn dissimilarities to be identical, which they aren't.
There are probably other R libraries that can compute more conventional cosine similarities, but it's also not too hard to do it ourselves, from first principle. And it might end up more educational.
sv <- c(s1=s1, s2=s2, s3=s3, s4=s4, sn=sn)
# Split sentences into words
svs <- strsplit(tolower(sv), "\\s+")
# Calculate term frequency tables (tf)
termf <- table(stack(svs))
# Calculate inverse document frequencies (idf)
idf <- log(1/rowMeans(termf != 0))
# Multiply to get tf-idf
tfidf <- termf*idf
# Calculate dot products between the last tf-idf and all the previous
dp <- t(tfidf[,5]) %*% tfidf[,-5]
# Divide by the product of the euclidean norms do get the cosine similarity
cosim <- dp/(sqrt(colSums(tfidf[,-5]^2))*sqrt(sum(tfidf[,5]^2)))
cosim
# [,1] [,2] [,3] [,4]
# [1,] 0.1215616 0.1215616 0.02694245 0.6198245

The best way to do what the question asks for is to use package stringdist.
library(stringdist)
stringdist(sn, c(s1, s2, s3, s4), method = "cosine")
#[1] 0.06479426 0.08015590 0.09776951 0.04376841
In the case where the strings' names have an obvious pattern, such as the ones in the question, mget can be of use, there will be no need to hard code the strings names one by one in the call to stringdist.
s_vec <- unlist(mget(ls(pattern = "^s\\d+")))
stringdist(sn, s_vec, method = "cosine")
#[1] 0.06479426 0.08015590 0.09776951 0.04376841

Related

Calculate similarity matrix for the 1st column

I have started working on a few ML projects and use R as the preferred language. I am trying to build a basic recommendation system
http://www.dataperspective.info/2014/05/basic-recommendation-engine-using-r.html
I need to find the similarity matrix (according to the website) and using cosine function (in 'lsa' package) to find user_similarity.
library(lsa)
data_rating <- read.csv("recommendation_basic1.csv", header = TRUE)
x = data_rating[,2:7]
x[is.na(x)] = 0
print(x)
similarity_users <- cosine(as.matrix(x))
similarity_users
But I need to find the similarity matrix among users and this code is giving me an output similarity matrix among the movies. Do I need to modify the below line?
x = data_rating[,2:7]
PS. The recommendation_basic1.csv is the same as in the link.
Putting this in so the question is not unanswered.
You can just use similarity_users <- cosine(as.matrix(t(x)))
Here, the t is matrix transpose, so it just switches the rows and columns which is equivalent to switching the users and the movies.

Apply text2vec embeddings to new data

I used text2vec to generate custom word embeddings from a corpus of proprietary text data that contains a lot of industry-specific jargon (thus stock embeddings like those available from google won't work). The analogies work great, but I'm having difficulty applying the embeddings to assess new data. I want to use the embeddings that I've already trained to understand relationships in new data. the approach I'm using (described below) seems convoluted, and it's painfully slow. Is there a better approach? Perhaps something already built into the package that I've simply missed?
Here's my approach (offered with the closest thing to reproducible code I can generate given that I'm using a proprietary data source):
d = list containing new data. each element is of class character
vecs = the word vectorizations obtained form text2vec's implementation of glove
new_vecs <- sapply(d, function(y){
it <- itoken(word_tokenizer(y), progressbar=FALSE) # for each statement, create an iterator punctuation
voc <- create_vocabulary(it, stopwords= tm::stopwords()) # for each document, create a vocab
vecs[rownames(vecs) %in% voc$vocab$terms, , drop=FALSE] %>% # subset vecs for the words in the new document, then
colMeans # find the average vector for each document
}) %>% t # close y function and sapply, then transpose to return matrix w/ one row for each statement
For my use case, I need to keep the results separate for each document, so anything that involves pasting-together the elements of d won't work, but surely there must be a better way than what I've cobbled together. I feel like I must be missing something rather obvious.
Any help will be greatly appreciated.
You need to do it in a "batch" mode using efficient linear algebra matrix operations. The idea is to have document-term matrix for documents d. This matrix will contain information about how many times each word appears in each document. Then need just multiply dtm by matrix of embeddings:
library(text2vec)
# we are interested in words which are in word embeddings
voc = create_vocabulary(rownames(vecs))
# now we will create document-term matrix
vectorizer = vocab_vectorizer(voc)
dtm = itoken(d, tokenizer = word_tokenizer) %>%
create_dtm(vectorizer)
# normalize - calculate term frequaency - i.e. divide count of each word
# in document by total number of words in document.
# So at the end we will receive average of word vectors (not sum of word vectors!)
dtm = normalize(dtm)
# and now we can calculate vectors for document (average of vecors of words)
# using dot product of dtm and embeddings matrix
document_vecs = dtm %*% vecs

Can I prune a parser's vocabulary in spaCy?

The following code uses spaCy word vectors to find the 20 most similar words to a given word by first computing cosine similarity for all words in the vocabulary (more than a million), then sorting this list of the most similar words.
parser = English()
# access known words from the parser's vocabulary
current_word = parser.vocab[word]
# cosine similarity
cosine = lambda v1, v2: dot(v1, v2) / (norm(v1) * norm(v2))
# gather all known words, take only the lowercased versions
allWords = list({w for w in parser.vocab if w.has_vector and w.orth_.islower() and w.lower_ != word})
# sort by similarity
allWords.sort(key=lambda w: cosine(w.vector, current_word.vector))
allWords.reverse()
print("Top 20 most similar words to %s:") % word
for word in allWords[:20]:
print(word.orth_)
What I would like to know is whether there is a way to restrict spaCy's vocabulary only to the words that occur in a given list, which I hope would hugely reduce the cost of the sort operation.
To be clear, I would like to pass in a list of just a few words, or just the words in a given text, and be able to rapidly look up which of these words are nearest each other in spaCy's vector space.
Any help on this front appreciated.
The SpaCy documentation says that:
The default English model installs vectors for one million vocabulary
entries, using the 300-dimensional vectors trained on the Common Crawl
corpus using the GloVe algorithm. The GloVe common crawl vectors have
become a de facto standard for practical NLP.
So you could just load the GloVe vectors using Gensim. I'm not sure if you can load them directly, or if you have to use this script.
If you have loaded the word vectors in Gensim as model, you can simply use word_vectors.similarity('woman', 'man') to get the similarity between two words. If you have a list of words, you could do something like:
def most_similar(word, candidates, model, n=20):
"Get N most similar words from a list of candidates"
similarities = [(model.similarity(word,candidate), candidate)
for candidate in candidates]
most_similar_words = sorted(similarities, reverse=True)[:n]
only_words = [w for sim,w in most_similar_words]
return only_words
Spacy has a Vectors class that has a most_similar method. You could then define a wrapper function like so to avoid writing your own implementation:
import spacy
import numpy as np
def most_similar(word, model, n=20):
nlp = spacy.load(model)
doc = nlp(word)
vecs = [token.vector for token in doc]
queries = np.array(vecs)
keys_arr, best_rows_arr, scores_arr = nlp.vocab.vectors.most_similar(queries, n=n)
keys = keys_arr[0] # The array of keys is nested in another array from the previous step.
similar_words_list = [nlp.vocab[key].text for key in keys]
return similar_words_list
And call it like this: most_similar('apple', 'en_core_web_md', n=20) This would find the 20 most similar words using cosine similarity of the word "apple" based on the the Spacy model package "en_core_web_md".
This is the result: ['BLACKBERRY', 'APPLE', 'apples', 'PRUNES', 'iPHone', '3g/3gs', 'fruit', 'FIG', 'CREAMSICLE', 'iPad', 'ipad4', 'LONGAN', 'CALVADOS', 'iPOD', 'iPod', 'SORBET', 'PERSICA', 'peach', 'juice', 'JUICE']

Document similarity using LSA in R

I am working on LSA (using R) for Document Similarity Analysis.
Here are my steps
Imported the text data & created Corpus. Did basis Corpus operations like stemming, white space removal etc
Created LSA space as below
tdm <- TermDocumentMatrix(chat_corpus)
tdm_matrix <- as.matrix(tdm)
tdm.lsa <- lw_bintf(tdm_matrix)*gw_idf(tdm_matrix)
lsaSpace <- lsa(tdm.lsa)
Multi Dimensional Modelling (MDS) on LSA
'
dist.mat.lsa <- dist(t(as.textmatrix(lsaSpace)))
fit <- cmdscale(dist.mat.lsa,eig = T)
points <- data.frame(fit1$points,row.names=chat$text)
I want to create a matrix/data frame showing how similar the texts are (as shown in the attachment Result). Rows & Columns will be the texts to match while the cell values will be their similarity value. Ideally the diagonal values will be one 1 (perfect match) while the rest of the cell values will be lesser than 1.
Please trow some insights into how to do this. Thanks in advance
Note : I got the python code for this but need the same in R
similarity = np.asarray(numpy.asmatrix(dtm_lsa) * numpy.asmatrix(dtm_lsa).T)
pd.DataFrame(similarity,index=example, columns=example).head(10)
Expected Result
In order to do this you first need to take the S_k and D_k matrices from the lsa space you've created and multiply S_k by the transpose of D_k to get a k by n matrix, where k is the number of dimensions and n is the number of documents. This code would be as follows:
lsaMatrix <- diag(myLSAspace$sk) %*% t(myLSAspace$dk)
Then it's as simple as putting the resulting matrix through the cosine function from the lsa package:
simMatrix <- cosine(lsaMatrix)
Which will result in an n^2 size similarity matrix which can then be used for clustering etc.
You can read more about the S_k and D_k matrices in the lsa package documentation, they're outputs of the SVD applied.
https://cran.r-project.org/web/packages/lsa/lsa.pdf

String distance metrics that is in favor of substring, and word order independent?

For my data analytics problem, I usually needs to regulate names, that names A, and B, I'd consider them the same or very similar, if A and B share substantial number of common substrings, regardless of the order of those substring.
For example, for "COLD", and c("FLOOD", "COLD/WIND CHILL"), I'd like to choose "COLD/WIND CHILL" to be much more similar to "COLD" than with "FLOOD".
My current assignment is in R. So my concrete questions are the following:
Is there such metrics already defined in R?
Is it possible to provide my own implementation and somehow integrate with R's stringdist package?
For my requirement, I could simply use regular expression search as long as I could find A in B or B in A, I may just consider their distance to be 0.
Thanks a lot!
Edit:
In the context of the following:
> vv <- c("FLOOD", "COLD/WIND CHILL")
> sapply(vv, adist, y = "COLD")
FLOOD COLD/WIND CHILL
3 11
I wish the distance from "COLD" to "COLD/WIND CHILL" would be smaller than "COLD" to "FLOOD".
It seems that the metrics has to ignore the remaining part to be deleted, after finding the matched substring.
Edit1:
My original problem has been solved. Here is a follow up with related problem of using amatch of stringdist in R:
It seems to me that I was not able to reproduce the equivalent result of those with adist, and even stringdist in the same package with amatch.
Below is the illustration:
vv <- c("FLOOD", "COLD/WIND CHILL")
sapply(vv, adist, y = "COLD",costs=list(deletions=0))
FLOOD COLD/WIND CHILL
2 0
stringdist("COLD", c("FLOOD", " COLD/WIND CHILL"), method = 'lv', weight=c(0.001, 0.99, 0.99, 0.99))
[1] 1.981 1.002
amatch("COLD", c("FLOOD", " COLD/WIND CHILL"), method = 'lv', weight=c(0.0001, 0.999, 0.999, 0.999), maxDist = 100)
[1] 1
In the above context, by using the computation of stringdist, amatch should return 2, instead of 1.
Based on the document of stringdist,
"weight:
For method='osa' or 'dl', the penalty for deletion, insertion, substitution and transposition, in that order. When method='lv', the penalty for transposition is ignored. "
I have chosen the weights accordingly to remove penalty to deletion, while maxing the penalty to the other operations. It's encouraging that stringdist shows the expected behavior with the weights setting.
I'd assume that amatch would use stringdist to do the calculation, but it seems strange the behavior of amatch contradicts with the behavior of stringdist!
I wish to get amatch working so that I don't have to re-implement it using adist or stringdist.
Thanks for help again.
You can use adist for fuzzy distance. The distance is a generalized Levenshtein distance.
vv <- c("COLD","FLOOD")
sapply(vv,adist,y="COLD/WIND CHILL")
## COLD FLOOD
## 11 13 ## the distance to COLD < distance to FLOOD
edit after OP update:
You can play with costs parameter to set how you wan the distance to be computed in terms of : deletions,substitutions, insertions . Here for example:
sapply(vv, adist, y = "COLD",costs=list(deletions=0))
FLOOD COLD/WIND CHILL
2 0
Here is one direction to pursue. Basically, it intends to break up your text into trigrams (sequences of three letter) and return associations between each trigram and all others if they reach the level you set (here, 0.8). The glitch is that this code works only at the word level, not at trigrams as it is supposed to. Perhaps if the text file were larger there would be a difference?
library(tm)
library("RWeka")
text <- c("FLOOD", "COLD/WIND CHILL", "OLD", "FRIGID", "FLOW")
BigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 3, max = 3))
corpus <- Corpus(VectorSource(text))
tdm <- TermDocumentMatrix(corpus, control = list(tokenize = BigramTokenizer))
lapply(tdm$dimnames$Terms, function(x) findAssocs(tdm, x, 0.8))

Resources