Can I prune a parser's vocabulary in spaCy? - vocabulary

The following code uses spaCy word vectors to find the 20 most similar words to a given word by first computing cosine similarity for all words in the vocabulary (more than a million), then sorting this list of the most similar words.
parser = English()
# access known words from the parser's vocabulary
current_word = parser.vocab[word]
# cosine similarity
cosine = lambda v1, v2: dot(v1, v2) / (norm(v1) * norm(v2))
# gather all known words, take only the lowercased versions
allWords = list({w for w in parser.vocab if w.has_vector and w.orth_.islower() and w.lower_ != word})
# sort by similarity
allWords.sort(key=lambda w: cosine(w.vector, current_word.vector))
allWords.reverse()
print("Top 20 most similar words to %s:") % word
for word in allWords[:20]:
print(word.orth_)
What I would like to know is whether there is a way to restrict spaCy's vocabulary only to the words that occur in a given list, which I hope would hugely reduce the cost of the sort operation.
To be clear, I would like to pass in a list of just a few words, or just the words in a given text, and be able to rapidly look up which of these words are nearest each other in spaCy's vector space.
Any help on this front appreciated.

The SpaCy documentation says that:
The default English model installs vectors for one million vocabulary
entries, using the 300-dimensional vectors trained on the Common Crawl
corpus using the GloVe algorithm. The GloVe common crawl vectors have
become a de facto standard for practical NLP.
So you could just load the GloVe vectors using Gensim. I'm not sure if you can load them directly, or if you have to use this script.
If you have loaded the word vectors in Gensim as model, you can simply use word_vectors.similarity('woman', 'man') to get the similarity between two words. If you have a list of words, you could do something like:
def most_similar(word, candidates, model, n=20):
"Get N most similar words from a list of candidates"
similarities = [(model.similarity(word,candidate), candidate)
for candidate in candidates]
most_similar_words = sorted(similarities, reverse=True)[:n]
only_words = [w for sim,w in most_similar_words]
return only_words

Spacy has a Vectors class that has a most_similar method. You could then define a wrapper function like so to avoid writing your own implementation:
import spacy
import numpy as np
def most_similar(word, model, n=20):
nlp = spacy.load(model)
doc = nlp(word)
vecs = [token.vector for token in doc]
queries = np.array(vecs)
keys_arr, best_rows_arr, scores_arr = nlp.vocab.vectors.most_similar(queries, n=n)
keys = keys_arr[0] # The array of keys is nested in another array from the previous step.
similar_words_list = [nlp.vocab[key].text for key in keys]
return similar_words_list
And call it like this: most_similar('apple', 'en_core_web_md', n=20) This would find the 20 most similar words using cosine similarity of the word "apple" based on the the Spacy model package "en_core_web_md".
This is the result: ['BLACKBERRY', 'APPLE', 'apples', 'PRUNES', 'iPHone', '3g/3gs', 'fruit', 'FIG', 'CREAMSICLE', 'iPad', 'ipad4', 'LONGAN', 'CALVADOS', 'iPOD', 'iPod', 'SORBET', 'PERSICA', 'peach', 'juice', 'JUICE']

Related

Finding the cosine similarity of a sentence with many others in r

I would like to use R to find the cosine similarity of one sentence with many others. For example:
s1 <- "The book is on the table"
s2 <- "The pen is on the table"
s3 <- "Put the pen on the book"
s4 <- "Take the book and pen"
sn <- "Take the book and pen from the table"
I want to find the cosine similarity of s1, s2, s3 and s4 with sn. I understand that I have to use vectors (convert the sentences into vectors and use TF-IDF and/or dot product) but since I'm relatively new to R, I'm having a problem implementing it.
Would appreciate all help.
The cosine dissimilarity used by stringdist isn't based on words, or terms, but qgrams, which are sequences of q characters, which might or might not form words. We can intuitively see that there's something wrong with the output given in Rui's answer. The only difference between the two first sentences is pen and book, while the last sentence contains both of these words once, so we'd expect the s1–sn and s2–sn dissimilarities to be identical, which they aren't.
There are probably other R libraries that can compute more conventional cosine similarities, but it's also not too hard to do it ourselves, from first principle. And it might end up more educational.
sv <- c(s1=s1, s2=s2, s3=s3, s4=s4, sn=sn)
# Split sentences into words
svs <- strsplit(tolower(sv), "\\s+")
# Calculate term frequency tables (tf)
termf <- table(stack(svs))
# Calculate inverse document frequencies (idf)
idf <- log(1/rowMeans(termf != 0))
# Multiply to get tf-idf
tfidf <- termf*idf
# Calculate dot products between the last tf-idf and all the previous
dp <- t(tfidf[,5]) %*% tfidf[,-5]
# Divide by the product of the euclidean norms do get the cosine similarity
cosim <- dp/(sqrt(colSums(tfidf[,-5]^2))*sqrt(sum(tfidf[,5]^2)))
cosim
# [,1] [,2] [,3] [,4]
# [1,] 0.1215616 0.1215616 0.02694245 0.6198245
The best way to do what the question asks for is to use package stringdist.
library(stringdist)
stringdist(sn, c(s1, s2, s3, s4), method = "cosine")
#[1] 0.06479426 0.08015590 0.09776951 0.04376841
In the case where the strings' names have an obvious pattern, such as the ones in the question, mget can be of use, there will be no need to hard code the strings names one by one in the call to stringdist.
s_vec <- unlist(mget(ls(pattern = "^s\\d+")))
stringdist(sn, s_vec, method = "cosine")
#[1] 0.06479426 0.08015590 0.09776951 0.04376841

Expected input to torch Embedding layer with pre_trained vectors from gensim

I would like to use pre-trained embeddings in my neural network architecture. The pre-trained embeddings are trained by gensim. I found this informative answer which indicates that we can load pre_trained models like so:
import gensim
from torch import nn
model = gensim.models.KeyedVectors.load_word2vec_format('path/to/file')
weights = torch.FloatTensor(model.vectors)
emb = nn.Embedding.from_pretrained(torch.FloatTensor(weights.vectors))
This seems to work correctly, also on 1.0.1. My question is, that I don't quite understand what I have to feed into such a layer to utilise it. Can I just feed the tokens (segmented sentence)? Do I need a mapping, for instance token-to-index?
I found that you can access a token's vector simply by something like
print(weights['the'])
# [-1.1206588e+00 1.1578362e+00 2.8765252e-01 -1.1759659e+00 ... ]
What does that mean for an RNN architecture? Can we simply load in the tokens of the batch sequences? For instance:
for seq_batch, y in batch_loader():
# seq_batch is a batch of sequences (tokenized sentences)
# e.g. [['i', 'like', 'cookies'],['it', 'is', 'raining'],['who', 'are', 'you']]
output, hidden = model(seq_batch, hidden)
This does not seem to work so I am assuming you need to convert the tokens to its index in the final word2vec model. Is that true? I found that you can get the indices of words by using the word2vec model's vocab:
weights.vocab['world'].index
# 147
So as an input to an Embedding layer, should I provide a tensor of int for a sequence of sentences that consist of a sequence of words? Example use with dummy dataloader (cf. example above) and dummy RNN welcome.
The documentation says the following
This module is often used to store word embeddings and retrieve them using indices. The input to the module is a list of indices, and the output is the corresponding word embeddings.
So if you want to feed in a sentence, you give a LongTensor of indices, each corresponding to a word in the vocabulary, which the nn.Embedding layer will map into word vectors going forward.
Here's an illustration
test_voc = ["ok", "great", "test"]
# The word vectors for "ok", "great" and "test"
# are at indices, 0, 1 and 2, respectively.
my_embedding = torch.rand(3, 50)
e = nn.Embedding.from_pretrained(my_embedding)
# LongTensor of indicies corresponds to a sentence,
# reshaped to (1, 3) because batch size is 1
my_sentence = torch.tensor([0, 2, 1]).view(1, -1)
res = e(my_sentence)
print(res.shape)
# => torch.Size([1, 3, 50])
# 1 is the batch dimension, and there's three vectors of length 50 each
In terms of RNNs, next you can feed that tensor into your RNN module, e.g
lstm = nn.LSTM(input_size=50, hidden_size=5, batch_first=True)
output, h = lstm(res)
print(output.shape)
# => torch.Size([1, 3, 5])
I also recommend you look into torchtext. It can automatate some of the stuff you will have to do manually otherwise.

R: how to find the most optimal string matches while combining different distance metrics criteria?

I have two data files to merge with and both of them have the keyword fund_name, but the fund_name in the two files may be different and it's possible that some of the rows have no matches. Therefore, I want to do a fuzzy matching, returning the best match for each row.
I've read a relevant thread agrep: only return best matches and I've tried amatch(string, stringVector, maxDist = Inf) function in the package stringdist, and it worked well.
I saw there're many different method (i.e. string distance metrics) in amatch() like "osa","lv", "dl"... I wonder if I can combine them and return a value only when all of them find the same match. If so, how should I write the algorithm?
I care more about the accuracy of a match than finding a match in this fuzzy matching work. Many thanks for your help!
A possible solution is using the native R adist function to calculate the Levenshtein distance:
names<-"key, fund_name, keyword"
names_split<-strsplit(names, ", ")[[1]]
names2<-"fund_name2, other_keyword"
names_split2<-strsplit(names2, ", ")[[1]]
# It creates a matrix with the Standard Levenshtein distance between the name fields of both sources
dist.name<-adist(names_split, names_split2, partial = TRUE, ignore.case = TRUE)
# We now take the pairs with the minimum distance
min.name<-apply(dist.name, 1, min)
match.s1.s2<-NULL
for(i in 1:nrow(dist.name))
{
s2.i<-match(min.name[i],dist.name[i,])
s1.i<-i
match.s1.s2<-rbind(data.frame(s2.i=s2.i,s1.i=s1.i,s2name=names_split2[s2.i], s1name=names_split[s1.i], adist=min.name[i]),match.s1.s2)
}
# and we then can have a look at the results
View(match.s1.s2)

Fast Implementation of TF-IDF

I am trying to calculate Term Frequency Inverse Document Frequency to get normalized weights using below function. When the number of rows are in hundred's, the results are pretty quick, but, when the number of rows are in thousands (Just 20 thousand), It takes almost 3 to 4 minutes, to get the result. Can some one point me in right direction to decrease the computational time?
tfidf=function(mat){
mat = mat[,names]
tf = mat/rowSums(mat)
id=function(col){sum(!col==0)}
idf = log10(nrow(mat)/apply(mat, 2, id))
tfidf = mat
for(word in names(idf)){tfidf[,word] <- tf[,word] * idf[word]}
return(tfidf)
}
I would recommend using text2vec::'s TfIdf() class -- it is super fast when used in coordination with a matrix created by text2vec::create_dtm(). The interface is similar to sklearn if you've ever used that.
Check out this part of the vectorization vignette for an example of tfidf weighting in action.
(more generally, I've found most of text2vec::'s core functionality to be shockingly fast, so if you're using R for NLP this is a good option :p)

Apply text2vec embeddings to new data

I used text2vec to generate custom word embeddings from a corpus of proprietary text data that contains a lot of industry-specific jargon (thus stock embeddings like those available from google won't work). The analogies work great, but I'm having difficulty applying the embeddings to assess new data. I want to use the embeddings that I've already trained to understand relationships in new data. the approach I'm using (described below) seems convoluted, and it's painfully slow. Is there a better approach? Perhaps something already built into the package that I've simply missed?
Here's my approach (offered with the closest thing to reproducible code I can generate given that I'm using a proprietary data source):
d = list containing new data. each element is of class character
vecs = the word vectorizations obtained form text2vec's implementation of glove
new_vecs <- sapply(d, function(y){
it <- itoken(word_tokenizer(y), progressbar=FALSE) # for each statement, create an iterator punctuation
voc <- create_vocabulary(it, stopwords= tm::stopwords()) # for each document, create a vocab
vecs[rownames(vecs) %in% voc$vocab$terms, , drop=FALSE] %>% # subset vecs for the words in the new document, then
colMeans # find the average vector for each document
}) %>% t # close y function and sapply, then transpose to return matrix w/ one row for each statement
For my use case, I need to keep the results separate for each document, so anything that involves pasting-together the elements of d won't work, but surely there must be a better way than what I've cobbled together. I feel like I must be missing something rather obvious.
Any help will be greatly appreciated.
You need to do it in a "batch" mode using efficient linear algebra matrix operations. The idea is to have document-term matrix for documents d. This matrix will contain information about how many times each word appears in each document. Then need just multiply dtm by matrix of embeddings:
library(text2vec)
# we are interested in words which are in word embeddings
voc = create_vocabulary(rownames(vecs))
# now we will create document-term matrix
vectorizer = vocab_vectorizer(voc)
dtm = itoken(d, tokenizer = word_tokenizer) %>%
create_dtm(vectorizer)
# normalize - calculate term frequaency - i.e. divide count of each word
# in document by total number of words in document.
# So at the end we will receive average of word vectors (not sum of word vectors!)
dtm = normalize(dtm)
# and now we can calculate vectors for document (average of vecors of words)
# using dot product of dtm and embeddings matrix
document_vecs = dtm %*% vecs

Resources