Apply text2vec embeddings to new data - r

I used text2vec to generate custom word embeddings from a corpus of proprietary text data that contains a lot of industry-specific jargon (thus stock embeddings like those available from google won't work). The analogies work great, but I'm having difficulty applying the embeddings to assess new data. I want to use the embeddings that I've already trained to understand relationships in new data. the approach I'm using (described below) seems convoluted, and it's painfully slow. Is there a better approach? Perhaps something already built into the package that I've simply missed?
Here's my approach (offered with the closest thing to reproducible code I can generate given that I'm using a proprietary data source):
d = list containing new data. each element is of class character
vecs = the word vectorizations obtained form text2vec's implementation of glove
new_vecs <- sapply(d, function(y){
it <- itoken(word_tokenizer(y), progressbar=FALSE) # for each statement, create an iterator punctuation
voc <- create_vocabulary(it, stopwords= tm::stopwords()) # for each document, create a vocab
vecs[rownames(vecs) %in% voc$vocab$terms, , drop=FALSE] %>% # subset vecs for the words in the new document, then
colMeans # find the average vector for each document
}) %>% t # close y function and sapply, then transpose to return matrix w/ one row for each statement
For my use case, I need to keep the results separate for each document, so anything that involves pasting-together the elements of d won't work, but surely there must be a better way than what I've cobbled together. I feel like I must be missing something rather obvious.
Any help will be greatly appreciated.

You need to do it in a "batch" mode using efficient linear algebra matrix operations. The idea is to have document-term matrix for documents d. This matrix will contain information about how many times each word appears in each document. Then need just multiply dtm by matrix of embeddings:
library(text2vec)
# we are interested in words which are in word embeddings
voc = create_vocabulary(rownames(vecs))
# now we will create document-term matrix
vectorizer = vocab_vectorizer(voc)
dtm = itoken(d, tokenizer = word_tokenizer) %>%
create_dtm(vectorizer)
# normalize - calculate term frequaency - i.e. divide count of each word
# in document by total number of words in document.
# So at the end we will receive average of word vectors (not sum of word vectors!)
dtm = normalize(dtm)
# and now we can calculate vectors for document (average of vecors of words)
# using dot product of dtm and embeddings matrix
document_vecs = dtm %*% vecs

Related

Expected input to torch Embedding layer with pre_trained vectors from gensim

I would like to use pre-trained embeddings in my neural network architecture. The pre-trained embeddings are trained by gensim. I found this informative answer which indicates that we can load pre_trained models like so:
import gensim
from torch import nn
model = gensim.models.KeyedVectors.load_word2vec_format('path/to/file')
weights = torch.FloatTensor(model.vectors)
emb = nn.Embedding.from_pretrained(torch.FloatTensor(weights.vectors))
This seems to work correctly, also on 1.0.1. My question is, that I don't quite understand what I have to feed into such a layer to utilise it. Can I just feed the tokens (segmented sentence)? Do I need a mapping, for instance token-to-index?
I found that you can access a token's vector simply by something like
print(weights['the'])
# [-1.1206588e+00 1.1578362e+00 2.8765252e-01 -1.1759659e+00 ... ]
What does that mean for an RNN architecture? Can we simply load in the tokens of the batch sequences? For instance:
for seq_batch, y in batch_loader():
# seq_batch is a batch of sequences (tokenized sentences)
# e.g. [['i', 'like', 'cookies'],['it', 'is', 'raining'],['who', 'are', 'you']]
output, hidden = model(seq_batch, hidden)
This does not seem to work so I am assuming you need to convert the tokens to its index in the final word2vec model. Is that true? I found that you can get the indices of words by using the word2vec model's vocab:
weights.vocab['world'].index
# 147
So as an input to an Embedding layer, should I provide a tensor of int for a sequence of sentences that consist of a sequence of words? Example use with dummy dataloader (cf. example above) and dummy RNN welcome.
The documentation says the following
This module is often used to store word embeddings and retrieve them using indices. The input to the module is a list of indices, and the output is the corresponding word embeddings.
So if you want to feed in a sentence, you give a LongTensor of indices, each corresponding to a word in the vocabulary, which the nn.Embedding layer will map into word vectors going forward.
Here's an illustration
test_voc = ["ok", "great", "test"]
# The word vectors for "ok", "great" and "test"
# are at indices, 0, 1 and 2, respectively.
my_embedding = torch.rand(3, 50)
e = nn.Embedding.from_pretrained(my_embedding)
# LongTensor of indicies corresponds to a sentence,
# reshaped to (1, 3) because batch size is 1
my_sentence = torch.tensor([0, 2, 1]).view(1, -1)
res = e(my_sentence)
print(res.shape)
# => torch.Size([1, 3, 50])
# 1 is the batch dimension, and there's three vectors of length 50 each
In terms of RNNs, next you can feed that tensor into your RNN module, e.g
lstm = nn.LSTM(input_size=50, hidden_size=5, batch_first=True)
output, h = lstm(res)
print(output.shape)
# => torch.Size([1, 3, 5])
I also recommend you look into torchtext. It can automatate some of the stuff you will have to do manually otherwise.

Can I prune a parser's vocabulary in spaCy?

The following code uses spaCy word vectors to find the 20 most similar words to a given word by first computing cosine similarity for all words in the vocabulary (more than a million), then sorting this list of the most similar words.
parser = English()
# access known words from the parser's vocabulary
current_word = parser.vocab[word]
# cosine similarity
cosine = lambda v1, v2: dot(v1, v2) / (norm(v1) * norm(v2))
# gather all known words, take only the lowercased versions
allWords = list({w for w in parser.vocab if w.has_vector and w.orth_.islower() and w.lower_ != word})
# sort by similarity
allWords.sort(key=lambda w: cosine(w.vector, current_word.vector))
allWords.reverse()
print("Top 20 most similar words to %s:") % word
for word in allWords[:20]:
print(word.orth_)
What I would like to know is whether there is a way to restrict spaCy's vocabulary only to the words that occur in a given list, which I hope would hugely reduce the cost of the sort operation.
To be clear, I would like to pass in a list of just a few words, or just the words in a given text, and be able to rapidly look up which of these words are nearest each other in spaCy's vector space.
Any help on this front appreciated.
The SpaCy documentation says that:
The default English model installs vectors for one million vocabulary
entries, using the 300-dimensional vectors trained on the Common Crawl
corpus using the GloVe algorithm. The GloVe common crawl vectors have
become a de facto standard for practical NLP.
So you could just load the GloVe vectors using Gensim. I'm not sure if you can load them directly, or if you have to use this script.
If you have loaded the word vectors in Gensim as model, you can simply use word_vectors.similarity('woman', 'man') to get the similarity between two words. If you have a list of words, you could do something like:
def most_similar(word, candidates, model, n=20):
"Get N most similar words from a list of candidates"
similarities = [(model.similarity(word,candidate), candidate)
for candidate in candidates]
most_similar_words = sorted(similarities, reverse=True)[:n]
only_words = [w for sim,w in most_similar_words]
return only_words
Spacy has a Vectors class that has a most_similar method. You could then define a wrapper function like so to avoid writing your own implementation:
import spacy
import numpy as np
def most_similar(word, model, n=20):
nlp = spacy.load(model)
doc = nlp(word)
vecs = [token.vector for token in doc]
queries = np.array(vecs)
keys_arr, best_rows_arr, scores_arr = nlp.vocab.vectors.most_similar(queries, n=n)
keys = keys_arr[0] # The array of keys is nested in another array from the previous step.
similar_words_list = [nlp.vocab[key].text for key in keys]
return similar_words_list
And call it like this: most_similar('apple', 'en_core_web_md', n=20) This would find the 20 most similar words using cosine similarity of the word "apple" based on the the Spacy model package "en_core_web_md".
This is the result: ['BLACKBERRY', 'APPLE', 'apples', 'PRUNES', 'iPHone', '3g/3gs', 'fruit', 'FIG', 'CREAMSICLE', 'iPad', 'ipad4', 'LONGAN', 'CALVADOS', 'iPOD', 'iPod', 'SORBET', 'PERSICA', 'peach', 'juice', 'JUICE']

Document similarity using LSA in R

I am working on LSA (using R) for Document Similarity Analysis.
Here are my steps
Imported the text data & created Corpus. Did basis Corpus operations like stemming, white space removal etc
Created LSA space as below
tdm <- TermDocumentMatrix(chat_corpus)
tdm_matrix <- as.matrix(tdm)
tdm.lsa <- lw_bintf(tdm_matrix)*gw_idf(tdm_matrix)
lsaSpace <- lsa(tdm.lsa)
Multi Dimensional Modelling (MDS) on LSA
'
dist.mat.lsa <- dist(t(as.textmatrix(lsaSpace)))
fit <- cmdscale(dist.mat.lsa,eig = T)
points <- data.frame(fit1$points,row.names=chat$text)
I want to create a matrix/data frame showing how similar the texts are (as shown in the attachment Result). Rows & Columns will be the texts to match while the cell values will be their similarity value. Ideally the diagonal values will be one 1 (perfect match) while the rest of the cell values will be lesser than 1.
Please trow some insights into how to do this. Thanks in advance
Note : I got the python code for this but need the same in R
similarity = np.asarray(numpy.asmatrix(dtm_lsa) * numpy.asmatrix(dtm_lsa).T)
pd.DataFrame(similarity,index=example, columns=example).head(10)
Expected Result
In order to do this you first need to take the S_k and D_k matrices from the lsa space you've created and multiply S_k by the transpose of D_k to get a k by n matrix, where k is the number of dimensions and n is the number of documents. This code would be as follows:
lsaMatrix <- diag(myLSAspace$sk) %*% t(myLSAspace$dk)
Then it's as simple as putting the resulting matrix through the cosine function from the lsa package:
simMatrix <- cosine(lsaMatrix)
Which will result in an n^2 size similarity matrix which can then be used for clustering etc.
You can read more about the S_k and D_k matrices in the lsa package documentation, they're outputs of the SVD applied.
https://cran.r-project.org/web/packages/lsa/lsa.pdf

findAssocs in a single document

I only have one document (a survey compilation). I want to do word association within a single document with findAssocs. So far all the examples i have seen are all combination of a few documents.
inspect(myDtm)
A term-document matrix (864 terms, 1 documents)
Non-/sparse entries: 864/0 (what is this for?)
Sparsity : 0% (what is this for? what does it mean if its 0%)
Maximal term length: 20
Weighting : term frequency (tf)
my data looks like this
unwanted 1
upgrade 3
valid 1
this is my code and i end up with the results = numeric (0)
findAssocs(myDtm, "salary", 0.5)
numeric(0)
please help.
Sparsity measures the percentage of elements (cf. cells) in the matrix that are equal to zero. When sparsity is high, you have a lot of terms that only occur in one or a few documents. You only have one document in your example, so all terms must occur in that doc. Very generally speaking a lower degree of sparsity is more useful for investigating document similarity (if that's what you're doing... it's not clear from your question).
The short answer is that your question has already been asked and answered: you need to have more than one doc in your dtm to calculate term associations using findAssocs.
You'll have to include a reproducible example if you want any more specific help with findAssocs. Try using the 'crude' dataset that comes with the tm package and experiment with findAssocs to see what happens when you alter the parameters. Check out the tm [documentation](http://cran.r-
project.org/web/packages/tm/vignettes/tm.pdf) to see more about how to use the built-in data.
Here's an example using the built-in data, try it for yourself:
require(tm)
data(crude)
dtm <- DocumentTermMatrix(crude)
# one doc in dtm, doesn't work...
dtm1 <- dtm[1,]
findAssocs(dtm1, "oil", 0.01)
# ten docs, does work
dtm10 <- dtm[1:10,]
findAssocs(dtm10, "oil", 0.01)
You can use findAssocs by adding your data in the following manner
data <- data.frame(text=txt, stringsAsFactors=FALSE)
tdm <- TermDocumentMatrix(Corpus(DataframeSource(data)))
Basically
Import your data into a "Source", your "Source" into a "Corpus", and then make a TDM out of your "Corpus"
I couple of years late. But I ran in to the same problem recently. It is because your term-document matrix (TDM) consists of only one document. Rather, your tdm should consist of multiple documents. If you use paste() to retrieve text from a data frame, you should not use paste(data$text, collapse = " "), but paste(data$text), before turning it into a TDM.
But if you present reproducable example maybe we can help.

Maximum reasonable size for stemCompletion in tm?

I have a corpus of 26 plain text files, each between 12 - 148kb, total of 1.2Mb. I'm using R on a Windows 7 laptop.
I did all the normal cleanup stuff (stopwords, custom stopwords, lower case, numbers) and want to do stem completion. I am using the original corpus as a dictionary as shown in the examples. I tried a couple of simple vectors to make sure it would work at all (with about 5 terms) and it did and very quickly.
exchanger <- function(x) stemCompletion(x, budget.orig)
budget <- tm_map(budget, exchanger)
It's been working since yesterday at 4pm! In R Studio under diagnostics, the request log shows new requests with different request numbers. Task manager shows it using some memory, but not a crazy amount. I don't want to stop it because what if it's almost there? Any other ideas of how to check progress - it's a volatile corpus, unfortunately? Ideas on how long it should take? I thought about using the dtm names vector as a dictionary, cut off at the most frequent (or high tf-idf), but I'm reluctant to kill this process.
This is a regular windows 7 laptop with lots of other things running.
Is this corpus too big for stemCompletion? Short of moving to Python, is there a better way to do stemCompletion or lemmatize vice stem - my web searching has not yielded any answers.
I can't give you a definite answer without data that reproduces your problem, but I would guess the bottleneck comes from the folllowing line from the stemCompletion source code:
possibleCompletions <- lapply(x, function(w) grep(sprintf("^%s", w), dictionary, value = TRUE))
After which, given you've kept the completion heuristic on the default of "prevalent", this happens:
possibleCompletions <- lapply(possibleCompletions, function(x) sort(table(x), decreasing = TRUE))
structure(names(sapply(possibleCompletions, "[", 1)), names = x)
That first line loops through each word in your corpus and checks it against your dictionary for possible completions. I'm guessing you have many words that appear many times in your corpus. That means the function is being called many times only to give the same response. A possibly faster version (depending on how many words were repeats and how often they were repeated) would look something like this:
y <- unique(x)
possibleCompletions <- lapply(y, function(w) grep(sprintf("^%s", w), dictionary, value = TRUE))
possibleCompletions <- lapply(possibleCompletions, function(x) sort(table(x), decreasing = TRUE))
z <- structure(names(sapply(possibleCompletions, "[", 1)), names = y)
z[match(x, names(z))]
So it only loops through the unique values of x rather than every value of x. To create this revised version of the code, you would need to download the source from CRAN and modify the function (I found it in the completion.R in the R folder).
Or you may just want to use Python for this one.
Cristina, following Schaun I recomend you use just unique word for apply stemcompletion. I mean, it is easy for your PC do completion in your unique words them do the completition in all your corpus (with all repetitions).
First of all, take the unique words from your corpus. For exemplo:
unique$text <- unique(budget)
Them, you can get the unique words from your original text
unique_budget.orig <- unique(budget.orig)
Now, you can apply the stemcomplection for your unieque words
unique$completition <- budget %>% stemCompletion (dictionary= unique_budget.orig)
Now you have an object with all words from your corpus and their completion. you just have to apply a join between your corpus and the object unique. Be sure both objects have the same variable name for words without the completition: this gonne be the key.
This gonna reduce the number of operation your PC have to do.

Resources