I am using keras in NLP problem. There comes a question about word embedding when I try to predict next word according to previous words. I have already turn the one-hot word to word vector via keras Embedding layer like this:
word_vector = Embedding(input_dim=2000,output_dim=100)(word_one_hot)
And use this word_vector to do something and the model gives another word_vector at last. But I have to see what the prediction word really is. How I can turn the word_vector back to word_one_hot?
This question is old but seems to be linked to a common point of confusion about what embeddings are and what purpose they serve.
First off, you should never convert to one-hot if you're going to embed afterwards. This is just a wasted step.
Starting with your raw data, you need to tokenize it. This is simply the process of assigning a unique integer to each element in your vocabulary (the set of all possible words/characters [your choice] in your data). Keras has convenience functions for this:
from keras.preprocessing.sequence import pad_sequences
from keras.preprocessing.text import Tokenizer
max_words = 100 # just a random example,
# it is the number of most frequently occurring words in your data set that you want to use in your model.
tokenizer = Tokenizer(num_words=max_words)
# This builds the word index
tokenizer.fit_on_texts(df['column'])
# This turns strings into lists of integer indices.
train_sequences = tokenizer.texts_to_sequences(df['column'])
# This is how you can recover the word index that was computed
print(tokenizer.word_index)
Embeddings generate a representation. Later layers in your model use earlier representations to generate more abstract representations. The final representation is used to generate a probability distribution over the number of possible classes (assuming classification).
When your model makes a prediction, it provides a probability estimate for each of the integers in the word_index. So, 'cat' as the most likely next word, and your word_index had something like {cat:666}, ideally the model would have provided a high likelihood for 666 (not 'cat'). Does this make sense? The model doesn't predict an embedding vector ever, the embedding vectors are intermediary representations of the input data that are (hopefully) useful for predicting an integer associated with a word/character/class.
Related
I am a new entry to R and fastText.
I read on fastText website that you should be able to retrieve words vectors for names like "New York" by typing "New_York" but it's not working for me. Actually, there are also other vector that I'm not able to recall properly.
I thought that it could be because of the OS (I work on windows).
require(plyr)
require(proxy)
require(ggpubr)
require (jtools)
require(tidyverse)
require(reshape2)
require(fastTextR)
london_agg <- read.csv2("londra_latlong2.csv",header=T,sep=",",dec=".",fill = T)
model <- ft_load("fastText/cc.en.300.bin")
london_agg$Name <- as.character(london_agg$Name)
ccc <- ft_word_vectors(model,london_agg$Name)
A word-vector model will only have full-word vectors for a string like New_York if the training data had preprocessed the text to create such tokens. I'm not sure if the cc FastText models, specifically, have done that – their distribution page doesn't mention it. (Google's original GoogleNews vectors in plain word2vec had used a phrase-combining algorithm to create vectors for a lot of multiword tokens like New_York.)
Failing that, a FastText model will also synthesize guess-vectors for other tokens that weren't in the training-data, using substrings of your requested token.
The cc.en.300.bin vectors are reported (on same page as above) as only having learned 5-character n-grams, so unknown (out-of-vocabulary with respect to training-tokens) token with fewer than 5 characters couldn't get any vector from FastText.
But those with more characters should get at least junk vectors. (The method for matching n-grams is based on a collision-oblivious hashtable, so even if 5-grams weren't in the training data, there should be some random junk data returned.)
Perhaps there's a bug in the R FastText implementation you're using. Separate from looking up your specific geo-data tokens, could you expand your question with some examples of individual tokens, of different lengths, that either return credible vectors (every dimension non-zero) & absolutely nothing (all zero dimensions)? The pattern of lookup words that return all-zeros might give a further hint as to what, if anything, the problem is.
I am doing my research with fasttext pre-trained model and I need word frequency to do further analysis. Does the .vec or .bin files provided on fasttext website contain the info of word frequency? if yes, how do I get?
I am using load_word2vec_format to load the model tried using model.wv.vocab[word].count, which only gives you the word frequency rank not the original word frequency.
I don't believe those formats include any word frequency information.
To the extent any pre-trained word-vectors declare what they were trained on – like, say, Wikipedia text – you could go back to the training corpus (or some reasonable approximation) to perform your own frequency-count. Even if you've only got a "similar" corpus, the frequencies might be "close enough" for your analytical need.
Similarly, you could potentially use the frequency-rank to synthesize a dummy frequency table, using Zipf's Law, which roughly holds for normal natural-language corpora. Again, the relative proportions between words might be roughly close enough to the real proportions for your need, even with real/precise frequencies as were used during word-vector training.
Synthesizing the version of the Zipf's law formula on the Wikipedia page that makes use of the Harmonic number (H) in the denominator, with the efficient approximation of H given in this answer, we can create a function that, given a word's (starting at 1) rank and the total number of unique words, gives the proportionate frequency predicted by Zipf's law:
from numpy import euler_gamma
from scipy.special import digamma
def digamma_H(s):
""" If s is complex the result becomes complex. """
return digamma(s + 1) + euler_gamma
def zipf_at(k_rank, N_total):
return 1.0 / (k_rank * digamma_H(N_total))
Then, if you had a pretrained set of 1 million word-vectors, you could estimate the first word's frequency as:
>>> zipf_at(1, 1000000)
0.06947953777315177
I'd like to use the GloVe word embedding implemented in text2vec to perform supervised regression/classification. I read the helpful tutorial on the text2vec homepage on how to generate the word vectors. However, I'm having trouble grasping how to proceed further, namely apply or transform these word vectors and attach them to each document in such a way that each document is represented by a vector (derived from its component words' vectors I assume), to be used as input in a classifier. I've run into some quick fixes online for short documents, but my documents are rather lengthy (movie subtitles) and there doesn't seem to be any guidance on how to proceed with such documents - or at least guidance matching my comprehension level; I have experience working with n-grams, dictionaries, and topic models, but word embeddings puzzle me.
Thank you!
If your goal is to classify documents - I doubt any doc2vec approach will beat bag-of-words/ngrams. If you still want to try - common simple strategy short documents (< 20 words) is to represent document as weighted sum/average of word vectors.
You can obtain it by something like:
common_terms = intersect(colnames(dtm), rownames(word_vectors) )
dtm_averaged = normalize(dtm[, common_terms], "l1")
# you can re-weight dtm above with tf-idf instead of "l1" norm
sentence_vectors = dtm_averaged %*% word_vectors[common_terms, ]
I'm not aware of any universal established methods to obtain good document vectors for long documents.
We have models for converting words to vectors (for example the word2vec model). Do similar models exist which convert sentences/documents into vectors, using perhaps the vectors learnt for the individual words?
1) Skip gram method: paper here and the tool that uses it, google word2vec
2) Using LSTM-RNN to form semantic representations of sentences.
3) Representations of sentences and documents. The Paragraph vector is introduced in this paper. It is basically an unsupervised algorithm that learns fixed-length feature representations from variable-length pieces of texts, such as sentences, paragraphs, and documents.
4) Though this paper does not form sentence/paragraph vectors, it is simple enough to do that. One can just plug in the individual word vectors(Glove word vectors are found to give the best performance) and then can form a vector representation of the whole sentence/paragraph.
5) Using a CNN to summarize documents.
It all depends on:
which vector model you're using
what is the purpose of the model
your creativity in combining word vectors into a document vector
If you've generated the model using Word2Vec, you can either try:
Doc2Vec: https://radimrehurek.com/gensim/models/doc2vec.html
Wiki2Vec: https://github.com/idio/wiki2vec
Or you can do what some people do, i.e. sum all content words in the documents and divide by the content words, e.g. https://github.com/alvations/oque/blob/master/o.py#L13 (note: line 17-18 is a hack to reduce noise):
def sent_vectorizer(sent, model):
sent_vec = np.zeros(400)
numw = 0
for w in sent:
try:
sent_vec = np.add(sent_vec, model[w])
numw+=1
except:
pass
return sent_vec / np.sqrt(sent_vec.dot(sent_vec))
A solution that is slightly less off the shelf, but probably hard to beat in terms of accuracy if you have a specific thing you're trying to do:
Build an RNN (with LSTM or GRU memory cells, comparison here) and optimize the error function of the actual task you're trying to accomplish. You feed it your sentence, and train it to produce the output you want. The activations of the network after being fed your sentence is a representation of the sentence (although you might only care about the networks output).
You can represent the sentence as a sequence of one-hot encoded characters, as a sequence of one-hot encoded words, or as a sequence of word vectors (e.g. GloVe or word2vec). If you use word vectors, you can keep backpropagating into the word vectors, updating their weights, so you also get custom word vectors tweaked specifically for the task you're doing.
There are a lot of ways to answer this question. The answer depends on your interpretation of phrases and sentences.
These distributional models such as word2vec which provide vector representation for each word can only show how a word usually is used in a window-base context in relation with other words. Based on this interpretation of context-word relations, you can take average vector of all words in a sentence as vector representation of the sentence. For example, in this sentence:
vegetarians eat vegetables .
We can take the normalised vector as vector representation:
The problem is in compositional nature of sentences. If you take the average word vectors as above, these two sentences have the same vector representation:
vegetables eat vegetarians .
There are a lot of researches in distributional fashion to learn tree structures through corpus processing. For example: Parsing With Compositional Vector Grammars. This video also explain this method.
Again I want to emphasise on interpretation. These sentence vectors probably have their own meanings in your application. For instance, in sentiment analysis in this project in Stanford, the meaning that they are seeking is the positive/negative sentiment of a sentence. Even if you find a perfect vector representation for a sentence, there are philosophical debates that these are not actual meanings of sentences if you cannot judge the truth condition (David Lewis "General Semantics" 1970). That's why there are lines of works focusing on computer vision (this paper or this paper). My point is that it can completely depend on your application and interpretation of vectors.
Hope you welcome an implementation. I faced the similar problem in converting the movie plots for analysis, after trying many other solutions I sticked to an implementation that made my job easier. The code snippet is attached below.
Install 'spaCy' from the following link.
import spacy
nlp = spacy.load('en')
doc = nlp(YOUR_DOC_HERE)
vec = doc.vector
Hope this helps.
I am trying my hand at learning Latent Component Analysis, while also learning R. I'm using the poLCA package, and am having a bit of trouble accessing the attributes. I can run the sample code just fine:
ds = read.csv("http://www.math.smith.edu/r/data/help.csv")
ds = within(ds, (cesdcut = ifelse(cesd>20, 1, 0)))
library(poLCA)
res2 = poLCA(cbind(homeless=homeless+1,
cesdcut=cesdcut+1, satreat=satreat+1,
linkstatus=linkstatus+1) ~ 1,
maxiter=50000, nclass=3,
nrep=10, data=ds)
but in order to make this more useful, I'd like to access the attributes within the objects created by the poLCA class as such:
attr(res2, 'Nobs')
attr(res2, 'maxiter')
but they both come up as 'Null'. I expect Nobs to be 453 (determined by the function) and maxiter to be 50000 (dictated by my input value).
I'm sure I'm just being naive, but I could use any help available. Thanks a lot!
Welcome to R. You've got the model-fitting syntax right, in that you can get a model out (don't know how latent component analysis works, so can't speak to the statistical validity of your result). However, you've mixed up the different ways in which R can store information pertaining to a model.
poLCA returns an object of class poLCA, which is
a list containing the following elements:
(. . .)
Nobs number of fully observed cases (less than or equal to N).
maxiter maximum number of iterations through which the estimation algorithm was set
to run.
Since it's a list, you can extract individual elements from your model object using the $ operator:
res2$Nobs # number of observations
res2$maxiter # maximum iterations
In some cases, there might be extractor functions to get this information without having to do low-level indexing. For example, many model-fitting functions will have a fitted method, which pulls out the vector of fitted values on the training data; and similarly residuals pulls out the vector of residuals. You should check whether there are such extractor functions provided by the poLCA package and use them if possible; that way, you're not making assumptions about the structure of the model object that might be broken in the future.
This is distinct to getting the attributes of an object, which is what you use attr for. Attributes in R are what you might call metadata: they contain R-specific information about an object itself, rather than information about whatever it is the object relates to. Examples of common attributes include class (the class of an object), dim (the dimensions of an array or matrix), names (names of individual elements of a vector/list/array) and so on.