My question concerns the proper training of the model for unique and really specific use of the Word2Vec model. See Word2Vec details here
I am working on identifying noun-adjective (or ) relationships within the word embeddings.
(E.g. we have 'nice car' in a sentence of the data-set. Given the word embeddings of the corpus and the nouns and adjectives all labeled, I am trying to design a technique to find the proper vector that connects 'nice' with 'car'.)
Of course I am not trying to connect only that pair of words, but the technique should would for all relationships. A supervised approach is taken at this moment, then try to work towards designing an unsupervised method.
Now that you understand what I am trying to do, I will explain the problem. I obviously know that word2vec needs to be trained on large amounts of data, to learn the proper embeddings as accurately as possible, but I am afraid to give it more data than the data-set with labelled sentences (500-700).
I am afraid that if I give it more data to train on (e.g. Latest Wikipedia dump data-set), it will learn better vectors, but the extra data will influence the positioning of my words, then this word relationship is biased by the extra training data. (e.g. what if there is also 'nice Apple' in the extra training data, then the positioning of the word 'nice' could be compromised).
Hopefully this makes sense and I am not making bad assumptions, but I am just in the dilemma of having bad vectors because of not enough training data, or having good vectors, but compromised vector positioning in the word embeddings.
What would be the proper way to train on ? As much training data as possible (billions of words) or just the labelled data-set (500-700 sentences) ?
Thank you kindly for your time, and let me know if anything that I explained does not make sense.
As always in similar situations it is best to check...
I wonder if you tested the difference in training on the labelled dataset results vs. the wikipedia dataset. Are there really the issues you are afraid of seeing?
I would just run an experiment and check if the vectors in both cases are indeed different (statistically speaking).
I suspect that you may introduce some noise with larger corpus but more data may be beneficial wrt. to vocabulary coverage (larger corpus - more universal). It all depends on your expected use case. It is likely to be a trade off between high precision with very low recall vs. so-so precision with relatively good recall.
Related
Both Doc2Vec and BERT are NLP models used to create vectors for text. The original BERT model maintained a vector of 768, while the original Doc2Vec model maintained a vector of size 300. Would it be reasonable to assume that all the information captured by D2V is a subset of information captured by BERT?
I ask, because I want to think about how to compare differences in representations for a set of sentences between models. I am thinking I could project the BERT vectors into a D2V subspace and compare those vectors to the D2V vectors for the same sentence, but this relies on the assumption that the subspace I'm projecting the BERT vectors into is actually comparable (i.e., the same type of information) to the D2V space.
The objective functions, while different, are quite similar. The Cloze task for BERT and the next word prediction for D2V are both trying to create associations between a word and its surrounding words. BERT can look bidirectionally, while D2V can only look at a window and moves from the left to the right of a sentence. The same objective function doesn't necessarily mean that they're capturing the same information, but it seems in which the way D2V does it (the covariates it uses) are a subset of the covariates used by BERT.
Interested to hear other people's thoughts.
I'll assume by Doc2Vec you mean the "Paragraph Vector" algorithm, which is often called Doc2Vec (including in libraries like Python Gensim).
That Doc2Vec is closely related to word2vec: it's essentially word2vec with a synthetic floating pseudoword vector over the entire text. It models texts via a shallow network that can't really consider word-order, or the composite-meaning of word runs, except in a very general 'nearness' sense.
So, a Doc2Vec model will not generate realistic/grammatical completions/summaries from vectors (except perhaps in very-limited single-word tests).
What info Doc2Vec most captures can be somewhat influenced by parameter choices, especially choice-of-mode and window (in modes where that matters, like when co-training word-vectors).
BERT is a far deeper model with more internal layers and a larger default dimensionality of text-representations. Its training mechanisms give it the potential to differentiate between significant word-orderings – and thus be sensitive to grammar and composite phrases beyond what Doc2Vec can learn. It can generate plausible multi-word completions/summarizations.
You could certainly train a 768-dimension Doc2Vec model on the same texts as a BERT model & compare the results. The resulting summary text-vectors, from the 2 models, would likely perform quite differently on key tasks. If you need to detect subtle shifts in meaning in short texts – things like the reversal of menaing from the insert of a single 'not' – I'd expect the BERT model to dominate (if sufficiently trained). On broader tasks less-sensitive to grammar like topic-classification, the Doc2Vec model might be competitive, or (given its simplicity) attractive in its ability to achieve certain targets with far less data or quicker training.
So, it'd be improper to assume that what Doc2Vec captures is a proper subset of what BERT does.
You could try learning a mapping from one model to the other (possibly including dimensionality-reduction), as there are surely many consistent correlations between the trained coordinate-spaces. But the act of creating such a mapping requires starting assumptions that certain vectors "should" line-up, or be in similar configurations.
If trying to understand what's unique/valuable across the two options, it's likely better to compare how the models rank a text's neighbors – do certain kinds of similarities dominate in one or the other? Or, try both as inputs to downstream classification/info-retrieval tasks, and see where they each shine.
(With sufficient data & training time, I'd expect BERT as the more-sophisticated model to usually provide better results – especially if it's also allotted a larger representation. But for some tasks, and limited data/compute/time resources, Doc2Vec might shine.
I know that in Word2Vec the length of word vectors could encode properties like term frequency. In that case, we can see two word vectors, say synonyms, with a similar meaning but with a different length given their usage in our corpus.
However, if we normalize the word vectors, we keep their "directions of meaning" and we could clusterize them according that: meaning.
Following that train of thought, the same would be applicable to document vectors in Doc2Vec.
But my question is, is there a reason to NOT normalize document vectors if we want to cluster them? In Word2Vec we can say we want to keep the frequency property of the words, is there a similar thing for documents?
I'm not familiar with any reasoning or research precedent which implies that either unit-normalized or non-normalized document-vectors are better for clustering.
So, I'd try both to see which seems to work better for your purposes.
Other thoughts:
In Word2Vec, my general impression is that larger-magnitude word-vectors are associated with words that, in the training data, have more unambiguous meaning. (That is, they reliably tend to imply the same smaller set of neighboring words.) Meanwhile, words with multiple meanings (polysemy) and usage amongst many other diverse words tend to have lower-magnitude vectors.
Still, the common way of comparing such vectors, cosine-similarity, is oblivious to magnitudes. That's likely because most comparisons just need the best sense of a word, without any more subtle indicator of "unity of meaning".
A similar effect might be present in Doc2Vec vectors: lower-magnitude doc-vectors could be a hint that the document has more broad word-usage/subject-matter, while higher-magnitude doc-vectors suggest more focused documents. (I'd similarly have the hunch that longer documents may tend to have lower-magnitude doc-vectors, because they use a greater diversity of words, whereas small documents with a narrow set of words/topics may have higher-magnitude doc-vectors. But I have not specifically observed/tested this hunch, and any effect here could be heavily influenced by other training choices, like the number of training iterations.)
Thus, it's possible that the non-normalized vectors would be interesting for some clustering goals, like separating focused documents from more general documents. So again, after this longer analysis: I'd suggest trying it both ways to see if one or the other seems to work better for your specific needs.
So, I have to compare vector of article and vector of single word. And I don't have any idea how to do it. Looks like that BERT and Doc2wec good work with long text, Word2vec works with single words. But how to compare long text with just a word?
Some modes of the "Paragraph Vector" algorithm (aka Doc2Vec in libraries like Python gensim) will train both doc-vectors and word-vectors into the a shared coordinate space. (Specifically, any of the PV-DM dm=1 modes, or the PV-DBOW mode dm=0 if you enable the non-default interleaved word-vector training using dbow_words=1.)
In such a case, you can compare Doc2Vec doc-vectors with the co-trained word-vectors, with some utility. You can see some examples in the followup paper form the originators of the "Paragraph Vector" algorithm, "Document Embedding with Paragraph Vectors".
However, beware that vectors for single words, having been trained in contexts of use, may not have vectors that match what we'd expect of those same words when intended as overarching categories. For example, education as used in many sentences wouldn't necessarily assume all facets/breadth that you might expect from Education as a category-header.
Such single word-vectors might work better than nothing, and perhaps help serve as a bootstrapping tool. But, it'd be better if you had expert-labelled examples of documents belonging to categories of interest. Then you you could also use more advanced classification algorithms, sensitive to categories that wouldn't necessarily be summarized-by (and in a tight sphere around) any single vector point. In real domains-of-interest, that'd likely do better than using single-word-vectors as category-anchors.
For any other non-Doc2Vec method of vectorizing a text, you could conceivably get a comparable vector for a single word by supplying a single-word text to the method. (Even in a Doc2Vec mode that doesn't create word-vectors, like pure PV-DBOW, you could use that model's out-of-training-text inference capability to infer a doc-vector for a single-word doc, for known words.)
But again, such simplified/degenerate single-word outputs might not well match the more general/textured categories you're seeking. The models are more typically used for larger contexts, and narrowing their output to a single word might reflect the peculiarities of that unnatural input case moreso than the usual import of the word in real context.
You can use BERT as is for words too. a single word is just a really short sentence. so, in theory, you should be able to use any sentence embedding as you like.
But if you don't have any supervised data, BERT is not the best option for you and there are better options out there!
I think it's best to first try doc2vec and if it didn't work then switch to something else like SkipThoughts or USE.
Sorry that I can't help you much, it's completely task and data dependent and you should test different things.
Based on your further comments that explain your problem a bit more, it sounds like you're actually trying to do Topic Modelling (categorizing documents by a given word is equivalent to labeling them with that topic). If this is what you're doing, I would recommend looking into LDA and variants of it (eg guidedLDA as an example).
I am trying to understand relation between word2vec and doc2vec vectors in Gensim's implementation. In my application, I am tagging multiple documents with same label (topic), I am training a doc2vec model on my corpus using dbow_words=1 in order to train word vectors as well. I have been able to obtain similarities between word and document vectors in this fashion which does make a lot of sense
For ex. getting documents labels similar to a word-
doc2vec_model.docvecs.most_similar(positive = [doc2vec_model["management"]], topn = 50))
My question however is about theoretical interpretation of computing similarity between word2vec and doc2vec vectors. Would it be safe to assume that when trained on the same corpus with same dimensionality (d = 200), word vectors and document vectors can always be compared to find similar words for a document label or similar document labels for a word. Any suggestion/ideas are most welcome.
Question 2: My other questions is about impact of high/low frequency of a word in final word2vec model. If wordA and wordB have similar contexts in a particular doc label(set) of documents but wordA has much higher frequency than wordB, would wordB have higher similarity score with the corresponding doc label or not. I am trying to train multiple word2vec models by sampling corpus in a temporal fashion and want to know if the hypothesis that as words get more and more frequent, assuming context relatively stays similar, similarity score with a document label would also increase. Am I wrong to make this assumption? Any suggestions/ideas are very welcome.
Thanks,
Manish
In a training mode where word-vectors and doctag-vectors are interchangeably used during training, for the same surrounding-words prediction-task, they tend to be meaningfully comparable. (Your mode, DBOW with interleaved skip-gram word-training, fits this and is the mode used by the paper 'Document Embedding with Paragraph Vectors'.)
Your second question is abstract and speculative; I think you'd have to test those ideas yourself. The Word2Vec/Doc2Vec processes train the vectors to be good at certain mechanistic word-prediction tasks, subject to the constraints of the model and tradeoffs with other vectors' quality. That the resulting spatial arrangement happens to be then useful for other purposes – ranked/absolute similarity, similarity along certain conceptual lines, classification, etc. – is then just an observed, pragmatic benefit. It's a 'trick that works', and might yield insights, but many of the ways models change in response to different parameter choices or corpus characteristics haven't been theoretically or experimentally worked-out.
There exist a very large own-collected dataset of size [2000000 12672] where the rows shows the number of instances and the columns, the number of features. This dataset occupies ~60 Gigabyte on the local hard disk. I want to train a linear SVM on this dataset. The problem is that I have only 8 Gigabyte of RAM! so I cannot load all data once. Is there any solution to train the SVM on this large dataset? Generating the dataset is on my own desire, and currently are is HDF5 format.
Thanks
Welcome to machine learning! One of the hard things about working in this space is the compute requirements. There are two main kinds of algorithms, on-line and off-line.
Online: supports feeding in examples one at a time, each one improving the model slightly
Offline: supports feeding in the entire dataset at once, achieving higher accuracy than an On-line model
Many typical algorithms have both on-line, and off-line implementations, but an SVM is not one of them. To the best of my knowledge, SVMs are traditionally an off-line only algorithm. The reason for this is a lot of the fine details around "shattering" the dataset. I won't go too far into the math here, but if you read into it it should become apparent.
It's also worth noting that the complexity of an SVM is somewhere between n^2 and n^3, meaning that even if you could load everything into memory it would take ages to actually train the model. It's very typical to test with a much smaller portion of your dataset before moving to the full dataset.
When moving to the full dataset you would have to run this on a much larger machine than your own, but AWS should have something large enough for you, though at your size of data I highly advise using something other than an SVM. At large data sizes, neural net approaches really shine, and can be trained in a more realistic amount of time.
As alluded to in the comments, there's also the concept of an out-of-core algorithm that can operate directly on objects stored on disk. The only group I know with a good offering of out-of-core algorithms is dato. It's a commercial product, but might be your best solution here.
A stochastic gradient descent approach to SVM could help, as it scales well and avoids the n^2 problem. An implementation available in R is RSofia, which was created by a team at Google and is discussed in Large Scale Learning to Rank. In the paper, they show that compared to a traditional SVM, the SGD approach significantly decreases the training time (this is due to 1, the pairwise learning method and 2, only a subset of the observations end up being used to train the model).
Note that RSofia is a little more bare bones than some of the other SVM packages available in R; for example, you need to do your own centering and scaling of features.
As to your memory problem, it'd be a little surprising if you needed the entire dataset - I would expect that you'd be fine reading in a sample of your data and then training your model on that. To confirm this, you could train multiple models on different samples and then estimate performance on the same holdout set - the performance should be similar across the different models.
You don't say why you want Linear SVM, but if you can consider another model that often gives superior results then check out the hpelm python package. It can read an HDF5 file directly. You can find it here https://pypi.python.org/pypi/hpelm It trains on segmented data, that can even be pre-loaded (called async) to speed up reading from slow hard disks.