Computing a matrix of cosine similarities versus a matrix of transformer embeddings - bert-language-model

It seems like the authors of CLIP opted in for a DSSM-like approach of embedding the image and the caption in parallel pathways and then taking dot products. At the same time, BERT's approach of concatenating two sequences of embeddings and training to predict another embedding concatenated to those two has been shown to achieve superior quality of ranking.
Hence two questions:
Are there any specific reasons why one would opt in for the DSSM-like architecture except reduced compute, like CLIP authors did?
Are there any subsequent works that setup experiments with training CLIP-like models with the BERT-like transformer approach of concatenating embeddings for the image and the caption?

Related

Is the information captured by Doc2Vec a subset of the information captured by BERT?

Both Doc2Vec and BERT are NLP models used to create vectors for text. The original BERT model maintained a vector of 768, while the original Doc2Vec model maintained a vector of size 300. Would it be reasonable to assume that all the information captured by D2V is a subset of information captured by BERT?
I ask, because I want to think about how to compare differences in representations for a set of sentences between models. I am thinking I could project the BERT vectors into a D2V subspace and compare those vectors to the D2V vectors for the same sentence, but this relies on the assumption that the subspace I'm projecting the BERT vectors into is actually comparable (i.e., the same type of information) to the D2V space.
The objective functions, while different, are quite similar. The Cloze task for BERT and the next word prediction for D2V are both trying to create associations between a word and its surrounding words. BERT can look bidirectionally, while D2V can only look at a window and moves from the left to the right of a sentence. The same objective function doesn't necessarily mean that they're capturing the same information, but it seems in which the way D2V does it (the covariates it uses) are a subset of the covariates used by BERT.
Interested to hear other people's thoughts.
I'll assume by Doc2Vec you mean the "Paragraph Vector" algorithm, which is often called Doc2Vec (including in libraries like Python Gensim).
That Doc2Vec is closely related to word2vec: it's essentially word2vec with a synthetic floating pseudoword vector over the entire text. It models texts via a shallow network that can't really consider word-order, or the composite-meaning of word runs, except in a very general 'nearness' sense.
So, a Doc2Vec model will not generate realistic/grammatical completions/summaries from vectors (except perhaps in very-limited single-word tests).
What info Doc2Vec most captures can be somewhat influenced by parameter choices, especially choice-of-mode and window (in modes where that matters, like when co-training word-vectors).
BERT is a far deeper model with more internal layers and a larger default dimensionality of text-representations. Its training mechanisms give it the potential to differentiate between significant word-orderings – and thus be sensitive to grammar and composite phrases beyond what Doc2Vec can learn. It can generate plausible multi-word completions/summarizations.
You could certainly train a 768-dimension Doc2Vec model on the same texts as a BERT model & compare the results. The resulting summary text-vectors, from the 2 models, would likely perform quite differently on key tasks. If you need to detect subtle shifts in meaning in short texts – things like the reversal of menaing from the insert of a single 'not' – I'd expect the BERT model to dominate (if sufficiently trained). On broader tasks less-sensitive to grammar like topic-classification, the Doc2Vec model might be competitive, or (given its simplicity) attractive in its ability to achieve certain targets with far less data or quicker training.
So, it'd be improper to assume that what Doc2Vec captures is a proper subset of what BERT does.
You could try learning a mapping from one model to the other (possibly including dimensionality-reduction), as there are surely many consistent correlations between the trained coordinate-spaces. But the act of creating such a mapping requires starting assumptions that certain vectors "should" line-up, or be in similar configurations.
If trying to understand what's unique/valuable across the two options, it's likely better to compare how the models rank a text's neighbors – do certain kinds of similarities dominate in one or the other? Or, try both as inputs to downstream classification/info-retrieval tasks, and see where they each shine.
(With sufficient data & training time, I'd expect BERT as the more-sophisticated model to usually provide better results – especially if it's also allotted a larger representation. But for some tasks, and limited data/compute/time resources, Doc2Vec might shine.

how to train Word2Vec model properly for a special purpose

My question concerns the proper training of the model for unique and really specific use of the Word2Vec model. See Word2Vec details here
I am working on identifying noun-adjective (or ) relationships within the word embeddings.
(E.g. we have 'nice car' in a sentence of the data-set. Given the word embeddings of the corpus and the nouns and adjectives all labeled, I am trying to design a technique to find the proper vector that connects 'nice' with 'car'.)
Of course I am not trying to connect only that pair of words, but the technique should would for all relationships. A supervised approach is taken at this moment, then try to work towards designing an unsupervised method.
Now that you understand what I am trying to do, I will explain the problem. I obviously know that word2vec needs to be trained on large amounts of data, to learn the proper embeddings as accurately as possible, but I am afraid to give it more data than the data-set with labelled sentences (500-700).
I am afraid that if I give it more data to train on (e.g. Latest Wikipedia dump data-set), it will learn better vectors, but the extra data will influence the positioning of my words, then this word relationship is biased by the extra training data. (e.g. what if there is also 'nice Apple' in the extra training data, then the positioning of the word 'nice' could be compromised).
Hopefully this makes sense and I am not making bad assumptions, but I am just in the dilemma of having bad vectors because of not enough training data, or having good vectors, but compromised vector positioning in the word embeddings.
What would be the proper way to train on ? As much training data as possible (billions of words) or just the labelled data-set (500-700 sentences) ?
Thank you kindly for your time, and let me know if anything that I explained does not make sense.
As always in similar situations it is best to check...
I wonder if you tested the difference in training on the labelled dataset results vs. the wikipedia dataset. Are there really the issues you are afraid of seeing?
I would just run an experiment and check if the vectors in both cases are indeed different (statistically speaking).
I suspect that you may introduce some noise with larger corpus but more data may be beneficial wrt. to vocabulary coverage (larger corpus - more universal). It all depends on your expected use case. It is likely to be a trade off between high precision with very low recall vs. so-so precision with relatively good recall.

word vector and paragraph vector query

I am trying to understand relation between word2vec and doc2vec vectors in Gensim's implementation. In my application, I am tagging multiple documents with same label (topic), I am training a doc2vec model on my corpus using dbow_words=1 in order to train word vectors as well. I have been able to obtain similarities between word and document vectors in this fashion which does make a lot of sense
For ex. getting documents labels similar to a word-
doc2vec_model.docvecs.most_similar(positive = [doc2vec_model["management"]], topn = 50))
My question however is about theoretical interpretation of computing similarity between word2vec and doc2vec vectors. Would it be safe to assume that when trained on the same corpus with same dimensionality (d = 200), word vectors and document vectors can always be compared to find similar words for a document label or similar document labels for a word. Any suggestion/ideas are most welcome.
Question 2: My other questions is about impact of high/low frequency of a word in final word2vec model. If wordA and wordB have similar contexts in a particular doc label(set) of documents but wordA has much higher frequency than wordB, would wordB have higher similarity score with the corresponding doc label or not. I am trying to train multiple word2vec models by sampling corpus in a temporal fashion and want to know if the hypothesis that as words get more and more frequent, assuming context relatively stays similar, similarity score with a document label would also increase. Am I wrong to make this assumption? Any suggestions/ideas are very welcome.
Thanks,
Manish
In a training mode where word-vectors and doctag-vectors are interchangeably used during training, for the same surrounding-words prediction-task, they tend to be meaningfully comparable. (Your mode, DBOW with interleaved skip-gram word-training, fits this and is the mode used by the paper 'Document Embedding with Paragraph Vectors'.)
Your second question is abstract and speculative; I think you'd have to test those ideas yourself. The Word2Vec/Doc2Vec processes train the vectors to be good at certain mechanistic word-prediction tasks, subject to the constraints of the model and tradeoffs with other vectors' quality. That the resulting spatial arrangement happens to be then useful for other purposes – ranked/absolute similarity, similarity along certain conceptual lines, classification, etc. – is then just an observed, pragmatic benefit. It's a 'trick that works', and might yield insights, but many of the ways models change in response to different parameter choices or corpus characteristics haven't been theoretically or experimentally worked-out.

Text classification with R and SVM. Matrix features

I am playing a bit with text classification and SVM.
My understanding is that typically the way to pick up the features for the training matrix is essentially to use a "bag of words" where we essentially end up with a matrix with as many columns as different words are in our document and the values of such columns is the number of occurrences per word per document (of course each document is represented by a single row).
So that all works fine, I can train my algorithm and so on, but sometimes i get an error like
Error during wrapup: test data does not match model !
By digging it a bit, I found the answer in this question Error in predict.svm: test data does not match model which essentially says that if your model has features A, B and C, then your new data to be classified should contain columns A, B and C. Of course with text this is a bit tricky, my new documents to classify might contain words that have never been seen by the classifier with the training set.
More specifically I am using the RTextTools library whith uses SparseM and tm libraries internally, the object used to train the svm is of type "matrix.csr".
Regardless of the specifics of the library my question is, is there any technique in document classification to ensure that the fact that training documents and new documents have different words will not prevent new data from being classified?
UPDATE The solution suggested by #lejlot is very simple to achieve in RTextTools by simply making use of the originalMatrix optional parameter when using the create_matrix function. Essentially, originalMatrix should be the SAME matrix that one creates when one uses the create_matrix function for TRAINING the data. So after you have trained your data and have your models, keep also the original document matrix, when using new examples, make sure of using such object when creating the new matrix for your prediction set.
Regardless of the specifics of the library my question is, is there any technique in document classification to ensure that the fact that training documents and new documents have different words will not prevent new data from being classified?
Yes, and it is very trivial one. Before applying any training or classification you create a preprocessing object, which is supposed to map text to your vector representation. In particular - it stores whole vocabulary used for training. Later on you reuse the same preprocessing object on test documents, and you simply ignore words from outside of vocabulary stored before (OOV words, as they are often refered in the literature).
Obviously there are plenty other more "heuristic" approaches, where instead of discarding you try to map them to existing words (although it is less theoreticalyy justified). Rather - you should create intermediate representation, which will be your new "preprocessing" object which can handle OOV words (through some levenstein distance mapping etc.).

explain the algorithm for document summarization

I want to do a project on document summarization.
Can anyone please explain the algorithm for document summarization using graph based approach?
Also if someone can provide me links to few good research papers???
Take a look at TextRank and LexRank.
LexRank is an algorithm essentially identical to TextRank, and both use this approach for document summarization. The two methods were developed by different groups at the same time, and LexRank simply focused on summarization, but could just as easily be used for keyphrase extraction or any other NLP ranking task.
In both algorithms, the sentences are ranked by applying PageRank to the resulting graph. A summary is formed by combining the top ranking sentences, using a threshold or length cutoff to limit the size of the summary.
https://en.wikipedia.org/wiki/Automatic_summarization#Unsupervised_approaches:_TextRank_and_LexRank

Resources