Fine-tuning BERT sentence transformer model - bert-language-model

I am using a pre-trained BERT sentence transformer model, as described here https://www.sbert.net/docs/training/overview.html , to get embeddings for sentences.
I want to fine-tune these pre-trained embeddings, and I am following the instructions in the tutorial i have linked above. According to the tutorial, you fine-tune the pre-trained model by feeding it sentence pairs and a label score that indicates the similarity score between two sentences in a pair. I understand this fine-tuning happens using the architecture shown in the image below:
Each sentence in a pair is encoded first using the BERT model, and then the "pooling" layer aggregates (usually by taking the average) the word embeddings produced by Bert layer to produce a single embedding for each sentence. The cosine similarity of the two sentence embeddings is computed in the final step and compared against the label score.
My question here is - which parameters are being optimized when fine-tuning the model using the given architecture? Is it fine-tuning only the parameters of the last layer in BERT model? This is not clear to me by looking at the code example shown in the tutorial for fine-tuning the model.

Related

Is the information captured by Doc2Vec a subset of the information captured by BERT?

Both Doc2Vec and BERT are NLP models used to create vectors for text. The original BERT model maintained a vector of 768, while the original Doc2Vec model maintained a vector of size 300. Would it be reasonable to assume that all the information captured by D2V is a subset of information captured by BERT?
I ask, because I want to think about how to compare differences in representations for a set of sentences between models. I am thinking I could project the BERT vectors into a D2V subspace and compare those vectors to the D2V vectors for the same sentence, but this relies on the assumption that the subspace I'm projecting the BERT vectors into is actually comparable (i.e., the same type of information) to the D2V space.
The objective functions, while different, are quite similar. The Cloze task for BERT and the next word prediction for D2V are both trying to create associations between a word and its surrounding words. BERT can look bidirectionally, while D2V can only look at a window and moves from the left to the right of a sentence. The same objective function doesn't necessarily mean that they're capturing the same information, but it seems in which the way D2V does it (the covariates it uses) are a subset of the covariates used by BERT.
Interested to hear other people's thoughts.
I'll assume by Doc2Vec you mean the "Paragraph Vector" algorithm, which is often called Doc2Vec (including in libraries like Python Gensim).
That Doc2Vec is closely related to word2vec: it's essentially word2vec with a synthetic floating pseudoword vector over the entire text. It models texts via a shallow network that can't really consider word-order, or the composite-meaning of word runs, except in a very general 'nearness' sense.
So, a Doc2Vec model will not generate realistic/grammatical completions/summaries from vectors (except perhaps in very-limited single-word tests).
What info Doc2Vec most captures can be somewhat influenced by parameter choices, especially choice-of-mode and window (in modes where that matters, like when co-training word-vectors).
BERT is a far deeper model with more internal layers and a larger default dimensionality of text-representations. Its training mechanisms give it the potential to differentiate between significant word-orderings – and thus be sensitive to grammar and composite phrases beyond what Doc2Vec can learn. It can generate plausible multi-word completions/summarizations.
You could certainly train a 768-dimension Doc2Vec model on the same texts as a BERT model & compare the results. The resulting summary text-vectors, from the 2 models, would likely perform quite differently on key tasks. If you need to detect subtle shifts in meaning in short texts – things like the reversal of menaing from the insert of a single 'not' – I'd expect the BERT model to dominate (if sufficiently trained). On broader tasks less-sensitive to grammar like topic-classification, the Doc2Vec model might be competitive, or (given its simplicity) attractive in its ability to achieve certain targets with far less data or quicker training.
So, it'd be improper to assume that what Doc2Vec captures is a proper subset of what BERT does.
You could try learning a mapping from one model to the other (possibly including dimensionality-reduction), as there are surely many consistent correlations between the trained coordinate-spaces. But the act of creating such a mapping requires starting assumptions that certain vectors "should" line-up, or be in similar configurations.
If trying to understand what's unique/valuable across the two options, it's likely better to compare how the models rank a text's neighbors – do certain kinds of similarities dominate in one or the other? Or, try both as inputs to downstream classification/info-retrieval tasks, and see where they each shine.
(With sufficient data & training time, I'd expect BERT as the more-sophisticated model to usually provide better results – especially if it's also allotted a larger representation. But for some tasks, and limited data/compute/time resources, Doc2Vec might shine.

Computing a matrix of cosine similarities versus a matrix of transformer embeddings

It seems like the authors of CLIP opted in for a DSSM-like approach of embedding the image and the caption in parallel pathways and then taking dot products. At the same time, BERT's approach of concatenating two sequences of embeddings and training to predict another embedding concatenated to those two has been shown to achieve superior quality of ranking.
Hence two questions:
Are there any specific reasons why one would opt in for the DSSM-like architecture except reduced compute, like CLIP authors did?
Are there any subsequent works that setup experiments with training CLIP-like models with the BERT-like transformer approach of concatenating embeddings for the image and the caption?

word vector and paragraph vector query

I am trying to understand relation between word2vec and doc2vec vectors in Gensim's implementation. In my application, I am tagging multiple documents with same label (topic), I am training a doc2vec model on my corpus using dbow_words=1 in order to train word vectors as well. I have been able to obtain similarities between word and document vectors in this fashion which does make a lot of sense
For ex. getting documents labels similar to a word-
doc2vec_model.docvecs.most_similar(positive = [doc2vec_model["management"]], topn = 50))
My question however is about theoretical interpretation of computing similarity between word2vec and doc2vec vectors. Would it be safe to assume that when trained on the same corpus with same dimensionality (d = 200), word vectors and document vectors can always be compared to find similar words for a document label or similar document labels for a word. Any suggestion/ideas are most welcome.
Question 2: My other questions is about impact of high/low frequency of a word in final word2vec model. If wordA and wordB have similar contexts in a particular doc label(set) of documents but wordA has much higher frequency than wordB, would wordB have higher similarity score with the corresponding doc label or not. I am trying to train multiple word2vec models by sampling corpus in a temporal fashion and want to know if the hypothesis that as words get more and more frequent, assuming context relatively stays similar, similarity score with a document label would also increase. Am I wrong to make this assumption? Any suggestions/ideas are very welcome.
Thanks,
Manish
In a training mode where word-vectors and doctag-vectors are interchangeably used during training, for the same surrounding-words prediction-task, they tend to be meaningfully comparable. (Your mode, DBOW with interleaved skip-gram word-training, fits this and is the mode used by the paper 'Document Embedding with Paragraph Vectors'.)
Your second question is abstract and speculative; I think you'd have to test those ideas yourself. The Word2Vec/Doc2Vec processes train the vectors to be good at certain mechanistic word-prediction tasks, subject to the constraints of the model and tradeoffs with other vectors' quality. That the resulting spatial arrangement happens to be then useful for other purposes – ranked/absolute similarity, similarity along certain conceptual lines, classification, etc. – is then just an observed, pragmatic benefit. It's a 'trick that works', and might yield insights, but many of the ways models change in response to different parameter choices or corpus characteristics haven't been theoretically or experimentally worked-out.

Predicting next word with text2vec in R

I am building a language model in R to predict a next word in the sentence based on the previous words. Currently my model is a simple ngram model with Kneser-Ney smoothing. It predicts next word by finding ngram with maximum probability (frequency) in the training set, where smoothing offers a way to interpolate lower order ngrams, which can be advantageous in the cases where higher order ngrams have low frequency and may not offer a reliable prediction. While this method works reasonably well, it 'fails in the cases where the n-gram cannot not capture the context. For example, "It is warm and sunny outside, let's go to the..." and "It is cold and raining outside, let's go to the..." will suggest the same prediction, because the context of weather is not captured in the last n-gram (assuming n<5).
I am looking into more advanced methods and I found text2vec package, which allows to map words into vector space where words with similar meaning are represented with similar (close) vectors. I have a feeling that this representation can be helpful for the next word prediction, but i cannot figure out how exactly to define the training task. My quesiton is if text2vec is the right tool to use for next word prediction and if yes, what is the suitable prediction algorithm that can be used for this task?
You can try char-rnn or word-rnn (google a little bit).
For character-level model R/mxnet implementation take a look to mxnet examples. Probably it is possible to extend this code to word-level model using text2vec GloVe embeddings.
If you will have any success, let us know (I mean text2vec or/and mxnet developers). I will be very interesting case for R community. I wanted to perform such model/experiment, but still haven't time for that.
There is one implemented solution as an complete example using word embeddings. In fact, the paper from Makarenkov et al. (2017) named Language Models with Pre-Trained (GloVe) Word Embeddings presents a step-by-step implementation of training a Language Model, using Recurrent Neural Network (RNN) and pre-trained GloVe word embeddings.
In the paper the authors provide the instructions to run de code:
1. Download pre-trained GloVe vectors.
2. Obtain a text to train the model on.
3. Open and adjust the LM_RNN_GloVe.py file parameters inside the main
function.
4. Run the following methods:
(a) tokenize_file_to_vectors(glove_vectors_file_name, file_2_tokenize_name,
tokenized_file_name)
(b) run_experiment(tokenized_file_name)
The code in Python is here https://github.com/vicmak/ProofSeer.
I also found that #Dmitriy Selivanov recently published a nice and friendly tutorial using its text2vec package which can be useful to address the problem from the R perspective. (It would be great if he could comment further).
Your intuition is right that word embedding vectors can be used to improve language models by incorporating long distance dependencies. The algorithm you are looking for is called RNNLM (recurrent neural network language model). http://www.rnnlm.org/

Smarter than an Eighth grader? Kaggle AI Challenge. R

I am working on the Allen AI Science Challenge currently up on Kaggle.
The idea behind the challenge is to train to a model using the training data provided (a set of Eighth grade level science questions along with four answer options, one of which is the correct answer and the correct answer) along with any additional knowledge sources (Wikipedia, Science textbooks, etc) so that it can answer science questions as well as an (average?) Eighth grader can.
I'm thinking of taking the first crack at the problem in R (proficient only in R and C++; I don't think C++ will be a very useful language to solve this problem in). After exploring the Kaggle forums, I decided to use the TopicModels (tm), RWeka and Latent Dirichlet Algorithm (LDA) packages.
My current approach is to build a text predictor of some sort which on reading the question posed to it outputs a string of text and compute the cosine similarity between this output text and the four options given in the test set and predict the correct one to be with the highest cosine similarity.
I will train the model using the training data, a Wikipedia corpus along with a few Science textbooks so that the model does not overfit.
I have two questions here:
Does the overall approach make sense?
What would be a good starting point to build this text predictor? Will converting the corpus(training data, Wikipedia and Textbooks) to a Term Document/Document Term matrix help? I think forming n-grams for all the sources would help but I don't know what the next step would be, i.e. how exactly will the model predict and belt out a string of text(of say, size n) on reading a question.
I have tried implementing a part of the approach; finding out optimum number of topics and performing LDA over the training set; here's the code:
library(topicmodels)
library(RTextTools)
data<-read.delim("cleanset.txt", header = TRUE)
data$question<-as.character(data$question)
data$answerA<-as.character(data$answerA)
data$answerB<-as.character(data$answerB)
data$answerC<-as.character(data$answerC)
data$answerD<-as.character(data$answerD)
matrix <- create_matrix(cbind(as.vector(data$question),as.vector(data$answerA),as.vector(data$answerB),as.vector(data$answerC),as.vector(data$answerD)), language="english", removeNumbers=FALSE, stemWords=TRUE, weighting = tm::weightTf)
best.model<-lapply(seq(2,25,by=1),function(k){LDA(matrix,k)})
best.model.logLik <- as.data.frame(as.matrix(lapply(best.model, logLik)))
best.model.logLik.df <- data.frame(topics=c(2:25), LL=as.numeric(as.matrix(best.model.logLik)))
best.model.logLik.df[which.max(best.model.logLik.df$LL),]
best.model.lda<-LDA(matrix,25)
Any help will be appreciated!

Resources