Must the vocab size must math the vocab_size in bert_config.json exactly? - bert-language-model

I am seeing someone other's BERT model, in which the vocab.txt's size is 22110, but the vocab_size parameter's value is 21128 in bert_config.json.
I understand that these two numbers must be exactly the same. Is that right?

If it is really BERT that uses WordPiece tokenizer, then yes. Different lengths of vocabulary and vocab_size in the config would mean that there are either embeddings that can never be used or that there are vocabulary items without any embeddings.
In this case, you will see no error message because the model and the tokenizer are loaded separately. The embedding table of BERT has 8 embeddings that are no "reachable".
Note, however, that the model may use some very non-standard tokenizer that saves the vocabulary in such a way, it is 8 items shorter (although it is quite unlikely).

Related

Is AllenNLP biased towards BERT?

At my University's research group we have been pre-training a RoBERTa model for Portuguese and also a domain-specific one, also based on RoBERTa. We have been conducting a series of benchmarks using huggingface's transformers library, and the RoBERTa models are performing better than the existing Portuguese BERT model for almost all datasets and tasks.
One of the tasks we're focusing on is NER, and since AllenNLP supports a CRF-based NER model, we were looking forward to seeing if we would get even greater improvements using these new RoBERTa models combined with AllenNLP's crf_tagger. We used the same jsonnet config we were using for BERT, only switching to RoBERTa, and ran a grid search on some hyperparameters to look for the best model. We tested hyperparameters such as weight decay and learning rate (for huggingface_adamw optimizer) and dropout (for crf_tagger), using 3 different seeds. To our surprise, the RoBERTa models weren't getting better results than the existing BERT model, which contradicted the experiments using transformers. It wasn't even a tie, the BERT model was much better (90.43% for the best BERT x 89.27% for the best RoBERTa).
This made us suspicious that AllenNLP could be somehow biased towards BERT, then we decided to run an English-specific standard benchmark (CoNLL 2003) for NER using transformers and AllenNLP, and the results we got enforced this suspicion. For AllenNLP, we ran a grid search keeping the exact jsonnet config, changing only the learning rate (from 8e-6 to 7e-5), the learning rate scheduler (slanted_triangular and linear_with_warmup with 10% and 3% of the steps with warmup) and the model, of course (bert-base-cased and roberta-base). The results we got for AllenNLP were surprising: absolutely all models trained with bert-base-cased were better than all roberta-base models (best BERT was 91.65% on the test set and best RoBERTa was 90.63%).
For transformers, we did almost the same thing, except we didn't change the learning rate scheduler there, we kept the default one, which is linear with warmup, using 10% warmup ratio. We tested the same learning rates, and also applied 3 different seeds. The results we got for transformers were exactly the opposite: all roberta-base models were better than all bert-base-cased models (best RoBERTa was 92.46% on the test set and best BERT was 91.58%).
Is there something in AllenNLP framework that could be making these trained NER models biased towards BERT, and underperforming for RoBERTa? Where could we start looking for possible issues? Doesn't look like a hyperparameter issue, since we tested so many combinations with grid search so far.
Thanks!
If model-biased behavior does exist, I'd expect it to be somewhere in the implementations of the Transformer-related modules, viz. PretrainedTransformerIndexer, PretrainedTransformerTokenizer, PretrainedTransformerEmbedder, etc.
It may be worth checking whether RoBERTa's special tokens (i.e., <s>, </s>, <pad>, <unk>, and <mask>) are being used. My understanding is that AllenNLP attempts to infer these, but if this inference process failed, then it's possible that e.g. the tokenizer would be preparing sequences with another model's special tokens, e.g. [CLS] instead of <s>, etc.
I think I've figured this out. This behavior is likely caused by AllenNLP's default implementation of tokenization: when a pre-existing tokenization with paired tags is provided (as I assume it is since you are working with NER datasets where tags must be paired with tokens), PretrainedTransformerTokenizer.intra_word_tokenize is used, and this tokenization function does not add a leading space to tokens, causing suboptimal wordpiece tokenization.
Recall that the RoBERTa tokenizer uses byte-pair encoding, which uses special characters (Ġ in some implementations) to indicate the initial wordpiece of whitespace-separated tokens, while BERT uses ## to indicate non-initial wordpieces of whitespace-separated tokens. Observe:
>>> from transformers import BertTokenizer, RobertaTokenizer
>>> rt = RobertaTokenizer.from_pretrained('roberta-base')
>>> bt = BertTokenizer.from_pretrained('bert-base-cased')
>>> bt.tokenize('modern artistry')
['modern', 'artist', '##ry']
>>> rt.tokenize('modern artistry')
['modern', 'Ġart', 'istry']
RoBERTa does have the option add_prefix_space which adds a space to the beginning-of-sequence token, but this is False by default, at least on roberta-base.
>>> rt.add_prefix_space
False
>>> rt.add_prefix_space = True
>>> rt.tokenize('modern artistry')
['Ġmodern', 'Ġart', 'istry']
Now, for AllenNLP: I expect that you used the PretrainedTransformerMismatchedEmbedder and PretrainedTransformerMismatchedIndexer setup, since you're doing NER. The indexer uses the intra_word_tokenize function of PretrainedTransformerTokenizer, and a quick look at its implementation reveals that what it is doing is invoking the tokenizer for each individual token.
Why is this a problem? Well, this works fine if you're using WordPiece tokenization (like with BERT) since whitespace does not need to be present in the tokenizer's input for good subword tokenization to occur. However, BPE tokenization does require whitespace to be in the input string, and if we're calling the tokenizer on tokens without whitespace in them, then the BPE tokenizer no longer knows how to distinguish which subwords are token-initial! Consider:
# From before
>>> rt.tokenize('modern artistry')
['modern', 'Ġart', 'istry']
# The way AllenNLP does it. Bad, no initial "Ġ" on "art"!
>>> [wp for token in ['modern', 'artistry'] for wp in t.tokenize(token)]
['modern', 'art', 'istry']
# This is equivalent to tokenizing a whole string with no space:
>>> t.tokenize('modernartistry')
['modern', 'art', 'istry']
This information about token boundaries is potentially meaningful. Consider two strings ax island and axis land which have different meanings in English. If you tokenize it the way AllenNLP does, the input IDs for the wordpieces will be substantially different (!):
# Intended
>>> t.tokenize('axis land')
['axis', 'Ġland']
>>> t.tokenize('ax island')
['ax', 'Ġisland']
# What AllenNLP gives you
>>> [wp for token in ['axis', 'land'] for wp in t.tokenize(token)]
['axis', 'land']
>>> [wp for token in ['ax', 'island'] for wp in t.tokenize(token)]
['ax', 'is', 'land']
So, to mitigate this, you would need to modify intra_word_tokenize somehow to bring the wordpieces more in line with what you'd expect. I'm not positive this is exactly what's causing the performance issues you note, but I'm pretty sure this tokenization issue ought to be happening for you, and if it is, I would expect performance degradations due to the suboptimal wordpiece tokenization. A cheap solution would be to flip add_prefix_space on, but there may be other problems that that could subtly cause--I haven't considered it yet.

BERT sentence embeddings from transformers

I'm trying to get sentence vectors from hidden states in a BERT model. Looking at the huggingface BertModel instructions here, which say:
from transformers import BertTokenizer, BertModel
tokenizer = BertTokenizer.from_pretrained('bert-base-multilingual-cased')
model = BertModel.from_pretrained("bert-base-multilingual-cased")
text = "Replace me by any text you'd like."
encoded_input = tokenizer(text, return_tensors='pt')
output = model(**encoded_input)
So first note, as it is on the website, this does /not/ run. You get:
>>> Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: 'BertTokenizer' object is not callable
But it looks like a minor change fixes it, in that you don't call the tokenizer directly, but ask it to encode the input:
encoded_input = tokenizer.encode(text, return_tensors="pt")
output = model(encoded_input)
OK, that aside, the tensors I get, however, have a different shape than I expected:
>>> output[0].shape
torch.Size([1,11,768])
This is a lot of layers. Which is the correct layer to use for sentence embeddings? [0]? [-1]? Averaging several? I have the goal of being able to do cosine similarity with these, so I need a proper 1xN vector rather than an NxK tensor.
I see that the popular bert-as-a-service project appears to use [0]
Is this correct? Is there documentation for what each of the layers are?
While the existing answer of Jindrich is generally correct, it does not address the question entirely. The OP asked which layer he should use to calculate the cosine similarity between sentence embeddings and the short answer to this question is none. A metric like cosine similarity requires that the dimensions of the vector contribute equally and meaningfully, but this is not the case for BERT weights released by the original authors. Jacob Devlin
(one of the authors of the BERT paper) wrote:
I'm not sure what these vectors are, since BERT does not generate meaningful sentence vectors. It seems that this is doing average pooling over the word tokens to get a sentence vector, but we never suggested that this will generate meaningful sentence representations. And even if they are decent representations when fed into a DNN trained for a downstream task, it doesn't mean that they will be meaningful in terms of cosine distance. (Since cosine distance is a linear space where all dimensions are weighted equally).
However, that does not mean you can not use BERT for such a task. It just means that you can not use the pre-trained weights out-of-the-box. You can either train a classifier on top of BERT which learns which sentences are similar (using the [CLS] token) or you can use sentence-transformers which can be used in an unsupervised scenario because they were trained to produce meaningful sentence representations.
I don't think there is single authoritative documentation saying what to use and when. You need to experiment and measure what is best for your task. Recent observations about BERT are nicely summarized in this paper: https://arxiv.org/pdf/2002.12327.pdf.
I think the rule of thumb is:
Use the last layer if you are going to fine-tune the model for your specific task. And finetune whenever you can, several hundred or even dozens of training examples are enough.
Use some of the middle layers (7-th or 8-th) if you cannot finetune the model. The intuition behind that is that the layers first develop a more and more abstract and general representation of the input. At some point, the representation starts to be more target to the pre-training task.
Bert-as-services uses the last layer by default (but it is configurable). Here, it would be [:, -1]. However, it always returns a list of vectors for all input tokens. The vector corresponding to the first special (so-called [CLS]) token is considered to be the sentence embedding. This where the [0] comes from in the snipper you refer to.
As mentioned in other answers, BERT was not meant to produce sentence level embeddings. Now, let's work on the how we can leverage power of BERT for computing context-sensitive sentence level embeddings.
BERT does carry the context at word level, here is an example:
This is a wooden stick.
Stick to your work.
Above two sentences carry the word 'stick', BERT does a good job in computing embeddings of stick as per sentence(or say, context).
Now, let's move to one another example:
--What is your age?
--How old are you?
Above two sentences are contextually very similar, so, we need a model that can accept a sentence or text chunk or paragraph and produce right embeddings collectively. Here is how it can be achieved.
Method 1:
Use pre-trained sentence_transformers, here is link to huggingface hub.
from sentence_transformers import SentenceTransformer
from sentence_transformers.util import cos_sim
model = SentenceTransformer(r"sentence-transformers/paraphrase-MiniLM-L6-v2")
embd_a = model.encode("What is your age?")
embd_b = model.encode("How old are you?")
sim_score = cos_sim(embd_a, embd_b)
print(sim_score)
output: tensor([[0.8648]])
Now, there may be a question on how can we train our on sentence_transformer, specific to a domain. Here we go,
Supervised approach:
A common challenge for Data Scientist or MLEngineers is to get rightly annotated data, mostly it is hard to get it in good volume, but say, if you have it here is how we can train our on sentence_transformer (don't worry, there is an unsupervised approach too).
model = SentenceTransformer('distilbert-base-nli-mean-tokens')
train_examples = [InputExample(texts=['My first sentence', 'My second sentence'], label=0.8),
InputExample(texts=['Another pair', 'Unrelated sentence'], label=0.3)]
train_dataloader = DataLoader(train_examples, shuffle=True, batch_size=16)
train_dataloader = DataLoader(train_examples, shuffle=True, batch_size=16)
train_loss = losses.CosineSimilarityLoss(model)
#Tune the model
model.fit(train_objectives=[(train_dataloader, train_loss)], epochs=1, warmup_steps=100)
More details here.
Tip: If you have a set of sentences that are similar to each other, say, you have a CSV, where column A and B contains sentences similar to each other(I mean each row will have a pair of sentences which are similar to each other), just load the csv and assign random values between 0.85 to 0.95 as similarity score and proceed.
Unsupervised approach
Say you don't have a huge set of annotated data, but you want to train a domain specific sentence_transformer, here is how we do it. Even for unsupervised training, data will be required, i.e. list of sentences/paragraphs, but need not to be annotated. Say, you don't have any data at all, still there is a work round (please visit last part of the answer).
Multiple approaches are available for unsupervised training, listing two of the most prominent ones. To see list of all available approaches, please visit here.
TSDAE link to research paper.
from sentence_transformers import SentenceTransformer, LoggingHandler
from sentence_transformers import models, util, datasets, evaluation, losses
from torch.utils.data import DataLoader
# Define your sentence transformer model using CLS pooling
model_name = 'bert-base-uncased'
word_embedding_model = models.Transformer(model_name)
pooling_model = models.Pooling(word_embedding_model.get_word_embedding_dimension(), 'cls')
model = SentenceTransformer(modules=[word_embedding_model, pooling_model])
# Define a list with sentences (1k - 100k sentences)
train_sentences = ["Your set of sentences",
"Model will automatically add the noise",
"And re-construct it",
"You should provide at least 1k sentences"]
# Create the special denoising dataset that adds noise on-the-fly
train_dataset = datasets.DenoisingAutoEncoderDataset(train_sentences)
# DataLoader to batch your data
train_dataloader = DataLoader(train_dataset, batch_size=8, shuffle=True)
# Use the denoising auto-encoder loss
train_loss = losses.DenoisingAutoEncoderLoss(model, decoder_name_or_path=model_name, tie_encoder_decoder=True)
# Call the fit method
model.fit(
train_objectives=[(train_dataloader, train_loss)],
epochs=1,
weight_decay=0,
scheduler='constantlr',
optimizer_params={'lr': 3e-5},
show_progress_bar=True
)
model.save('output/tsdae-model')
SimCSE link to research paper
from sentence_transformers import SentenceTransformer, InputExample
from sentence_transformers import models, losses
from torch.utils.data import DataLoader
# Define your sentence transformer model using CLS pooling
model_name = 'distilroberta-base'
word_embedding_model = models.Transformer(model_name, max_seq_length=32)
pooling_model = models.Pooling(word_embedding_model.get_word_embedding_dimension())
model = SentenceTransformer(modules=[word_embedding_model, pooling_model])
# Define a list with sentences (1k - 100k sentences)
train_sentences = ["Your set of sentences",
"Model will automatically add the noise",
"And re-construct it",
"You should provide at least 1k sentences"]
# Convert train sentences to sentence pairs
train_data = [InputExample(texts=[s, s]) for s in train_sentences]
# DataLoader to batch your data
train_dataloader = DataLoader(train_data, batch_size=128, shuffle=True)
# Use the denoising auto-encoder loss
train_loss = losses.MultipleNegativesRankingLoss(model)
# Call the fit method
model.fit(
train_objectives=[(train_dataloader, train_loss)],
epochs=1,
show_progress_bar=True
)
model.save('output/simcse-model')
Tip: If you carefully observer, major difference is in the loss function used. To see a list of all the loss function applicable to such training scenarios, visit here. Also, with all the experiments I did, I found that TSDAE is more useful, when you want decent precision and good recall. However, SimCSE can be used when you want very high precision and low recall.
Now, if you don't have sufficient data to fine tune the model, but you find a BERT model trained on your domain, you can directly leverage that by adding pooling and dense layers. Please do research on what is 'pooling', to have better understanding on what you are doing.
from sentence_transformers import SentenceTransformer, models
from torch import nn
word_embedding_model = models.Transformer('bert-base-uncased', max_seq_length=256)
pooling_model = models.Pooling(word_embedding_model.get_word_embedding_dimension())
dense_model = models.Dense(in_features=pooling_model.get_sentence_embedding_dimension(), out_features=256, activation_function=nn.Tanh())
model = SentenceTransformer(modules=[word_embedding_model, pooling_model, dense_model])
Tip: With above approach, if you start getting extreme high cosine score, it is an alarm to do negative testing. Sometime, simply adding pooling layers may not help, you must take few examples and check similarity scores for the inputs that are not similar (it is possible that even for dissimilar sentences, this may show good similarity, and that is the time you should stop and try to collect some data and do unsupervised training)
People who are interested in going deeper, here is a list of topics that may help you.
Pooling
Siamese Networks
Contrastive Loss
:) :)

Not sure what keys should be in class_weight dict to pass to Keras model.fit() method

I'm working on a text classification problem with very imbalanced classes, of which there are 94 total. I think that passing a dictionary of class weights to the class_weight parameter in model.fit() might prove fruitful for me. I've read up on how to generate good ratios; that's not a problem.
If I understand correctly, the dictionary should contain pairs of (class: weight). But in my model's parlance, a sample's class (or label) is a one-hot encoded vector — not an integer (which I've seen as the keys in discussion online about the class_weight parameter, and which confuses me), and certainly not the label string itself. I can't have a vector as a key in a dictionary... I'd appreciate any insight.
I have tried a couple things. I did try making a dictionary of (class: weight) pairs in which the class value was merely an integer (maybe Keras ingeniously knows to associate an integer label with the one-hot vectors whose 1 is in the associated index — wishful thinking!), but performance didn't change. Then I updated the dictionary to feature an INCORRECT number of classes, and got no error messages and the same performance. So... fit was just neglecting it, or...?
Any advice would be greatly appreciated.

What is Lambda definability?

While I was reading about lambda calculus, came across the word Lambda definability. Can someone please explain what that is as I couldn't find any good resources on that.
Thanks
More generally, there is a line of research seeking to characterize "lambda definability" over a broad class of languages. "lambda definability" itself is typically relative to a semantics of a language given in terms of sets. For a type T in our language, write |T| for its interpretation as a set. Now, take an element of |T| -- call it e. We want to know if there is a term in our language -- call it x : T (x of type T), such that |x| is e. If there is such a term, then we say that t is lambda-definable.
Now, in our perfect world, when we interpret a language into sets, we would like to say that the sets associated with each type are precisely those that contain the lambda-definable elements of that type and only the lambda-definable elements (completeness). It would also be nice, perhaps to say that we can provide an algorithm to determine if a claimed element of a set has an associated lambda term (decidability).
Now, often we don't just model into sets, but into other funny mathematical constructions. And we don't model just from the lambda calculus, but from other related systems such as Plotkin's PCF or the like. But the property under study is typically still called "lambda-definability".
After decades of research there are still many open problems and questions in this regard -- while certain lower-order terms have been shown to have decidable lambda-definability (the classic results involve terms up to second-order), many terms do not yield so easily. This paper ("The Undecidability of lambda-Definability" by Ralph Loader) gives an important such undecidability result and characterizes some consequences: http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.36.6860
See the Church-Turing thesis, where lambda-definable functions (from Church) are those that give us "effectively computable" functions. Turing showed that programs implementable on a Turing machine are equivalent to lambda-definable functions.

Math question regarding Python's uuid4

I'm not great with statistical mathematics, etc. I've been wondering, if I use the following:
import uuid
unique_str = str(uuid.uuid4())
double_str = ''.join([str(uuid.uuid4()), str(uuid.uuid4())])
Is double_str string squared as unique as unique_str or just some amount more unique? Also, is there any negative implication in doing something like this (like some birthday problem situation, etc)? This may sound ignorant, but I simply would not know as my math spans algebra 2 at best.
The uuid4 function returns a UUID created from 16 random bytes and it is extremely unlikely to produce a collision, to the point at which you probably shouldn't even worry about it.
If for some reason uuid4 does produce a duplicate it is far more likely to be a programming error such as a failure to correctly initialize the random number generator than genuine bad luck. In which case the approach you are using it will not make it any better - an incorrectly initialized random number generator can still produce duplicates even with your approach.
If you use the default implementation random.seed(None) you can see in the source that only 16 bytes of randomness are used to initialize the random number generator, so this is an a issue you would have to solve first. Also, if the OS doesn't provide a source of randomness the system time will be used which is not very random at all.
But ignoring these practical issues, you are basically along the right lines. To use a mathematical approach we first have to define what you mean by "uniqueness". I think a reasonable definition is the number of ids you need to generate before the probability of generating a duplicate exceeds some probability p. An approcimate formula for this is:
where d is 2**(16*8) for a single randomly generated uuid and 2**(16*2*8) with your suggested approach. The square root in the formula is indeed due to the Birthday Paradox. But if you work it out you can see that if you square the range of values d while keeping p constant then you also square n.
Since uuid4 is based off a pseudo-random number generator, calling it twice is not going to square the amount of "uniqueness" (and may not even add any uniqueness at all).
See also When should I use uuid.uuid1() vs. uuid.uuid4() in python?
It depends on the random number generator, but it's almost squared uniqueness.

Resources