Do I need to train on my own data in using bert model as an embedding vector? - bert-language-model

When I try the huggingface models and it gives the following error message:
from transformers import AutoTokenizer, AutoModel
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
model = AutoModel.from_pretrained("bert-base-uncased")
inputs = tokenizer("Hello world!", return_tensors="pt")
outputs = model(**inputs)
And the error message:
Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertModel: ['cls.predictions.transform.dense.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.decoder.weight', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.seq_relationship.bias', 'cls.seq_relationship.weight', 'cls.predictions.bias']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
My purpose is to find a pretrained model to create embedding vectors for my text, so that it can be used in downstream text. I don't want to create my own pretrained models to generate the embedding vector. In this case, can I ignore those warning messages, or I need to continue to train on my own data? In another post I learn that "Most of the official models don't have pretrained output layers. The weights are randomly initialized. You need to train them for your task." My understanding is that I don't need to train if I just want to get generic embedding vector for my text based on the public models, like Huggingface. Is that right?
I am new to transformer and please comment.

Indeed the bert-base-uncased model is already pre-trained and will produce contextualised outputs, which should not be random.
If you're aiming to get a vector representation for entire the input sequence, this is typically done by running your sequence through your model (as you have done) and extracting the representation of the [CLS] token.
The position of the [CLS] token may change depending on the base model you are using, but it is typically the first dimension in the output.
The FeatureExtractionPipeline (documentation here) is a wrapper for the process of extracting contextualised features from the model.
from transformers import FeatureExtractionPipeline
nlp = FeatureExtractionPipeline(
model=model,
tokenizer=tokenizer,
)
outputs = nlp(sentence)
embeddings = outputs[0]
cls_embedding = embeddings[0]
Some things to help verify things are going as expected:
Check that the [CLS] embedding has the expected dimensionality
Check that the [CLS] embedding produces similar vectors for similar text, and different vectors for different text (e.g. by applying cosine similarity)
Additional References: https://github.com/huggingface/transformers/issues/1950

Related

Split a pre-trained CoreML model into two

I have a Sound Classification model from turicreate example here:
https://apple.github.io/turicreate/docs/userguide/sound_classifier/
I am trying to split this model into two and save the two parts as separate CoreML Models using coremltools library. Can anyone please guide me on how to do this?
I am able to load the model and even print out the spec of the model. But don't know where to go from here.
import coremltools
mlmodel = coremltools.models.MLModel('./EnvSceneClassification.mlmodel')
# Get spec from the model
spec = mlmodel.get_spec()
Output should be two CoreML Models i.e. the above model split into two parts.
I'm not 100% sure on what the sound classifier model looks like. If it's a pipeline, you can just save each sub-model from the pipeline as its own separate mlmodel file.
If it's not a pipeline, it requires some model surgery. You will need to delete layers from the spec (with del spec.neuralNetworkClassifier.layers[a:b]).
You'll also need to change the inputs of the first model and the outputs of the second model to account for the deleted layers.

R CRAN Neural Network Package compute vs prediction

I am using R along with the neuralnet package see docs (https://cran.r-project.org/web/packages/neuralnet/neuralnet.pdf). I have used the neural network function to build and train my model.
Now I have built my model I want to test it on real data. Could someone explain if I should use the compute or prediction function? I have read the documentation and it isnt clear, both functions seem to do similar?
Thanks
The short answer is to use compute to do predictions.
You can see an example of using compute on the test set here. We can also see that compute is the right one from the documentation:
compute, a method for objects of class nn, typically produced by neuralnet. Computes the outputs
of all neurons for specific arbitrary covariate vectors given a trained neural network.
The above says that you can use covariate vectors in order to compute the output of the neural network i.e. make a prediction.
On the other hand prediction does what is mentioned in the title in the documentation:
Summarizes the output of the neural network, the data and the fitted
values of glm objects (if available)
Moreover, it only takes two arguments: the nn object and a list of glm models so there isn't a way to pass in the test set in order to make a prediction.

Text classification with R and SVM. Matrix features

I am playing a bit with text classification and SVM.
My understanding is that typically the way to pick up the features for the training matrix is essentially to use a "bag of words" where we essentially end up with a matrix with as many columns as different words are in our document and the values of such columns is the number of occurrences per word per document (of course each document is represented by a single row).
So that all works fine, I can train my algorithm and so on, but sometimes i get an error like
Error during wrapup: test data does not match model !
By digging it a bit, I found the answer in this question Error in predict.svm: test data does not match model which essentially says that if your model has features A, B and C, then your new data to be classified should contain columns A, B and C. Of course with text this is a bit tricky, my new documents to classify might contain words that have never been seen by the classifier with the training set.
More specifically I am using the RTextTools library whith uses SparseM and tm libraries internally, the object used to train the svm is of type "matrix.csr".
Regardless of the specifics of the library my question is, is there any technique in document classification to ensure that the fact that training documents and new documents have different words will not prevent new data from being classified?
UPDATE The solution suggested by #lejlot is very simple to achieve in RTextTools by simply making use of the originalMatrix optional parameter when using the create_matrix function. Essentially, originalMatrix should be the SAME matrix that one creates when one uses the create_matrix function for TRAINING the data. So after you have trained your data and have your models, keep also the original document matrix, when using new examples, make sure of using such object when creating the new matrix for your prediction set.
Regardless of the specifics of the library my question is, is there any technique in document classification to ensure that the fact that training documents and new documents have different words will not prevent new data from being classified?
Yes, and it is very trivial one. Before applying any training or classification you create a preprocessing object, which is supposed to map text to your vector representation. In particular - it stores whole vocabulary used for training. Later on you reuse the same preprocessing object on test documents, and you simply ignore words from outside of vocabulary stored before (OOV words, as they are often refered in the literature).
Obviously there are plenty other more "heuristic" approaches, where instead of discarding you try to map them to existing words (although it is less theoreticalyy justified). Rather - you should create intermediate representation, which will be your new "preprocessing" object which can handle OOV words (through some levenstein distance mapping etc.).

How are the predictions obtained?

I have been unable to find information on how exactly predict.cv.glmnet works.
Specifically, when a prediction is being made are the predictions based on a fit that uses all the available data? Or are predictions based on a fit where some data has been discarded as part of the cross validation procedure when running cv.glmnet?
I would strongly assume the former but was unable to find a sentence in the documentation that clearly states that after a cross validation is finished, the model is fitted with all available data for a new prediction.
If I have overlooked a statement along those lines, I would also appreciate a hint on where to find this.
Thanks!
In the documentation for predict.cv.glmnet :
"This function makes predictions from a cross-validated glmnet model, using the stored "glmnet.fit" object ... "
In the documentation for cv.glmnet (under value):
"glmnet.fit a fitted glmnet object for the full data."

R - How to get one "summary" prediction map instead for 5 when using 5-fold cross-validation in maxent model?

I hope I have come to the right forum. I'm an ecologist making species distribution models using the maxent (version 3.3.3, http://www.cs.princeton.edu/~schapire/maxent/) function in R, through the dismo package. I have used the argument "replicates = 5" which tells maxent to do a 5-fold cross-validation. When running maxent from the maxent.jar file directly (the maxent software), an html file with statistics will be made, including the prediction maps. In R, an html file is also made, but the prediction maps have to be extracted afterwards, using the function "predict" in the dismo package in r. When I do this, I get 5 maps, due to the 5-fold cross-validation setting. However, (and this is the problem) I want only one output map, one "summary" prediction map. I assume this is possible, although I don't know how maxent computes it. The maxent tutorial (see link above) says that:
"...you may want to avoid eating up disk space by turning off the “write output grids” option, which will suppress writing of output grids for the replicate runs, so that you only get the summary statistics grids (avg, stderr etc.)."
A list of arguments that can be put into R is found in this forum https://groups.google.com/forum/#!topic/maxent/yRBlvZ1_9rQ.
I have tried to use the argument "outputgrids=FALSE" both in the maxent function itself, and in the predict function, but it doesn't work. I still get 5 maps, even though I don't get any errors in R.
So my question is: How do I get one "summary" prediction map instead of the five prediction maps that results from the cross-validation?
I hope someone can help me with this, I am really stuck and haven't found any answers anywhere on the internet. Not even a discussion about this. Hope my question is clear. This is the R-script that I use:
model1<-maxent(x=predvars, p=presence_points, a=target_group_absence, path="//home//...//model1", args=c("replicates=5", "outputgrids=FALSE"))
model1map<-predict(model1, predvars, filename="//home//...//model1map.tif", outputgrids=FALSE)
Best regards,
Kristin
Sorry to be the bearer of bad news, but based on the source code, it looks like Dismo's predict function does not have the ability to generate a summary map.
Nitty-gritty details for those who care: When you call maxent with replicates set to something greater than 1, the maxent function returns a MaxEntReplicates object, rather than a normal MaxEnt object. When predict receives a MaxEntReplicates object, it just iterates through all of the models that it contains and calls predict on them individually.
So, what next? Fortunately, all is not lost! The reason that Dismo doesn't have this functionality is that for most kinds of model-building, there isn't actually a valid way to average parameters across your cross-validation models. I don't want to go so far as to say that that's definitely the case for MaxEnt specifically, but I suspect it is. As such, cross-validation is usually used more as a way of checking that your model building methodology works for your data than as a way of building your model directly (see this question for further discussion of that point). After verifying via cross-validation that models built using a given procedure seem to be accurate for the phenomenon you're modelling, it's customary to build a final model using all of your data. In theory this new model should only be better than models trained on a subset of your data.
So basically, assuming your cross-validated models look reasonable, you can run MaxEnt again with only one replicate. Your final result will be a model accuracy estimate based on the cross-validation and a map based on the second run with all of your data lumped together. Depending on what exactly your question is, there might be other useful summary statistics from the cross-validation that you want to use, but those are all things you've already seen in the html output.
I may have found this a couple of years later. But you could do something like this:
xm <- maxent(predictors, pres_train) # basically the maxent model
px <- predict(predictors, xm, ext=ext, progress= '' ) #prediction
px2 <- predict(predictors, xm2, ext=ext, progress= '' ) #prediction #02
models <- stack(px,px2) # create a stack of prediction from all the models
final_map <- mean(px,px2) # Take a mean of all the prediction
plot(final_map) #plot the averaged map
xm1,xm2,.. would be the maxent models for each partitions in cross-validation, and px, px2,.. would be the predicted maps.

Resources