How to freeze layer parameters in Flux.jl - julia

I am currently working on a transfer learning problem so I would like to freeze most of my layers such that the model is not re-trained, and only the final layer's weights are modified. How can I choose which layers to freeze with Flux.jl?

Flux provides a simple interface to do this which is to only pass in the layers you want to be modified to the Flux.params() function as shown below:
m = Chain(
Dense(784, 64, relu),
Dense(64, 64, relu),
Dense(32, 10)
)
ps = Flux.params(m[3:end])
In the above example, we chose to only update the final Dense layer (which is commonly what you do in a transfer learning example).
You can see a full example with more build up in the Flux.jl tutorial on transfer learning: https://fluxml.ai/tutorials/2020/10/18/transfer-learning.html

Related

TensorFlow Lite for Microcontrollers: Didn't find op for builtin opcode 'REDUCE_PROD' version '1'

I am using a Seeed Studio XIAO to play some machine learning.
I follow the tutorialhere
The model in this example seems to have an input size of [None, time_steps * num_features]. In its case, the input instance to the model is [1, 119 samples * 6 IMU features].
I could also run the .ino file on my Seeed Studio board without any errors.
However, I built and trained a model with an input size of [None, 30(time_steps), 6(num_features)].
Basically, my model just had Input layer (None, 30, 6), 2 dense layers, a flatten layer, and a final dense layer (None, 4).
I was able to convert the model to .tflite and .h. But when I loaded it to Arduino, the Serial Port showed error:
20:26:45.976 -> Didn't find op for builtin opcode 'REDUCE_PROD' version '1'. An older version of this builtin might be supported. Are you using an old TFLite binary with a newer model?
20:26:45.976 ->
20:26:45.976 -> Failed to get registration from op code REDUCE_PROD
Anyone has an idea?
I also noticed that Flatten() is not supported by tflite micro, then I used Reshape() in my model. However, same error was shown in the Serial Port.
In TensorFlow Lite for Microcontroller applications, what should I do to run a model with an input size [None, time_steps, num_features]?

BERT sentence embeddings from transformers

I'm trying to get sentence vectors from hidden states in a BERT model. Looking at the huggingface BertModel instructions here, which say:
from transformers import BertTokenizer, BertModel
tokenizer = BertTokenizer.from_pretrained('bert-base-multilingual-cased')
model = BertModel.from_pretrained("bert-base-multilingual-cased")
text = "Replace me by any text you'd like."
encoded_input = tokenizer(text, return_tensors='pt')
output = model(**encoded_input)
So first note, as it is on the website, this does /not/ run. You get:
>>> Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: 'BertTokenizer' object is not callable
But it looks like a minor change fixes it, in that you don't call the tokenizer directly, but ask it to encode the input:
encoded_input = tokenizer.encode(text, return_tensors="pt")
output = model(encoded_input)
OK, that aside, the tensors I get, however, have a different shape than I expected:
>>> output[0].shape
torch.Size([1,11,768])
This is a lot of layers. Which is the correct layer to use for sentence embeddings? [0]? [-1]? Averaging several? I have the goal of being able to do cosine similarity with these, so I need a proper 1xN vector rather than an NxK tensor.
I see that the popular bert-as-a-service project appears to use [0]
Is this correct? Is there documentation for what each of the layers are?
While the existing answer of Jindrich is generally correct, it does not address the question entirely. The OP asked which layer he should use to calculate the cosine similarity between sentence embeddings and the short answer to this question is none. A metric like cosine similarity requires that the dimensions of the vector contribute equally and meaningfully, but this is not the case for BERT weights released by the original authors. Jacob Devlin
(one of the authors of the BERT paper) wrote:
I'm not sure what these vectors are, since BERT does not generate meaningful sentence vectors. It seems that this is doing average pooling over the word tokens to get a sentence vector, but we never suggested that this will generate meaningful sentence representations. And even if they are decent representations when fed into a DNN trained for a downstream task, it doesn't mean that they will be meaningful in terms of cosine distance. (Since cosine distance is a linear space where all dimensions are weighted equally).
However, that does not mean you can not use BERT for such a task. It just means that you can not use the pre-trained weights out-of-the-box. You can either train a classifier on top of BERT which learns which sentences are similar (using the [CLS] token) or you can use sentence-transformers which can be used in an unsupervised scenario because they were trained to produce meaningful sentence representations.
I don't think there is single authoritative documentation saying what to use and when. You need to experiment and measure what is best for your task. Recent observations about BERT are nicely summarized in this paper: https://arxiv.org/pdf/2002.12327.pdf.
I think the rule of thumb is:
Use the last layer if you are going to fine-tune the model for your specific task. And finetune whenever you can, several hundred or even dozens of training examples are enough.
Use some of the middle layers (7-th or 8-th) if you cannot finetune the model. The intuition behind that is that the layers first develop a more and more abstract and general representation of the input. At some point, the representation starts to be more target to the pre-training task.
Bert-as-services uses the last layer by default (but it is configurable). Here, it would be [:, -1]. However, it always returns a list of vectors for all input tokens. The vector corresponding to the first special (so-called [CLS]) token is considered to be the sentence embedding. This where the [0] comes from in the snipper you refer to.
As mentioned in other answers, BERT was not meant to produce sentence level embeddings. Now, let's work on the how we can leverage power of BERT for computing context-sensitive sentence level embeddings.
BERT does carry the context at word level, here is an example:
This is a wooden stick.
Stick to your work.
Above two sentences carry the word 'stick', BERT does a good job in computing embeddings of stick as per sentence(or say, context).
Now, let's move to one another example:
--What is your age?
--How old are you?
Above two sentences are contextually very similar, so, we need a model that can accept a sentence or text chunk or paragraph and produce right embeddings collectively. Here is how it can be achieved.
Method 1:
Use pre-trained sentence_transformers, here is link to huggingface hub.
from sentence_transformers import SentenceTransformer
from sentence_transformers.util import cos_sim
model = SentenceTransformer(r"sentence-transformers/paraphrase-MiniLM-L6-v2")
embd_a = model.encode("What is your age?")
embd_b = model.encode("How old are you?")
sim_score = cos_sim(embd_a, embd_b)
print(sim_score)
output: tensor([[0.8648]])
Now, there may be a question on how can we train our on sentence_transformer, specific to a domain. Here we go,
Supervised approach:
A common challenge for Data Scientist or MLEngineers is to get rightly annotated data, mostly it is hard to get it in good volume, but say, if you have it here is how we can train our on sentence_transformer (don't worry, there is an unsupervised approach too).
model = SentenceTransformer('distilbert-base-nli-mean-tokens')
train_examples = [InputExample(texts=['My first sentence', 'My second sentence'], label=0.8),
InputExample(texts=['Another pair', 'Unrelated sentence'], label=0.3)]
train_dataloader = DataLoader(train_examples, shuffle=True, batch_size=16)
train_dataloader = DataLoader(train_examples, shuffle=True, batch_size=16)
train_loss = losses.CosineSimilarityLoss(model)
#Tune the model
model.fit(train_objectives=[(train_dataloader, train_loss)], epochs=1, warmup_steps=100)
More details here.
Tip: If you have a set of sentences that are similar to each other, say, you have a CSV, where column A and B contains sentences similar to each other(I mean each row will have a pair of sentences which are similar to each other), just load the csv and assign random values between 0.85 to 0.95 as similarity score and proceed.
Unsupervised approach
Say you don't have a huge set of annotated data, but you want to train a domain specific sentence_transformer, here is how we do it. Even for unsupervised training, data will be required, i.e. list of sentences/paragraphs, but need not to be annotated. Say, you don't have any data at all, still there is a work round (please visit last part of the answer).
Multiple approaches are available for unsupervised training, listing two of the most prominent ones. To see list of all available approaches, please visit here.
TSDAE link to research paper.
from sentence_transformers import SentenceTransformer, LoggingHandler
from sentence_transformers import models, util, datasets, evaluation, losses
from torch.utils.data import DataLoader
# Define your sentence transformer model using CLS pooling
model_name = 'bert-base-uncased'
word_embedding_model = models.Transformer(model_name)
pooling_model = models.Pooling(word_embedding_model.get_word_embedding_dimension(), 'cls')
model = SentenceTransformer(modules=[word_embedding_model, pooling_model])
# Define a list with sentences (1k - 100k sentences)
train_sentences = ["Your set of sentences",
"Model will automatically add the noise",
"And re-construct it",
"You should provide at least 1k sentences"]
# Create the special denoising dataset that adds noise on-the-fly
train_dataset = datasets.DenoisingAutoEncoderDataset(train_sentences)
# DataLoader to batch your data
train_dataloader = DataLoader(train_dataset, batch_size=8, shuffle=True)
# Use the denoising auto-encoder loss
train_loss = losses.DenoisingAutoEncoderLoss(model, decoder_name_or_path=model_name, tie_encoder_decoder=True)
# Call the fit method
model.fit(
train_objectives=[(train_dataloader, train_loss)],
epochs=1,
weight_decay=0,
scheduler='constantlr',
optimizer_params={'lr': 3e-5},
show_progress_bar=True
)
model.save('output/tsdae-model')
SimCSE link to research paper
from sentence_transformers import SentenceTransformer, InputExample
from sentence_transformers import models, losses
from torch.utils.data import DataLoader
# Define your sentence transformer model using CLS pooling
model_name = 'distilroberta-base'
word_embedding_model = models.Transformer(model_name, max_seq_length=32)
pooling_model = models.Pooling(word_embedding_model.get_word_embedding_dimension())
model = SentenceTransformer(modules=[word_embedding_model, pooling_model])
# Define a list with sentences (1k - 100k sentences)
train_sentences = ["Your set of sentences",
"Model will automatically add the noise",
"And re-construct it",
"You should provide at least 1k sentences"]
# Convert train sentences to sentence pairs
train_data = [InputExample(texts=[s, s]) for s in train_sentences]
# DataLoader to batch your data
train_dataloader = DataLoader(train_data, batch_size=128, shuffle=True)
# Use the denoising auto-encoder loss
train_loss = losses.MultipleNegativesRankingLoss(model)
# Call the fit method
model.fit(
train_objectives=[(train_dataloader, train_loss)],
epochs=1,
show_progress_bar=True
)
model.save('output/simcse-model')
Tip: If you carefully observer, major difference is in the loss function used. To see a list of all the loss function applicable to such training scenarios, visit here. Also, with all the experiments I did, I found that TSDAE is more useful, when you want decent precision and good recall. However, SimCSE can be used when you want very high precision and low recall.
Now, if you don't have sufficient data to fine tune the model, but you find a BERT model trained on your domain, you can directly leverage that by adding pooling and dense layers. Please do research on what is 'pooling', to have better understanding on what you are doing.
from sentence_transformers import SentenceTransformer, models
from torch import nn
word_embedding_model = models.Transformer('bert-base-uncased', max_seq_length=256)
pooling_model = models.Pooling(word_embedding_model.get_word_embedding_dimension())
dense_model = models.Dense(in_features=pooling_model.get_sentence_embedding_dimension(), out_features=256, activation_function=nn.Tanh())
model = SentenceTransformer(modules=[word_embedding_model, pooling_model, dense_model])
Tip: With above approach, if you start getting extreme high cosine score, it is an alarm to do negative testing. Sometime, simply adding pooling layers may not help, you must take few examples and check similarity scores for the inputs that are not similar (it is possible that even for dissimilar sentences, this may show good similarity, and that is the time you should stop and try to collect some data and do unsupervised training)
People who are interested in going deeper, here is a list of topics that may help you.
Pooling
Siamese Networks
Contrastive Loss
:) :)

doubts regarding batch size and time steps in RNN

In Tensorflow's tutorial of RNN: https://www.tensorflow.org/tutorials/recurrent
. It mentions two parameters: batch size and time steps. I am confused by the concepts. In my opinion, RNN introduces batch is because the fact that the to-train sequence can be very long such that backpropagation cannot compute that long(exploding/vanishing gradients). So we divide the long to-train sequence into shorter sequences, each of which is a mini-batch and whose size is called "batch size". Am I right here?
Regarding time steps, RNN consists of only a cell (LSTM or GRU cell, or other cell) and this cell is sequential. We can understand the sequential concept by unrolling it. But unrolling a sequential cell is a concept, not real which means we do not implement it in unroll way. Suppose the to-train sequence is a text corpus. Then we feed one word each time to the RNN cell and then update the weights. So why do we have time steps here? Combining my understanding of the above "batch size", I am even more confused. Do we feed the cell one word or multiple words (batch size)?
Batch size pertains to the amount of training samples to consider at a time for updating your network weights. So, in a feedforward network, let's say you want to update your network weights based on computing your gradients from one word at a time, your batch_size = 1.
As the gradients are computed from a single sample, this is computationally very cheap. On the other hand, it is also very erratic training.
To understand what happen during the training of such a feedforward network,
I'll refer you to this very nice visual example of single_batch versus mini_batch to single_sample training.
However, you want to understand what happens with your num_steps variable. This is not the same as your batch_size. As you might have noticed, so far I have referred to feedforward networks. In a feedforward network, the output is determined from the network inputs and the input-output relation is mapped by the learned network relations:
hidden_activations(t) = f(input(t))
output(t) = g(hidden_activations(t)) = g(f(input(t)))
After a training pass of size batch_size, the gradient of your loss function with respect to each of the network parameters is computed and your weights updated.
In a recurrent neural network (RNN), however, your network functions a tad differently:
hidden_activations(t) = f(input(t), hidden_activations(t-1))
output(t) = g(hidden_activations(t)) = g(f(input(t), hidden_activations(t-1)))
=g(f(input(t), f(input(t-1), hidden_activations(t-2)))) = g(f(inp(t), f(inp(t-1), ... , f(inp(t=0), hidden_initial_state))))
As you might have surmised from the naming sense, the network retains a memory of its previous state, and the neuron activations are now also dependent on the previous network state and by extension on all states the network ever found itself to be in. Most RNNs employ a forgetfulness factor in order to attach more importance to more recent network states, but that is besides the point of your question.
Then, as you might surmise that it is computationally very, very expensive to calculate the gradients of the loss function with respect to network parameters if you have to consider backpropagation through all states since the creation of your network, there is a neat little trick to speed up your computation: approximate your gradients with a subset of historical network states num_steps.
If this conceptual discussion was not clear enough, you can also take a look at a more mathematical description of the above.
I found this diagram which helped me visualize the data structure.
From the image, 'batch size' is the number of examples of a sequence you want to train your RNN with for that batch. 'Values per timestep' are your inputs.' (in my case, my RNN takes 6 inputs) and finally, your time steps are the 'length', so to speak, of the sequence you're training
I'm also learning about recurrent neural nets and how to prepare batches for one of my projects (and stumbled upon this thread trying to figure it out).
Batching for feedforward and recurrent nets are slightly different and when looking at different forums, terminology for both gets thrown around and it gets really confusing, so visualizing it is extremely helpful.
Hope this helps.
RNN's "batch size" is to speed up computation (as there're multiple lanes in parallel computation units); it's not mini-batch for backpropagation. An easy way to prove this is to play with different batch size values, an RNN cell with batch size=4 might be roughly 4 times faster than that of batch size=1 and their loss are usually very close.
As to RNN's "time steps", let's look into the following code snippets from rnn.py. static_rnn() calls the cell for each input_ at a time and BasicRNNCell::call() implements its forward part logic. In a text prediction case, say batch size=8, we can think input_ here is 8 words from different sentences of in a big text corpus, not 8 consecutive words in a sentence.
In my experience, we decide the value of time steps based on how deep we would like to model in "time" or "sequential dependency". Again, to predict next word in a text corpus with BasicRNNCell, small time steps might work. A large time step size, on the other hand, might suffer gradient exploding problem.
def static_rnn(cell,
inputs,
initial_state=None,
dtype=None,
sequence_length=None,
scope=None):
"""Creates a recurrent neural network specified by RNNCell `cell`.
The simplest form of RNN network generated is:
state = cell.zero_state(...)
outputs = []
for input_ in inputs:
output, state = cell(input_, state)
outputs.append(output)
return (outputs, state)
"""
class BasicRNNCell(_LayerRNNCell):
def call(self, inputs, state):
"""Most basic RNN: output = new_state =
act(W * input + U * state + B).
"""
gate_inputs = math_ops.matmul(
array_ops.concat([inputs, state], 1), self._kernel)
gate_inputs = nn_ops.bias_add(gate_inputs, self._bias)
output = self._activation(gate_inputs)
return output, output
To visualize how these two parameters are related to the data set and weights, Erik Hallström's post is worth reading. From this diagram and above code snippets, it's obviously that RNN's "batch size" will no affect weights (wa, wb, and b) but "time steps" does. So, one could decide RNN's "time steps" based on their problem and network model and RNN's "batch size" based on computation platform and data set.

Translational equivariance and its relationship with convolutonal layer and spatial pooling layer

In the context of convolutional neural network model, I once heard a statement that:
One desirable property of convolutions is that they are
translationally equivariant; and the introduction of spatial pooling
can corrupt the property of translationally equivalent.
What does this statement mean, and why?
Most probably you heard it from Bengio's book. I will try to give you my explanation.
In a rough sense, two transformations are equivariant if f(g(x)) = g(f(x)). In your case of convolutions and translations means that if you convolve(translate(x)) it would be the same as if you translate(convolve(x)). This is desired because if your convolution will find an eye of a cat in an image, it will find that eye if you will shift the image.
You can see this by yourself (I use 1d conv only because it is easy to calculate stuff). Lets convolve v = [4, 1, 3, 2, 3, 2, 9, 1] with k = [5, 1, 2]. The result will be [27, 12, 23, 17, 35, 21]
Now let's shift our v by appending it with something v' = [8] + v. Convolving with k you will get [46, 27, 12, 23, 17, 35, 21]. As you the result is just a previous result prepended with some new stuff.
Now the part about spatial pooling. Let's do a max-pooling of size 3 on the first result and on the second one. In the first case you will get [27, 35], in the second [46, 35, 21]. As you see 27 somehow disappeared (result was corrupted). It will be more corrupted if you will take an average pooling.
P.S. max/min pooling is the most translationally invariant of all poolings (if you can say so, if you compare the number of non-corrupt elements).
A note on translation equivariant and invariant terms. These terms are different.
Equivariant translation means that a translation of input features results in an equivalent translation of outputs. This is desirable when we need to find the pattern rectangle.
Invariant translation means that a translation of input does not change the outputs at all.
Translation invariance is so important to achieve. This effectively means after learning a certain pattern in the lower-left corner of a picture our convnet can recognize the pattern anywhere (also in the upper right corner).
As we know just a densely connected network without convolution layers in-between cannot achieve translation invariance.
We need to introduce convolution layers to bring generalization power to the deep networks and learn representations with fewer training samples.

Cheat sheet for caffe / pycaffe?

Does anyone know whether there is a cheat sheet for all important pycaffe commands?
I was so far using caffe only via Matlab interface and terminal + bash scripts.
I wanted to shift towards using ipython and work through the ipython notebook examples. However I find it hard to get an overview of all the functions that are inside the caffe module for python. (I'm also quite new to python).
The pycaffe tests and this file are the main gateway to the python coding interface.
First of all, you would like to choose whether to use Caffe with CPU or GPU. It is sufficient to call caffe.set_mode_cpu() or caffe.set_mode_gpu(), respectively.
Net
The main class that the pycaffe interface exposes is the Net. It has two constructors:
net = caffe.Net('/path/prototxt/descriptor/file', caffe.TRAIN)
which simply create a Net (in this case using the Data Layer specified for training), or
net = caffe.Net('/path/prototxt/descriptor/file', '/path/caffemodel/weights/file', caffe.TEST)
which creates a Net and automatically loads the weights as saved in the provided caffemodel file - in this case using the Data Layer specified for testing.
A Net object has several attributes and methods. They can be found here. I will cite just the ones I use more often.
You can access the network blobs by means of Net.blobs. E.g.
data = net.blobs['data'].data
net.blobs['data'].data[...] = my_image
fc7_activations = net.blobs['fc7'].data
You can access the parameters (weights) too, in a similar way. E.g.
nice_edge_detectors = net.params['conv1'].data
higher_level_filter = net.params['fc7'].data
Ok, now it's time to actually feed the net with some data. So, you will use backward() and forward() methods. So, if you want to classify a single image
net.blobs['data'].data[...] = my_image
net.forward() # equivalent to net.forward_all()
softmax_probabilities = net.blobs['prob'].data
The backward() method is equivalent, if one is interested in computing gradients.
You can save the net weights to subsequently reuse them. It's just a matter of
net.save('/path/to/new/caffemodel/file')
Solver
The other core component exposed by pycaffe is the Solver. There are several types of solver, but I'm going to use only SGDSolver for the sake of clarity. It is needed in order to train a caffe model.
You can instantiate the solver with
solver = caffe.SGDSolver('/path/to/solver/prototxt/file')
The Solver will encapsulate the network you are training and, if present, the network used for testing. Note that they are usually the same network, only with a different Data Layer. The networks are accessible with
training_net = solver.net
test_net = solver.test_nets[0] # more than one test net is supported
Then, you can perform a solver iteration, that is, a forward/backward pass with weight update, typing just
solver.step(1)
or run the solver until the last iteration, with
solver.solve()
Other features
Note that pycaffe allows you to do more stuff, such as specifying the network architecture through a Python class or creating a new Layer type.
These features are less often used, but they are pretty easy to understand by reading the test cases.
Please note that the answer by Flavio Ferrara has a litte problem which may cause you waste a lot of time:
net.blobs['data'].data[...] = my_image
net.forward()
The code above is noneffective if your first layer is a Data type layer, because when net.forward() is called, it will begin from the first layer, and then your inserted data my_image will be covered. So it will show no error but give you totally irrelevant output. The correct way is to assign the start and end layer, for example:
net.forward(start='conv1', end='fc')
Here is a Github repository of Face Verification Experiment on LFW Dataset, using pycaffe and some matlab code. I guess it could help a lot, especially the caffe_ftr.py file.
https://github.com/AlfredXiangWu/face_verification_experiment
Besides, here are some short example code of using pycaffe for image classification:
http://codrspace.com/Jaleyhd/caffe-python-tutorial/
http://prog3.com/sbdm/blog/u011762313/article/details/48342495

Resources