Worse performance with transfer learning in convolutional neural network - initialization

I am doing emotion recognition with convolutional neural networks in MatConvNet. I have one main big dataset(A with 40.000 pics) and two harder, smaller datasets(B,C with 5.000 pics) with the same classes. When I run my network on the dataset A with random weight initialization, I get 70% accuracy.
So, I wanted to increase the performance by initializing with weights pretrained on datasets B and C on the same network architecture. I only take three initial layers(conv,relu,pool) from pretrained network when finetuning my network on dataset A. However, I get lower result than with random weights. I also tried taking all the layers, six first layers and one first layer.
Am I understanding and implementing it right? Instead of random weights in first three layers(actually just in the first one - conv), I use the ones from pretrained network. Now I am not sure if I understand the concept well.
I use following code for fine tuning:
net = load('net-epoch-100.mat');
trainOpts.learningRate = [0.004*ones(1,25), 0.002*ones(1,25),
0.001*ones(1,25), 0.0005*ones(1,25)]; %I set much higher training rate
%for pretraining on datasets B and C
net.layers=net.layers(1:end-13); %only taking first three layers from pretrained net
... the rest of the layers

"I only take three initial layers(conv,relu,pool) from pretrained network when finetuning my network on dataset A."
Since relu and pool are not trainable, you essentially only used one layer from pretrained network. The first conv layer just does some edge detection and does not capture any high-level visual concepts.
The best practice for transfer learning is using ImageNet pretrained features from high layers. You can first fine-tune it on your large dataset and then fine-tune it on your small dataset.

Related

Difference between Graph Neural Networks and GraphSage

What is the difference between the basic Graph Convolutional Neural Networks and GraphSage?
Which of the methods is more suited to unsupervised learning and in that case how is the loss function defined?
Please share the base papers for both the methods.
Graph Convolutional Networks are inherently transductive i.e they can only generate embeddings for the nodes present in the fixed graph during the training.
This implies that, if in the future the graph evolves and new nodes (unseen during the training) make their way into the graph then we need to retrain the whole graph in order to compute the embeddings for the new node. This limitation makes the transductive approaches inefficient to get applied on the ever-evolving graphs (like social networks, protein-protein networks, etc) because of their inability to generalize on unseen nodes.
On the other hand, the GraphSage algorithm exploits the rich node features and the topological structure of each node’s neighborhood simultaneously to generate representations for new nodes without retraining efficiently. In addition to this GraphSage performs neighborhood sampling which provides the GraphSage algorithm its unique ability to scale up to billions of nodes in the graph
To find more detail one can follow this blogpost https://sachinsharma9780.medium.com/a-comprehensive-case-study-of-graphsage-algorithm-with-hands-on-experience-using-pytorchgeometric-6fc631ab1067
GCN Paper
GraphSage

Merging Tree Models from two random forest models into one random forest model at H2O in R

I am relatively new to the machine learning ocean, please excuse me if some of my questions are really basic.
Current situation: The overall goal was trying to improve some code for h2o package in r running on the supercomputer cluster. However, since the data is too large that single node with h2o really takes more than a day, therefore, we have decided to use multiple nodes to run the model. I came up with an idea:
(1) Distribute each node to build (nTree/num_node) trees and saved into a model;
(2) running on the cluster at each node for (nTree/num_node) number of trees in the forest;
(3) Merging the trees back together and reform the original forest, and using the measurement results in average.
I later realized this could be risky. But I cannot find the actual support or against statement since I am not machine learning focused programmer.
Questions:
if this way of handling random forest will result in some risk, please reference me the link so I can have a basic idea why this is not right.
If this way is actually an "ok" way to do so. What should I be do to merge the trees, is there a package or method I can borrow from?
If this is actually a solved problem, please reference me the link, I may have searched the wrong keywords, and thank you!
The real number-involved example I can present here is:
I have a random forest task with 80k rows and 2k columns and wanted the number of trees are 64. What I have done is put 16 trees on each node running with the whole dataset, and each one of four nodes come up with an RF model. I am now trying to merge the trees from each model into this one big RF model and average the measurements (from each of those four models).
There is no need to merge the models. Unlike with boosting methods, every tree in a Random Forest is grown independently (just don't set the same seed prior to kicking off RF on each node!).
You are basically doing what Random Forest does on its own, which is to grow X independent trees and then average across the votes. Many packages provide an option to specify the number of cores or threads, in order to take advantage of this feature of RF.
In your case, since you have the same number of trees per node, you'll get 4 "models" back, but those are really just collections of 16 trees. To use it, I'd just keep the 4 models separate and when you want a prediction, average the prediction from each of the 4 models. Assuming you're going to be doing that more than once, you could write a small wrapper function to predict with the 4 models and average the output.
10,000 rows by 1,000 columns is not overly large and should not take that long to train an RF model.
It sound like something unexpected is happening.
While you can try to average models if you know what you are doing, I don't think it should be necessary in this case.

CNN AlexNet algorithm complexity

I'm first year student in machine learning and I really recently started to immersing.
So, professor gave me a task, calculate number of:
matrix additions
matrix multiplications
matrix divisions
Which are processed in the well known convolutional neural network - AlexNet.
I found some matherials about it, but I really confused where to start.
So, the overall structure might looks like:
But, how can I calculate operations for each type distinctly?
It's a convolutional network. These limit parameters to make the calculations manageable while still giving good results.
This particular network is described in many places and papers, so it's not difficult to get the figures for the number of parameters and conv nets involved.
But you need to start with an understanding of how a convolutional network works. I find this is a good place to start: http://cs231n.github.io/convolutional-networks/

Unchanges training accuracy in convolutional neural network using MXnet

I am totally new to NN and want to classify the almost 6000 images that belong to different games (gathered by IR). I used the steps introduced in the following link, but I get the same training accuracy in each round.
some info about NN architecture: 2 convloutional, activation, and pooling layers. Activation type: relu, Number of filters in first and second layers are 30 and 70 respectively.
2 fully connected layer with 500 and 2 hidden layers respectively.
http://firsttimeprogrammer.blogspot.de/2016/07/image-recognition-in-r-using.html
I had a similar problem, but for regression. After trying several things (different optimizers, varying layers and nodes, learning rates, iterations etc), I found that the way initial values are given helps a lot. For instance I used a -random initializer with variance of 0.2 (initializer = mx.init.normal(0.2)).
I came upon this value from this blog. I recommend you read it. [EDIT]An excerpt from the same,
Weight initialization. Worry about the random initialization of the weights at the start of learning.
If you are lazy, it is usually enough to do something like 0.02 * randn(num_params). A value at this scale tends to work surprisingly well over many different problems. Of course, smaller (or larger) values are also worth trying.
If it doesn’t work well (say your neural network architecture is unusual and/or very deep), then you should initialize each weight matrix with the init_scale / sqrt(layer_width) * randn. In this case init_scale should be set to 0.1 or 1, or something like that.
Random initialization is super important for deep and recurrent nets. If you don’t get it right, then it’ll look like the network doesn’t learn anything at all. But we know that neural networks learn once the conditions are set.
Fun story: researchers believed, for many years, that SGD cannot train deep neural networks from random initializations. Every time they would try it, it wouldn’t work. Embarrassingly, they did not succeed because they used the “small random weights” for the initialization, which works great for shallow nets but simply doesn’t work for deep nets at all. When the nets are deep, the many weight matrices all multiply each other, so the effect of a suboptimal scale is amplified.
But if your net is shallow, you can afford to be less careful with the random initialization, since SGD will just find a way to fix it.
You’re now informed. Worry and care about your initialization. Try many different kinds of initialization. This effort will pay off. If the net doesn’t work at all (i.e., never “gets off the ground”), keep applying pressure to the random initialization. It’s the right thing to do.
http://yyue.blogspot.in/2015/01/a-brief-overview-of-deep-learning.html

ProClus cluster analysis in R

For my thesis assignment I need to perform a cluster analysis on a high dimensional data set containing purchase data from a retail store (+1000 dimensions). Because traditional clustering algorithms are not well suited for high dimensions (and dimension reduction is not really an option), I would like to try algorithms specifically developed for high dimensional data(e.g. ProClus).
Here however, my problem starts.
I have no clue what value I should use for parameter d. Can anyone help me?
This is just one of the many limitations of ProClus.
The parameter is the average dimensionality of your cluster. It assumes there is a linear cluster somewhere in your data. This likely will not hold for purchase data, but you can try. For sparse data such as purchases, I would rather focus on frequent itemset mining.
There is no universal clustering algorithm. Any clustering algorithm will come with a variety of parameters that you need to experiment with.
For cluster analysis it is essential that you somehow can visualize or analyze the result, to be able to find out if and how well the method worked.

Resources