I would like to use AlexNet architecture to solve a regression problem, which is initially used to classification tasks.
Furthermore, for learning step I want to include a parameter to batch size.
So I have several question :
What do I need to change in the network architecture to achieve a regression ? Precisely in the last layer, the loss function or other things.
If I use a batch size of 5, what is the output size in the last layer ?
Thanks !
It would be helpful to share:
Q Framework: Which deep learning framework you are working with and/or share specific piece of code that you need help modifying
A: eg. TensorFlow, PyTorch, Keras etc.
Q Type of Loss, Output size: What is the task you are trying to achieve with regression? This would impact the kind of loss you want to use, the output dimension, fine-tuning the VGGnet etc.
A: eg. Auto-colorization of grayscale images (here is an example) is an example of a regression task, where you would try to regress the RGB channel pixel values from a monochrome image. You may have an L2 loss (or some other loss for improved performance). The output size should be independent of the batch size, it would be determined by the dimension of the output from the final layer (i.e. the prediction op). The batch size is a training parameter that you can change without having to alter the model architecture or output dimensions.
Related
I am using the h2o package to train a GBM for a churn prediction problem.
all I wanted to know is what influences the size of the fitted model saved on disk (via h2o.saveModel()), but unfortunately I wasn't able to find an answer anywhere.
more specifically, when I tune the GBM to find the optimal hyperparameters (via h2o.grid()) on 3 non-overlapping rolling windows of the same length, I obtain models whose sizes are not comparable (i.e. 11mb, 19mb and 67mb). the hyperparameters grid is the same, and also the train set sizes are comparable.
naturally the resulting optimized hyperparameters are different across the 3 intervals, but I cannot see how this can produces such a difference in the model sizes.
moreover, when I train the actual models based on those hyperparameters sets, I end up with models with different sizes as well.
any help is appreciated!
thank you
ps. I'm sorry but I cannot share any dataset to make it reproducible (due to privacy restrictions)
It’s the two things you would expect: the number of trees and the depth.
But it also depends on your data. For GBM, the trees can be cut short depending on the data.
What I would do is export MOJOs and then visualize them as described in the document below to get more details on what was really produced:
http://docs.h2o.ai/h2o/latest-stable/h2o-genmodel/javadoc/index.html
Note the 60 MB range does not seem overly large, in general.
If you look at the model info you will find out things about the number of trees, their average depth, and so on. Comparing those between the three best models should give you some insight into what is making the models large.
From R, if m is your model, just printing it gives you most of that information. str(m) gives you all the information that is held.
I think it is worth investigating. The cause is probably that two of those data windows are relatively clear-cut, and only a few fields can define the trees, whereas the third window of data is more chaotic (in the mathematical sense), and you get some deep trees being made as it tries to split that apart into decision trees.
Looking into that third window more deeply might suggest some data engineering you could do, that would make it easier to learn. Or, it might be a difference in your data. E.g. one column is all NULL in your 2016 and 2017 data, but not in your 2018 data, because 2018 was the year you started collecting it, and it is that extra column that allows/causes the trees to become deeper.
Finally, maybe the grid hyperparameters are unimportant as regards performance, and this a difference due to noise. E.g. you have max_depth as a hyperparameter, but the influence on MSE is minor, and noise is a large factor. These random differences could allow your best model to go to depth 5 for two of your data sets (but 2nd best model was 0.01% worse but went to depth 20), but go to depth 30 for your third data set (but 2nd best model was 0.01% worse but only went to depth 5).
(If I understood your question correctly, you've eliminated this as a possibility, as you then trained all three data sets on the same hyperparameters? But I thought I'd include it, anyway.)
Given an indicator function that maps finite set X to {0, 1} and assuming the exact function of how this mapping occurs is unknown, can I learn how each element of X is contributing to the output simply by generating random samples of X and using ML?
In general, yes. This is a very basic machine learning problem: binary classification, supervised learning.
Use random samples of X and the indicator classification (0 | 1) as your data set. You can feed this to a component analysis tool, or perhaps train a model and then look at the derived training function (which, supposedly, will approach the same functionality as your indicator function).
Learning to use these tools will take more learning on your part. There is probably a ML or statistics support package in your favourite language that can help. Since Stack Overflow is not really a tutorial site, I'll leave off here and let you get to those search terms and your next set of reading.
When running RandomForest, is there a way to use the number of rows and columns from the input data, plus the options of the forest (trees and trys) to calculate the size of the forest (in bytes) before it's run?
The specific issue I'm having is when running my final RandomForest (as opposed to exploratory), I want as robust a model as possible. I want to run right up to my memory limit without hitting it. Right now, I'm just doing trial and error, but I'm looking for a more precise way.
I want to run right up to my memory limit without hitting it.
Why do you want to do that? Instead of pushing your resources to the limit, you should instead just use whatever resources are required to build a good random forest model. In my experience, I have rarely ran into memory limit problems when running random forests. This is because I train on a subset of the actual data set which is reasonably sized.
The randomForest function (from the randomForest package) has two parameters which influence how large the forest will become. The first is ntree, which is the number of trees to be used when building the forest. The fewer the trees, the smaller the size of the model. Another parameter is nodesize, which controls how many observations will be placed into each leaf node of each tree. The smaller the node size, the more splitting which has to be done in each tree, and the larger the forest model.
You should experiment with these parameters, and also train on a reasonably-sized training set. The metric for a good model is not how close you come to maxing out your memory limit, but rather how robust a model you build.
I am trying to understand how neural network can predict different outputs by learning different input/output patterns..I know that weights changes are the mode of learning...but if an input brings about weight adjustments to achieve a particular output in back propagtion algorithm.. won't this knowledge(weight updates) be knocked of when presented with a different set of input pattern...thus making the network forget what it had previously learnt..
The key to avoid "destroying" the networks current knowledge is to set the learning rate to a sufficiently low value.
Lets take a look at the mathmatics for a perceptron:
The learning rate is always specified to be < 1. This forces the backpropagation algorithm to take many small steps towards the correct setting, rather than jumping in large steps. The smaller the steps, the easier it will be to "jitter" the weight values into the perfect settings.
If, on the other hand, used a learning rate = 1, we could start to experience trouble with converging as you mentioned. A high learning rate would imply that the backpropagation should always prefer to satisfy the currently observed input pattern.
Trying to adjust the learning rate to a "perfect value" is unfortunately more of an art, than science. There are of course implementations with adaptive learning rate values, refer to this tutorial from Willamette University. Personally, I've just used a static learning rate in the range [0.03, 0.1].
I have my own implementation of the Expectation Maximization (EM) algorithm based on this paper, and I would like to compare this with the performance of another implementation. For the tests, I am using k centroids with 1 Gb of txt data and I am just measuring the time it takes to compute the new centroids in 1 iteration. I tried it with an EM implementation in R, but I couldn't, since the result is plotted in a graph and gets stuck when there's a large number of txt data. I was following the examples in here.
Does anybody know of an implementation of EM that can measure its performance or know how to do it with R?
Fair benchmarking of EM is hard. Very hard.
the initialization will usually involve random, and can be very different. For all I know, the R implementation by default uses hierarchical clustering to find the initial clusters. Which comes at O(n^2) memory and most likely at O(n^3) runtime cost. In my benchmarks, R would run out of memory due to this. I assume there is a way to specify initial cluster centers/models. A random-objects initialization will of course be much faster. Probably k-means++ is a good way to choose initial centers in practise.
EM theoretically never terminates. It just at some point does not change much anymore, and thus you can set a threshold to stop. However, the exact definition of the stopping threshold varies.
There exist all kinds of model variations. A method only using fuzzy assignments such as Fuzzy-c-means will of course be much faster than an implementation using multivariate Gaussian Mixture Models with a covaraince matrix. In particular with higher dimensionality.
Covariance matrixes also need O(k * d^2) memory, and the inversion will take O(k * d^3) time, and thus is clearly not appropriate for text data.
Data may or may not be appropriate. If you run EM on a data set that actually has Gaussian clusters, it will usually work much better than on a data set that doesn't provide a good fit at all. When there is no good fit, you will see a high variance in runtime even with the same implementation.
For a starter, try running your own algorithm several times with different initialization, and check your runtime for variance. How large is the variance compared to the total runtime?
You can try benchmarking against the EM implementation in ELKI. But I doubt the implementation will work with sparse data such as text - that data just is not Gaussian, it is not proper to benchmark. Most likely it will not be able to process the data at all because of this. This is expected, and can be explained from theory. Try to find data sets that are dense and that can be expected to have multiple gaussian clusters (sorry, I can't give you many recommendations here. The classic Iris and Old Faithful data sets are too small to be useful for benchmarking.