I would like to set the default learning rate for my optimizer in Flux. I was looking at this example: https://fluxml.ai/Flux.jl/stable/training/optimisers/ and it appears that the interface to do so is through the update! function. Is this the way to set the Learning rate or are their other options as well?
As mentioned in the Flux.jl docs, there are a few different interfaces to set the learning rate. For an optimizer, you can use the update!() function. In the case of gradient descent:
Descent(η = 0.1):
Classic gradient descent optimiser with learning rate η. For each parameter
p and its gradient δp, this runs p -= η*δp
which means we can pass in some learning rate (usually between 0.1 and 0.001) to the Descent function to set the LR.
There are a bunch of other functions you can use to specify the LR for specific optimizer use cases and you can find those here: https://fluxml.ai/Flux.jl/stable/training/optimisers/#Optimiser-Reference
Related
I'm working in R and trying to get started with neural networks, using the keras package.
I'd like to use a custom loss function for training my NN. It's possible to do this by writing a the custom loss function as lossFn <- function(y_true, y_pred) { ... } and passing it to the compile method as model %>% compile(loss = lossFn, ...).
Now in order to use the gradient descent method of training the NN, the loss function needs to be differentiable. I understand that you'd usually accomplish this by restricting yourself to using backend functions in your loss function, e.g.
lossFn <- function(y_true, y_pred) {
K <- backend()
K$mean(K$square(y_true - y_pred), axis = 1L)
}
or something like that.
Now, my problem is that I cannot express my loss function this way; I need to use functions that aren't available in the backend.
So my idea was that I'd work out the gradient myself on paper, and then provide it to compile as another argument, say compile(loss = lossFn, gradient = gradientFn, ...), with gradientFn suitably defined.
The documentation for keras (the R package!) does not indicate that this is possible. At the same time, it does not suggest it's not. And googling has turned up little that is relevant.
So my question is, is it possible?
An addendum: since Google has suggested that there are other training methods for NNs that do not rely on the gradient of the loss function, I should add I'm not too hung up on the specific training method. My ultimate goal isn't to manually supply the gradient of a custom loss function, it's to use a custom loss function to train the NN. The gradient is just a technical obstacle for me right now.
Thanks!
This is certainly possible in Keras, you'll just have to move up the stack a little and implement a train_step method and then call optimizer$apply_gradients().
Chapter 7 in the Deep Learning with R book covers this use case:
https://github.com/t-kalinowski/deep-learning-with-R-2nd-edition-code/blob/9f8b6d08dbb8d6565e4f5396e509aaea3e242b84/ch07.R#L608
Also, this keras guide may be useful, even though it's in Python and you're working in R. (The Python interface is very similar to the R interface).
https://keras.io/guides/writing_a_training_loop_from_scratch/
I would like to use AlexNet architecture to solve a regression problem, which is initially used to classification tasks.
Furthermore, for learning step I want to include a parameter to batch size.
So I have several question :
What do I need to change in the network architecture to achieve a regression ? Precisely in the last layer, the loss function or other things.
If I use a batch size of 5, what is the output size in the last layer ?
Thanks !
It would be helpful to share:
Q Framework: Which deep learning framework you are working with and/or share specific piece of code that you need help modifying
A: eg. TensorFlow, PyTorch, Keras etc.
Q Type of Loss, Output size: What is the task you are trying to achieve with regression? This would impact the kind of loss you want to use, the output dimension, fine-tuning the VGGnet etc.
A: eg. Auto-colorization of grayscale images (here is an example) is an example of a regression task, where you would try to regress the RGB channel pixel values from a monochrome image. You may have an L2 loss (or some other loss for improved performance). The output size should be independent of the batch size, it would be determined by the dimension of the output from the final layer (i.e. the prediction op). The batch size is a training parameter that you can change without having to alter the model architecture or output dimensions.
I'm using this LDA package for R. Specifically I am trying to do supervised latent dirichlet allocation (slda). In the linked package, there's an slda.em function. However what confuses me is that it asks for alpha, eta and variance parameters. As far as I understand, I thought these parameters are unknowns in the model. So my question is, did the author of the package mean to say that these are initial guesses for the parameters? If yes, there doesn't seem to be a way of accessing them from the result of running slda.em.
Aside from coding the extra EM steps in the algorithm, is there a suggested way to guess reasonable values for these parameters?
Since you are trying to generate a supervised model, the typical approach would be to use cross validation to determine the model parameters. So you hold out some of the data as your test set, train the a model on the remaining data, and evaluate the model performance, repeating k times. You then continue to repeat with different model parameters to determine which result in the best model performance.
In the specific case of slda, I would run demo(slda) to see the author's implementation of it. When you run the demo, you'll see that he sets alpha=1.0, eta=0.1, and variance=0.25. I'd suggest using these as your starting point, and then use cross validation to determine better parameters if you need to improve model performance.
Let Y be a binary variable.
If we use logistic regression for modeling, then we can use cv.glm for cross validation and there we can specify the cost function in the cost argument. By specifying the cost function, we can assign different unit costs to different types of errors:predicted Yes|reference is No or predicted No|reference is Yes.
I am wondering if I could achieve the same in SVM. In other words, is there a way for me to specify a cost(loss) function instead of using built-in loss function?
Besides the Answer by Yueguoguo, there is also three more solutions, the standard Wrapper approach, hyperplane tuning and the one in e1017.
The Wrapper approach (available out of the box for example in weka) is applicable to almost all classifiers. The idea is to over- or undersample the data in accordance with the misclassification costs. The learned model if trained to optimise accuracy is optimal under the costs.
The second idea is frequently used in textminining. The classification is svm's are derived from distance to the hyperplane. For linear separable problems this distance is {1,-1} for the support vectors. The classification of a new example is then basically, whether the distance is positive or negative. However, one can also shift this distance and not make the decision and 0 but move it for example towards 0.8. That way the classifications are shifted in one or the other direction, while the general shape of the data is not altered.
Finally, some machine learning toolkits have a build in parameter for class specific costs like class.weights in the e1017 implementation. the name is due to the fact that the term cost is pre-occupied.
The loss function for SVM hyperplane parameters is automatically tuned thanks to the beautiful theoretical foundation of the algorithm. SVM applies cross-validation for tuning hyperparameters. Say, an RBF kernel is used, cross validation is to select the optimal combination of C (cost) and gamma (kernel parameter) for the best performance, measured by certain metrics (e.g., mean squared error). In e1071, the performance can be obtained by using tune method, where the range of hyperparameters as well as attribute of cross-validation (i.e., 5-, 10- or more fold cross validation) can be specified.
To obtain comparative cross-validation results by using Area-Under-Curve type of error measurement, one can train different models with different hyperparameter configurations and then validate the model against sets of pre-labelled data.
Hope the answer helps.
I have a complex objective function I am looking to optimize. The optimization problem takes a considerable time to optimize. Fortunately, I do have the gradient and the hessian of the function available.
Is there an optimization package in R that can take all three of these inputs? The class 'optim' does not accept the Hessian. I have scanned the CRAN task page for optimization and nothing pops.
For what it's worth, I am able to perform the optimization in MATLAB using 'fminunc' with the the 'GradObj' and 'Hessian' arguments.
I think the package trust which does trust region optimization will do the trick. From the documentation of trust, you see that
This function carries out a minimization or maximization of a function
using a trust region algorithm... (it accepts) an R function that
computes value, gradient, and Hessian of the function to be minimized
or maximized and returns them as a list with components value,
gradient, and hessian.
In fact, I think it uses the same algorithm used by fminunc.
By default fminunc chooses the large-scale algorithm if you supply the
gradient in fun and set GradObj to 'on' using optimset. This algorithm
is a subspace trust-region method and is based on the
interior-reflective Newton method described in [2] and [3]. Each
iteration involves the approximate solution of a large linear system
using the method of preconditioned conjugate gradients (PCG). See
Large Scale fminunc Algorithm, Trust-Region Methods for Nonlinear
Minimization and Preconditioned Conjugate Gradient Method.
Both stats::nlm() and stats::nlminb() take analytical gradients and hessians. Note, however, that the former (nlm()) currently does not update the analytical gradient correctly but that this is fixed in the current development version of R (since R-devel, svn rev 72555).