What does `train!()` do in Flux.jl? - julia

In some machine learning frameworks, the train function might not actually do the training itself, and instead just set the mode (i.e. just making sure the model and such are ready to train). Is this the case with the train function in Flux or does the train!() function actually do the training?

According to the Flux.jl docs, the train!() function does indeed do the actual training. The function signature looks like: train!(loss, params, data, opt; cb) where:
For each datapoint d in data, compute the gradient of loss with respect to params through backpropagation and call the optimizer opt.
If d is a tuple of arguments to loss call loss(d...), else call loss(d).
A callback is given with the keyword argument cb. For example, this will print "training" every 10 seconds (using Flux.throttle): train!(loss, params, data, opt, cb = throttle(() -> println("training"), 10))
The callback can call Flux.stop to interrupt the training loop.
Multiple optimisers and callbacks can be passed to opt and cb as arrays.
Another example: #epochs 2 Flux.train!(loss, ps, dataset, opt) where we do 2 training epochs. You can find more here in the Flux transfer learning tutorial.

Related

Julia Flux: writing a regularizer depending on the provided regularization coefficients

I am writing a script converting Python's Keras (v1.1.0) model to Julia's Flux model, and I am struggling with implementing regularization (I have read https://fluxml.ai/Flux.jl/stable/models/regularisation/) as a way to get to know Julia.
So, in Keras's json model I have something like: "W_regularizer": {"l2": 0.0010000000474974513, "name": "WeightRegularizer", "l1": 0.0} for each Dense layer. I want to use these coefficients to create regularization in the Flux model. The problem is that, in Flux it is added directly to the loss instead of being defined as a property of the layer itself.
To avoid posting too much code here, I've added it to the repo. Here is a small script that takes the json and createa Flux's Chain: https://github.com/iegorval/Keras2Flux.jl/blob/master/Keras2Flux/src/Keras2Flux.jl
Now, I want to create a penalty for each Dense layer with the predefined l1/l2 coefficient. I tried to do it like this:
using Pkg
pkg"activate /home/username/.julia/dev/Keras2Flux"
using Flux
using Keras2Flux
using LinearAlgebra
function get_penalty(model::Chain, regs::Array{Any, 1})
index_model = 1
index_regs = 1
penalties = []
for layer in model
if layer isa Dense
println(regs[index_regs](layer.W))
penalty(m) = regs[index_regs](m[index_model].W)
push!(penalties, penalty)
#println(regs[i])
index_regs += 1
end
index_model += 1
end
total_penalty(m) = sum([p(m) for p in penalties])
println(total_penalty)
println(total_penalty(model))
return total_penalty
end
model, regs = convert_keras2flux("examples/keras_1_1_0.json")
penalty = get_penalty(model, regs)
So, I create a penalty function for each Dense layer and then sum it up to the total penalty. However, it gives me this error:
ERROR: LoadError: BoundsError: attempt to access 3-element Array{Any,1} at index [4]
I understand what it means but I really don't understand how to fix it. So, it seems that when I call total_penalty(model), it uses index_regs == 4 (so, the values of index_regs and index_model as they are AFTER the for-cycle). Instead, I want to use their actual indices that I had while pushing the given penalty to the list of penalties.
On the other hand, if I did it not as a list of functions but as a list of values, it also would not be correct, because I will define the loss as:
loss(x, y) = binarycrossentropy(model(x), y) + total_penalty(model). If I was to use it just as list of values, then I would have a static total_penalty, while it should be recalculated for every Dense layer every time during the model training.
I would be thankful if somebody with Julia experience gives me some advise because I am definitely failing to understand how it works in Julia and, specifically, in Flux. How would I create total_penalty that would be recalculated automatically during training?
There are a couple parts to your question, and since you are new to Flux (and Julia?), I will answer in steps. But I suggest the solution at the end as a cleaner way to handle this.
First, there is the issue of p(m) calculating the penalty using index_regs and index_model as the values after the for-loop. This is because of the scoping rules in Julia. When you define the closure penalty(m) = regs[index_regs](m[index_model].W), index_regs is bound to the variable defined in get_penalty. So, as index_regs changes, so does the output of p(m). The other issue is the naming of the function as penalty(m). Every time you run this line, you are redefining penalty and all references to it that you pushed onto penalties. Instead, you should prefer to create an anonymous function. Here is how we incorporate these changes:
function get_penalty(model::Chain, regs::Array{Any, 1})
index_model = 1
index_regs = 1
penalties = []
for layer in model
if layer isa Dense
println(regs[index_regs](layer.W))
penalty = let i = index_regs, index_model = index_model
m -> regs[i](m[index_model].W)
end
push!(penalties, penalty)
index_regs += 1
end
index_model += 1
end
total_penalty(m) = sum([p(m) for p in penalties])
return total_penalty
end
I used i and index_model in the let block to drive home the scoping rules. I'd encourage you to replace the anonymous function in the let block with global penalty(m) = ... (and remove the assignment to penalty before the let block) to see the difference of using anonymous vs named functions.
But, if we go back to your original issue, you want to calculate the regularization penalty for your model using the stored coefficients. Ideally, these would be stored with each Dense layer as in Keras. You can recreate the same functionality in Flux:
using Flux, Functor
struct RegularizedDense{T, LT<:Dense}
layer::LT
w_l1::T
w_l2::T
end
#functor RegularizedDense
(l::RegularizedDense)(x) = l.layer(x)
penalty(l) = 0
penalty(l::RegularizedDense) =
l.w_l1 * norm(l.layer.W, 1) + l.w_l2 * norm(l.layer.W, 2)
penalty(model::Chain) = sum(penalty(layer) for layer in model)
Then, in your Keras2Flux source, you can redefine get_regularization to return w_l1_reg and w_l2_reg instead of functions. And in create_dense you can do:
function create_dense(config::Dict{String,Any}, prev_out_dim::Int64=-1)
# ... code you have already written
dense = Dense(in, out, activation; initW = init, initb = zeros)
w_l1, w_l2 = get_regularization(config)
return RegularizedDense(dense, w_l1, w_l2)
end
Lastly, you can compute your loss function like so:
loss(x, y, m) = binarycrossentropy(m(x), y) + penalty(m)
# ... later for training
train!((x, y) -> loss(x, y, m), training_data, params)
We define loss as a function of (x, y, m) to avoid performance issues.
So, in the end, this approach is cleaner because after model construction, you don't need to pass around an array of regularization functions and figure out how to index each function correctly with the corresponding dense layer.
If you prefer to keep the regularizer and model separate (i.e. have standard Dense layers in your model chain), then you can do that too. Let me know if you want that solution, but I'll leave it out for now.

Double discrete integration of periodic function with R: doubly integrated function contains linear artifact

I need to integrate a signal from accelerometer, in order to get speed and position over time.
I'm trying the code on some code-generated acceleration data:
1)squarewave
2)sawtooth
3)sin
The speed function obtained is ok, the problem is with the position function obtained integrating speed. IN each case (squarewave, sawtooth, sin) the doubly discrete-integrated funtion shows a linear term superposed to the expected oscillating one.
I've perfomed this discrete-integration with both diffinv() function and with this custom function I've written:
#function that, given a function sampled at some time values, calculates its primitive
calculatePrimitive<-function(f_t, time, initialValue){
F_t<-0
F_t[1]<-initialValue
for (i in 2:length(f_t)) {
F_t[i] <- F_t[i-1] + (( (f_t[i]+f_t[i-1])/2 )*(time[i]-time[i-1]) )
}
F_t
}
The result is the same, no matter which function i use to performe the discrete integration, and it is shown in the attached graphs for cases 1) to 3).
I don't understand why this happen when, no matter what is the acceleration data, the discrete integration is applied to data that have been obtained by descrete integration themselves.

Optimizing a non differentiable function in R

There are two methods I am experimenting with in minimizing a cost function. The first is optim() and the second is optim_nm() part of the optimization package. The problem I am facing is my error function takes on 2 parameters,
A list of variable parameters the optimization function needs to modify
A set of fixed parameters
optim(par = variableParameters,fn = error_function, par2 = fixedParameters):
optim handles this well because the first argument is the variable parameters, the function and then a set of optional parameters where I can pass the fixed parameters. This works, however, the function is slow.
optim_nm(fun = error_function,k=5,start = variable_parameters)
optim_nm, allows me to tune the optimization function, however, i'm unsure of how to pass the fixed parameters. All the examples in the documentation are with variable parameters.
Both methods implement the Nelder and Mead algorithm which is robust for nondifferentiable error functions which is what I require. If there are other packages that do this fast please do mention them too!
If someone has used this, or can better interpret the documentation I could use your help.
optim_nm Documentation
optim documentation
Create a wrapper function that fills in the values for the fixed parameters:
error_function <- function(variableParameters, fixedParameters) {
...
}
wrapper <- function(x) {
error_function(x, fixedParameters = 3)
}
optim_nm(fun = wrapper,
k = 5,
start = initial_parameter_values)
If error_function is expensive to evaluate, you may want to look into Bayesian optimization with the rBayesianOptimization or mlrMBO packages.

how to change the values of convergence in optim in r

I am working with big and complex function. I am using optim to estimate the model parameters. I see from the iteration values of optim, it does not converge even if the current and last values are very close.
For example,
iteration 10 400.0091
iteration 20 400.0092
iteration 30 400.0093
:
:
keep going, say for iteration 1200.
So how can I change the convergence round of the optim that is, if the current iteration is very close to previous iteration then converge.
You are looking for abstol or reltol which are components control argument.
See ?optim for more details. I can't recommend one without an example/context for your question but your call will look something like:
optim(par, fn, [other vars?], control = list(reltol = 1e-5))

R: SVM performance using custom kernel (user defined kernel) is not working in kernlab

I'm trying to use user defined kernel. I know that kernlab offer user defined kernel(custom kernel functions) in R. I used data spam including package kernlab.
(number of variables=57 number of examples =4061)
I'm defined kernel's form,
kp=function(d,e){
as=v*d
bs=v*e
cs=as-bs
cs=as.matrix(cs)
exp(-(norm(cs,"F")^2)/2)
}
class(kp)="kernel"
It is the transformed kernel for gaussian kernel, where v is the continuously changed values that are inverse of standard deviation vector about each variables, for example:
v=(0.1666667,........0.1666667)
The training set defined 60% of spam data (preserving the proportions of the different classes).
if data's type is spam, than data's type = 1 for train svm
m=ksvm(xtrain,ytrain,type="C-svc",kernel=kp,C=10)
But this step is not working. It's always waiting for a response.
So, I ask you this problem, why? Is it because the number of examples are too big? Is there any other R package that can train SVMs for user defined kernel?
First, your kernel looks like a classic RBF kernel, with v = 1/sigma, so why do you use it? You can use a built-in RBF kernel and simply set the sigma parameter. In particular - instead of using frobenius norm on matrices you could use classic euclidean on the vectorized matrices.
Second - this is working just fine.
> xtrain = as.matrix( c(1,2,3,4) )
> ytrain = as.factor( c(0,0,1,1) )
> v= 0.01
> m=ksvm(xtrain,ytrain,type="C-svc",kernel=kp,C=10)
> m
Support Vector Machine object of class "ksvm"
SV type: C-svc (classification)
parameter : cost C = 10
Number of Support Vectors : 4
Objective Function Value : -39.952
Training error : 0
There are at least two reasons for you still waiting for results:
RBF kernels induce the most hard problem to optimize for SVM (especially for large C)
User defined kernels are far less efficient then builtin
As I am not sure, whether ksvm actually optimizes the user-defined kernel computation (in fact I'm pretty sure it does not), you could try to build the kernel matrix ( K[i,j] = K(x_i,x_j) where x_i is i'th training vector) and provide ksvm with it. You can achieve this by
K <- kernelMatrix(kp,xtrain)
m <- ksvm(K,ytrain,type="C-svc",kernel='matrix',C=10)
Precomputing kernel matrix can be quite long process, but then optimization itself will be much faster, so it is a good method if you want to test many different C values (which you for sure should do). Unfortunately this requires O(n^2) memory, so if you use more then 100 000 vectors, you will need really great amount of RAM.

Resources