Julia Flux: writing a regularizer depending on the provided regularization coefficients - julia

I am writing a script converting Python's Keras (v1.1.0) model to Julia's Flux model, and I am struggling with implementing regularization (I have read https://fluxml.ai/Flux.jl/stable/models/regularisation/) as a way to get to know Julia.
So, in Keras's json model I have something like: "W_regularizer": {"l2": 0.0010000000474974513, "name": "WeightRegularizer", "l1": 0.0} for each Dense layer. I want to use these coefficients to create regularization in the Flux model. The problem is that, in Flux it is added directly to the loss instead of being defined as a property of the layer itself.
To avoid posting too much code here, I've added it to the repo. Here is a small script that takes the json and createa Flux's Chain: https://github.com/iegorval/Keras2Flux.jl/blob/master/Keras2Flux/src/Keras2Flux.jl
Now, I want to create a penalty for each Dense layer with the predefined l1/l2 coefficient. I tried to do it like this:
using Pkg
pkg"activate /home/username/.julia/dev/Keras2Flux"
using Flux
using Keras2Flux
using LinearAlgebra
function get_penalty(model::Chain, regs::Array{Any, 1})
index_model = 1
index_regs = 1
penalties = []
for layer in model
if layer isa Dense
println(regs[index_regs](layer.W))
penalty(m) = regs[index_regs](m[index_model].W)
push!(penalties, penalty)
#println(regs[i])
index_regs += 1
end
index_model += 1
end
total_penalty(m) = sum([p(m) for p in penalties])
println(total_penalty)
println(total_penalty(model))
return total_penalty
end
model, regs = convert_keras2flux("examples/keras_1_1_0.json")
penalty = get_penalty(model, regs)
So, I create a penalty function for each Dense layer and then sum it up to the total penalty. However, it gives me this error:
ERROR: LoadError: BoundsError: attempt to access 3-element Array{Any,1} at index [4]
I understand what it means but I really don't understand how to fix it. So, it seems that when I call total_penalty(model), it uses index_regs == 4 (so, the values of index_regs and index_model as they are AFTER the for-cycle). Instead, I want to use their actual indices that I had while pushing the given penalty to the list of penalties.
On the other hand, if I did it not as a list of functions but as a list of values, it also would not be correct, because I will define the loss as:
loss(x, y) = binarycrossentropy(model(x), y) + total_penalty(model). If I was to use it just as list of values, then I would have a static total_penalty, while it should be recalculated for every Dense layer every time during the model training.
I would be thankful if somebody with Julia experience gives me some advise because I am definitely failing to understand how it works in Julia and, specifically, in Flux. How would I create total_penalty that would be recalculated automatically during training?

There are a couple parts to your question, and since you are new to Flux (and Julia?), I will answer in steps. But I suggest the solution at the end as a cleaner way to handle this.
First, there is the issue of p(m) calculating the penalty using index_regs and index_model as the values after the for-loop. This is because of the scoping rules in Julia. When you define the closure penalty(m) = regs[index_regs](m[index_model].W), index_regs is bound to the variable defined in get_penalty. So, as index_regs changes, so does the output of p(m). The other issue is the naming of the function as penalty(m). Every time you run this line, you are redefining penalty and all references to it that you pushed onto penalties. Instead, you should prefer to create an anonymous function. Here is how we incorporate these changes:
function get_penalty(model::Chain, regs::Array{Any, 1})
index_model = 1
index_regs = 1
penalties = []
for layer in model
if layer isa Dense
println(regs[index_regs](layer.W))
penalty = let i = index_regs, index_model = index_model
m -> regs[i](m[index_model].W)
end
push!(penalties, penalty)
index_regs += 1
end
index_model += 1
end
total_penalty(m) = sum([p(m) for p in penalties])
return total_penalty
end
I used i and index_model in the let block to drive home the scoping rules. I'd encourage you to replace the anonymous function in the let block with global penalty(m) = ... (and remove the assignment to penalty before the let block) to see the difference of using anonymous vs named functions.
But, if we go back to your original issue, you want to calculate the regularization penalty for your model using the stored coefficients. Ideally, these would be stored with each Dense layer as in Keras. You can recreate the same functionality in Flux:
using Flux, Functor
struct RegularizedDense{T, LT<:Dense}
layer::LT
w_l1::T
w_l2::T
end
#functor RegularizedDense
(l::RegularizedDense)(x) = l.layer(x)
penalty(l) = 0
penalty(l::RegularizedDense) =
l.w_l1 * norm(l.layer.W, 1) + l.w_l2 * norm(l.layer.W, 2)
penalty(model::Chain) = sum(penalty(layer) for layer in model)
Then, in your Keras2Flux source, you can redefine get_regularization to return w_l1_reg and w_l2_reg instead of functions. And in create_dense you can do:
function create_dense(config::Dict{String,Any}, prev_out_dim::Int64=-1)
# ... code you have already written
dense = Dense(in, out, activation; initW = init, initb = zeros)
w_l1, w_l2 = get_regularization(config)
return RegularizedDense(dense, w_l1, w_l2)
end
Lastly, you can compute your loss function like so:
loss(x, y, m) = binarycrossentropy(m(x), y) + penalty(m)
# ... later for training
train!((x, y) -> loss(x, y, m), training_data, params)
We define loss as a function of (x, y, m) to avoid performance issues.
So, in the end, this approach is cleaner because after model construction, you don't need to pass around an array of regularization functions and figure out how to index each function correctly with the corresponding dense layer.
If you prefer to keep the regularizer and model separate (i.e. have standard Dense layers in your model chain), then you can do that too. Let me know if you want that solution, but I'll leave it out for now.

Related

Correct way to generate Poisson-distributed random numbers in Julia GPU code?

For a stochastic solver that will run on a GPU, I'm currently trying to draw Poisson-distributed random numbers. I will need one number for each entry of a large array. The array lives in device memory and will also be deterministically updated afterwards. The problem I'm facing is that the mean of the distribution depends on the old value of the entry. Therefore, I would have to do naively do something like:
CUDA.rand_poisson!(lambda=array*constant)
or:
array = CUDA.rand_poisson(lambda=array*constant)
Both of which don't work, which does not really surprise me, but maybe I just need to get a better understanding of broadcasting?
Then I tried writing a kernel which looks like this:
function cu_draw_rho!(rho::CuDeviceVector{FloatType}, λ::FloatType)
idx = (blockIdx().x - 1i32) * blockDim().x + threadIdx().x
stride = gridDim().x * blockDim().x
#inbounds for i=idx:stride:length(rho)
l = rho[i]*λ
# 1. variant
rho[i] > 0.f0 && (rho[i] = FloatType(CUDA.rand_poisson(UInt32,1;lambda=l)))
# 2. variant
rho[i] > 0.f0 && (rho[i] = FloatType(rand(Poisson(lambda=l))))
end
return
end
And many slight variations of the above. I get tons of errors about dynamic function calls, which I connect to the fact that I'm calling functions that are meant for arrays from my kernels. the 2. variant of using rand() works only without the Poisson argument (which uses the Distributions package, I guess?)
What is the correct way to do this?
You may want CURAND.jl, which provides curand_poisson.
using CURAND
n = 10
lambda = .5
curand_poisson(n, lambda)

Parameters of function in Julia

Does anyone know the reasons why Julia chose a design of functions where the parameters given as inputs cannot be modified?  This requires, if we want to use it anyway, to go through a very artificial process, by representing these data in the form of a ridiculous single element table.
Ada, which had the same kind of limitation, abandoned it in its 2012 redesign to the great satisfaction of its users. A small keyword (like out in Ada) could very well indicate that the possibility of keeping the modifications of a parameter at the output is required.
From my experience in Julia it is useful to understand the difference between a value and a binding.
Values
Each value in Julia has a concrete type and location in memory. Value can be mutable or immutable. In particular when you define your own composite type you can decide if objects of this type should be mutable (mutable struct) or immutable (struct).
Of course Julia has in-built types and some of them are mutable (e.g. arrays) and other are immutable (e.g. numbers, strings). Of course there are design trade-offs between them. From my perspective two major benefits of immutable values are:
if a compiler works with immutable values it can perform many optimizations to speed up code;
a user is can be sure that passing an immutable to a function will not change it and such encapsulation can simplify code analysis.
However, in particular, if you want to wrap an immutable value in a mutable wrapper a standard way to do it is to use Ref like this:
julia> x = Ref(1)
Base.RefValue{Int64}(1)
julia> x[]
1
julia> x[] = 10
10
julia> x
Base.RefValue{Int64}(10)
julia> x[]
10
You can pass such values to a function and modify them inside. Of course Ref introduces a different type so method implementation has to be a bit different.
Variables
A variable is a name bound to a value. In general, except for some special cases like:
rebinding a variable from module A in module B;
redefining some constants, e.g. trying to reassign a function name with a non-function value;
rebinding a variable that has a specified type of allowed values with a value that cannot be converted to this type;
you can rebind a variable to point to any value you wish. Rebinding is performed most of the time using = or some special constructs (like in for, let or catch statements).
Now - getting to the point - function is passed a value not a binding. You can modify a binding of a function parameter (in other words: you can rebind a value that a parameter is pointing to), but this parameter is a fresh variable whose scope lies inside a function.
If, for instance, we wanted a call like:
x = 10
f(x)
change a binding of variable x it is impossible because f does not even know of existence of x. It only gets passed its value. In particular - as I have noted above - adding such a functionality would break the rule that module A cannot rebind variables form module B, as f might be defined in a module different than where x is defined.
What to do
Actually it is easy enough to work without this feature from my experience:
What I typically do is simply return a value from a function that I assign to a variable. In Julia it is very easy because of tuple unpacking syntax like e.g. x,y,z = f(x,y,z), where f can be defined e.g. as f(x,y,z) = 2x,3y,4z;
You can use macros which get expanded before code execution and thus can have an effect modifying a binding of a variable, e.g. macro plusone(x) return esc(:($x = $x+1)) end and now writing y=100; #plusone(y) will change the binding of y;
Finally you can use Ref as discussed above (or any other mutable wrapper - as you have noted in your question).
"Does anyone know the reasons why Julia chose a design of functions where the parameters given as inputs cannot be modified?" asked by Schemer
Your question is wrong because you assume the wrong things.
Parameters are variables
When you pass things to a function, often those things are values and not variables.
for example:
function double(x::Int64)
2 * x
end
Now what happens when you call it using
double(4)
What is the point of the function modifying it's parameter x , it's pointless. Furthermore the function has no idea how it is called.
Furthermore, Julia is built for speed.
A function that modifies its parameter will be hard to optimise because it causes side effects. A side effect is when a procedure/function changes objects/things outside of it's scope.
If a function does not modifies a variable that is part of its calling parameter then you can be safe knowing.
the variable will not have its value changed
the result of the function can be optimised to a constant
not calling the function will not break the program's behaviour
Those above three factors are what makes FUNCTIONAL language fast and NON FUNCTIONAL language slow.
Furthermore when you move into Parallel programming or Multi Threaded programming, you absolutely DO NOT WANT a variable having it's value changed without you (The programmer) knowing about it.
"How would you implement with your proposed macro, the function F(x) which returns a boolean value and modifies c by c:= c + 1. F can be used in the following piece of Ada code : c:= 0; While F(c) Loop ... End Loop;" asked by Schemer
I would write
function F(x)
boolean_result = perform_some_logic()
return (boolean_result,x+1)
end
flag = true
c = 0
(flag,c) = F(c)
while flag
do_stuff()
(flag,c) = F(c)
end
"Unfortunately no, because, and I should have said that, c has to take again the value 0 when F return the value False (c increases as long the Loop lives and return to 0 when it dies). " said Schemer
Then I would write
function F(x)
boolean_result = perform_some_logic()
if boolean_result == true
return (true,x+1)
else
return (false,0)
end
end
flag = true
c = 0
(flag,c) = F(c)
while flag
do_stuff()
(flag,c) = F(c)
end

Minimize the maximum in Julia JuMP

So I've combed through the various websites pertaining to Julia JuMP and using functions as arguments to #objective or #NLobjective, but let me try to state my problem. I'm certain that I'm doing something silly, and that this is a quick fix.
Here is a brief code snippet and what I would like to do:
using juMP;
tiLim = 1800;
x = range(1,1,M); # M stated elsewhere
solver_opt = "bonmin.time_limit=" * "$tiLim";
m = Model(solver=AmplNLSolver("bonmin",[solver_opt]));
#variables m begin
T[x];
... # Have other decision variables which are matrices
end
#NLobjective(m,:Min,maximum(T[i] for i in x));
Now from my understanding, the `maximum' function makes the problem nonlinear and is not allowed inside the JuMP objective function, so people will do one of two things:
(1) play the auxiliary variable + constraint trick, or
(2) create a function and then `register' this function with JuMP.
However, I can't seem to do either correctly.
Here is an attempt at using the auxiliary variable + constraint trick:
mymx(vec::Array) = maximum(vec) #generic function in Julia
#variable(m, aux)
#constraint(m, aux==mymx(T))
#NLobjective(m,:Min,aux)
I was hoping to get some assistance with doing this seemingly trivial task of minimizing a maximum.
Also, it should be noted that this is a MILP problem which I'm trying to solve. I've previously implemented the problem in CPLEX using the ILOG script for OPL, where this objective function is much more straightforward it seems. Though it's probably just my ignorance of using JuMP.
Thanks.
You can model this as a linear problem as follows:
#variable(m, aux)
for i in x
#constraint(m, aux >= T[i]
end
#objective(m, Min, aux)

Replacing functions with Table Lookups

I've been watching this MSDN video with Brian Beckman and I'd like to better understand something he says:
Every imperitive programmer goes through this phase of learning that
functions can be replaced with table lookups
Now, I'm a C# programmer who never went to university, so perhaps somewhere along the line I missed out on something everyone else learned to understand.
What does Brian mean by:
functions can be replaced with table lookups
Are there practical examples of this being done and does it apply to all functions? He gives the example of the sin function, which I can make sense of, but how do I make sense of this in more general terms?
Brian just showed that the functions are data too. Functions in general are just a mapping of one set to another: y = f(x) is mapping of set {x} to set {y}: f:X->Y. The tables are mappings as well: [x1, x2, ..., xn] -> [y1, y2, ..., yn].
If function operates on finite set (this is the case in programming) then it's can be replaced with a table which represents that mapping. As Brian mentioned, every imperative programmer goes through this phase of understanding that the functions can be replaced with the table lookups just for performance reason.
But it doesn't mean that all functions easily can or should be replaced with the tables. It only means that you theoretically can do that for every function. So the conclusion would be that the functions are data because tables are (in the context of programming of course).
There is a lovely trick in Mathematica that creates a table as a side-effect of evaluating function-calls-as-rewrite-rules. Consider the classic slow-fibonacci
fib[1] = 1
fib[2] = 1
fib[n_] := fib[n-1] + fib[n-2]
The first two lines create table entries for the inputs 1 and 2. This is exactly the same as saying
fibTable = {};
fibTable[1] = 1;
fibTable[2] = 1;
in JavaScript. The third line of Mathematica says "please install a rewrite rule that will replace any occurrence of fib[n_], after substituting the pattern variable n_ with the actual argument of the occurrence, with fib[n-1] + fib[n-2]." The rewriter will iterate this procedure, and eventually produce the value of fib[n] after an exponential number of rewrites. This is just like the recursive function-call form that we get in JavaScript with
function fib(n) {
var result = fibTable[n] || ( fib(n-1) + fib(n-2) );
return result;
}
Notice it checks the table first for the two values we have explicitly stored before making the recursive calls. The Mathematica evaluator does this check automatically, because the order of presentation of the rules is important -- Mathematica checks the more specific rules first and the more general rules later. That's why Mathematica has two assignment forms, = and :=: the former is for specific rules whose right-hand sides can be evaluated at the time the rule is defined; the latter is for general rules whose right-hand sides must be evaluated when the rule is applied.
Now, in Mathematica, if we say
fib[4]
it gets rewritten to
fib[3] + fib[2]
then to
fib[2] + fib[1] + 1
then to
1 + 1 + 1
and finally to 3, which does not change on the next rewrite. You can imagine that if we say fib[35], we will generate enormous expressions, fill up memory, and melt the CPU. But the trick is to replace the final rewrite rule with the following:
fib[n_] := fib[n] = fib[n-1] + fib[n-2]
This says "please replace every occurrence of fib[n_] with an expression that will install a new specific rule for the value of fib[n] and also produce the value." This one runs much faster because it expands the rule-base -- the table of values! -- at run time.
We can do likewise in JavaScript
function fib(n) {
var result = fibTable[n] || ( fib(n-1) + fib(n-2) );
fibTable[n] = result;
return result;
}
This runs MUCH faster than the prior definition of fib.
This is called "automemoization" [sic -- not "memorization" but "memoization" as in creating a memo for yourself].
Of course, in the real world, you must manage the sizes of the tables that get created. To inspect the tables in Mathematica, do
DownValues[fib]
To inspect them in JavaScript, do just
fibTable
in a REPL such as that supported by Node.JS.
In the context of functional programming, there is the concept of referential transparency. A function that is referentially transparent can be replaced with its value for any given argument (or set of arguments), without changing the behaviour of the program.
Referential Transparency
For example, consider a function F that takes 1 argument, n. F is referentially transparent, so F(n) can be replaced with the value of F evaluated at n. It makes no difference to the program.
In C#, this would look like:
public class Square
{
public static int apply(int n)
{
return n * n;
}
public static void Main()
{
//Should print 4
Console.WriteLine(Square.apply(2));
}
}
(I'm not very familiar with C#, coming from a Java background, so you'll have to forgive me if this example isn't quite syntactically correct).
It's obvious here that the function apply cannot have any other value than 4 when called with an argument of 2, since it's just returning the square of its argument. The value of the function only depends on its argument, n; in other words, referential transparency.
I ask you, then, what the difference is between Console.WriteLine(Square.apply(2)) and Console.WriteLine(4). The answer is, there's no difference at all, for all intents are purposes. We could go through the entire program, replacing all instances of Square.apply(n) with the value returned by Square.apply(n), and the results would be the exact same.
So what did Brian Beckman mean with his statement about replacing function calls with a table lookup? He was referring to this property of referentially transparent functions. If Square.apply(2) can be replaced with 4 with no impact on program behaviour, then why not just cache the values when the first call is made, and put it in a table indexed by the arguments to the function. A lookup table for values of Square.apply(n) would look somewhat like this:
n: 0 1 2 3 4 5 ...
Square.apply(n): 0 1 4 9 16 25 ...
And for any call to Square.apply(n), instead of calling the function, we can simply find the cached value for n in the table, and replace the function call with this value. It's fairly obvious that this will most likely bring about a large speed increase in the program.

Performing operations on CUDA matrices while reading from a global Point

Hey there,
I have a mathematical function (multidimensional which means that there's an index which I pass to the C++-function on which single mathematical function I want to return. E.g. let's say I have a mathematical function like that:
f = Vector(x^2*y^2 / y^2 / x^2*z^2)
I would implement it like that:
double myFunc(int function_index)
{
switch(function_index)
{
case 1:
return PNT[0]*PNT[0]*PNT[1]*PNT[1];
case 2:
return PNT[1]*PNT[1];
case 3:
return PNT[2]*PNT[2]*PNT[1]*PNT[1];
}
}
whereas PNT is defined globally like that: double PNT[ NUM_COORDINATES ]. Now I want to implement the derivatives of each function for each coordinate thus generating the derivative matrix (columns = coordinates; rows = single functions). I wrote my kernel already which works so far and which call's myFunc().
The Problem is: For calculating the derivative of the mathematical sub-function i concerning coordinate j, I would use in sequential mode (on CPUs e.g.) the following code (whereas this is simplified because usually you would decrease h until you reach a certain precision of your derivative):
f0 = myFunc(i);
PNT[ j ] += h;
derivative = (myFunc(j)-f0)/h;
PNT[ j ] -= h;
now as I want to do this on the GPU in parallel, the problem is coming up: What to do with PNT? As I have to increase certain coordinates by h, calculate the value and than decrease it again, there's a problem coming up: How to do it without 'disturbing' the other threads? I can't modify PNT because other threads need the 'original' point to modify their own coordinate.
The second idea I had was to save one modified point for each thread but I discarded this idea quite fast because when using some thousand threads in parallel, this is a quite bad and probably slow (perhaps not realizable at all because of memory limits) idea.
'FINAL' SOLUTION
So how I do it currently is the following, which adds the value 'add' on runtime (without storing it somewhere) via preprocessor macro to the coordinate identified by coord_index.
#define X(n) ((coordinate_index == n) ? (PNT[n]+add) : PNT[n])
__device__ double myFunc(int function_index, int coordinate_index, double add)
{
//*// Example: f[i] = x[i]^3
return (X(function_index)*X(function_index)*X(function_index));
// */
}
That works quite nicely and fast. When using a derivative matrix with 10000 functions and 10000 coordinates, it just takes like 0.5seks. PNT is defined either globally or as constant memory like __constant__ double PNT[ NUM_COORDINATES ];, depending on the preprocessor variable USE_CONST.
The line return (X(function_index)*X(function_index)*X(function_index)); is just an example where every sub-function looks the same scheme, mathematically spoken:
f = Vector(x0^3 / x1^3 / ... / xN^3)
NOW THE BIG PROBLEM ARISES:
myFunc is a mathematical function which the user should be able to implement as he likes to. E.g. he could also implement the following mathematical function:
f = Vector(x0^2*x1^2*...*xN^2 / x0^2*x1^2*...*xN^2 / ... / x0^2*x1^2*...*xN^2)
thus every function looking the same. You as a programmer should only code once and not depending on the implemented mathematical function. So when the above function is being implemented in C++, it looks like the following:
__device__ double myFunc(int function_index, int coordinate_index, double add)
{
double ret = 1.0;
for(int i = 0; i < NUM_COORDINATES; i++)
ret *= X(i)*X(i);
return ret;
}
And now the memory accesses are very 'weird' and bad for performance issues because each thread needs access to each element of PNT twice. Surely, in such a case where each function looks the same, I could rewrite the complete algorithm which surrounds the calls to myFunc, but as I stated already: I don't want to code depending on the user-implemented function myFunc...
Could anybody come up with an idea how to solve this problem??
Thanks!
Rewinding back to the beginning and starting with a clean sheet, it seems you want to be able to do two things
compute an arbitrary scalar valued
function over an input array
approximate the partial derivative of an arbitrary scalar
valued function over the input array
using first order accurate finite differencing
While the function is scalar valued and arbitrary, it seems that there are, in fact, two clear forms which this function can take:
A scalar valued function with scalar arguments
A scalar valued function with vector arguments
You appeared to have started with the first type of function and have put together code to deal with computing both the function and the approximate derivative, and are now wrestling with the problem of how to deal with the second case using the same code.
If this is a reasonable summary of the problem, then please indicate so in a comment and I will continue to expand it with some code samples and concepts. If it isn't, I will delete it in a few days.
In comments, I have been trying to suggest that conflating the first type of function with the second is not a good approach. The requirements for correctness in parallel execution, and the best way of extracting parallelism and performance on the GPU are very different. You would be better served by treating both types of functions separately in two different code frameworks with different usage models. When a given mathematical expression needs to be implemented, the "user" should make a basic classification as to whether that expression is like the model of the first type of function, or the second. The act of classification is what drives algorithmic selection in your code. This type of "classification by algorithm" is almost universal in well designed libraries - you can find it in C++ template libraries like Boost and the STL, and you can find it in legacy Fortran codes like the BLAS.

Resources