How to decide which mode to use for 'kaiming_normal' initialization - initialization

I have read several codes that do layer initialization using nn.init.kaiming_normal_() of PyTorch. Some codes use the fan in mode which is the default. Of the many examples, one can be found here and shown below.
init.kaiming_normal(m.weight.data, a=0, mode='fan_in')
However, sometimes I see people using the fan out mode as seen here and shown below.
if isinstance(m, nn.Conv2d):
nn.init.kaiming_normal_(m.weight, mode='fan_out', nonlinearity='relu')
Can someone give me some guidelines or tips to help me decide which mode to select? Further I am working on image super resolutions and denoising tasks using PyTorch and which mode will be more beneficial.

According to documentation:
Choosing 'fan_in' preserves the magnitude of the variance of the
weights in the forward pass. Choosing 'fan_out' preserves the
magnitudes in the backwards pass.
and according to Delving deep into rectifiers: Surpassing human-level performance on ImageNet classification - He, K. et al. (2015):
We note that it is sufficient to use either Eqn.(14) or
Eqn.(10)
where Eqn.(10) and Eqn.(14) are fan_in and fan_out appropriately. Furthermore:
This means that if the initialization properly scales the backward
signal, then this is also the case for the forward signal; and vice
versa. For all models in this paper, both forms can make them converge
so all in all it doesn't matter much but it's more about what you are after. I assume that if you suspect your backward pass might be more "chaotic" (greater variance) it is worth changing the mode to fan_out. This might happen when the loss oscillates a lot (e.g. very easy examples followed by very hard ones).
Correct choice of nonlinearity is more important, where nonlinearity is the activation you are using after the layer you are initializaing currently. Current defaults set it to leaky_relu with a=0, which is effectively the same as relu. If you are using leaky_relu you should change a to it's slope.

Related

openMDAO: Does the use of ExecComp maximum() interfere with constraints not being affected by design variables?

When running the optimization driver on a large model I recieve:
DerivativesWarning:Constraints or objectives [('max_current.current_constraint.current_constraint', inds=[0]), ('max_current.continuous_current_constraint.continuous_current_constraint', inds=[0])] cannot be impacted by the design variables of the problem.
I read the answer to a similar question posed here.
The values do change as the design variables change, and the two constraints are satisfied during the course of optimization.
I had assumed this was due to those components' ExecComp using a maximum(), as this is the only place in my model I use a maximum function, however when setting up a simple problem with a maximum() function in a similar manner I do not receive an error.
My model uses explicit components that are looped, there are connections in the bottom left of the N2 diagram and NLBGS is converging the whole model. I currently am thinking it is due to the use of only explicit components and the NLBGS instead of implicit components.
Thank you for any insight you can give in resolving this warning.
Below is a simple script using maximum() that does not report errors. (I was so sure that was it) As I create a minimum working example that gives the error in a similar way to my larger model I will upload it.
import openmdao.api as om
prob=om.Problem()
prob.driver = om.ScipyOptimizeDriver()
prob.driver.options['optimizer'] = 'SLSQP'
prob.driver.options['tol'] = 1e-6
prob.driver.options['maxiter'] = 80
prob.driver.options['disp'] = True
indeps = prob.model.add_subsystem('indeps', om.IndepVarComp())
indeps.add_output('x', val=2.0, units=None)
prob.model.promotes('indeps', outputs=['*'])
prob.model.add_subsystem('y_func_1',
om.ExecComp('y_func_1 = x'),
promotes_inputs=['x'],
promotes_outputs=['y_func_1'])
prob.model.add_subsystem('y_func_2',
om.ExecComp('y_func_2 = x**2'),
promotes_inputs=['x'],
promotes_outputs=['y_func_2'])
prob.model.add_subsystem('y_max',
om.ExecComp('y_max = maximum( y_func_1 , y_func_2 )'),
promotes_inputs=['y_func_1',
'y_func_2'],
promotes_outputs=['y_max'])
prob.model.add_subsystem('y_check',
om.ExecComp('y_check = y_max - 1.1'),
promotes_inputs=['*'],
promotes_outputs=['*'])
prob.model.add_constraint('y_check', lower=0.0)
prob.model.add_design_var('x', lower=0.0, upper=2.0)
prob.model.add_objective('x')
prob.setup()
prob.run_driver()
print(prob.get_val('x'))
There is a problem with the maximum function in this context. Technically a maximum function is not differentiable; at least not when the index of which value is max is subject to change. If the maximum value is not subject to change, then it is differentiable... but you didn't need the max function anyway.
One correct, differentiable way to handle a max when doing gradient based things is to use a KS function. OpenMDAO provides the KSComp which implements it. There are other kinds of functions (like p-norm that you could use as well).
However, even though maximum is not technically differentiable ... you can sort-of/kind-of get away with it. At least, numpy (which ExecComp uses) lets you apply complex-step differentiation to the maximum function and it seems to give a non-zero derivative. So while its not technically correct, you can maybe get rid of it. At least, its not likely to be the core of your problem.
You mention using NLBGS, and that you have components which are looped. Your test case is purely feed forward though (here is the N2 from your test case).
. That is an important difference.
The problem here is with your derivatives, not with the maximum function. Since you have a nonlinear solver, you need to do something to get the derivatives right. In the example Sellar optimization, the model uses this line: prob.model.approx_totals(), which tells OpenMDAO to finite-difference across the whole model (including the nonlinear solver). This is simple and keeps the example compact. It also works regardless of whether your components define derivatives or not. It is however, slow and suffers from numerical difficulties. So use on "real" problems at your own risk.
If you don't include that (and your above example does not, so I assume your real problem does not either) then you're basically telling OpenMDAO that you want to use analytic derivatives (yay! they are so much more awesome). That means that you need to have a Linear solver to match your nonlinear one. For most problems that you start out with, you can simply put a DirectSolver right at the top of the model and it will all work out. For more advanced models, you need a more complex linear solver structure... but thats a different question.
Give this a try:
prob.model.linear_solver = om.DirectSolver()
That should give you non-zero total derivatives regardless of whether you have coupling (loops) or not.

Finite difference between old and new OpenMDAO

So I am converting a code from the old OpenMDAO to the new OpenMDAO. All the outputs and the partial gradients have been verified as correct. At first the problem would not optimize at all and then I realized that the old code had some components that did not provide gradients so they were automatically finite differenced. So I added fd_options['force_fd'] = True to those components but it still does not optimize to the right value. I checked the total derivative and it was still not correct. It also takes quite a bit longer to do each iteration than the old OpenMDAO. The only way I can get my new code to optimize to the same value as the old OpenMDAO code is to set each component to finite difference, even on the components that provide gradients. So I have a few questions about how finite difference works between the old and the new OpenMDAO:
When the old OpenMDAO did automatic finite difference did it only do it on the outputs and inputs needed for the optimization or did it calculate the entire Jacobian for all the inputs and outputs? Same question for the new OpenMDAO when you turn 'force_fd' to True.
Can you provide some parts of the Jacobian of a component and have it finite difference the rest? In the old OpenMDAO did it finite difference any gradients not provided unless you put missing_deriv_policy = 'assume_zero'?
So, the old OpenMDAO looked for groups of components without derivatives, and bundled them together into a group that could be finite differenced together. New OpenMDAO doesn't do that, so each of those components would be finite differenced separately.
We don't support that yet, and didn't in old OpenMDAO. We do have a story up on our pivotal tracker though, so we will eventually have this feature.
What I suspect might be happening for you is that the finite-difference groupings happened to be better in classic OpenMDAO. Consider one component with one input and 10 outputs connected to a second component with 10 inputs and 1 output. If you finite difference them together, only one execution is required. if you finite difference them individually, you need one execution of component one, and 10 executions of component two. This could cause a noticeable or even major performance hit.
Individual FD vs group FD can also cause accuracy problems, if there is an important input that has vastly different scaling than the other variables, so that the default FD stepsize of 1.0e-6 is no good. (Note: you can set a step_size when you add a param or output and it overrides the default for that var.)
Luckilly, new OpenMDAO has a way to recreate what you had in old OpenMDAO, but it is not automatic. What you would need to do is take a look at your model and figure out what components can be FD'd together, and then create a sub Group and move those components into that group. You can set fd_options['force_fd'] to True on the group, and it'll finite difference that group together. So for example, if you have A -> B -> C, with no components in between, and none have derivatives, you can move A, B, and C into a new sub Group with force_fd set to True.
If that doesn't fix things, we may have to look more deeply at your model.

OpenMDAO 1.x relevance reduction

I have a component in OpenMDAO without outputs that serves to provide inputs to the rest of the group. apply_linear in that component is being called despite the fact that the output of it is not connected. Shouldn't the relevance reduction algorithm in OpenMDAO 1.x figure out that apply_linear for this method never needs to be called?
As it turns out, relevance reduction on a per-variable basis isn't turned on by default. You can turn it on with:
prob.root.ln_solver = LinearGaussSeidel()
prob.root.ln_solver.options['single_voi_relevance_reduction'] = True
This options is set to False by default because it does use more memory by allocating separate vectors for each quantity of interest (though each vector is smaller because it only contains relevant variables, but the total size may be larger.) Also, relevance-reduction is only applicable when using Linear Gauss Seidel as the top linear solver.
My reputation isn't high enough yet to leave comments, so I'm just adding another answer instead. I just wanted to mention that if you're not running under MPI, activating single_voi_relevance_reduction is essentially free. The real increase in memory use isn't due to the vectors themselves, but instead it's due to the index arrays that we store in order to transfer the data from source arrays to target arrays. We're forced to use index arrays under MPI, because PETSc requires it, but when we're not using MPI we use python slice objects to do our data transfer. Slice objects require very little memory.

arithmetic library for tracking worst case error

(edited)
Is there any library or tool that allows for knowing the maximum accumulated error in arithmetic operations?
For example if I make some iterative calculation ...
myVars = initialValues;
while (notEnded) {
myVars = updateMyVars(myVars)
}
... I want to know at the end not only the calculated values, but also the potential error (the range of posible values if results in each individual operations took the range limits for each operand).
I have already written a Java class called EADouble.java (EA for Error Accounting) which holds and updates the maximum positive and negative errors along with the calculated value, for some basic operations, but I'm afraid I might be reinventing an square wheel.
Any libraries/whatever in Java/whatever? Any suggestions?
Updated on July 11th: Examined existing libraries and added link to sample code.
As commented by fellows, there is the concept of Interval Arithmetic, and there was a previous question ( A good uncertainty (interval) arithmetic library? ) on the topic. There just a couple of small issues about my intent:
I care more about the "main" value than about the upper and lower bounds. However, to add that extra value to an open library should be straight-forward.
Accounting the error as an independent floating point might allow for a finer accuracy (e.g. for addition the upper bound would be incremented just half ULP instead of a whole ULP).
Libraries I had a look at:
ia_math (Java. Just would have to add the main value. My favourite so far)
Boost/numeric/Interval (C++, Very complex/complete)
ErrorProp (Java, accounts value, and error as standard deviation)
The sample code (TestEADouble.java) runs ok a ballistic simulation and a calculation of number e. However those are not very demanding scenarios.
probably way too late, but look at BIAS/Profil: http://www.ti3.tuhh.de/keil/profil/index_e.html
Pretty complete, simple, account for computer error, and if your errors are centered easy access to your nominal (through Mid(...)).

Math question regarding Python's uuid4

I'm not great with statistical mathematics, etc. I've been wondering, if I use the following:
import uuid
unique_str = str(uuid.uuid4())
double_str = ''.join([str(uuid.uuid4()), str(uuid.uuid4())])
Is double_str string squared as unique as unique_str or just some amount more unique? Also, is there any negative implication in doing something like this (like some birthday problem situation, etc)? This may sound ignorant, but I simply would not know as my math spans algebra 2 at best.
The uuid4 function returns a UUID created from 16 random bytes and it is extremely unlikely to produce a collision, to the point at which you probably shouldn't even worry about it.
If for some reason uuid4 does produce a duplicate it is far more likely to be a programming error such as a failure to correctly initialize the random number generator than genuine bad luck. In which case the approach you are using it will not make it any better - an incorrectly initialized random number generator can still produce duplicates even with your approach.
If you use the default implementation random.seed(None) you can see in the source that only 16 bytes of randomness are used to initialize the random number generator, so this is an a issue you would have to solve first. Also, if the OS doesn't provide a source of randomness the system time will be used which is not very random at all.
But ignoring these practical issues, you are basically along the right lines. To use a mathematical approach we first have to define what you mean by "uniqueness". I think a reasonable definition is the number of ids you need to generate before the probability of generating a duplicate exceeds some probability p. An approcimate formula for this is:
where d is 2**(16*8) for a single randomly generated uuid and 2**(16*2*8) with your suggested approach. The square root in the formula is indeed due to the Birthday Paradox. But if you work it out you can see that if you square the range of values d while keeping p constant then you also square n.
Since uuid4 is based off a pseudo-random number generator, calling it twice is not going to square the amount of "uniqueness" (and may not even add any uniqueness at all).
See also When should I use uuid.uuid1() vs. uuid.uuid4() in python?
It depends on the random number generator, but it's almost squared uniqueness.

Resources