I am trying to understand how neural network can predict different outputs by learning different input/output patterns..I know that weights changes are the mode of learning...but if an input brings about weight adjustments to achieve a particular output in back propagtion algorithm.. won't this knowledge(weight updates) be knocked of when presented with a different set of input pattern...thus making the network forget what it had previously learnt..
The key to avoid "destroying" the networks current knowledge is to set the learning rate to a sufficiently low value.
Lets take a look at the mathmatics for a perceptron:
The learning rate is always specified to be < 1. This forces the backpropagation algorithm to take many small steps towards the correct setting, rather than jumping in large steps. The smaller the steps, the easier it will be to "jitter" the weight values into the perfect settings.
If, on the other hand, used a learning rate = 1, we could start to experience trouble with converging as you mentioned. A high learning rate would imply that the backpropagation should always prefer to satisfy the currently observed input pattern.
Trying to adjust the learning rate to a "perfect value" is unfortunately more of an art, than science. There are of course implementations with adaptive learning rate values, refer to this tutorial from Willamette University. Personally, I've just used a static learning rate in the range [0.03, 0.1].
Related
I would like to perform some optimizations by minimizing the maximum of a specific path variable within Dymos. or the maximum of the absolute of such a variable.
In linear programming methods, this can be done by introducing slack variables.
Do you know if this has been attempted before with Dymos, or if there was a reason not to include it?
I understand gradient based methods are not entirely suitable for these problems, though I think some "functions" can be introduced to mitigate this.
For example,
The space shuttle reentry problem from [Betts][1] used as a [test example][2] in dymos, the original source contains an example where the maximum heat flux is minimized. Such functionality could be implemented with the "loc" argument as:
phase.add_objective('q_c', loc='max')
[1]: J. Betts. Practical Methods for Optimal Control and Estimation Using Nonlinear Programming. Society for Industrial and Applied Mathematics, second edition, 2010. URL: https://epubs.siam.org/doi/abs/10.1137/1.9780898718577, arXiv:https://epubs.siam.org/doi/pdf/10.1137/1.9780898718577, doi:10.1137/1.9780898718577.
[2]: https://openmdao.github.io/dymos/examples/reentry/reentry.html
This has been done with pseudospectral methods before. Dymos currently doesn't have any direct way of implementing this, for a few reasons:
As you said, doing this naively can introduce discontinuous gradients that confuse the optimizer. When the node at which the maximum occurs switches, this tends to cause a sharp edge discontinuity in the gradient.
Since the pseudospectral methods are discrete, you cannot guarantee that the maximum will occur at a node. It's often fine to assume it does, but sometimes your requirements might demand more precision.
There are two possible ways to get around this.
The KSComp in OpenMDAO can be used as a "differentiable maximum". Add one after the trajectory, feed it the timeseries data for the output of interest, and set it up such that it returns a smooth approximation to the maximum. The KS function is a bit conservative, so it won't pick out the precise maximum, but depending on the value of the rho option it can be tuned to get pretty close.
When a more precise value of a maximum is needed, it's pretty common to set up a trajectory such that a phase ends when the maximum or minimum is reached.
If the variable whose maximum is being sought is a state, this can be done by adding a boundary constraint on the rate source for that state.
This ensures that the maximum occurs at the first or last node in the phase (depending on if its an initial or final boundary constraint). That lets you more accurately capture its value.
If the variable being sought is not a state, its possible to use the polynomials that are used for fitting states and controls in a phase to interpolate the variable of interest. By then taking the time derivative of that polynomial we can get a reasonably good approximation for its rate. The master branch of dymos has a method add_timeseries_rate_output that does this. And soon, within a few weeks hopefully, we'll add add_boundary_rate_constraint so that these interpolated rates can be easily used as boundary constraints.
In the meantime, you should be able to achieve this by adding the timeseries rate output and then manually applying the OpenMDAO method 'add_constraint' to the resulting timeseries output, using either indices=[0] or indices=[-1] to treat it as an initial or final constraint.
This is a common enough request that we'll add some documentation on how to achieve this behavior using both the KSComp approach and the boundary constraint approach.
Personally I'm not as much of a fan of KSComp because I've had trouble getting problems getting those types of objectives to converge in the past. I've used the slack variable and that has worked well. In the following example, we take a guess at the Rotor power in static analysis, and then we run a trajectory and get the actual rotor power during the mission. The objective was to minimize aircraft weight, so if you have a large amount of power in statics, that costs more weight. The constraint shown below prevents us from decreasing our updated guess of rotor power in statics below the maximum power required during the trajectory.
p.model.add_subsystem(
'static_power_check',
om.ExecComp('Power_check = Power_ODE - Power_statics',
Power_check = {'value':np.ones(nn_timeseries_main_tx), 'units':'kW'},
Power_ODE = {'value':np.ones(nn_timeseries_main_tx), 'units':'kW'},
Power_statics = {'value':0.0, 'units':'kW'}),
promotes_inputs=[
('Power_ODE','hop0.main_phase.timeseries.Power_R'), ('Power_statics','Power_{rotor,slack}')],
promotes_outputs=['Power_check'])
p.model.add_constraint('Power_check', upper=0, ref=1)
The constraint on the slack variable effectively helped us ensure that our slack rotor power matched the maximum rotor power during the mission. This allowed us to get the right sizes for the rotor parts (i.e. motors).
I was hoping to get some information on how to set my defect refs in Dymos a smart way. I found the following notes on scaling here https://github.com/hweyandtnasa/scaling-tutorial but it lists defect scaling in Dymos as a TODO still. Should I just set them equal to the ref value for the state they pertain to?
Scaling pseudospectral optimal control problems is tricky. If you can get a copy of John Betts' Practical Methods for Optimal Control and Estimation Using Nonlinear Programming, I highly recommend it. Betts suggest using the same scaling for both the state design variable values and the defects. This is often a good rule of thumb, but as with most approaches to scaling, isn't universal. The collocation "defects" which dictate whether the dynamics are physically correct are just the difference between the slope of the approximating polynomial and the computed equations of motion.
In situations where state values are large but tiny rates of change are significant, then different scaling is warranted in my experience. Examples of states where these can be true are aircraft range or spacecraft orbital elements. Just recently we had a situation where a low-thrust orbit transfer of spacecraft wasn't matching physics. The semi-latus rectum, for instance, is typically measured in km, so on the scale of thousands when in Earth orbit). In the units being used, a "significant" difference in the defect was less than 1E-6 (the threshold for feasibility being used). In this case, the problem was solved by bumping the defect_scaler up a few orders of magnitude (equivalent to bumping the defect_ref down a few orders of magnitude).
I'd also recommend this paper from Ross, Gong, Karpenko, and Proulx. It lays out some good rules of thumb and has an approachable example in the brachistochrone. It references costates a lot. Dymos doesn't provide automatic costate estimation yet, but they are closely related to the lagrange multipliers of the problem, which are printed in the pyoptsparse output if you use SNOPT.
The github repo you pointed out was the work of an intern and was based around this scaling method developed by Sagliano. We found it to work well in a many situations, but it's also not a panacea.
Ultimately we want some automatic scaling options in Dymos and/or OpenMDAO, but we're not sure when they might find their way into the framework. Our past work has typically tied scaling approaches more tightly to the equations of motion, and Dymos is designed to be more general in that the user can supply whatever EOM they choose.
In Dymos, if you leave the defect_ref value unset when you call set_state_options then the default behavior is to make make the defect_ref equal to the ref value. Here is why that is done:
Defects are the differences between the computed state rate from the polynomial interpolation function and the actual state rate computed by the ODE.
As you can see here:
defect = (f_approx-f_computed) * dt_dstau
the dt_dstau just adjusts things into a normalized time space called tau but it also multiplies by the time unit as well (tau is dimensionless). That means the defects are computed in the same units as the states themselves. Thus a reasonable guess for scaling is to match the scaling between the states and the defects. As Rob Falck's answer points out that is not always the right solution, but it's a good starting point.
I am totally new to NN and want to classify the almost 6000 images that belong to different games (gathered by IR). I used the steps introduced in the following link, but I get the same training accuracy in each round.
some info about NN architecture: 2 convloutional, activation, and pooling layers. Activation type: relu, Number of filters in first and second layers are 30 and 70 respectively.
2 fully connected layer with 500 and 2 hidden layers respectively.
http://firsttimeprogrammer.blogspot.de/2016/07/image-recognition-in-r-using.html
I had a similar problem, but for regression. After trying several things (different optimizers, varying layers and nodes, learning rates, iterations etc), I found that the way initial values are given helps a lot. For instance I used a -random initializer with variance of 0.2 (initializer = mx.init.normal(0.2)).
I came upon this value from this blog. I recommend you read it. [EDIT]An excerpt from the same,
Weight initialization. Worry about the random initialization of the weights at the start of learning.
If you are lazy, it is usually enough to do something like 0.02 * randn(num_params). A value at this scale tends to work surprisingly well over many different problems. Of course, smaller (or larger) values are also worth trying.
If it doesn’t work well (say your neural network architecture is unusual and/or very deep), then you should initialize each weight matrix with the init_scale / sqrt(layer_width) * randn. In this case init_scale should be set to 0.1 or 1, or something like that.
Random initialization is super important for deep and recurrent nets. If you don’t get it right, then it’ll look like the network doesn’t learn anything at all. But we know that neural networks learn once the conditions are set.
Fun story: researchers believed, for many years, that SGD cannot train deep neural networks from random initializations. Every time they would try it, it wouldn’t work. Embarrassingly, they did not succeed because they used the “small random weights” for the initialization, which works great for shallow nets but simply doesn’t work for deep nets at all. When the nets are deep, the many weight matrices all multiply each other, so the effect of a suboptimal scale is amplified.
But if your net is shallow, you can afford to be less careful with the random initialization, since SGD will just find a way to fix it.
You’re now informed. Worry and care about your initialization. Try many different kinds of initialization. This effort will pay off. If the net doesn’t work at all (i.e., never “gets off the ground”), keep applying pressure to the random initialization. It’s the right thing to do.
http://yyue.blogspot.in/2015/01/a-brief-overview-of-deep-learning.html
I have been trying to make a biologically accurate 2D spatial model of tissue layers, where different physiological processes happen. This includes mainly chemical reactions, diffusion and fluxes over boundaries.
I am making this model in COMSOL Multiphysics, a finite element software package that solves different physics like reaction-diffusion systems, although for my question this might not be really relevant.
In my geometry, I have really small regions between the cells of the tissue layers. These regions serve as openings where diffusion can take place between the cells (junctions). The quality of the mesh is not great here and if I want to improve the quality (mainly by introducing more elements and such), my simulation time increases drastically. The lesser quality mesh also causes convergence to take longer. I added a picture of the geometry to give an idea. I tried different meshes, all with different qualities of the elements and the number of elements ranging from 16000 to 50000.
My background in FEM is really limited and I wanted to know if I can tackle this problem in such a way that it
doesn't negatively affect the biology (keep the tissue domain sizes/problem etc as biologically accurate as possible),
doesn't increase the simulation time drastically,
give a better mesh quality.
So I really want to know what the best way to go is, since I have already thought of some things.
Can I go with the lesser quality mesh (which is not really bad, but not good either), so that I can keep the small regions for optimum biological accuracy and have a relatively small computation time (and hope I won't run into convergence errors).
But maybe there are possibilities that I am missing, for instance: is it possible to make the small domain bigger and then add some kind of factor to the diffusion rates. In other words, if I want to make the domain twice as large, do I factor the diffusion rate with half? Is that even accurate in chemical/physical laws :S.
Hopefully I made the problem a bit clear and thank you greatly in advance for the help.
Cheers,
Mesh of the tissue model
I know this thread was posted some months back but I am unsure if you found a solution.
In order to find the relationship between accuracy and computational time would be that you run a mesh analysis on your model and see how the mesh size directly affects the results you are expecting to obtain (pore pressure, fluid velocity, strain, etc.) This will allow you to determine the most appropriate mesh strategy for your specific problem.
Also, you might need to keep in mind that the diffusion rate of a material will depend on the pore size and the permeability (by means of Darcy's law) so depending on the assumptions you are making for the implementation of your constitutive law and your problem boundary conditions you might simplify/enlarge some of the smaller domains you have in your model so long they are within your previously made assumptions.
I'm trying to design a nonlinear fitness function where I maximize variable A and minimize the variable B. The issue is that maximizing A is much more important at single digit values, almost logarithmic. B needs to be minimized and in contrast to A, it becomes less important when small (less than one) and more important when it's larger (>1), so exponential decay.
The main goal is to optimize A, so I guess an analog is A=profits, B=costs
Should I aim to keep everything positive so that the I can use a roulette wheel selection, or would it be better to use a rank/torunament kind of system? The purpose of my algorithm is shape optimization.
Thanks
When considering a multi-objective problem the goal is usually to identify all solutions that lie on the Pareto curve - the Pareto optimal set. Have a look here for a 2-dimensional visual example. When the algorithm completes you want a set of solutions that are not dominated by any other solution. You therefore need to define a pareto ranking mechanism to take into account both objectives - for a more in depth explanation, as well as links to even more reading, go here
With this in mind, in order to effectively explore all solutions along the pareto front you do not want an implementation that encourages premature convergence, otherwise your algorithm will only explore the search space in one specific area of the Pareto curve. I would implement a selection operator that keeps all members of each iteration's optimal set of solutions, that is all solutions which are not dominated by another + plus a parameter controlled percentage of other solutions. This way you encourage exploration all along the Pareto curve.
You also need to ensure your mutation and crossover operators are tuned correctly too. With any novel application of Evolutionary Algorithms, part of the problem is trying to identify an optimal parameter set for the problem domain... this is where it gets really interesting!!
The description is very vague, but assuming that you actually have an idea of what the function should look like and you're just wondering whether you need to modify it so that proportional selection can be used easily, then no. Regardless of fitness function, you should probably default to using something like tournament selection. Controlling selection pressure is one of the most important things you have to do in order to get consistently good results, and roulette wheel selection doesn't allow you that control. You typically get enormous pressure very early, which drives premature convergence. That might be preferable in a few cases, but it's not where I'd start my investigations.