Strange behavior when implementing Back propagation in DBN - math

Currently I'm trying to implement the Deep Belief Network. But I've met a very strange problem. My source code can be found here: https://github.com/mistree/GoDeep/blob/master/GoDeep/
I first implemented the RBM using CD and it works perfectly (by using the concurrency feature of Golang it's quite fast). Then I start to implement a normal feed forward network with back propagation and then the strange thing happens. It seems very unstable. When I run it with xor gate test it sometimes fails, only when I set the hidden layer nodes to 10 or more then it never fails. Below is how I calculate it
Step 1 : calculate all the activation with bias
Step 2 : calculate the output error
Step 3 : back propagate the error to each node
Step 4 : calculate the delta weight and bias for each node with momentum
Step 1 to Step 4 I do a full batch and sum up these delta weight and bias
Step 5 : apply the averaged delta weight and bias
I followed the tutorial here http://ufldl.stanford.edu/wiki/index.php/Backpropagation_Algorithm
And normally it works if I give it more hidden layer nodes. My test code is here https://github.com/mistree/GoDeep/blob/master/Test.go
So I think it should work and start to implement the DBN by combining the RBM and normal NN. However then the result becomes really bad. It even can't learn a xor gate in 1000 iteration. And sometimes goes totally wrong. I tried to debug with that so after the PreTrain of DBN I do a reconstruction. Most times the reconstruction looks good but the back propagation even fails when the preTrain result is perfect.
I really don't know what's wrong with the back propagation. I must misunderstood the algorithm or made some big mistakes in the implementation.
If possible please run the test code and you'll see how weird it is. The code it self is quite readable. Any hint will be great help.Thanks in advance

I remember Hinton saying you cant train RBM's on an XOR, something about the vector space that doesnt allow a two layer network to work. Deeper networks have less linear properties that allow it to work.

Related

Flipping coin simulation with gain/loss

Suppose you successively toss a fair coin and each time the result is
heads, you win $1, while if you get tails you lose 1$. Your initial capital is
3$. The throws stop if your capital is zeroed or you reach 10$. Let X_n be the
process that describes your chapter during the nth throw.
Simulate the X_n process 1000 times and present the graph
of its evolution through R.
2. Estimate the average number of consecutive throws until you stop. Is the result expected?
Can someone help me solve this or at least understand the steps I am supposed to take?
Someone already posted a link to a solution of your homework in the comments. I fear, however, that this uncommented code is incomprehensive for you, given that you have asked the question in the first place.
I would therefore suggest to first write your own implementation with an outer for loop and an inner while loop conditioned upon the running capital, call rbinom in each run and recompute the running capital. Store the resulting runs in a numeric vector and call mean on this vector.
It will start becoming interesting when you measure the runtime of your solution, which will be surprisingly slow. To speed it up, you must use "vectorization", which the linked to solution uses, but this is a completely different topic to be left for a different lesson...

How to decide which mode to use for 'kaiming_normal' initialization

I have read several codes that do layer initialization using nn.init.kaiming_normal_() of PyTorch. Some codes use the fan in mode which is the default. Of the many examples, one can be found here and shown below.
init.kaiming_normal(m.weight.data, a=0, mode='fan_in')
However, sometimes I see people using the fan out mode as seen here and shown below.
if isinstance(m, nn.Conv2d):
nn.init.kaiming_normal_(m.weight, mode='fan_out', nonlinearity='relu')
Can someone give me some guidelines or tips to help me decide which mode to select? Further I am working on image super resolutions and denoising tasks using PyTorch and which mode will be more beneficial.
According to documentation:
Choosing 'fan_in' preserves the magnitude of the variance of the
weights in the forward pass. Choosing 'fan_out' preserves the
magnitudes in the backwards pass.
and according to Delving deep into rectifiers: Surpassing human-level performance on ImageNet classification - He, K. et al. (2015):
We note that it is sufficient to use either Eqn.(14) or
Eqn.(10)
where Eqn.(10) and Eqn.(14) are fan_in and fan_out appropriately. Furthermore:
This means that if the initialization properly scales the backward
signal, then this is also the case for the forward signal; and vice
versa. For all models in this paper, both forms can make them converge
so all in all it doesn't matter much but it's more about what you are after. I assume that if you suspect your backward pass might be more "chaotic" (greater variance) it is worth changing the mode to fan_out. This might happen when the loss oscillates a lot (e.g. very easy examples followed by very hard ones).
Correct choice of nonlinearity is more important, where nonlinearity is the activation you are using after the layer you are initializaing currently. Current defaults set it to leaky_relu with a=0, which is effectively the same as relu. If you are using leaky_relu you should change a to it's slope.

IterativeClosestPoint with pcl does not give expected results

I got two point clouds. To match them I try to do a registration with ICP. The point cloud's are not super similar but I want to at least get them very near together.
When using IterativeClosestPoint from the pcl library this works when I use my pointCloud A as a source and pointCloud B as a target. But it doesn't work when I use B as source and A as target. In the latter case it even increases the distance between my both clouds.
Does anyone know what I am doing wrong? Why should there be a difference in performance when changing the source/target?
This is my code:
pcl::IterativeClosestPoint<pcl::PointXYZ, pcl::PointXYZ> icp;
icp.setInputSource(A);
icp.setInputTarget(B);
icp.setMaximumIterations(50);
icp.setTransformationEpsilon(1e-8);
icp.setEuclideanFitnessEpsilon(1);
icp.setMaxCorrespondenceDistance(0.5); // 50cm
icp.setRANSACOutlierRejectionThreshold(0.03);
icp.align(aligned_model_cloud);
I am happy for any ideas and input.
Edit: here are the two clouds
Cloud A
Cloud B
Update:
I tried my code using Cloud A as source and Cloud A* as target. Where Cloud A* is a copy of Cloud A with just a translation on the x-axis. I did the same experiment with Cloud B and both were able to successfully converge in icp.
But as soon as I use Cloud B as source and Cloud A as target, it doesn't work anymore and converges after moving the cloud only a tiny bit (even the wrong direction). I checked the convergecriteria and found that it is CONVERGENCE_CRITERIA_REL_MSE (when transfromationEpslion is almost zero). I tried reducing the relative MSE with
icp.getConvergeCriteria()->setRelativeMSE(1e-15) but this didn't succeed. When checking the value of the relativeMSE after converging I get something like this: -124034642 which doesn't make any sense at all for me.
Update2: I moved the clouds quite near together first without ICP. When doing this ICP works fine.
Update3: I am doing an FPFH for a first estimation and afterwards ICP. Doing it like this works too.
This question is old and OP has already found a solution, but I'll just explain in case OP and someone find it useful.
First, ICP works by iteratively estimating correspondences between the two clouds and then minimize the overall distances between them. And ICP estimates correspondences using closest point data association (hence the name Iterative Closest Point).
And as you may know, closest neigbor graph is directed. That is to say, if point A has B as its closest neighbor, point B might not have A as its closest neighbor since C is closer to B than A!
Now that ICP uses closest point data association to estimate correspondences between the two clouds, specifying A as source will get a different correspondence set from specifying B. That explains the differences you observed.
Usually the difference is small and you may not notice after ICP. But in your case, I found the two clouds you provided are too different (one is extra large and the other small) and the relation becomes too asymmetric.
If you want to ensure the result is symmetric, you can just change the data association step (PCL might provide the option to do that) to make closest point correspondences come from both cloud (and this is just a variant of the classic ICP. For more information you can see my other answer).

Using results from ODEProblem while it is running

I’m currently studying the documentation of DifferentialEquations.jl and trying to port my older computational neuroscience codes for using it instead of my own, less elegant and performant, ODE solvers. While doing this, I stumbled upon the following question: is it possible to access and use the results returned from the solver as soon as the current step is returned (instead of waiting for the problem to finish)?
I’m looking for a way to e.g. plot in real-time the voltage levels of a simulated neuron, which seems like a simple enough task and one that’s probably trivial to do using already existing Julia packages but I can’t figure out how. Does it have to do anything with callbacks? Thanks in advance.
Plots.jl doesn't seem to be animating for me right now, but I'll show you the steps anyways. Yes, you can use a DiscreteCallback for this. If you make condition(u,t,integrator)=true then the affect! is called every step, and you could do that.
But, I think using the integrator interface is perfect for this case. Let me show you an example of this. Take the 2D problem from the tutorial:
using DifferentialEquations
using Plots
A = [1. 0 0 -5
4 -2 4 -3
-4 0 0 1
5 -2 2 3]
u0 = rand(4,2)
tspan = (0.0,1.0)
f(u,p,t) = A*u
prob = ODEProblem(f,u0,tspan)
Now instead of using solve, use init to get an integrator out.
integrator = init(prob,Tsit5())
The integrator interface is defined in full at its documentation page, but the basic usage is that you can step using step!. If you put that in a loop and keep stepping then that's essentially what solve does. But it also has the iterator interface, so if you do something like for integ in integrator then inside of the for loop integ will be the current state of the integrator, with values integ.u at time point integ.t. It also has all sorts of things like a plot recipe for intermediate interpolation integ(t) (this is true even when dense=false because it's free and doesn't require extra saving allocations, so feel free to use it).
So, you can do
p = plot(integrator,markersize=0,legend=false,xlims=tspan)
anim = #animate for integ in integrator
plot!(p,integrator,lw=3)
end
plot(p)
gif(anim, "test.gif", fps = 2)
and Plots.jl will give you the animated gif that adds the current interval at each step. Here's what the end plot looks like:
It colored differently in each step because it was a different plot, so you can see how it continued. Of course, you can do anything inside of that loop, or if you want more control you can manually step!(integrator) as necessary.

Why does lsoda (in R) fail to complete running duration, with warning messages?

I am writing a numerical model in R, for an ecological system, and solving it using "lsoda" from package deSolve.
My model has 14 state variables.
I define the model, set it up fine, and give time duration according to this:
nyears<-60
ndays<-nyears*365+1
times<-seq(0,nyears*365,by=1)
Rates of change of state variables (e.g. the rate of change of variable "A1" is "dA1")are calculated according to existing values for state variables (at time=t) and a set of parameters.
Simplified example:
dA1<-Tf*A1*(ImaxA*p_sub)
Where Tf, ImaxA and p_sub are parameters, and A1 is my state variable at time=t.
When I solve the model, I use the lsoda solver like this:
out<-as.data.frame(lsoda(start,times,model,parms))
Sometimes (depending on my parameter combinations), the model run completes over the entire duration I have specified, however sometimes it stops short of the mark (still giving me output up until the solver "crashes"). When it "crashes", this message is displayed:
DLSODA- At current T (=R1), MXSTEP (=I1) steps
taken on this call before reaching TOUT
In above message, I1 = 5000
In above message, R1 = 11535.5
Warning messages:
1: In lsoda(start, times, model, parms) :
an excessive amount of work (> maxsteps ) was done, but integration was not successful - increase maxsteps
2: In lsoda(start, times, model, parms) :
Returning early. Results are accurate, as far as they go
It commonly appears when one of the state variables is getting exponentially bigger, or is tending very near to zero, however sometimes it crashes when seemingly not much change is happening. I may be wrong, but is it due to the rate of change of state-variables becoming too large? If so, why might it also "crash" when there is not a fast rate of change?
Is there a way that I can make the solver complete its task with the specified parameter values, maybe with a more relaxed tolerance for error?
Thank you all for your contributions. I looked at some of the rates, and at the point of crashing, the model was switching between two metabolic states - and the fast rate of this binary switch caused the solver to stop - rejecting the solution because the rate of change was too large. I have fixed my model by introducing a gradual switch between states (with a logistic curve) instead of this binary switch. I aknowledge that I didn;t give enough info in the original question, so thanks for the help you offered!

Resources