Cost function of Convolutional Neural Networ not serving intended purpose - math

So I have built a CNN and now I am trying to get the training of my network to work effectively despite my lack of a formal education on the topic. I have decided to use stochastic gradient descent with a standard mean squared error cost function. As stated in the title, the problem seems to lie within the cost function.
When I use a couple of training examples, I calculate the mean squared error for each, and get the mean, and use that as the full error. There are two output neurons, one for face, and one for not a face; which ever is higher is the class that is yielded. Essentially, if a training example yields the wrong classification, I calculate the error (with the desired value being the value of the class that was yielded).
Example:
Input an image of a face--->>>
Face: 500
Not face: 1000
So in this case, the network says that the image isn't a face, when in fact it is. The error comes out to:
500 - 1000 = -500
-500^2 = 250000 <<--error
(correct me if i'm doing anything wrong)
As you can see the desired value is set to the value of the incorrect class that was selected.
Now this is all good (from what I can tell), but here is my issue:
As I perform b-prop on the network multiple times, the mean cost of the entire training set falls to 0, but this is only because all of the weights in the network are becoming 0, so all classes always become 0.
After training:
Input not face->
Face: 0
Not face: 0
--note that if the classes are the same, the first one is selected
(0-0)^0 = 0 <<--error
So the error is being minimized to 0 (which is good I guess), but obviously not the way we want.
So my question is this:
How do I minimize the space between the classes when the class is wrong, but also get it to overshoot the incorrect class so that the correct class is yielded.
//example
Had this: (for input of face)
Face: 100
Not face: 200
Got this:
Face: 0
Notface: 0
Want this: (or something similar)
Face: 300
Not face: 100
I hope this question wasn't too vague...
But any help would be much appreciated!!!

The way you're computing the error doesn't correspond to the standard 'mean squared error'. But, even if you were to fix it, it makes more sense to use a different type of outputs and error that are specifically designed for classification problems.
One option is to use a single output unit with a sigmoid activation function. This neuron would output the probability that the input image is a face. The probability that it's not a face is given by 1 minus this value. This approach will work for binary classification problems. Softmax outputs are another option. You'd have two output units: the first outputs the probability that the input image is a face, and the second outputs the probability that it's not a face. This approach will also work for multi-class problems, with one output unit for each class.
In either case, use the cross entropy loss (also called log loss). Here, you have a target value (face or no face), which is the true class of the input image. The error is the negative log probability that the network assigns to the target.
Most neural nets that perform classification work this way. You can find many good tutorials here, and read this online book.

Related

Error: Required number of iterations = 1087633109 exceeds iterMax = 1e+06 ; either increase iterMax, dx, dt or reduce sigma

I am getting this error and this post telling me that I should decrease the sigma but here is the thing this code was working fine a couple of months ago. Nothing change based on the data and the code. I was wondering why this error out of blue.
And the second point, when I lower the sigma such as 13.1, it looks running (but I have been waiting for an hour).
sigma=203.9057
dimyx1=1024
A22den=density(Lnetwork,sigma,distance="path",continuous=TRUE,dimyx=dimyx1) #
About Lnetwork
Point pattern on linear network
69436 points
Linear network with 8417 vertices and 8563 lines
Enclosing window: rectangle = [143516.42, 213981.05] x [3353367, 3399153] units
Error: Required number of iterations = 1087633109 exceeds iterMax = 1e+06 ; either increase iterMax, dx, dt or reduce sigma
This is a question about the spatstat package.
The code for handling data on a linear network is still under active development. It has changed in recent public releases of spatstat, and has changed again in the development version. You need to specify exactly which version you are using.
The error report says that the required number of iterations of the algorithm is too large. This occurs because either the smoothing bandwidth sigma is too large, or the spacing dx between sample points along the network is too small. The number of iterations is proportional to (sigma/dx)^2 in most cases.
First, check that the value of sigma is physically reasonable.
Normally you shouldn't have to worry about the algorithm parameter dx because it is determined automatically by default. However, it's possible that your data are causing the code to choose a very small value of dx.
The internal code which automatically determines the spacing dx of sample points along the network has been changed recently, in order to fix several bugs.
I suggest that you specify the algorithm parameters manually. See the help file for densityHeat for information on how to control the spacings. Setting the parameters manually will also ensure greater consistency of the results between different versions of the software.
The quickest solution is to set finespacing=FALSE. This is not the best solution because it still uses some of the automatic rules which may be giving problems. Please read the help file to understand what that does.
Did you update spatstat since this last worked? Probably the internal code for determining spacing on the network etc. changed a bit. The actual computations are done by the function densityHeat(), and you can see how to manually set spacing etc. in its help file.

what if the FD steps varied w.r.t output/input

I am using the finite difference scheme to find gradients.
Lets say i have 2 outputs (y1,y2) and 1 input (x) in a single component. And in advance I know that the sensitivity of y1 with respect to x is not same as the sensitivity of y2 to x. And thus i could potentially have two different steps for those as in ;
self.declare_partials(of=y1, wrt=x, method='fd',step=0.01, form='central')
self.declare_partials(of=y2, wrt=x, method='fd',step=0.05, form='central')
There is nothing that stops me (algorithmically) but it is not clear what would openmdao gradient calculation exactly do in this case?
does it exchange information from the case where the steps are different by looking at the steps ratios or simply treating them independently and therefore doubling computational time ?
I just tested this, and it does the finite difference twice with the two different step sizes, and only saves the requested outputs for each step. I don't think we could do anything with the ratios as you suggested, as the reason for using different stepsizes to resolve individual outputs is because you don't trust the accuracy of the outputs at the smaller (or large) stepsize.
This is a fair question about the effect of the API. In typical FD applications you would get only 1 function call per design variable for forward and backward difference and 2 function calls for central difference.
However in this case, you have asked for two different step sizes for two different outputs, both with central difference. So here, you'll end up with 4 function calls to compute all the derivatives. dy1_dx will be computed using the step size of .01 and dy2_dx will be computed with a step size of .05.
There is no crosstalk between the two different FD calls, and you do end up with more function calls than you would have if you just specified a single step size via:
self.declare_partials(of='*', wrt=x, method='fd',step=0.05, form='central')
If the cost is something you can bear, and you get improved accuracy, then you could use this method to get different step sizes for different outputs.

Why has the author used the following matrices for the following standardisation?

Can somebody tell me why this author has used the following code in their normalisation.
The first line appears fine to me they have standardised the training set by the following formula;
(x - mean(x)) / std(x)
However the second line and third line (validation and test) they have used the train mean (trainme) and train standard deviation (trainstd). Should they not have used the validation mean (validationme) and validation standard deviation (validationstd) along with the test mean and test standard deviation?
You can also view the page from the book at the following link (page 173)
What the authors are doing is reasonable and it's what is conventionally done. The idea is that the same normalization is applied to all inputs. This is essentially allocating some new parameters (offset and scale) and estimating them from the training data. In that scheme, if the value 100 is input, then the normalized value is (100 - offset)/scale, no matter where (training, testing, whatever) that 100 came from.
I guess one can also make an argument that the offset and scale should be context dependent in the sense that if you are given a set of data and for some reason the offset and scale are very different from the original training data, maybe what's important is how big each value is relative to the others in the same data set. E.g. maybe you should treat 200 the same as 100, if the scale is twice as big in the data set containing 200.
Whether that data-dependent scaling is reasonable would have to be decided case by case. I don't remember ever having seen it, but it's plausible that it could be the right thing to do in some cases.
By the way, you'll get more interest in general statistical questions at stats.stackexchange.com and/or datascience.stackexchange.com.

Concept of Naive Bayes for demonstration purposes, how to calculate word possibilities

I need to demonstrate the Bayesian spam filter in school.
To do this I want to write a small Java application with GUI (this isn't a problem).
I just want to make sure that I really grasped the concept of the filter, before starting to write my code. So i will describe what I am going to build and how I will program it and would be really grateful if you could give a "thumbs up" or "thumbs down".
Remember: It is for a short presentation, just to demonstrate. It does not have to be performant or something else ;)
I imagine the program having 2 textareas.
In the first I want to enter a text, for example
"The quick brown fox jumps over the lazy dog"
I then want to have two buttons under this field with "good" or "bad".
When I hit one of the buttons, the program counts the appearances of each word in each section.
So for example, when I enter following texts:
"hello you viagra" | bad
"hello how are you" | good
"hello drugs viagra" | bad
For words I do not know I assume a probability of 0.5
My "database" then looks like this:
<word>, <# times word appeared in bad message>
hello, 2
you, 1
viagra, 2
how, 0
are, 0
drugs, 1
In the second textarea I then want to enter a text to evaluate if it is "good" or "bad".
So for example:
"hello how is the viagra selling"
The algorithm then takes the whole text apart and looks up for every word it's probability to appear in a "bad" message.
This is now where I'm stuck:
If I calculate the probability of a word to appear in a bad message by # times it appeared in bad messages / # times it appeared in all messages, the above text would have 0 probability to be in any category, because:
how never appeared in a bad message, so probability is 0
viagra never appeared in a good message, so probability also 0
When I now multiply the single probabilities, this would give 0 in both cases.
Could you please explain, how I calculate the probability for a single word to be "good" or "bad"?
Best regards and many thanks in advance
me
For unseen words you would like to do Laplace smoothing. What does that mean: having a zero for some word count is counterintuitive since it implies that probability of this word is 0, which is false for any word you can imagine :-) Thus you want to add a little, but positive probability to every word.
Also, consider using logarithms. Long messages will have many words with probability < 1. When you multiply lots of small floating numbers on a computer you can easily run into numerical issues. In order to overcome it, you may note that:
log (p1 * ... * pn) = log p1 + ... + log pn
So we traded n multiplications of small numbers for n additions of relatively big (and negative) ones. Then you can exponentiate result to obtain a probability estimate.
UPD: Actually, it's an interesting subtopic for your demo. It shows a drawback of NB of outputting zero probabilities and a way one can fix it. And it's not an ad-hoc patch, but a result of applying Bayesian approach (it's equivalent to adding a prior)
UPD 2: Didn't notice it first time, but it looks like you got Naive Bayes concept wrong. Especially, bayesian part of it.
Essentially, NB consists of 2 components:
We use Bayes' rule for a posterior distribution over class labels. This gives us p(class|X) = p(X|class) p(class) / p(X) where p(X) is the same for all classes, so it doesn't have any influence on the order of probabilities. Or, another way to say the same, is to say that p(class|X) is proportional to p(X|class) p(class) (up to a constant). As you may guessed already, that's where Bayes comes from.
The formula above does not have any model assumptions, it's a probability theory law. However, it's too hard to apply it directly since p(X|class) denotes probability of encountering message X in a class. No way we would have enough data to estimate probability of every single message possible. So here goes our model assumption: we say that words of a message are independent (which is, obviously, wrong and incorrect, thus method is Naive). This leads us to p(X|class) = p(x1|class) * ... * p(xn|class) where n is amount of words in X.
Now we need somehow estimate probabilities p(x|class). x here is not a whole message, but just a (one) word. Intuitively, probability of getting some word from a given a class is equal to the number of occurrences of that word in that class divided by the total size of the class: #(word, class) / #(class) (or, we could use Bayes' rule once again: p(x|class) = p(x, class) / p(class)).
Accordingly, since p(x|class) is a distribution over xs, we need it to sum to 1. Thus, if we apply Laplace smoothing by saying p(x|class) = (#(x, class) + a) / Z where Z is a normalizing constant, we need to enforce the following constraint: sum_x p(x|class) = 1, or, equivalently, sum_x(#(x, class) + a) = Z. It gives us Z = #(class) + a * N where N is number of all words (just number of words, not their occurrences!)

Why does lsoda (in R) fail to complete running duration, with warning messages?

I am writing a numerical model in R, for an ecological system, and solving it using "lsoda" from package deSolve.
My model has 14 state variables.
I define the model, set it up fine, and give time duration according to this:
nyears<-60
ndays<-nyears*365+1
times<-seq(0,nyears*365,by=1)
Rates of change of state variables (e.g. the rate of change of variable "A1" is "dA1")are calculated according to existing values for state variables (at time=t) and a set of parameters.
Simplified example:
dA1<-Tf*A1*(ImaxA*p_sub)
Where Tf, ImaxA and p_sub are parameters, and A1 is my state variable at time=t.
When I solve the model, I use the lsoda solver like this:
out<-as.data.frame(lsoda(start,times,model,parms))
Sometimes (depending on my parameter combinations), the model run completes over the entire duration I have specified, however sometimes it stops short of the mark (still giving me output up until the solver "crashes"). When it "crashes", this message is displayed:
DLSODA- At current T (=R1), MXSTEP (=I1) steps
taken on this call before reaching TOUT
In above message, I1 = 5000
In above message, R1 = 11535.5
Warning messages:
1: In lsoda(start, times, model, parms) :
an excessive amount of work (> maxsteps ) was done, but integration was not successful - increase maxsteps
2: In lsoda(start, times, model, parms) :
Returning early. Results are accurate, as far as they go
It commonly appears when one of the state variables is getting exponentially bigger, or is tending very near to zero, however sometimes it crashes when seemingly not much change is happening. I may be wrong, but is it due to the rate of change of state-variables becoming too large? If so, why might it also "crash" when there is not a fast rate of change?
Is there a way that I can make the solver complete its task with the specified parameter values, maybe with a more relaxed tolerance for error?
Thank you all for your contributions. I looked at some of the rates, and at the point of crashing, the model was switching between two metabolic states - and the fast rate of this binary switch caused the solver to stop - rejecting the solution because the rate of change was too large. I have fixed my model by introducing a gradual switch between states (with a logistic curve) instead of this binary switch. I aknowledge that I didn;t give enough info in the original question, so thanks for the help you offered!

Resources