Gaussian kernel - math

I was going through a Gaussian filter article and I suddenly came across this line In theory, the Gaussian distribution is non-zero everywhere. I gave it a couple of thoughts but couldn't satisfy myself. I would love to have others' opinions on it. Can someone explain to me in a simple term?
Thanks in advance.

The probability of any number x in Gaussian Distribution is non-zero because of the way it's equation is designed.
f(x) will be very close to zero but still not zero for large x. Exp of a very large negative number is close to zero but not zero. In the graphs below, f(x) is asymptotic to x, not zero.

Related

In what sense does K(r) [spatstat] become biased for point patterns with <15 points?

In the help file for the Kest function in spatstat there is a warning section stating:
"The estimator of K(r) is approximately unbiased for each fixed r. Bias increases with r and depends on the window geometry. For a rectangular window it is prudent to restrict the r values to a maximum of 1/4 of the smaller side length of the rectangle. Bias may become appreciable for point patterns consisting of fewer than 15 points."
I would like to know in what sense the estimator of K(r) becomes biased with increasing r and for point patterns with fewer than 15 points?
Any advice on this matter would be greatly appreciated!
I have read the book "Spatial point patterns" (Baddeley et al., 2015) but I can't seem to find the answer there (or in any other literature). I may of course have missed that section of the book, if so please let me know.
I don't know the historical facts about where n=15 comes from, but this is probably related to the fact that the estimate of K(r) is only ratio-unbiased. Typically what we can estimate directly is X(r) = lambda^2*K(r) where lambda is the the true intensity of the process. Then we use the estimate of this quantity, X_est(r) say, together with an estimate of lambda^2, lambda^2_est say, and then estimate K(r) as K_est(r) = X_est(r) / lambda^2_est. Thus the numerator and denominator are unbiased estimates of the right things, but the ratio isn't. The problem is worst when lambda^2 is poorly estimated, i.e., when we have few data points.

In numerical optimizing likelihood function with R, minimum is achieved, but the hessian matrix is not positive semi-definite

Recently, I have constructed a stats model with the negative log-likelihood to be minimized. There are nine parameters to be estimate (in fact I wanna add two more further). Several optimization method in R have been used,including optim,GenSA, DEoptim,Solnp. Then I got a minimum satisfied.
In the next procedure to compute t-value, it is necessary to compute se:
sqrt(diag(solve(hessian)))
However, error occurs due to hessian matrix is not positive semi-definite that negative numbers exist in the main diagonal elements. I have tried optimHess or numericHessian to compute different hessian (the hessians are different) but failed all the same. The work suspends.
This question I think is common in multiple parametric statistics. I ask for help that how should I do in this situation.
There is a paper by Jeff Gill and Gary King discussing this issue. It may help. Essentially, even if theoretically the Hessian should be definite positive at the minimum, because of numerical issues it may not. The paper discusses methods to deal with such matrices.

Why use softmax as opposed to standard normalization?

In the output layer of a neural network, it is typical to use the softmax function to approximate a probability distribution:
This is expensive to compute because of the exponents. Why not simply perform a Z transform so that all outputs are positive, and then normalise just by dividing all outputs by the sum of all outputs?
There is one nice attribute of Softmax as compared with standard normalisation.
It react to low stimulation (think blurry image) of your neural net with rather uniform distribution and to high stimulation (ie. large numbers, think crisp image) with probabilities close to 0 and 1.
While standard normalisation does not care as long as the proportion are the same.
Have a look what happens when soft max has 10 times larger input, ie your neural net got a crisp image and a lot of neurones got activated
>>> softmax([1,2]) # blurry image of a ferret
[0.26894142, 0.73105858]) # it is a cat perhaps !?
>>> softmax([10,20]) # crisp image of a cat
[0.0000453978687, 0.999954602]) # it is definitely a CAT !
And then compare it with standard normalisation
>>> std_norm([1,2]) # blurry image of a ferret
[0.3333333333333333, 0.6666666666666666] # it is a cat perhaps !?
>>> std_norm([10,20]) # crisp image of a cat
[0.3333333333333333, 0.6666666666666666] # it is a cat perhaps !?
I've had this question for months. It seems like we just cleverly guessed the softmax as an output function and then interpret the input to the softmax as log-probabilities. As you said, why not simply normalize all outputs by dividing by their sum? I found the answer in the Deep Learning book by Goodfellow, Bengio and Courville (2016) in section 6.2.2.
Let's say our last hidden layer gives us z as an activation. Then the softmax is defined as
Very Short Explanation
The exp in the softmax function roughly cancels out the log in the cross-entropy loss causing the loss to be roughly linear in z_i. This leads to a roughly constant gradient, when the model is wrong, allowing it to correct itself quickly. Thus, a wrong saturated softmax does not cause a vanishing gradient.
Short Explanation
The most popular method to train a neural network is Maximum Likelihood Estimation. We estimate the parameters theta in a way that maximizes the likelihood of the training data (of size m). Because the likelihood of the whole training dataset is a product of the likelihoods of each sample, it is easier to maximize the log-likelihood of the dataset and thus the sum of the log-likelihood of each sample indexed by k:
Now, we only focus on the softmax here with z already given, so we can replace
with i being the correct class of the kth sample. Now, we see that when we take the logarithm of the softmax, to calculate the sample's log-likelihood, we get:
, which for large differences in z roughly approximates to
First, we see the linear component z_i here. Secondly, we can examine the behavior of max(z) for two cases:
If the model is correct, then max(z) will be z_i. Thus, the log-likelihood asymptotes zero (i.e. a likelihood of 1) with a growing difference between z_i and the other entries in z.
If the model is incorrect, then max(z) will be some other z_j > z_i. So, the addition of z_i does not fully cancel out -z_j and the log-likelihood is roughly (z_i - z_j). This clearly tells the model what to do to increase the log-likelihood: increase z_i and decrease z_j.
We see that the overall log-likelihood will be dominated by samples, where the model is incorrect. Also, even if the model is really incorrect, which leads to a saturated softmax, the loss function does not saturate. It is approximately linear in z_j, meaning that we have a roughly constant gradient. This allows the model to correct itself quickly. Note that this is not the case for the Mean Squared Error for example.
Long Explanation
If the softmax still seems like an arbitrary choice to you, you can take a look at the justification for using the sigmoid in logistic regression:
Why sigmoid function instead of anything else?
The softmax is the generalization of the sigmoid for multi-class problems justified analogously.
I have found the explanation here to be very good: CS231n: Convolutional Neural Networks for Visual Recognition.
On the surface the softmax algorithm seems to be a simple non-linear (we are spreading the data with exponential) normalization. However, there is more than that.
Specifically there are a couple different views (same link as above):
Information Theory - from the perspective of information theory the softmax function can be seen as trying to minimize the cross-entropy between the predictions and the truth.
Probabilistic View - from this perspective we are in fact looking at the log-probabilities, thus when we perform exponentiation we end up with the raw probabilities. In this case the softmax equation find the MLE (Maximum Likelihood Estimate)
In summary, even though the softmax equation seems like it could be arbitrary it is NOT. It is actually a rather principled way of normalizing the classifications to minimize cross-entropy/negative likelihood between predictions and the truth.
The values of q_i are unbounded scores, sometimes interpreted as log-likelihoods. Under this interpretation, in order to recover the raw probability values, you must exponentiate them.
One reason that statistical algorithms often use log-likelihood loss functions is that they are more numerically stable: a product of probabilities may be represented be a very small floating point number. Using a log-likelihood loss function, a product of probabilities becomes a sum.
Another reason is that log-likelihoods occur naturally when deriving estimators for random variables that are assumed to be drawn from multivariate Gaussian distributions. See for example the Maximum Likelihood (ML) estimator and the way it is connected to least squares.
We are looking at a multiclass classification problem. That is, the predicted variable y can take one of k categories, where k > 2. In probability theory, this is usually modelled by a multinomial distribution. Multinomial distribution is a member of exponential family distributions. We can reconstruct the probability P(k=?|x) using properties of exponential family distributions, it coincides with the softmax formula.
If you believe the problem can be modelled by another distribution, other than multinomial, then you could reach a conclusion that is different from softmax.
For further information and a formal derivation please refer to CS229 lecture notes (9.3 Softmax Regression).
Additionally, a useful trick usually performs to softmax is: softmax(x) = softmax(x+c), softmax is invariant to constant offsets in the input.
The choice of the softmax function seems somehow arbitrary as there are many other possible normalizing functions. It is thus unclear why the log-softmax loss would perform better than other loss alternatives.
From "An Exploration of Softmax Alternatives Belonging to the Spherical Loss Family" https://arxiv.org/abs/1511.05042
The authors explored some other functions among which are Taylor expansion of exp and so called spherical softmax and found out that sometimes they might perform better than usual softmax.
I think one of the reasons can be to deal with the negative numbers and division by zero, since exp(x) will always be positive and greater than zero.
For example for a = [-2, -1, 1, 2] the sum will be 0, we can use softmax to avoid division by zero.
Adding to Piotr Czapla answer, the greater the input values, the greater the probability for the maximum input, for same proportion and compared to the other inputs:
Suppose we change the softmax function so the output activations are given by
where c is a positive constant. Note that c=1 corresponds to the standard softmax function. But if we use a different value of c we get a different function, which is nonetheless qualitatively rather similar to the softmax. In particular, show that the output activations form a probability distribution, just as for the usual softmax. Suppose we allow c to become large, i.e., cā†’āˆž. What is the limiting value for the output activations a^L_j? After solving this problem it should be clear to you why we think of the c=1 function as a "softened" version of the maximum function. This is the origin of the term "softmax". You can follow the details from this source (equation 83).
While it indeed somewhat arbitrary, the softmax has desirable properties such as:
being easily diferentiable (df/dx = f*(1-f))
when used as the output layer for a classification task, the in-fed scores are interpretable as log-odds

Nonlinear regression / Curve fitting with L-infinity norm

I am looking into time series data compression at the moment.
The idea is to fit a curve on a time series of n points so that the maximum deviation on any of the points is not greater than a given threshold. In other words, none of the values that the curve takes at the points where the time series is defined, should be "further away" than a certain threshold from the actual values.
Till now I have found out how to do nonlinear regression using the least squares estimation method in R (nls function) and other languages, but I haven't found any packages that implement nonlinear regression with the L-infinity norm.
I have found literature on the subject:
http://www.jstor.org/discover/10.2307/2006101?uid=3737864&uid=2&uid=4&sid=21100693651721
or
http://www.dtic.mil/dtic/tr/fulltext/u2/a080454.pdf
I could try to implement this in R for instance, but I first looking to see if this hasn't already been done and that I could maybe reuse it.
I have found a solution that I don't believe to be "very scientific": I use nonlinear least squares regression to find the starting values of the parameters which I subsequently use as starting points in the R "optim" function that minimizes the maximum deviation of the curve from the actual points.
Any help would be appreciated. The idea is to be able to find out if this type of curve-fitting is possible on a given time series sequence and to determine the parameters that allow it.
I hope there are other people that have already encountered this problem out there and that could help me ;)
Thank you.

I want to draw a curve and generate a polynomial that closely fits it. How would I go about this?

I have an arbitrary curve (defined by a set of points) and I would like to generate a polynomial that fits that curve to an arbitrary precision. What is the best way to tackle this problem, or is there already a library or online service that performs this task?
Thanks!
If your "arbitrary curve" is described by a set of points (x_i,y_i) where each x_i is unique, and if you mean by "fits" the calculation of the best least-squares polynomial approximation of degree N, you can simply obtain the coefficients b of the polynomial using
b = polyfit(X,Y,N)
where X is the vector of x_i values, Y is the vector of Y_i values. In this way you can increase N until you obtain the accuracy you require. Of course you can achieve zero approximation error by calculating the interpolating polynomial. However, data fitting often requires some thought beforehand - you need to give thought to what you want the approximation to achieve. There are a variety of mathematical ways of assessing approximation error (by using different norms), the choice of which will depend on your requirements of the resulting approximation. There are also many potential pitfalls (such as overfitting) that you may come across and blindly attempting to fit curves may result in an approximation that is theoritically sound but utterly useless to you in practical terms. I would suggest doing a little research on approximation theory if the above method does not meet your requirements, as has been suggested in the comments on your question.

Resources