The autocorrelation command acf is slightly off the correct value. the command i am giving is acf(X,lag.max = 1) and the output is 0.881 while the same autocorrelation calculation done using the command cor(X[1:41],X[2:42]) gives the value 0.9452. I searched a lot to understand if there is any syntax I am missing or if the two computations have some fundamental difference but could not find suitable resources. Please help.
The equation for autocorrelation used by acf() and in all time series textbooks is
where
See https://otexts.com/fpp2/autocorrelation.html for example.
The equation for regular correlation is slightly different:
where
The simpler formula is better because it uses all available data in the denominator.
Note also that you might expect the top formula to include the multiplier $T/(T-k)$ given the different number of terms in the summations. But this is not included to ensure that the corresponding autocorrelation matrix is nonnegative definite.
This set of exercises has the student use a QP solver to solve an SVM in R. The suggested solver is the quadprog package. The quadratic problem is given as:
From the remark about the linear SVM, $K=XX'$, $K$ is a singular matrix usually, at most rank $p$ where $X$ is $n\times p$. But the solver quadprog requires a positive definite matrix, not just PSD, in the place of $K$, as mentioned many places (and verified). Any ideas what the instructor had in mind?
I think the workaround would be to add a small number (such as 1e-7) to the diagonal elements of the matrix which is supposed to be positive definite. I am not certain about the math behind it, but the sources below, as well as my experience, suggest that this solution works.
source: https://stats.stackexchange.com/questions/179900/optimizing-a-support-vector-machine-with-quadratic-programming
source: https://teazrq.github.io/stat542/hw/HW6.pdf
I'm using IPOPT within Julia. My objective function will throw an error for certain parameter values (specifically, though I assume this doesn't matter, it involves a Cholesky decomposition of a covariance matrix and so requires that the covariance matrix be positive-definite). As such, I non-linearly constrain the parameters so that they cannot produce an error. Despite this constraint, IPOPT still insists on evaluating the objective function at paramaters which cause my objective function to throw an error. This causes my script to crash, resulting in misery and pain.
I'm interested why, in general, IPOPT would evaluate a function at parameters that breach the constraints. (I've ensured that it is indeed checking the constraints before evaluating the function.) If possible, I would like to know how I can stop it doing this.
I have set IPOPT's 'bound_relax_factor' parameter to zero; this doesn't help. I understand I could ask the objective function to return NaN instead of throwing an error, but when I do IPOPT seems to get even more confused and does not end up converging. Poor thing.
I'm happy to provide some example code if it would help.
Many thanks in advance :):)
EDIT:
A commenter suggested I ask my objective function to return a bad objective value when the constraints are violated. Unfortunately this is what happens when I do:
I'm not sure why Ipopt would go from a point evaluating at 2.0016x10^2 to a point evaluating at 10^10 — I worry there's something quite fundamental about IPOPT I'm not understanding.
Setting 'constr_viol_tol' and 'acceptable_constr_viol_tol' to their minimal values doesn't noticably affect optimisation, nor does 'over-constraining' my parameters (i.e. ensuring they cannot be anywhere near an unacceptable value).
The only constraints that Ipopt is guaranteed to satisfy at all intermediate iterations are simple upper and lower bounds on variables. Any other linear or nonlinear equality or inequality constraint will not necessarily be satisfied until the solver has finished converging at the final iteration (if it can get to a point that satisfies termination conditions). Guaranteeing that intermediate iterates are always feasible in the presence of arbitrary non-convex equality and inequality constraints is not tractable. The Newton step direction is based on local first and second order derivative information, so will be an approximation and may leave the space of feasible points if the problem has nontrivial curvature. Think about the space of points where x * y == constant as an example.
You should reformulate your problem to avoid needing to evaluate objective or constraint functions at invalid points. For example, instead of taking the Cholesky factorization of a covariance matrix constructed from your data, introduce a unit lower triangular matrix L and a diagonal matrix D. Impose lower bound constraints D[i, i] >= 0 for all i in 1:size(D,1), and nonlinear equality constraints L * D * L' == A where A is your covariance matrix. Then use L * sqrtm(D) anywhere you need to operate on the Cholesky factorization (this is a possibly semidefinite factorization, so more of a modified Cholesky representation than the classical strictly positive definite L * L' factorization).
Note that if your problem is convex, then there is likely a specialized formulation that a conic solver will be more efficient at solving than a general-purpose nonlinear solver like Ipopt.
In the output layer of a neural network, it is typical to use the softmax function to approximate a probability distribution:
This is expensive to compute because of the exponents. Why not simply perform a Z transform so that all outputs are positive, and then normalise just by dividing all outputs by the sum of all outputs?
There is one nice attribute of Softmax as compared with standard normalisation.
It react to low stimulation (think blurry image) of your neural net with rather uniform distribution and to high stimulation (ie. large numbers, think crisp image) with probabilities close to 0 and 1.
While standard normalisation does not care as long as the proportion are the same.
Have a look what happens when soft max has 10 times larger input, ie your neural net got a crisp image and a lot of neurones got activated
>>> softmax([1,2]) # blurry image of a ferret
[0.26894142, 0.73105858]) # it is a cat perhaps !?
>>> softmax([10,20]) # crisp image of a cat
[0.0000453978687, 0.999954602]) # it is definitely a CAT !
And then compare it with standard normalisation
>>> std_norm([1,2]) # blurry image of a ferret
[0.3333333333333333, 0.6666666666666666] # it is a cat perhaps !?
>>> std_norm([10,20]) # crisp image of a cat
[0.3333333333333333, 0.6666666666666666] # it is a cat perhaps !?
I've had this question for months. It seems like we just cleverly guessed the softmax as an output function and then interpret the input to the softmax as log-probabilities. As you said, why not simply normalize all outputs by dividing by their sum? I found the answer in the Deep Learning book by Goodfellow, Bengio and Courville (2016) in section 6.2.2.
Let's say our last hidden layer gives us z as an activation. Then the softmax is defined as
Very Short Explanation
The exp in the softmax function roughly cancels out the log in the cross-entropy loss causing the loss to be roughly linear in z_i. This leads to a roughly constant gradient, when the model is wrong, allowing it to correct itself quickly. Thus, a wrong saturated softmax does not cause a vanishing gradient.
Short Explanation
The most popular method to train a neural network is Maximum Likelihood Estimation. We estimate the parameters theta in a way that maximizes the likelihood of the training data (of size m). Because the likelihood of the whole training dataset is a product of the likelihoods of each sample, it is easier to maximize the log-likelihood of the dataset and thus the sum of the log-likelihood of each sample indexed by k:
Now, we only focus on the softmax here with z already given, so we can replace
with i being the correct class of the kth sample. Now, we see that when we take the logarithm of the softmax, to calculate the sample's log-likelihood, we get:
, which for large differences in z roughly approximates to
First, we see the linear component z_i here. Secondly, we can examine the behavior of max(z) for two cases:
If the model is correct, then max(z) will be z_i. Thus, the log-likelihood asymptotes zero (i.e. a likelihood of 1) with a growing difference between z_i and the other entries in z.
If the model is incorrect, then max(z) will be some other z_j > z_i. So, the addition of z_i does not fully cancel out -z_j and the log-likelihood is roughly (z_i - z_j). This clearly tells the model what to do to increase the log-likelihood: increase z_i and decrease z_j.
We see that the overall log-likelihood will be dominated by samples, where the model is incorrect. Also, even if the model is really incorrect, which leads to a saturated softmax, the loss function does not saturate. It is approximately linear in z_j, meaning that we have a roughly constant gradient. This allows the model to correct itself quickly. Note that this is not the case for the Mean Squared Error for example.
Long Explanation
If the softmax still seems like an arbitrary choice to you, you can take a look at the justification for using the sigmoid in logistic regression:
Why sigmoid function instead of anything else?
The softmax is the generalization of the sigmoid for multi-class problems justified analogously.
I have found the explanation here to be very good: CS231n: Convolutional Neural Networks for Visual Recognition.
On the surface the softmax algorithm seems to be a simple non-linear (we are spreading the data with exponential) normalization. However, there is more than that.
Specifically there are a couple different views (same link as above):
Information Theory - from the perspective of information theory the softmax function can be seen as trying to minimize the cross-entropy between the predictions and the truth.
Probabilistic View - from this perspective we are in fact looking at the log-probabilities, thus when we perform exponentiation we end up with the raw probabilities. In this case the softmax equation find the MLE (Maximum Likelihood Estimate)
In summary, even though the softmax equation seems like it could be arbitrary it is NOT. It is actually a rather principled way of normalizing the classifications to minimize cross-entropy/negative likelihood between predictions and the truth.
The values of q_i are unbounded scores, sometimes interpreted as log-likelihoods. Under this interpretation, in order to recover the raw probability values, you must exponentiate them.
One reason that statistical algorithms often use log-likelihood loss functions is that they are more numerically stable: a product of probabilities may be represented be a very small floating point number. Using a log-likelihood loss function, a product of probabilities becomes a sum.
Another reason is that log-likelihoods occur naturally when deriving estimators for random variables that are assumed to be drawn from multivariate Gaussian distributions. See for example the Maximum Likelihood (ML) estimator and the way it is connected to least squares.
We are looking at a multiclass classification problem. That is, the predicted variable y can take one of k categories, where k > 2. In probability theory, this is usually modelled by a multinomial distribution. Multinomial distribution is a member of exponential family distributions. We can reconstruct the probability P(k=?|x) using properties of exponential family distributions, it coincides with the softmax formula.
If you believe the problem can be modelled by another distribution, other than multinomial, then you could reach a conclusion that is different from softmax.
For further information and a formal derivation please refer to CS229 lecture notes (9.3 Softmax Regression).
Additionally, a useful trick usually performs to softmax is: softmax(x) = softmax(x+c), softmax is invariant to constant offsets in the input.
The choice of the softmax function seems somehow arbitrary as there are many other possible normalizing functions. It is thus unclear why the log-softmax loss would perform better than other loss alternatives.
From "An Exploration of Softmax Alternatives Belonging to the Spherical Loss Family" https://arxiv.org/abs/1511.05042
The authors explored some other functions among which are Taylor expansion of exp and so called spherical softmax and found out that sometimes they might perform better than usual softmax.
I think one of the reasons can be to deal with the negative numbers and division by zero, since exp(x) will always be positive and greater than zero.
For example for a = [-2, -1, 1, 2] the sum will be 0, we can use softmax to avoid division by zero.
Adding to Piotr Czapla answer, the greater the input values, the greater the probability for the maximum input, for same proportion and compared to the other inputs:
Suppose we change the softmax function so the output activations are given by
where c is a positive constant. Note that c=1 corresponds to the standard softmax function. But if we use a different value of c we get a different function, which is nonetheless qualitatively rather similar to the softmax. In particular, show that the output activations form a probability distribution, just as for the usual softmax. Suppose we allow c to become large, i.e., c→∞. What is the limiting value for the output activations a^L_j? After solving this problem it should be clear to you why we think of the c=1 function as a "softened" version of the maximum function. This is the origin of the term "softmax". You can follow the details from this source (equation 83).
While it indeed somewhat arbitrary, the softmax has desirable properties such as:
being easily diferentiable (df/dx = f*(1-f))
when used as the output layer for a classification task, the in-fed scores are interpretable as log-odds
I have several questions:
1. What's the difference between isoMDS and cmdscale?
2. May I use asymmetric matrix?
3. Is there any way to determine optimal number of dimensions (in result)?
One of the MDS methods is distance scaling and it is divided in metric and non-metric. Another one is the classical scaling (also called distance geometry by those in bioinformatics). Classical scaling can be carried out in R by using the command cmdscale. Kruskal's method of nonmetric distance scaling (using the stress function and isotonic regression) can be carried out by using the command isoMDS in library MASS.
The standard treatment of classical scaling yields an eigendecomposition problem and as such is the same as PCA if the goal is dimensionality reduction. The distance scaling methods, on the other hand, use iterative procedures to arrive at a solution.
If you refer to the distance structure, I guess you should pass a structure of the class dist which is an object with distance information. Or a (symmetric) matrix of distances, or an object which can be coerced to such a matrix using as.matrix(). (As I read in the help, only the lower triangle of the matrix is used, the rest is ignored).
(for classical scaling method): One way of determining the dimensionality of the resulting configuration is to look at the eigenvalues of the doubly centered symmetric matrix B (= HAH). The usual strategy is to plot the ordered eigenvalues (or some function of them) against dimension and then identify a dimension at which the eigenvalues become “stable” (i.e., do not change perceptively). At that dimension, we may observe an “elbow” that shows where stability occurs (for points of a n-dimensional space, stability in the plot should occur at dimension n+1). For easier graphical interpretation of a classical scaling solution, we usually choose n to be small, of the order 2 or 3.