conjugate gradient with a positive semidefinite matrix - linear-algebra

I'm working on conjugate gradient to solve Ax=b when A is symmetric and positive semidefinite.
When A is symmetric and positive semidefinite, is (A+λ I), where λ is positive and I is an identity matrix, always positive definite? Then can we use (A+λ I) instead of A in CG since (A+λ I) is symmetric and positive definite?
When A is positive semidefinite with many repeated eigenvalues of zeros, are both A and (A+λ I) not full rank? How does CG behave when the matrix is not full rank?
Many thanks!

If matrix A is positive semi-definite, then matrix A+λI where λ>0 is positive definite. The effect is to add λ to all the eigenvalues of A, so any zero eigenvalues of A become λ instead. So A+λI is always full rank.

Related

Deriving lower triangular matrix with positive diagonal in r

I am creating optimization algorithm where I need to plug initial value of A which is a lower triangular matrix with positive diagonal. My first question is how to derive a random lower triangular matrix with positive diagonal as an initial value matrix in r? And is it good idea to choose random matrix? if not what are the ways to do a better initial guess for this type of matrix?
I can imagine there are much better solutions based on your specific applications, but we can set the diagonal to half-Normal (i.e. |e| where e ~ N(0,1)) and set the lower-triangular off-diagonal elements to standard Normal values. ...
n <- 10
M <- diag(abs(rnorm(n)))
M[lower.tri(M, diag = FALSE)] <- rnorm(n*(n-1)/2)

Covariance matrix in RStan

I would like to define a covariance matrix in RStan.
Similarly to how you can provide constraints to scalar and vector values, e.g. real a, I would like to provide constraints that the leading diagonal of the covariance matrix must be positive, but the off-diagonal components could take any real value.
Is there a way to enforce that the matrix must also be positive semi-definite? Otherwise, some of the samples produced will not be valid covariance matrices.
Yes, defining
cov_matrix[K] Sigma;
ensures that Sigma is symmetric and positive definite K x K matrix. It can reduce to semidefinite due to floating point, but we'll catch that and raise exceptions to ensure it stays strictly positive definite.
Under the hood, Stan uses the Cholesky factor transform---the unconstrained representation is a lower triangular matrix with positive diagonal. We just use that as the real parameters, then transform and apply the Jacobian implicitly under the hood as described the reference manual chapter on constrained variables to create a covariance matrix with an implicit (improper) uniform prior.

Generating Positive definite matrix of dimensions 8x8

I am trying to generate a positive definite matrix (A'*A) of dimensions 8x8.
where A is 1x8.
I tried it for many randomly generated matrix A but not able to generate it.
octave-3.6.1.exe:166> A= (rand(1,8)+rand(1,8)*1i);
octave-3.6.1.exe:167> chol(A'*A);
error: chol: input matrix must be positive definite
Can anyone please tell me what is going wrong here. Thanks for the help in advance.
It's not possible to do that, since no matrix of that form is positive definite.
Claim: Given a 1xn (real, n>1) matrix A, the symmetric matrix M = A'A is not positive definite:
Proof: By definition, M is positive definite iff x'Mx > 0 for all non zero x. That is, iff x'A'Ax = (Ax)'Ax = (Ax)^2 = (A_1 x_1 + ... + A_n x_n) > 0 for all non zero x.
Since the real values A_i are linearly dependent, there exists x_i, not all zero, such that A_1 x_1 + ... + A_n x_n = 0. We found a non zero vector x such that x'Mx = 0, so M is not positive definite.
A different proof, that can be applied directly to the complex case is this: Let A be an 1xn (complex, n>1) matrix. Positive definiteness implies invertibility, so M = A*A must have full rank to be positive definite. It clearly has rank 1, so it's not invertible and thus not positive definite.
Here is how I routinelly create SPD matrix
1) Create a random Symetric Matrix
2) Make sure that all the diagonal values are greater than the sum of any row or column they appear in.
Usually for (1) I use random number between 0 and 1. Its then easy to figure out a number to use for each diagonal entries.
Cheers,

efficient computation of Trace(AB^{-1}) given A and B

I have two square matrices A and B. A is symmetric, B is symmetric positive definite. I would like to compute $trace(A.B^{-1})$. For now, I compute the Cholesky decomposition of B, solve for C in the equation $A=C.B$ and sum up the diagonal elements.
Is there a more efficient way of proceeding?
I plan on using Eigen. Could you provide an implementation if the matrices are sparse (A can often be diagonal, B is often band-diagonal)?
If B is sparse, it may be efficient (i.e., O(n), assuming good condition number of B) to solve for x_i in
B x_i = a_i
(sample Conjugate Gradient code is given on Wikipedia). Taking a_i to be the column vectors of A, you get the matrix B^{-1} A in O(n^2). Then you can sum the diagonal elements to get the trace. Generally, it's easier to do this sparse inverse multiplication than to get the full set of eigenvalues. For comparison, Cholesky decomposition is O(n^3). (see Darren Engwirda's comment below about Cholesky).
If you only need an approximation to the trace, you can actually reduce the cost to O(q n) by averaging
r^T (A B^{-1}) r
over q random vectors r. Usually q << n. This is an unbiased estimate provided that the components of the random vector r satisfy
< r_i r_j > = \delta_{ij}
where < ... > indicates an average over the distribution of r. For example, components r_i could be independent gaussian distributed with unit variance. Or they could be selected uniformly from +-1. Typically the trace scales like O(n) and the error in the trace estimate scales like O(sqrt(n/q)), so the relative error scales as O(sqrt(1/nq)).
If generalized eigenvalues are more efficient to compute, you can compute the generalized eigenvalues, A*v = lambda* B *v and then sum up all the lambdas.

What is SVD(singular value decomposition)

How does it actually reduce noise..can you suggest some nice tutorials?
SVD can be understood from a geometric sense for square matrices as a transformation on a vector.
Consider a square n x n matrix M multiplying a vector v to produce an output vector w:
w = M*v
The singular value decomposition M is the product of three matrices M=U*S*V, so w=U*S*V*v. U and V are orthonormal matrices. From a geometric transformation point of view (acting upon a vector by multiplying it), they are combinations of rotations and reflections that do not change the length of the vector they are multiplying. S is a diagonal matrix which represents scaling or squashing with different scaling factors (the diagonal terms) along each of the n axes.
So the effect of left-multiplying a vector v by a matrix M is to rotate/reflect v by M's orthonormal factor V, then scale/squash the result by a diagonal factor S, then rotate/reflect the result by M's orthonormal factor U.
One reason SVD is desirable from a numerical standpoint is that multiplication by orthonormal matrices is an invertible and extremely stable operation (condition number is 1). SVD captures any ill-conditioned-ness in the diagonal scaling matrix S.
One way to use SVD to reduce noise is to do the decomposition, set components that are near zero to be exactly zero, then re-compose.
Here's an online tutorial on SVD.
You might want to take a look at Numerical Recipes.
Singular value decomposition is a method for taking an nxm matrix M and "decomposing" it into three matrices such that M=USV. S is a diagonal square (the only nonzero entries are on the diagonal from top-left to bottom-right) matrix containing the "singular values" of M. U and V are orthogonal, which leads to the geometric understanding of SVD, but that isn't necessary for noise reduction.
With M=USV, we still have the original matrix M with all its noise intact. However, if we only keep the k largest singular values (which is easy, since many SVD algorithms compute a decomposition where the entries of S are sorted in nonincreasing order), then we have an approximation of the original matrix. This works because we assume that the small values are the noise, and that the more significant patterns in the data will be expressed through the vectors associated with larger singular values.
In fact, the resulting approximation is the most accurate rank-k approximation of the original matrix (has the least squared error).
To answer to the tittle question: SVD is a generalization of eigenvalues/eigenvectors to non-square matrices.
Say,
$X \in N \times p$, then the SVD decomposition of X yields X=UDV^T where D is diagonal and U and V are orthogonal matrices.
Now X^TX is a square matrice, and the SVD decomposition of X^TX=VD^2V where V is equivalent to the eigenvectors of X^TX and D^2 contains the eigenvalues of X^TX.
SVD can also be used to greatly ease global (i.e. to all observations simultaneously) fitting of an arbitrary model (expressed in an formula) to data (with respect to two variables and expressed in a matrix).
For example, data matrix A = D * MT where D represents the possible states of a system and M represents its evolution wrt some variable (e.g. time).
By SVD, A(x,y) = U(x) * S * VT(y) and therefore D * MT = U * S * VT
then D = U * S * VT * MT+ where the "+" indicates a pseudoinverse.
One can then take a mathematical model for the evolution and fit it to the columns of V, each of which are a linear combination the components of the model (this is easy, as each column is a 1D curve). This obtains model parameters which generate M? (the ? indicates it is based on fitting).
M * M?+ * V = V? which allows residuals R * S2 = V - V? to be minimized, thus determining D and M.
Pretty cool, eh?
The columns of U and V can also be inspected to glean information about the data; for example each inflection point in the columns of V typically indicates a different component of the model.
Finally, and actually addressing your question, it is import to note that although each successive singular value (element of the diagonal matrix S) with its attendant vectors U and V does have lower signal to noise, the separation of the components of the model in these "less important" vectors is actually more pronounced. In other words, if the data is described by a bunch of state changes that follow a sum of exponentials or whatever, the relative weights of each exponential get closer together in the smaller singular values. In other other words the later singular values have vectors which are less smooth (noisier) but in which the change represented by each component are more distinct.

Resources