Sum of residuals using lm is non-zero - r

I have defined two variables x and y. I want to regress y on x, but the sum of residuals using the lm is non-zero
Here are the variables:
x<-c(1,10,6,4,3,5,8,9,0,3,1,1,12,6,3,11,15,5,10,4)
y<-c(2,3,6,7,8,4,2,1,0,0,6,1,3,5,2,4,1,0,1,9)
gh<-lm(y~x)
sum(gh$residuals)
# [1] 4.718448e-16
I don't understand why the sum of residuals is non-zero. It should be zero by the procedure of OLS.
Thanks

Floating-point numbers have limited precision. Only a finite set of real numbers can be represented exactly as 32- or 64-bit floats; the rest are approximated by rounding them to the nearest number that can be represented exactly.
This means that, while mathematically the residuals should sum up to zero, in computer representation they might not.
I highly recommend What Every Computer Scientist Should Know About Floating-Point Arithmetic.

Related

Covariance matrix in RStan

I would like to define a covariance matrix in RStan.
Similarly to how you can provide constraints to scalar and vector values, e.g. real a, I would like to provide constraints that the leading diagonal of the covariance matrix must be positive, but the off-diagonal components could take any real value.
Is there a way to enforce that the matrix must also be positive semi-definite? Otherwise, some of the samples produced will not be valid covariance matrices.
Yes, defining
cov_matrix[K] Sigma;
ensures that Sigma is symmetric and positive definite K x K matrix. It can reduce to semidefinite due to floating point, but we'll catch that and raise exceptions to ensure it stays strictly positive definite.
Under the hood, Stan uses the Cholesky factor transform---the unconstrained representation is a lower triangular matrix with positive diagonal. We just use that as the real parameters, then transform and apply the Jacobian implicitly under the hood as described the reference manual chapter on constrained variables to create a covariance matrix with an implicit (improper) uniform prior.

Precision of matrix calculation R vs. Stata

I am trying to determine whether matrices are negative semi-definit or not. For this reason, I check if all eigenvalues are smaller or equal to zero. One example matrix is:
[,1] [,2] [,3] [,4]
[1,] -1.181830e-05 0.0001576663 -2.602332e-07 1.472770e-05
[2,] 1.576663e-04 -0.0116220027 3.249607e-04 -2.348050e-04
[3,] -2.602332e-07 0.0003249607 -2.616447e-05 3.492998e-05
[4,] 1.472770e-05 -0.0002348050 3.492998e-05 -9.103073e-05
The eigenvalues calculated by stata are 1.045e-12, -0.00001559, -0.00009737, -0.01163805. However, eigenvalues calculated by R are -1.207746e-20, -1.558760e-05, -9.737074e-05, -1.163806e-02. So the last three eigenvalues are very similar, but the first one which is very close to zero is not. With the eigenvalues obtained with stata, the matrix is not semi-definit, but with the eigenvalues obtained with R it is semi-definit. Is there a way I can find out which calculation is more precise? Or might it even be possible to rescale the matrix in order to avoid infinitely small eigenvalues?
Thank you very much in advance. Every hint will be highly appreciated.
You can't expect so much precision from a numerical algorithm using double precision floating point numbers.
You can expect no more than 17 decimal digits, and relative precision loss around zero is not uncommon. That is, given numerical error, 1e-12 and -1e-20 are both mostly indistinguishable from 0.
For instance, for the smallest eigenvalue (using coefficients you give in your comment), I get:
R 3.4.1: 5.929231e-21,
MATLAB R2017a: 3.412972022812169e-19
Stata 15: 3.2998e-20 (matrix eigenvalues) or 4.464e-19 (matrix symeigen)
Intel Fortran with MKL (DSYEV function): 2.2608e-19
You may choose a threshold, say 1e-10, and force an eigenvalue to zero when its ratio to the largest eigenvalue is less than 1e-10.
Anyway, your 1e-12 looks a bit large. You may have lost some precision when transferring data between Stata and R: a small relative error in the matrix can result in large relative error for eigenvalues arount zero.
With Stata and the data in your question (not in the comment), I get for instance 3.696e-12 for the smallest eigenvalue.
However, even with the same matrix, there may still be differences (there are, above), due to variations in:
the parser, if you enter your numbers as text
the algorithm used for eigenvalue computation
implementation details of the same algorithm (floating-point operators are not associative, for instance)
the compiler used to compile the computation routines, or compiler options
floating-point hardware
The traditionnal suggested reading for this kind of question:
What Every Computer Scientist Should Know About Floating-Point Arithmetic

Interpreting the psych::cor.smoother function

I've tried to contact William Revelle about this but he isn't responding.
In the psych package there is a function called cor.smoother, which determines whether or not a correlation matrix is positive definite. Its explanation is as follows:
"cor.smoother examines all of nvar minors of rank nvar-1 by systematically dropping one variable at a time and finding the eigen value decomposition. It reports those variables, which, when dropped, produce a positive definite matrix. It also reports the number of negative eigenvalues when each variable is dropped. Finally, it compares the original correlation matrix to the smoothed correlation matrix and reports those items with absolute deviations great than cut. These are all hints as to what might be wrong with a correlation matrix."
It is the really the statement in bold that I am hoping someone can interpret in a more understandable way for me?
A belated answer to your question.
Correlation matrices are said to be improper (or more accurately, not positive semi-definite) when at least one of the eigen values of the matrix is less than 0. This can happen if you have some missing data and are using pair-wise complete correlations. It is particularly likely to happen if you are doing tetrachoric or polychoric correlations based upon data sets with some or even a lot of missing data.
(A correlation matrix, R, may be decomposed into a set of eigen vectors (X) and eigen values (lambda) where R = X lambda X’. This decomposition is the basis of components analysis and factor analysis, but that is more than you want to know.)
The cor.smooth function finds the eigen values and then adjusts the negative ones by making them slightly positive (and adjusting the other ones to compensate for this change).
The cor.smoother function attempts to identify the variables that are making the matrix improper. It does this by considering all the matrices generated by dropping one variable at a time and seeing which ones of those are not positive semi-definite (i.e. have eigen values < 0.) Ideally, this will identify one variable that is messing things up.
An example of this is in the burt data set where the sorrow-tenderness correlation was probably mistyped and the .87 should be .81.
cor.smoother(burt) #identifies tenderness and sorrow as likely culprits

how to compute the global variance (square standard deviation) in a parallel application?

I have a parallel application in which I am computing in each node the variance of each partition of datapoint based on the calculated mean, but how can I compute the global variance (sum of all the variances)?
I thought that it would be a simple sum of the variances and divided by the number of nodes, but it is not giving me a close result...
The global variation is a sum.
You can compute parts of the sum in parallel trivially, and then add them together.
sum(x1...x100) = sum(x1...x50) + sum(x51...x100)
The same way, you can compute the global averages - compute the global sum, compute the sum of the object counts, divide (don't divide by the number of nodes; but by the total number of objects).
mean = sum/count
Once you have the mean, you can compute the sum of squared deviations using the distributed sum formula above (applied to (xi-mean)^2), then divide by count-1 to get the variance.
Do not use E[X^2] - (E[X])^2
While this formula "mean of square minus square of mean" is highly popular, it is numerically unstable when you are using floating point math. It's known as catastrophic cancellation.
Because the two values can be very close, you lose a lot of digits in precision when computing the difference. I've seen people get a negative variance this way...
With "big data", numerical problems gets worse...
Two ways to avoid these problems:
Use two passes. Computing the mean is stable, and gets you rid of the subtraction of the squares.
Use an online algorithm such as the one by Knuth and Welford, then use weighted sums to combine the per-partition means and variances. Details on Wikipedia In my experience, this often is slower; but it may be beneficial on Hadoop due to startup and IO costs.
You need to add the sums and sums of squares of each partition to get the global sum and sum of squares and then use them to calculate the global mean and variance.
UPDATE: E[X2] - E[X]2 and cancellation...
To figure out how important cancellation error is when calculating the standard deviation with
σ = √(E[X2] - E[X]2)
let us assume that we have both E[X2] and E[X]2 accurate to 12 significant decimal figures. This implies that σ2 has an error of order 10-12 × E[X2] or, if there has been significant cancellation, equivalently 10-12 × E[X]2 when σ will have an error of approximate order 10-6 × E[X]; one millionth the mean.
For many, if not most, statistical analyses this is negligable, in the sense that it falls within other sources of error (like measurement error), and so you can in good consciense simply set negative variances to zero before you take the square root.
If you really do care about deviations of this magnitude (and can show that it's a feature of the thing you are measuring and not, for example, an artifact of the method of measurement) then you can start worrying about cancellation. That said, the most likely explanation is that you have used an inappropriate scale for your data, such as measuring daily temperatures in Kelvin rather than Celcius!

PCA analysis using Correlation Matrix as input in R

Now i have a 7000*7000 correlation matrix and I have to do PCA on this in R.
I used the
CorPCA <- princomp(covmat=xCor)
, xCor is the correlation matrix
but it comes out
"covariance matrix is not non-negative definite"
it is because i have some negative correlation in that matrix.
I am wondering which inbuilt function in R that i can use to get the result of PCA
One method to do the PCA is to perform an eigenvalue decomposition of the covariance matrix, see wikipedia.
The advantage of the eigenvalue decomposition is that you see which directions (eigenvectors) are significant, i.e. have a noticeable variation expressed by the associated eigenvalues. Moreover, you can detect if the covariance matrix is positive definite (all eigenvalues greater than zero), not negative-definite (which is okay) if there are eigenvalues equal zero or if it is indefinite (which is not okay) by negative eigenvalues. Sometimes it also happens that due to numerical inaccuracies a non-negative-definite matrix becomes negative-definite. In that case you would observe negative eigenvalues which are almost zero. In that case you can set these eigenvalues to zero to retain the non-negative definiteness of the covariance matrix. Furthermore, you can still interpret the result: eigenvectors contributing the significant information are associated with the biggest eigenvalues. If the list of sorted eigenvalues declines quickly there are a lot of directions which do not contribute significantly and therefore can be dropped.
The built-in R function is eigen
If your covariance matrix is A then
eigen_res <- eigen(A)
# sorted list of eigenvalues
eigen_res$values
# slightly negative eigenvalues, set them to small positive value
eigen_res$values[eigen_res$values<0] <- 1e-10
# and produce regularized covariance matrix
Areg <- eigen_res$vectors %*% diag(eigen_res$values) %*% t(eigen_res$vectors)
not non-negative definite does not mean the covariance matrix has negative correlations. It's a linear algebra equivalent of trying to take square root of negative number! You can't tell by looking at a few values of the matrix, whether it's positive definite.
Try adjusting some default values like tolerance in princomp call. Check this thread for example: How to use princomp () function in R when covariance matrix has zero's?
An alternative is to write some code of your own to perform what is called a n NIPLAS analysis. Take a look at this thread on the R-mailing list: https://stat.ethz.ch/pipermail/r-help/2006-July/110035.html
I'd even go as far as asking where did you obtain the correlation matrix? Did you construct it yourself? Does it have NAs? If you constructed xCor from your own data, do you think you can sample the data and construct a smaller xCor matrix? (say 1000X1000). All these alternatives try to drive your PCA algorithm through the 'happy path' (i.e. all matrix operations can be internally carried out without difficulties in diagonalization etc..i.e., no more 'non-negative definite error msgs)

Resources