Interpreting the psych::cor.smoother function - r

I've tried to contact William Revelle about this but he isn't responding.
In the psych package there is a function called cor.smoother, which determines whether or not a correlation matrix is positive definite. Its explanation is as follows:
"cor.smoother examines all of nvar minors of rank nvar-1 by systematically dropping one variable at a time and finding the eigen value decomposition. It reports those variables, which, when dropped, produce a positive definite matrix. It also reports the number of negative eigenvalues when each variable is dropped. Finally, it compares the original correlation matrix to the smoothed correlation matrix and reports those items with absolute deviations great than cut. These are all hints as to what might be wrong with a correlation matrix."
It is the really the statement in bold that I am hoping someone can interpret in a more understandable way for me?

A belated answer to your question.
Correlation matrices are said to be improper (or more accurately, not positive semi-definite) when at least one of the eigen values of the matrix is less than 0. This can happen if you have some missing data and are using pair-wise complete correlations. It is particularly likely to happen if you are doing tetrachoric or polychoric correlations based upon data sets with some or even a lot of missing data.
(A correlation matrix, R, may be decomposed into a set of eigen vectors (X) and eigen values (lambda) where R = X lambda X’. This decomposition is the basis of components analysis and factor analysis, but that is more than you want to know.)
The cor.smooth function finds the eigen values and then adjusts the negative ones by making them slightly positive (and adjusting the other ones to compensate for this change).
The cor.smoother function attempts to identify the variables that are making the matrix improper. It does this by considering all the matrices generated by dropping one variable at a time and seeing which ones of those are not positive semi-definite (i.e. have eigen values < 0.) Ideally, this will identify one variable that is messing things up.
An example of this is in the burt data set where the sorrow-tenderness correlation was probably mistyped and the .87 should be .81.
cor.smoother(burt) #identifies tenderness and sorrow as likely culprits

Related

PCA : eigen values vs eigen vectors vs loadings in python vs R?

I am trying to calculate PCA loadings of a dataset. The more I read about it, the more I get confused because "loadings" is used differently at many places.
I am using sklearn.decomposition in python for PCA analysis as well as R (using factomineR and factoextra libraries) as it provides easy visualization techniques. The following is my understanding:
pca.components_ give us the eigen vectors. They give us the directions of maximum variation.
pca.explained_variance_ give us the eigen values associated with the eigen vectors.
eigenvectors * sqrt(eigen values) = loadings which tell us how principal components (pc's) load the variables.
Now, what I am confused by is:
Many forums say that eigen vectors are the loadings. Then, when we multiply the eigen vectors by the sqrt(eigen values) we just get the strength of association. Others say eigenvectors * sqrt(eigen values) = loadings.
Eigen vectors squared tells us the contribution of variable to pc? I believe this is equivalent to var$contrib in R.
loading squared (eigen vector or eigenvector*sqrt(eigenvalue) I don't know which one) shows how well a pc captures a variable (closer to 1 = variable better explained by a pc). Is this equivalent of var$cos2 in R? If not what is cos2 in R?
Basically I want to know how to understand how well a principal component captures a variable and what is the contribution of a variable to a pc. I think they both are different.
What is pca.singular_values_? It is not clear from the documentation.
These first and second links that I referred which contains R code with explanation and the statsexchange forum that confused me.
Okay, after much research and going through many papers I have the following,
pca.components_ = eigen vectors. Take a transpose so that pc's are columns and variables are rows.
1.a: eigenvector**2 = variable contribution in principal components. If it's close to 1 then a particular pc is well explained by that variable.
In python -> (pow(pca.components_.T),2) [Multiply with 100 if you want percentages and not proportions] [R equivalent -> var$contrib]
pca.variance_explained_ = eigen values
pca.singular_values_ = singular values obtained from SVD.
(singular values)**2/(n-1) = eigen values
eigen vectors * sqrt(eigen values) = loadings matrix
4.a: vertical sum of squared loading matrix = eigen values. (Given you have taken transpose as explained in step 1)
4.b: horizontal sum of squared loading matrix = observation's variance explained by all principal components -How much all pc's retain a variables variance after transformation. (Given you have taken transpose as explained in step 1)
In python-> loading matrix = pca.components_.T * sqrt(pca.explained_variance_).
For questions pertaining to r:
var$cos2 = var$cor (Both matrices are same). Given the coordinates of the variables on a factor map, how well it is represented by a particular principal component. Seems like variable and principal component's correlation.
var$contrib = Summarized by point 1. In r:(var.cos2 * 100) / (total cos2 of the component) PCA analysis in R link
Hope it helps others who are confused by PCA analysis.
Huge thanks to -- https://stats.stackexchange.com/questions/143905/loadings-vs-eigenvectors-in-pca-when-to-use-one-or-another

PCoA function pcoa extract vectors; percentage of variance explained

I have a dataset consisting of 132 observations and 10 variables.
These variables are all categorical. I am trying to see how my observations cluster and how they are different based on the percentage of variance. i.e I want to find out if a) there are any variables which helps to draw certain observation points apart from one another and b) if yes, what is the percentage of variance explained by it?
I was advised to run a PCoA (Principle Coordinates Analysis) on my data. I ran it using vegan and ape package. This is my code after loading my csv file into r, I call it data
#data.dis<-vegdist(data,method="gower",na.rm=TRUE)
#data.pcoa<-pcoa(data.dis)
I was then told to extract the vectors from the pcoa data and so
#data.pcoa$vectors
It then returned me 132 rows but 20 columns of values (e.g. from Axis 1 to Axis 20)
I was perplexed over why there were 20 columns of values when I only have 10 variables. I was under the impression that I would only get 10 columns. If any kind souls out there could help to explain a) what do the vectors actually represent and b) how do I get the percentage of variance explained by Axis 1 and 2?
Another question that I had was I don't really understand the purpose of extracting the eigenvalues from data.pcoa because I saw some websites doing that after running a pcoa on their distance matrix but there was no further explanation on it.
Gower index is non-Euclidean and you can expect more real axes than the number of variables in Euclidean ordination (PCoA). However, you said that your variables are categorical. I assume that in R lingo they are factors. If so, you should not use vegan::vegdist() which only accepts numeric data. Moreover, if the variable is defined as a factor, vegan::vegdist() refuses to compute the dissimilarities and gives an error. If you managed to use vegdist(), you did not properly define your variables as factors. If you really have factor variables, you should use some other package than vegan for Gower dissimilarity (there are many alternatives).
Te percentage of "variance" is a bit tricky for non-Euclidean dissimilarities which also give some negative eigenvalues corresponding to imaginary dimensions. In that case, the sum of all positive eigenvalues (real axes) is higher than the total "variance" of data. ape::pcoa() returns the information you asked in the element values. The proportion of variances explained is in its element values$Relative_eig. The total "variance" is returned in element trace. All this was documented in ?pcoa where I read it.

PCA analysis using Correlation Matrix as input in R

Now i have a 7000*7000 correlation matrix and I have to do PCA on this in R.
I used the
CorPCA <- princomp(covmat=xCor)
, xCor is the correlation matrix
but it comes out
"covariance matrix is not non-negative definite"
it is because i have some negative correlation in that matrix.
I am wondering which inbuilt function in R that i can use to get the result of PCA
One method to do the PCA is to perform an eigenvalue decomposition of the covariance matrix, see wikipedia.
The advantage of the eigenvalue decomposition is that you see which directions (eigenvectors) are significant, i.e. have a noticeable variation expressed by the associated eigenvalues. Moreover, you can detect if the covariance matrix is positive definite (all eigenvalues greater than zero), not negative-definite (which is okay) if there are eigenvalues equal zero or if it is indefinite (which is not okay) by negative eigenvalues. Sometimes it also happens that due to numerical inaccuracies a non-negative-definite matrix becomes negative-definite. In that case you would observe negative eigenvalues which are almost zero. In that case you can set these eigenvalues to zero to retain the non-negative definiteness of the covariance matrix. Furthermore, you can still interpret the result: eigenvectors contributing the significant information are associated with the biggest eigenvalues. If the list of sorted eigenvalues declines quickly there are a lot of directions which do not contribute significantly and therefore can be dropped.
The built-in R function is eigen
If your covariance matrix is A then
eigen_res <- eigen(A)
# sorted list of eigenvalues
eigen_res$values
# slightly negative eigenvalues, set them to small positive value
eigen_res$values[eigen_res$values<0] <- 1e-10
# and produce regularized covariance matrix
Areg <- eigen_res$vectors %*% diag(eigen_res$values) %*% t(eigen_res$vectors)
not non-negative definite does not mean the covariance matrix has negative correlations. It's a linear algebra equivalent of trying to take square root of negative number! You can't tell by looking at a few values of the matrix, whether it's positive definite.
Try adjusting some default values like tolerance in princomp call. Check this thread for example: How to use princomp () function in R when covariance matrix has zero's?
An alternative is to write some code of your own to perform what is called a n NIPLAS analysis. Take a look at this thread on the R-mailing list: https://stat.ethz.ch/pipermail/r-help/2006-July/110035.html
I'd even go as far as asking where did you obtain the correlation matrix? Did you construct it yourself? Does it have NAs? If you constructed xCor from your own data, do you think you can sample the data and construct a smaller xCor matrix? (say 1000X1000). All these alternatives try to drive your PCA algorithm through the 'happy path' (i.e. all matrix operations can be internally carried out without difficulties in diagonalization etc..i.e., no more 'non-negative definite error msgs)

Is it impossible to do PCA on the data whose # of variables are bigger than that of individuals?

I am a new user of R and I try to do PCA on my data set using R. The dimension of data is 20x10000, i.e. # of features is 10000 and # of individuals is 20. It seems that prcomp() cannot handle the data exactly, because the dimension of calculated eigenvectors and new data is 20x20 and 10000x20 instead of 10000x10000 and 20x10000. I tried FactoMineR library also, but the results looked like that it looses some dimension, too. Is there any way to doing PCA on the data like this? :(
By reading the manual, it looks like no components are omitted by default but check the tol argument. The problem is with negative eigenvalues that may bet there (and often are) when you have less cases than individuals. (I think with 10000 cases and 20 individuals you will always have many negative eigenvalues.) See a simplified version of PCA I'm sometimes using that computes "PC loadings" the way they're usually used in psychology.
PCA <- function(X, cut=NULL, USE="complete.obs") {
if(is.null(cut)) cut<- ncol(X)
E<-eigen(cor(X,use=USE))
vec<-E$vectors
val<-E$values
P<-sweep(vec,2,sqrt(val),"*")[,1:cut]
P
}
The "loadings" are, basically, eigenvectors multiplied by the square root of eigenvalues -- but there's a problem here if you have negative eigenvalues. Something similar may happen with prcomp.
If you just want to reconstruct your data matrix exactly (for whatever reason), you can easily use svd or eigen directly. /My example used correlation matrix but the logic is not confined to this case./

how to generate pseudo-random positive definite matrix with constraints on the off-diagonal elements? [duplicate]

This question already has answers here:
Closed 11 years ago.
Possible Duplicate:
how to generate pseudo-random positive definite matrix with constraints on the off-diagonal elements?
The user wants to impose a unique, non-trivial, upper/lower bound on the correlation between every pair of variable in a var/covar matrix.
For example: I want a variance matrix in which all variables have 0.9 > |rho(x_i,x_j)| > 0.6, rho(x_i,x_j) being the correlation between variables x_i and x_j.
Thanks.
There are MANY issues here.
First of all, are the pseudo-random deviates assumed to be normally distributed? I'll assume they are, as any discussion of correlation matrices gets nasty if we diverge into non-normal distributions.
Next, it is rather simple to generate pseudo-random normal deviates, given a covariance matrix. Generate standard normal (independent) deviates, and then transform by multiplying by the Cholesky factor of the covariance matrix. Add in the mean at the end if the mean was not zero.
And, a covariance matrix is also rather simple to generate given a correlation matrix. Just pre and post multiply the correlation matrix by a diagonal matrix composed of the standard deviations. This scales a correlation matrix into a covariance matrix.
I'm still not sure where the problem lies in this question, since it would seem easy enough to generate a "random" correlation matrix, with elements uniformly distributed in the desired range.
So all of the above is rather trivial by any reasonable standards, and there are many tools out there to generate pseudo-random normal deviates given the above information.
Perhaps the issue is the user insists that the resulting random matrix of deviates must have correlations in the specified range. You must recognize that a set of random numbers will only have the desired distribution parameters in an asymptotic sense. Thus, as the sample size goes to infinity, you should expect to see the specified distribution parameters. But any small sample set will not necessarily have the desired parameters, in the desired ranges.
For example, (in MATLAB) here is a simple positive definite 3x3 matrix. As such, it makes a very nice covariance matrix.
S = randn(3);
S = S'*S
S =
0.78863 0.01123 -0.27879
0.01123 4.9316 3.5732
-0.27879 3.5732 2.7872
I'll convert S into a correlation matrix.
s = sqrt(diag(S));
C = diag(1./s)*S*diag(1./s)
C =
1 0.0056945 -0.18804
0.0056945 1 0.96377
-0.18804 0.96377 1
Now, I can sample from a normal distribution using the statistics toolbox (mvnrnd should do the trick.) As easy is to use a Cholesky factor.
L = chol(S)
L =
0.88805 0.012646 -0.31394
0 2.2207 1.6108
0 0 0.30643
Now, generate pseudo-random deviates, then transform them as desired.
X = randn(20,3)*L;
cov(X)
ans =
0.79069 -0.14297 -0.45032
-0.14297 6.0607 4.5459
-0.45032 4.5459 3.6549
corr(X)
ans =
1 -0.06531 -0.2649
-0.06531 1 0.96587
-0.2649 0.96587 1
If your desire was that the correlations must ALWAYS be greater than -0.188, then this sampling technique has failed, since the numbers are pseudo-random. In fact, that goal will be a difficult one to achieve unless your sample size is large enough.
You might employ a simple rejection scheme, whereby you do the sampling, then redo it repeatedly until the sample has the desired properties, with the correlations in the desired ranges. This may get tiring.
An approach that might work (but one that I've not totally thought out at this point) is to use the standard scheme as above to generate a random sample. Compute the correlations. I they fail to lie in the proper ranges, then identify the perturbation one would need to make to the actual (measured) covariance matrix of your data, so that the correlations would be as desired. Now, find a zero mean random perturbation to your sampled data that would move the sample covariance matrix in the desired direction.
This might work, but unless I knew that this is actually the question at hand, I won't bother to go any more deeply into it. (Edit: I've thought some more about this problem, and it appears to be a quadratic programming problem, with quadratic constraints, to find the smallest perturbation to a matrix X, such that the resulting covariance (or correlation) matrix has the desired properties.)
This is not a complete answer, but a suggestion of a possible constructive method:
Looking at the characterizations of the positive definite matrices (http://en.wikipedia.org/wiki/Positive-definite_matrix) I think one of the most affordable approaches could be using the Sylvester criterion.
You can start with a trivial 1x1 random matrix with positive determinant and expand it in one row and column step by step while ensuring that the new matrix has also a positive determinant (how to achieve that is up to you ^_^).
Woodship,
"First of all, are the pseudo-random deviates assumed to be normally distributed?"
yes.
"Perhaps the issue is the user insists that the resulting random matrix of deviates must have correlations in the specified range."
Yes, that's the whole difficulty
"You must recognize that a set of random numbers will only have the desired distribution parameters in an asymptotic sense."
True, but this is not the problem here: your strategy works for p=2, but fails for p>2, regardless of sample size.
"If your desire was that the correlations must ALWAYS be greater than -0.188, then this sampling technique has failed, since the numbers are pseudo-random. In fact, that goal will be a difficult one to achieve unless your sample size is large enough."
It is not a sample size issue b/c with p>2 you do not even observe convergence to the right range for the correlations, as sample size growths: i tried the technique you suggest before posting here, it obviously is flawed.
"You might employ a simple rejection scheme, whereby you do the sampling, then redo it repeatedly until the sample has the desired properties, with the correlations in the desired ranges. This may get tiring."
Not an option, for p large (say larger than 10) this option is intractable.
"Compute the correlations. I they fail to lie in the proper ranges, then identify the perturbation one would need to make to the actual (measured) covariance matrix of your data, so that the correlations would be as desired."
Ditto
As for the QP, i understand the constraints, but i'm not sure about the way you define the objective function; by using the "smallest perturbation" off some initial matrix, you will always end up getting the same (solution) matrix: all the off diagonal entries will be exactly equal to either one of the two bounds (e.g. not pseudo random); plus it is kind of an overkill isn't it ?
Come on people, there must be something simpler

Resources