PCA : eigen values vs eigen vectors vs loadings in python vs R? - r

I am trying to calculate PCA loadings of a dataset. The more I read about it, the more I get confused because "loadings" is used differently at many places.
I am using sklearn.decomposition in python for PCA analysis as well as R (using factomineR and factoextra libraries) as it provides easy visualization techniques. The following is my understanding:
pca.components_ give us the eigen vectors. They give us the directions of maximum variation.
pca.explained_variance_ give us the eigen values associated with the eigen vectors.
eigenvectors * sqrt(eigen values) = loadings which tell us how principal components (pc's) load the variables.
Now, what I am confused by is:
Many forums say that eigen vectors are the loadings. Then, when we multiply the eigen vectors by the sqrt(eigen values) we just get the strength of association. Others say eigenvectors * sqrt(eigen values) = loadings.
Eigen vectors squared tells us the contribution of variable to pc? I believe this is equivalent to var$contrib in R.
loading squared (eigen vector or eigenvector*sqrt(eigenvalue) I don't know which one) shows how well a pc captures a variable (closer to 1 = variable better explained by a pc). Is this equivalent of var$cos2 in R? If not what is cos2 in R?
Basically I want to know how to understand how well a principal component captures a variable and what is the contribution of a variable to a pc. I think they both are different.
What is pca.singular_values_? It is not clear from the documentation.
These first and second links that I referred which contains R code with explanation and the statsexchange forum that confused me.

Okay, after much research and going through many papers I have the following,
pca.components_ = eigen vectors. Take a transpose so that pc's are columns and variables are rows.
1.a: eigenvector**2 = variable contribution in principal components. If it's close to 1 then a particular pc is well explained by that variable.
In python -> (pow(pca.components_.T),2) [Multiply with 100 if you want percentages and not proportions] [R equivalent -> var$contrib]
pca.variance_explained_ = eigen values
pca.singular_values_ = singular values obtained from SVD.
(singular values)**2/(n-1) = eigen values
eigen vectors * sqrt(eigen values) = loadings matrix
4.a: vertical sum of squared loading matrix = eigen values. (Given you have taken transpose as explained in step 1)
4.b: horizontal sum of squared loading matrix = observation's variance explained by all principal components -How much all pc's retain a variables variance after transformation. (Given you have taken transpose as explained in step 1)
In python-> loading matrix = pca.components_.T * sqrt(pca.explained_variance_).
For questions pertaining to r:
var$cos2 = var$cor (Both matrices are same). Given the coordinates of the variables on a factor map, how well it is represented by a particular principal component. Seems like variable and principal component's correlation.
var$contrib = Summarized by point 1. In r:(var.cos2 * 100) / (total cos2 of the component) PCA analysis in R link
Hope it helps others who are confused by PCA analysis.
Huge thanks to -- https://stats.stackexchange.com/questions/143905/loadings-vs-eigenvectors-in-pca-when-to-use-one-or-another

Related

PCoA function pcoa extract vectors; percentage of variance explained

I have a dataset consisting of 132 observations and 10 variables.
These variables are all categorical. I am trying to see how my observations cluster and how they are different based on the percentage of variance. i.e I want to find out if a) there are any variables which helps to draw certain observation points apart from one another and b) if yes, what is the percentage of variance explained by it?
I was advised to run a PCoA (Principle Coordinates Analysis) on my data. I ran it using vegan and ape package. This is my code after loading my csv file into r, I call it data
#data.dis<-vegdist(data,method="gower",na.rm=TRUE)
#data.pcoa<-pcoa(data.dis)
I was then told to extract the vectors from the pcoa data and so
#data.pcoa$vectors
It then returned me 132 rows but 20 columns of values (e.g. from Axis 1 to Axis 20)
I was perplexed over why there were 20 columns of values when I only have 10 variables. I was under the impression that I would only get 10 columns. If any kind souls out there could help to explain a) what do the vectors actually represent and b) how do I get the percentage of variance explained by Axis 1 and 2?
Another question that I had was I don't really understand the purpose of extracting the eigenvalues from data.pcoa because I saw some websites doing that after running a pcoa on their distance matrix but there was no further explanation on it.
Gower index is non-Euclidean and you can expect more real axes than the number of variables in Euclidean ordination (PCoA). However, you said that your variables are categorical. I assume that in R lingo they are factors. If so, you should not use vegan::vegdist() which only accepts numeric data. Moreover, if the variable is defined as a factor, vegan::vegdist() refuses to compute the dissimilarities and gives an error. If you managed to use vegdist(), you did not properly define your variables as factors. If you really have factor variables, you should use some other package than vegan for Gower dissimilarity (there are many alternatives).
Te percentage of "variance" is a bit tricky for non-Euclidean dissimilarities which also give some negative eigenvalues corresponding to imaginary dimensions. In that case, the sum of all positive eigenvalues (real axes) is higher than the total "variance" of data. ape::pcoa() returns the information you asked in the element values. The proportion of variances explained is in its element values$Relative_eig. The total "variance" is returned in element trace. All this was documented in ?pcoa where I read it.

convert a list -class numeric- into a distance structure in R

I have a list that looks like this, it is a measure of dispersion for each sample.
1 2 3 4 5
0.11829384 0.24987017 0.08082147 0.13355495 0.12933790
To further analyze this I need it to be a distance structure, the -vegan- package need it as a 'dist' object.
I found some solutions that applies to matrices > dist, but how could I change this current data into a dist object?
I am using the FD package, at the manual I found,
Still, one potential advantage of FDis over Rao’s Q is that in the unweighted case
(i.e. with presence-absence data), it opens possibilities for formal statistical tests for differences in
FD between two or more communities through a distance-based test for homogeneity of multivariate
dispersions (Anderson 2006); see betadisper for more details
I wanted to use vegan betadisper function to test if there are differences among different regions (I provided this using element "region" with column "region" too)
functional <- FD(trait, comun)
mod <- betadisper(functional$FDis, region$region)
using gowdis or fdisp from FD didn't work too.
distancias <- gowdis(rasgo)
mod <- betadisper(distancias, region$region)
dispersion <- fdisp(distancias, presence)
mod <- betadisper(dispersion, region$region)
I tried this but I need a list object. I thought I could pass those results to betadisper.
You cannot do this: FD::fdisp() does not return dissimilarities. It returns a list of three elements: the dispersions FDis for each sampling unit (SU), and the results of the eigen decomposition of input dissimilarities (eig for eigenvalues, vectors for orthonormal eigenvectors). The FDis values are summarized for each original SU, but there is no information on the differences among SUs. The eigen decomposition can be used to reconstruct the original input dissimilarities (your distancias from FD::gowdis()), but you can directly use the input dissimilarities. Function FD::gowdis() returns a regular "dist" structure that you can directly use in vegan::betadisper() if that gives you a meaningful analysis. For this, your grouping variable must be based on the same units as your distancias. In typical application of fdisp, the units are species (taxa), but it seems you want to get analysis for communities/sites/whatever. This will not be possible with these tools.

Interpreting the psych::cor.smoother function

I've tried to contact William Revelle about this but he isn't responding.
In the psych package there is a function called cor.smoother, which determines whether or not a correlation matrix is positive definite. Its explanation is as follows:
"cor.smoother examines all of nvar minors of rank nvar-1 by systematically dropping one variable at a time and finding the eigen value decomposition. It reports those variables, which, when dropped, produce a positive definite matrix. It also reports the number of negative eigenvalues when each variable is dropped. Finally, it compares the original correlation matrix to the smoothed correlation matrix and reports those items with absolute deviations great than cut. These are all hints as to what might be wrong with a correlation matrix."
It is the really the statement in bold that I am hoping someone can interpret in a more understandable way for me?
A belated answer to your question.
Correlation matrices are said to be improper (or more accurately, not positive semi-definite) when at least one of the eigen values of the matrix is less than 0. This can happen if you have some missing data and are using pair-wise complete correlations. It is particularly likely to happen if you are doing tetrachoric or polychoric correlations based upon data sets with some or even a lot of missing data.
(A correlation matrix, R, may be decomposed into a set of eigen vectors (X) and eigen values (lambda) where R = X lambda X’. This decomposition is the basis of components analysis and factor analysis, but that is more than you want to know.)
The cor.smooth function finds the eigen values and then adjusts the negative ones by making them slightly positive (and adjusting the other ones to compensate for this change).
The cor.smoother function attempts to identify the variables that are making the matrix improper. It does this by considering all the matrices generated by dropping one variable at a time and seeing which ones of those are not positive semi-definite (i.e. have eigen values < 0.) Ideally, this will identify one variable that is messing things up.
An example of this is in the burt data set where the sorrow-tenderness correlation was probably mistyped and the .87 should be .81.
cor.smoother(burt) #identifies tenderness and sorrow as likely culprits

Getting factor analysis scores when using the factanal function

I have used the factanal function in R to do a factor analysis on a data set.
Viewing the summary of the output, I see I have access to the loading and other objects, but I am interested in the scores of the factor analysis.
How can I get the scores when using the factanal function?
I attempted to calculate the scores myself:
m <- t(as.matrix(factor$loadings))
n <- (as.matrix(dataset))
scores <- m%*%n
and got the error:
Error in m %*% n : non-conformable arrays
which I'm not sure why, since I double checked the dimension of the data and the dimensionality is in agreement.
Thanks everyone for your help.
Ah.
factormodel$loadings[,1] %*% t(dataset)
This question might be a bit dated, but nevertheless:
factanal returns a matrix of scores. You simply call it like you called the loadings: factor$scores. No need to calculate it yourself. But you do need to specify in the function that you want to produce the scores, by using the "scores" argument.
Your solution, of multiplying the loadings by the observation matrix, is wrong. According to the FA model, the observed dataset should be the multiplication of loadings and scores (plus the unique contributions, and then rotation). This is not equivalent to what you wrote. I think you treated the loadings as the coefficients from observed data to scores, rather than the other way around (from scores to observations).
I found this paper that explains about different ways to extract scores, might be useful.

Is it impossible to do PCA on the data whose # of variables are bigger than that of individuals?

I am a new user of R and I try to do PCA on my data set using R. The dimension of data is 20x10000, i.e. # of features is 10000 and # of individuals is 20. It seems that prcomp() cannot handle the data exactly, because the dimension of calculated eigenvectors and new data is 20x20 and 10000x20 instead of 10000x10000 and 20x10000. I tried FactoMineR library also, but the results looked like that it looses some dimension, too. Is there any way to doing PCA on the data like this? :(
By reading the manual, it looks like no components are omitted by default but check the tol argument. The problem is with negative eigenvalues that may bet there (and often are) when you have less cases than individuals. (I think with 10000 cases and 20 individuals you will always have many negative eigenvalues.) See a simplified version of PCA I'm sometimes using that computes "PC loadings" the way they're usually used in psychology.
PCA <- function(X, cut=NULL, USE="complete.obs") {
if(is.null(cut)) cut<- ncol(X)
E<-eigen(cor(X,use=USE))
vec<-E$vectors
val<-E$values
P<-sweep(vec,2,sqrt(val),"*")[,1:cut]
P
}
The "loadings" are, basically, eigenvectors multiplied by the square root of eigenvalues -- but there's a problem here if you have negative eigenvalues. Something similar may happen with prcomp.
If you just want to reconstruct your data matrix exactly (for whatever reason), you can easily use svd or eigen directly. /My example used correlation matrix but the logic is not confined to this case./

Resources