compute gram matrix in R - r

I need some help to compute a nXn gram matrix K for a given kernel. Here is my R code that generates simulated data. I could be any positive definite matrix. Taking diagonal matrix for simplicity.
set.seed(3)
n=20
x=runif(n)
y=rnorm(n)
df<-cbind(x,y)
I=diag(2)
kernel<-function(x,y) {
t(x)%*%I%*%y
}
# for example
#K[1,1]
t(df[1,])%*%I%*%df[1,]
[,1]
[1,] 0.5829376
#K[1,2]
t(df[1,])%*%I%*%df[2,]
[,1]
[1,] 0.978207

Example in the case of a linear regression model for a database call data with a response Y and predictors X1 and X2:
#The regression model
model=lm(Y~X1+X2, data)
#Estimating residuals
r=model$res
#Estimating hat values
h=hatvalues(model)
#Computing Gramm matrix
d=r/(1-h)
#Estimating Gramm determinant (which summarizes the information in the Gramm matrix)
press=t(d)%*%d
round(press,2)
I hope, despite the delay, this may be useful for someone. Best regards.

Related

R principal component analysis interpreting the princomp() and eigen() functions for a non-square matrix

I'm trying to learn about and implement principal component analysis and study in particular how it relates to eigenvectors and eigenvalues and other things from linear algebra. Cross Validated has been helpful but I do have questions I haven't seen an answer for so far.
I've read online that eigenvalues and eigenvalues are for square matrices and singular value decomposition is like an extension of that for non-square matrices. Here is what I find on Google when I search the question:
Note. Eigenvalues and eigenvectors are only for square matrices. Eigenvectors are by definition nonzero. Eigenvalues may be equal to zero.
But if I take, for example, a selection from the mtcars dataset, by choosing only the first six rows but keeping all the observations, and then ask about the dimensions of this new dataset, I see that I have a m x n matrix that is a 32 x 6 matrix.
mtcars_selection <- mtcars %>%
dplyr::select(mpg:wt)
nrow(mtcars_selection) # 32
length(mtcars_selection) # 6
Now turning to principal component analysis, when I run these lines of code:
prcomp_attempt = stats::prcomp(mtcars_selection, scale = FALSE)
summary(prcomp_attempt)
I get the following as part of the output.
PC1 PC2 PC3 PC4 PC5 PC6
Standard deviation 136.5265 38.11828 3.04062 0.67678 0.36761 0.3076
Proportion of Variance 0.9272 0.07228 0.00046 0.00002 0.00001 0.0000
Cumulative Proportion 0.9272 0.99951 0.99997 0.99999 1.00000 1.0000
Similarly, when I change prcomp() to princomp() I get a similar output.
princomp_attempt = stats::princomp(mtcars_selection, scale = FALSE)
summary(princomp_attempt)
Comp.1 Comp.2 Comp.3 Comp.4 Comp.5 Comp.6
Standard deviation 134.3763479 37.51795949 2.9927313123 0.66612093849 0.361823645756 0.302784248968
Proportion of Variance 0.9272258 0.07228002 0.0004599126 0.00002278484 0.000006722546 0.000004707674
Cumulative Proportion 0.9272258 0.99950587 0.9999657849 0.99998856978 0.999995292326 1.000000000000
From ?prcomp() I see that the computation is done using singular value decomposition.
The calculation is done by a singular value decomposition of the (centered and possibly scaled) data matrix, not by using eigen on the covariance matrix. This is generally the preferred method for numerical accuracy.
And from ?princomp() I see:
The calculation is done using eigen on the correlation or covariance matrix, as determined by cor. This is done for compatibility with the S-PLUS result.
But doesn't this all mean that one of the code chunks above should work and one of them should not work? In particular, how did princomp() work if the matrix that went into the princomp() function is a non-square matrix?
Now when I take a look at the eigen() function on the covariance matrix, which is non-square, I get an output that looks like it only printed the first six rows.
eigen(cov(mtcars_selection))
I see in this particular output
[,1] [,2] [,3] [,4] [,5] [,6]
[1,] 0.038119604 -0.009201962 0.99282561 -0.057597550 0.094821246 0.021236444
[2,] -0.012035481 0.003373585 -0.06561936 -0.965667568 0.050583003 0.245897079
[3,] -0.899622021 -0.435427435 0.03153733 0.006753430 -0.006294596 -0.001825989
[4,] -0.434782990 0.900148911 0.02503332 0.006406853 0.004534789 -0.002722171
[5,] 0.002660336 0.003898813 0.03993024 0.187172744 -0.494914521 0.847590278
[6,] -0.006239859 -0.004861028 -0.08231475 0.170435844 0.862235306 0.469748461
In the eigen() and the princomp() functions, is the the data being conformed to a square matrix by slicing off the rows that are greater in number than the columns, so m = n?

Applying PCA to a covariance matrix

I am have some difficulty understanding some steps in a procedure. They take coordinate data, find the covariance matrix, apply PCA, then extract the standard deviation from the square root of each eigenvalue in short. I am trying to re-produce this process, but I am stuck on the steps.
The Steps Taken
The data set consists of one matrix, R, that contains coordiante paris, (x(i),y(i)) with i=1,...,N for N is the total number of instances recorded. We applied PCA to the covariance matrix of the R input data set, and the following variables were obtained:
a) the principal components of the new coordinate system, the eigenvectors u and v, and
b) the eigenvalues (λ1 and λ2) corresponding to the total variability explained by each principal component.
With these variables, a graphical representation was created for each item. Two orthogonal segments were centred on the mean of the coordinate data. The segments’ directions were driven by the eigenvectors of the PCA, and the length of each segment was defined as one standard deviation (σ1 and σ2) around the mean, which was calculated by extracting the square root of each eigenvalue, λ1 and λ2.
My Steps
#reproducable data
set.seed(1)
x<-rnorm(10,50,4)
y<-rnorm(10,50,7)
# Note my data is not perfectly distirbuted in this fashion
df<-data.frame(x,y) # this is my R matrix
covar.df<-cov(df,use="all.obs",method='pearson') # this is my covariance matrix
pca.results<-prcomp(covar.df) # this applies PCA to the covariance matrix
pca.results$sdev # these are the standard deviations of the principal components
# which is what I believe I am looking for.
This is where I am stuck because I am not sure if I am trying to get the sdev output form prcomp() or if I should scale my data first. They are all on the same scale, so I do not see the issue with it.
My second question is how do I extract the standard deviation in the x and y direciton?
You don't apply prcomp to the covariance matrix, you do it on the data itself.
result= prcomp(df)
If by scaling you mean normalize or standardize, that happens before you do prcomp(). For more information on the procedure see this link that is introductory to the procedure: pca on R. That can walk you through the basics. To get the sdev use the the summary on the result object
summary(result)
result$sdev
You don't apply prcomp to the covariance matrix. scale=T bases the PCA on the correlation matrix and F on the covariance matrix
df.cor = prcomp(df, scale=TRUE)
df.cov = prcomp(df, scale=FALSE)

Principal component analysis with EQUAMAX rotation

I need to do a principal component analysis (PCA) with EQUAMAX-rotation in R.
Unfortunately the function principal() I normally use for PCA does not offer this kind of rotation.
I could find out that it may be possible somehow with the package GPArotation but I could not yet figure out how to use this in the PCA.
Maybe someone can give an example on how to do an equamax-rotation PCA?
Or is there a function for PCA in another package that offers the use of equamax-rotation directly?
The package psych from i guess you are using principal() has the rotations varimax, quatimax, promax, oblimin, simplimax, and cluster but not equamax (psych p.232) which is a compromise between Varimax and Quartimax
excerpt from the STATA manual: mvrotate p.3
Rotation criteria
In the descriptions below, the matrix to be rotated is denoted as A, p denotes the number of rows of A, and f denotes the number of columns of A (factors or components). If A is a loading matrix from factor or pca, p is the number of variables, and f is the number of factors or components.
Criteria suitable only for orthogonal rotations
varimax and vgpf apply the orthogonal varimax rotation (Kaiser 1958). varimax maximizes the variance of the squared loadings within factors (columns of A). It is equivalent to cf(1/p) and to oblimin(1). varimax, the most popular rotation, is implemented with a dedicated fast algorithm and ignores all optimize options. Specify vgpf to switch to the general GPF algorithm used for the other criteria.
quartimax uses the quartimax criterion (Harman 1976). quartimax maximizes the variance of
the squared loadings within the variables (rows of A). For orthogonal rotations, quartimax is equivalent to cf(0) and to oblimax.
equamax specifies the orthogonal equamax rotation. equamax maximizes a weighted sum of the
varimax and quartimax criteria, reflecting a concern for simple structure within variables (rows of A) as well as within factors (columns of A). equamax is equivalent to oblimin(p/2) and cf(#), where # = f /(2p).
now the cf (Crawford-Ferguson) method is also available in GPArotation
cfT orthogonal Crawford-Ferguson family
cfT(L, Tmat=diag(ncol(L)), kappa=0, normalize=FALSE, eps=1e-5, maxit=1000)
The argument kappa parameterizes the family for the Crawford-Ferguson method. If m is the number of factors and p is the number of indicators then kappa values having special names are 0=Quartimax, 1/p=Varimax, m/(2*p)=Equamax, (m-1)/(p+m-2)=Parsimax, 1=Factor parsimony.
X <- matrix(rnorm(500), ncol=10)
C <- cor(X)
eig <- eigen(C)
# PCA by hand scaled by sqrt
eig$vectors * t(matrix(rep(sqrt(eig$values), 10), ncol=10))
require(psych)
PCA0 <- principal(C, rotate='none', nfactors=10) #PCA by psych
PCA0
# as the original loadings PCA0 are scaled by their squarroot eigenvalue
apply(PCA0$loadings^2, 2, sum) # SS loadings
## PCA with Equimax rotation
# now i think the Equamax rotation can be performed by cfT with m/(2*p)
# p number of variables (10)
# m (or f in STATA manual) number of components (10)
# gives m==p --> kappa=0.5
PCA.EQ <- cfT(PCA0$loadings, kappa=0.5)
PCA.EQ
I upgraded some of my PCA knowledge by your question, hope it helps, good luck
Walter's answer helped a great deal!
I'll add some sidenotes for what it's worth:
R's psych::principal says under option "rotate", that more rotations are available. Under the linked "fa", there's in fact an "equamax". Sadly, the results are neither replicable with STATA nor with SPSS, at least not with the standard syntax I tried:
# R:
PCA.5f=principal(data, nfactors=5, rotate="equamax", use="complete.obs")
Walter's solution replicates SPSS' equamax rotation (Kaiser-normalized by default) in the first 3 decimal places (i.e. loadings and rotating matrix fairly equivalent) using the following syntax with m=no of factors and p=no of indicators:
# R:
PCA.5f=principal(data, nfactors=5, rotate="none", use="complete.obs")
PCA.5f.eq = cfT(PCA.5f$loadings, kappa=m/(2*p), normalize=TRUE) # replace kappa factor formula with your actual numbers!
# SPSS:
FACTOR
/VARIABLES listofvariables
/MISSING LISTWISE
/ANALYSIS listofvariables
/PRINT ROTATION
/CRITERIA FACTORS(5) ITERATE(1000)
/EXTRACTION PC
/CRITERIA ITERATE(1000)
/ROTATION EQUAMAX
/METHOD=CORRELATION.
STATA's equamax - Kaiser-normalized and unnormalized - is replicable at least in the first 4 decimal places with Kappa .5 irrespective of your actual number of factors and indicators which seems to contradict their manual (c.f. Walter's citation).
# R:
PCA.5f=principal(data, nfactors=5, rotate="none", use="complete.obs")
PCA.5f.eq = cfT(PCA.5f$loadings, kappa=.5, normalize=TRUE)
# STATA:
factor listofvars, pcf factors(5)
rotate, equamax normalize # kick the "normalize" to replicate R's "normalize=FALSE"
mat list e(r_L)

PCA analysis using Correlation Matrix as input in R

Now i have a 7000*7000 correlation matrix and I have to do PCA on this in R.
I used the
CorPCA <- princomp(covmat=xCor)
, xCor is the correlation matrix
but it comes out
"covariance matrix is not non-negative definite"
it is because i have some negative correlation in that matrix.
I am wondering which inbuilt function in R that i can use to get the result of PCA
One method to do the PCA is to perform an eigenvalue decomposition of the covariance matrix, see wikipedia.
The advantage of the eigenvalue decomposition is that you see which directions (eigenvectors) are significant, i.e. have a noticeable variation expressed by the associated eigenvalues. Moreover, you can detect if the covariance matrix is positive definite (all eigenvalues greater than zero), not negative-definite (which is okay) if there are eigenvalues equal zero or if it is indefinite (which is not okay) by negative eigenvalues. Sometimes it also happens that due to numerical inaccuracies a non-negative-definite matrix becomes negative-definite. In that case you would observe negative eigenvalues which are almost zero. In that case you can set these eigenvalues to zero to retain the non-negative definiteness of the covariance matrix. Furthermore, you can still interpret the result: eigenvectors contributing the significant information are associated with the biggest eigenvalues. If the list of sorted eigenvalues declines quickly there are a lot of directions which do not contribute significantly and therefore can be dropped.
The built-in R function is eigen
If your covariance matrix is A then
eigen_res <- eigen(A)
# sorted list of eigenvalues
eigen_res$values
# slightly negative eigenvalues, set them to small positive value
eigen_res$values[eigen_res$values<0] <- 1e-10
# and produce regularized covariance matrix
Areg <- eigen_res$vectors %*% diag(eigen_res$values) %*% t(eigen_res$vectors)
not non-negative definite does not mean the covariance matrix has negative correlations. It's a linear algebra equivalent of trying to take square root of negative number! You can't tell by looking at a few values of the matrix, whether it's positive definite.
Try adjusting some default values like tolerance in princomp call. Check this thread for example: How to use princomp () function in R when covariance matrix has zero's?
An alternative is to write some code of your own to perform what is called a n NIPLAS analysis. Take a look at this thread on the R-mailing list: https://stat.ethz.ch/pipermail/r-help/2006-July/110035.html
I'd even go as far as asking where did you obtain the correlation matrix? Did you construct it yourself? Does it have NAs? If you constructed xCor from your own data, do you think you can sample the data and construct a smaller xCor matrix? (say 1000X1000). All these alternatives try to drive your PCA algorithm through the 'happy path' (i.e. all matrix operations can be internally carried out without difficulties in diagonalization etc..i.e., no more 'non-negative definite error msgs)

K-means and Mahalanobis distance

I'd like to use the Mahalanobis distance in the K-means algorithm, because I have 4 variables which are highly correlated (0.85)
It appears to me that it's better to use the Mahalanobis distance in this case.
The problem is I don't know how to implement it in R, with the K-means algorithm.
I think I need to "fake" it in transform the data before the clustering step, but I don't know how.
I tried the classical kmeans, with the euclidian distance on standardize data, but as I said, there is too much correlation.
fit <- kmeans(mydata.standardize, 4)
I also tried to find a distance parameter, but I think it doesn't exist in the kmeans() function.
The expected result is a way to applied the K-means algorithm with the Mahalanobis distance.
You can rescale the data before running the algorithm,
using the Cholesky decomposition of the variance matrix:
the Euclidian distance after the transformation
is the Mahalanobis distance before.
# Sample data
n <- 100
k <- 5
x <- matrix( rnorm(k*n), nr=n, nc=k )
x[,1:2] <- x[,1:2] %*% matrix( c(.9,1,1,.9), 2, 2 )
var(x)
# Rescale the data
C <- chol( var(x) )
y <- x %*% solve(C)
var(y) # The identity matrix
kmeans(y, 4)
But this assumes that all the clusters have the same shape and orientations as the whole data.
If this is not the case, you may want to look at models that explicitly allow for elliptical clusters,
e.g., in the mclust package.
You can see in page 10 of Brian S. Everitt book -"An R and S-PLUS® Companion to Multivariate Analysis", the formula for Mahalanobis distance. Euclidean distance is one special case of mahalanobis, when the sample covariance is identity matrix. Then the euclidean distance with rescaled data in 'y', is mahalanobis.
# Rescale the data
C <- chol( var(x) )
y <- x %*% solve(C)
var(y) # The identity matrix

Resources