I need to do a principal component analysis (PCA) with EQUAMAX-rotation in R.
Unfortunately the function principal() I normally use for PCA does not offer this kind of rotation.
I could find out that it may be possible somehow with the package GPArotation but I could not yet figure out how to use this in the PCA.
Maybe someone can give an example on how to do an equamax-rotation PCA?
Or is there a function for PCA in another package that offers the use of equamax-rotation directly?
The package psych from i guess you are using principal() has the rotations varimax, quatimax, promax, oblimin, simplimax, and cluster but not equamax (psych p.232) which is a compromise between Varimax and Quartimax
excerpt from the STATA manual: mvrotate p.3
Rotation criteria
In the descriptions below, the matrix to be rotated is denoted as A, p denotes the number of rows of A, and f denotes the number of columns of A (factors or components). If A is a loading matrix from factor or pca, p is the number of variables, and f is the number of factors or components.
Criteria suitable only for orthogonal rotations
varimax and vgpf apply the orthogonal varimax rotation (Kaiser 1958). varimax maximizes the variance of the squared loadings within factors (columns of A). It is equivalent to cf(1/p) and to oblimin(1). varimax, the most popular rotation, is implemented with a dedicated fast algorithm and ignores all optimize options. Specify vgpf to switch to the general GPF algorithm used for the other criteria.
quartimax uses the quartimax criterion (Harman 1976). quartimax maximizes the variance of
the squared loadings within the variables (rows of A). For orthogonal rotations, quartimax is equivalent to cf(0) and to oblimax.
equamax specifies the orthogonal equamax rotation. equamax maximizes a weighted sum of the
varimax and quartimax criteria, reflecting a concern for simple structure within variables (rows of A) as well as within factors (columns of A). equamax is equivalent to oblimin(p/2) and cf(#), where # = f /(2p).
now the cf (Crawford-Ferguson) method is also available in GPArotation
cfT orthogonal Crawford-Ferguson family
cfT(L, Tmat=diag(ncol(L)), kappa=0, normalize=FALSE, eps=1e-5, maxit=1000)
The argument kappa parameterizes the family for the Crawford-Ferguson method. If m is the number of factors and p is the number of indicators then kappa values having special names are 0=Quartimax, 1/p=Varimax, m/(2*p)=Equamax, (m-1)/(p+m-2)=Parsimax, 1=Factor parsimony.
X <- matrix(rnorm(500), ncol=10)
C <- cor(X)
eig <- eigen(C)
# PCA by hand scaled by sqrt
eig$vectors * t(matrix(rep(sqrt(eig$values), 10), ncol=10))
require(psych)
PCA0 <- principal(C, rotate='none', nfactors=10) #PCA by psych
PCA0
# as the original loadings PCA0 are scaled by their squarroot eigenvalue
apply(PCA0$loadings^2, 2, sum) # SS loadings
## PCA with Equimax rotation
# now i think the Equamax rotation can be performed by cfT with m/(2*p)
# p number of variables (10)
# m (or f in STATA manual) number of components (10)
# gives m==p --> kappa=0.5
PCA.EQ <- cfT(PCA0$loadings, kappa=0.5)
PCA.EQ
I upgraded some of my PCA knowledge by your question, hope it helps, good luck
Walter's answer helped a great deal!
I'll add some sidenotes for what it's worth:
R's psych::principal says under option "rotate", that more rotations are available. Under the linked "fa", there's in fact an "equamax". Sadly, the results are neither replicable with STATA nor with SPSS, at least not with the standard syntax I tried:
# R:
PCA.5f=principal(data, nfactors=5, rotate="equamax", use="complete.obs")
Walter's solution replicates SPSS' equamax rotation (Kaiser-normalized by default) in the first 3 decimal places (i.e. loadings and rotating matrix fairly equivalent) using the following syntax with m=no of factors and p=no of indicators:
# R:
PCA.5f=principal(data, nfactors=5, rotate="none", use="complete.obs")
PCA.5f.eq = cfT(PCA.5f$loadings, kappa=m/(2*p), normalize=TRUE) # replace kappa factor formula with your actual numbers!
# SPSS:
FACTOR
/VARIABLES listofvariables
/MISSING LISTWISE
/ANALYSIS listofvariables
/PRINT ROTATION
/CRITERIA FACTORS(5) ITERATE(1000)
/EXTRACTION PC
/CRITERIA ITERATE(1000)
/ROTATION EQUAMAX
/METHOD=CORRELATION.
STATA's equamax - Kaiser-normalized and unnormalized - is replicable at least in the first 4 decimal places with Kappa .5 irrespective of your actual number of factors and indicators which seems to contradict their manual (c.f. Walter's citation).
# R:
PCA.5f=principal(data, nfactors=5, rotate="none", use="complete.obs")
PCA.5f.eq = cfT(PCA.5f$loadings, kappa=.5, normalize=TRUE)
# STATA:
factor listofvars, pcf factors(5)
rotate, equamax normalize # kick the "normalize" to replicate R's "normalize=FALSE"
mat list e(r_L)
Related
I am have some difficulty understanding some steps in a procedure. They take coordinate data, find the covariance matrix, apply PCA, then extract the standard deviation from the square root of each eigenvalue in short. I am trying to re-produce this process, but I am stuck on the steps.
The Steps Taken
The data set consists of one matrix, R, that contains coordiante paris, (x(i),y(i)) with i=1,...,N for N is the total number of instances recorded. We applied PCA to the covariance matrix of the R input data set, and the following variables were obtained:
a) the principal components of the new coordinate system, the eigenvectors u and v, and
b) the eigenvalues (λ1 and λ2) corresponding to the total variability explained by each principal component.
With these variables, a graphical representation was created for each item. Two orthogonal segments were centred on the mean of the coordinate data. The segments’ directions were driven by the eigenvectors of the PCA, and the length of each segment was defined as one standard deviation (σ1 and σ2) around the mean, which was calculated by extracting the square root of each eigenvalue, λ1 and λ2.
My Steps
#reproducable data
set.seed(1)
x<-rnorm(10,50,4)
y<-rnorm(10,50,7)
# Note my data is not perfectly distirbuted in this fashion
df<-data.frame(x,y) # this is my R matrix
covar.df<-cov(df,use="all.obs",method='pearson') # this is my covariance matrix
pca.results<-prcomp(covar.df) # this applies PCA to the covariance matrix
pca.results$sdev # these are the standard deviations of the principal components
# which is what I believe I am looking for.
This is where I am stuck because I am not sure if I am trying to get the sdev output form prcomp() or if I should scale my data first. They are all on the same scale, so I do not see the issue with it.
My second question is how do I extract the standard deviation in the x and y direciton?
You don't apply prcomp to the covariance matrix, you do it on the data itself.
result= prcomp(df)
If by scaling you mean normalize or standardize, that happens before you do prcomp(). For more information on the procedure see this link that is introductory to the procedure: pca on R. That can walk you through the basics. To get the sdev use the the summary on the result object
summary(result)
result$sdev
You don't apply prcomp to the covariance matrix. scale=T bases the PCA on the correlation matrix and F on the covariance matrix
df.cor = prcomp(df, scale=TRUE)
df.cov = prcomp(df, scale=FALSE)
I have about 90 variables stored in data[2-90]. I suspect about 4 of them will have a parabola-like correlation with data[1]. I want to identify which ones have the correlation. Is there an easy and quick way to do this?
I have tried building a model like this (which I could do in a loop for each variable i = 2:90):
y <- data$AvgRating
x <- data$Hamming.distance
x2 <- x^2
quadratic.model = lm(y ~ x + x2)
And then look at the R^2/coefficient to get an idea of the correlation. Is there a better way of doing this?
Maybe R could build a regression model with the 90 variables and chose the ones which are significant itself? Would that be in any way possible? I can do this in JMP for linear regression, but I'm not sure I could do non-linear regression with R for all the variables at ones. Therefore I was manually trying to see if I could see which ones are correlated in advance. It would be helpful if there was a function to use for that.
You can use nlcor package in R. This package finds the nonlinear correlation between two data vectors.
There are different approaches to estimate a nonlinear correlation, such as infotheo. However, nonlinear correlations between two variables can take any shape.
nlcor is robust to most nonlinear shapes. It works pretty well in different scenarios.
At a high level, nlcor works by adaptively segmenting the data into linearly correlated segments. The segment correlations are aggregated to yield the nonlinear correlation. The output is a number between 0 to 1. With close to 1 meaning high correlation. Unlike a pearson correlation, negative values are not returned because it has no meaning in nonlinear relationships.
More details about this package here
To install nlcor, follow these steps:
install.packages("devtools")
library(devtools)
install_github("ProcessMiner/nlcor")
library(nlcor)
After you install it,
# Implementation
x <- seq(0,3*pi,length.out=100)
y <- sin(x)
plot(x,y,type="l")
# linear correlation is small
cor(x,y)
# [1] 6.488616e-17
# nonlinear correlation is more representative
nlcor(x,y, plt = T)
# $cor.estimate
# [1] 0.9774
# $adjusted.p.value
# [1] 1.586302e-09
# $cor.plot
As shown in the example the linear correlation was close to zero although there was a clear relationship between the variables that nlcor could detect.
Note: The order of x and y inside the nlcor is important. nlcor(x,y) is different from nlcor(y,x). The x and y here represent 'independent' and 'dependent' variables, respectively.
Fitting a generalized additive model, will help you identify curvature in the
relationships between the explanatory variables. Read the example on page 22 here.
Another option would be to compute mutual information score between each pair of variables. For example, using the mutinformation function from the infotheo package, you could do:
set.seed(1)
library(infotheo)
# corrleated vars (x & y correlated, z noise)
x <- seq(-10,10, by=0.5)
y <- x^2
z <- rnorm(length(x))
# list of vectors
raw_dat <- list(x, y, z)
# convert to a dataframe and discretize for mutual information
dat <- matrix(unlist(raw_dat), ncol=length(raw_dat))
dat <- discretize(dat)
mutinformation(dat)
Result:
| | V1| V2| V3|
|:--|---------:|---------:|---------:|
|V1 | 1.0980124| 0.4809822| 0.0553146|
|V2 | 0.4809822| 1.0943907| 0.0413265|
|V3 | 0.0553146| 0.0413265| 1.0980124|
By default, mutinformation() computes the discrete empirical mutual information score between two or more variables. The discretize() function is necessary if you are working with continuous data transform the data to discrete values.
This might be helpful at least as a first stab for looking for nonlinear relationships between variables, such as that described above.
The following R program creates an interpolated surface using 470 data points using walker Lake data in gstat package.
source("D:/kriging/allfunctions.r") # Reads in all functions.
source("D:/kriging/panel.gamma0.r") # Reads in panel function for xyplot.
library(lattice) # Needed for "xyplot" function.
library(geoR) # Needed for "polygrid" function.
library(akima)
library(gstat);
library(sp);
walk470 <- read.table("D:/kriging/walk470.txt",header=T)
attach(walk470)
coordinates(walk470) = ~x+y
walk.var1 <- variogram(v ~ x+y,data=walk470,width=10) #the width has to be tuned resulting different point pairs
plot(walk.var1,xlab="Distance",ylab="Semivariance",main="Variogram for V, Lag Spacing = 5")
model1.out <- fit.variogram(walk.var1,vgm(70000,"Sph",40,20000))
plot(walk.var1, model=model1.out,xlab="Distance",ylab="Semivariance",main="Variogram for V, Lag Spacing = 10")
poly <- chull(coordinates(walk470))
plot(coordinates(walk470),type="n",xlab="X",ylab="Y",cex.lab=1.6,main="Plot of Sample and Prediction Sites",cex.axis=1.5,cex.main=1.6)
lines(coordinates(walk470)[poly,])
poly.in <- polygrid(seq(2.5,247.5,5),seq(2.5,297.5,5),coordinates(walk470)[poly,])
points(poly.in)
points(coordinates(walk470),pch=16)
coordinates(poly.in) <- ~ x+y
krige.out <- krige(v ~ 1, walk470,poly.in, model=model1.out)
print(krige.out)
This program calculates the following for each point of 2688 points
(470x470) matrix inversion
(470x470) and (470x1) matrix multiplication
Is gstat package is using some smart way for calculation. I knew from previous stackoverflow query that it uses cholesky decomposition for matrix inversion. Is it normal speed for one machine to calculate it so quickly.
It uses LDL' decomposition, which is similar to Choleski. As you are using global kriging, the covariance matrix needs to be decomposed only once; then, for each prediction point, a system is solved, which is O(n). No 470x470 matrix gets ever inverted, neither are solutions obtained by multiplying it. Inverses are notational devices, but avoided as computational strategy when possible. In R, for instance, compare runtime of solve(A,b) with solve(A) %*% b.
Use the source, Luke!
Now i have a 7000*7000 correlation matrix and I have to do PCA on this in R.
I used the
CorPCA <- princomp(covmat=xCor)
, xCor is the correlation matrix
but it comes out
"covariance matrix is not non-negative definite"
it is because i have some negative correlation in that matrix.
I am wondering which inbuilt function in R that i can use to get the result of PCA
One method to do the PCA is to perform an eigenvalue decomposition of the covariance matrix, see wikipedia.
The advantage of the eigenvalue decomposition is that you see which directions (eigenvectors) are significant, i.e. have a noticeable variation expressed by the associated eigenvalues. Moreover, you can detect if the covariance matrix is positive definite (all eigenvalues greater than zero), not negative-definite (which is okay) if there are eigenvalues equal zero or if it is indefinite (which is not okay) by negative eigenvalues. Sometimes it also happens that due to numerical inaccuracies a non-negative-definite matrix becomes negative-definite. In that case you would observe negative eigenvalues which are almost zero. In that case you can set these eigenvalues to zero to retain the non-negative definiteness of the covariance matrix. Furthermore, you can still interpret the result: eigenvectors contributing the significant information are associated with the biggest eigenvalues. If the list of sorted eigenvalues declines quickly there are a lot of directions which do not contribute significantly and therefore can be dropped.
The built-in R function is eigen
If your covariance matrix is A then
eigen_res <- eigen(A)
# sorted list of eigenvalues
eigen_res$values
# slightly negative eigenvalues, set them to small positive value
eigen_res$values[eigen_res$values<0] <- 1e-10
# and produce regularized covariance matrix
Areg <- eigen_res$vectors %*% diag(eigen_res$values) %*% t(eigen_res$vectors)
not non-negative definite does not mean the covariance matrix has negative correlations. It's a linear algebra equivalent of trying to take square root of negative number! You can't tell by looking at a few values of the matrix, whether it's positive definite.
Try adjusting some default values like tolerance in princomp call. Check this thread for example: How to use princomp () function in R when covariance matrix has zero's?
An alternative is to write some code of your own to perform what is called a n NIPLAS analysis. Take a look at this thread on the R-mailing list: https://stat.ethz.ch/pipermail/r-help/2006-July/110035.html
I'd even go as far as asking where did you obtain the correlation matrix? Did you construct it yourself? Does it have NAs? If you constructed xCor from your own data, do you think you can sample the data and construct a smaller xCor matrix? (say 1000X1000). All these alternatives try to drive your PCA algorithm through the 'happy path' (i.e. all matrix operations can be internally carried out without difficulties in diagonalization etc..i.e., no more 'non-negative definite error msgs)
I am ecologist, using mainly the vegan R package.
I have 2 matrices (sample x abundances) (See data below):
matrix 1/ nrow= 6replicates*24sites, ncol=15 species abundances (fish)
matrix 2/ nrow= 3replicates*24sites, ncol=10 species abundances (invertebrates)
The sites are the same in both matrices. I want to get the overall bray-curtis dissimilarity (considering both matrices) among pairs of sites. I see 2 options:
option 1, averaging over replicates (at the site scale) fishes and macro-invertebrates abundances, cbind the two mean abundances matrix (nrow=24sites, ncol=15+10 mean abundances) and calculating bray-curtis.
option 2, for each assemblage, computing bray-curtis dissimilarity among pairs of sites, computing distances among sites centroids. Then summing up the 2 distance matrix.
In case I am not clear, I did these 2 operations in the R codes below.
Please, could you tell me if the option 2 is correct and more appropriate than option 1.
thank you in advance.
Pierre
here is below the R code exemples
generating data
library(plyr);library(vegan)
#assemblage 1: 15 fish species, 6 replicates per site
a1.env=data.frame(
Habitat=paste("H",gl(2,12*6),sep=""),
Site=paste("S",gl(24,6),sep=""),
Replicate=rep(paste("R",1:6,sep=""),24))
summary(a1.env)
a1.bio=as.data.frame(replicate(15,rpois(144,sample(1:10,1))))
names(a1.bio)=paste("F",1:15,sep="")
a1.bio[1:72,]=2*a1.bio[1:72,]
#assemblage 2: 10 taxa of macro-invertebrates, 3 replicates per site
a2.env=a1.env[a1.env$Replicate%in%c("R1","R2","R3"),]
summary(a2.env)
a2.bio=as.data.frame(replicate(10,rpois(72,sample(10:100,1))))
names(a2.bio)=paste("I",1:10,sep="")
a2.bio[1:36,]=0.5*a2.bio[1:36,]
#environmental data at the sit scale
env=unique(a1.env[,c("Habitat","Site")])
env=env[order(env$Site),]
OPTION 1, averaging abundances and cbind
a1.bio.mean=ddply(cbind(a1.bio,a1.env),.(Habitat,Site),numcolwise(mean))
a1.bio.mean=a1.bio.mean[order(a1.bio.mean$Site),]
a2.bio.mean=ddply(cbind(a2.bio,a2.env),.(Habitat,Site),numcolwise(mean))
a2.bio.mean=a2.bio.mean[order(a2.bio.mean$Site),]
bio.mean=cbind(a1.bio.mean[,-c(1:2)],a2.bio.mean[,-c(1:2)])
dist.mean=vegdist(sqrt(bio.mean),"bray")
OPTION 2, computing for each assemblage distance among centroids and summing the 2 distances matrix
a1.dist=vegdist(sqrt(a1.bio),"bray")
a1.coord.centroid=betadisper(a1.dist,a1.env$Site)$centroids
a1.dist.centroid=vegdist(a1.coord.centroid,"eucl")
a2.dist=vegdist(sqrt(a2.bio),"bray")
a2.coord.centroid=betadisper(a2.dist,a2.env$Site)$centroids
a2.dist.centroid=vegdist(a2.coord.centroid,"eucl")
summing up the two distance matrices using Gavin Simpson 's fuse()
dist.centroid=fuse(a1.dist.centroid,a2.dist.centroid,weights=c(15/25,10/25))
summing up the two euclidean distance matrices (thanks to Jari Oksanen correction)
dist.centroid=sqrt(a1.dist.centroid^2 + a2.dist.centroid^2)
and the 'coord.centroid' below for further distance-based analysis (is it correct ?)
coord.centroid=cmdscale(dist.centroid,k=23,add=TRUE)
COMPARING OPTION 1 AND 2
pco.mean=cmdscale(vegdist(sqrt(bio.mean),"bray"))
pco.centroid=cmdscale(dist.centroid)
comparison=procrustes(pco.centroid,pco.mean)
protest(pco.centroid,pco.mean)
An easier solution is just to flexibly combine the two dissimilarity matrices, by weighting each matrix. The weights need to sum to 1. For two dissimilarity matrices the fused dissimilarity matrix is
d.fused = (w * d.x) + ((1 - w) * d.y)
where w is a numeric scalar (length 1 vector) weight. If you have no reason to weight one of the sets of dissimilarities more than the other, just use w = 0.5.
I have a function to do this for you in my analogue package; fuse(). The example from ?fuse is
train1 <- data.frame(matrix(abs(runif(100)), ncol = 10))
train2 <- data.frame(matrix(sample(c(0,1), 100, replace = TRUE),
ncol = 10))
rownames(train1) <- rownames(train2) <- LETTERS[1:10]
colnames(train1) <- colnames(train2) <- as.character(1:10)
d1 <- vegdist(train1, method = "bray")
d2 <- vegdist(train2, method = "jaccard")
dd <- fuse(d1, d2, weights = c(0.6, 0.4))
dd
str(dd)
This idea is used in supervised Kohonen networks (supervised SOMs) to bring multiple layers of data into a single analysis.
analogue works closely with vegan so there won't be any issues running the two packages side by side.
The correctness of averaging distances depends on what are you doing with those distances. In some applications you may expect that they really are distances. That is, they satisfy some metric properties and have a defined relation to the original data. Combined dissimilarities may not satisfy these requirements.
This issue is related to the controversy of partial Mantel type analysis of dissimilarities vs. analysis of rectangular data that is really hot (and I mean red hot) in studies of beta diversities. We in vegan provide tools for both, but I think that in most cases analysis of rectangular data is more robust and more powerful. With rectangular data I mean normal sampling units times species matrix. The preferred dissimilarity based methods in vegan map dissimilarities onto rectangular form. These methods in vegan include db-RDA (capscale), permutational MANOVA (adonis) and analysis of within-group dispersion (betadisper). Methods working with disismilarities as such include mantel, anosim, mrpp, meandis.
The mean of dissimilarities or distances usually has no clear correspondence to the original rectangular data. That is: mean of the dissimilarities does not correspond to the mean of the data. I think that in general it is better to average or handle data and then get dissimilarities from transformed data.
If you want to combine dissimilarities, analogue::fuse() style approach is most practical. However, you should understand that fuse() also scales dissimilarity matrices into equal maxima. If you have dissimilarity measures in scale 0..1, this is usually minor issue, unless one of the data set is more homogeneous and has a lower maximum dissimilarity than others. In fuse() they are all equalized so that it is not a simple averaging but averaging after range equalizing. Moreover, you must remember that averaging dissimilarities usually destroys the geometry, and this will matter if you use analysis methods for rectangularized data (adonis, betadisper, capscale in vegan).
Finally about geometry of combining dissimilarities. Dissimilarity indices in scale 0..1 are fractions of type A/B. Two fractions can be added (and then divided to get the average) directly only if the denominators are equal. If you ignore this and directly average the fractions, then the result will not be equal to the same fraction from averaged data. This is what I mean with destroying geometry. Some open-scaled indices are not fractions and may be additive. Manhattan distances are additive. Euclidean distances are square roots of squared differences, and their squares are additive but not the distances directly.
I demonstrate these things by showing the effect of adding together two dissimilarities (and averaging would mean dividing the result by two, or by suitable weights). I take the Barro Colorado Island data of vegan and divide it into two subsets of slightly unequal sizes. A geometry preserving addition of distances of subsets of the data will give the same result as the analysis of the complete data:
library(vegan) ## data and vegdist
library(analogue) ## fuse
data(BCI)
dim(BCI) ## [1] 50 225
x1 <- BCI[, 1:100]
x2 <- BCI[, 101:225]
## Bray-Curtis and fuse: not additive
plot(vegdist(BCI), fuse(vegdist(x1), vegdist(x2), weights = c(100/225, 125/225)))
## summing distances is straigthforward (they are vectors), but preserving
## their attributes and keeping the dissimilarities needs fuse or some trick
## like below where we make dist structure dtmp to be replaced with the result
dtmp <- dist(BCI) ## dist skeleton with attributes
dtmp[] <- dist(x1, "manhattan") + dist(x2, "manhattan")
## manhattans are additive and can be averaged
plot(dist(BCI, "manhattan"), dtmp)
## Fuse rescales dissimilarities and they are no more additive
dfuse <- fuse(dist(x1, "man"), dist(x2, "man"), weights=c(100/225, 125/225))
plot(dist(BCI, "manhattan"), dfuse)
## Euclidean distances are not additive
dtmp[] <- dist(x1) + dist(x2)
plot(dist(BCI), dtmp)
## ... but squared Euclidean distances are additive
dtmp[] <- sqrt(dist(x1)^2 + dist(x2)^2)
plot(dist(BCI), dtmp)
## dfuse would rescale squared Euclidean distances like Manhattan (not shown)
I only considered addition above, but if you cannot add, you cannot average. It is a matter of taste if this is important. Brave people will average things that cannot be averaged, but some people are more timid and want to follow the rules. I rather go the second group.
I like this simplicity of this answer, but it only applies to adding 2 distance matrices:
d.fused = (w * d.x) + ((1 - w) * d.y)
so I wrote my own snippet to combine an array of multiple distance matrices (not just 2), and using standard R packages:
# generate array of distance matrices
x <- matrix(rnorm(100), nrow = 5)
y <- matrix(rnorm(100), nrow = 5)
z <- matrix(rnorm(100), nrow = 5)
dst_array <- list(dist(x),dist(y),dist(z))
# create new distance matrix with first element of array
dst <- dst_array[[1]]
# loop over remaining array elements, add them to distance matrix
for (jj in 2:length(dst_array)){
dst <- dst + dst_array[[jj]]
}
You could also use a vector of similar size to dst_array in order to define scaling factors
dst <- dst + my_scale[[jj]] * dst_array[[jj]]