summing 2 distance matrices for getting a third 'overall' distance matrix (ecological context) - r

I am ecologist, using mainly the vegan R package.
I have 2 matrices (sample x abundances) (See data below):
matrix 1/ nrow= 6replicates*24sites, ncol=15 species abundances (fish)
matrix 2/ nrow= 3replicates*24sites, ncol=10 species abundances (invertebrates)
The sites are the same in both matrices. I want to get the overall bray-curtis dissimilarity (considering both matrices) among pairs of sites. I see 2 options:
option 1, averaging over replicates (at the site scale) fishes and macro-invertebrates abundances, cbind the two mean abundances matrix (nrow=24sites, ncol=15+10 mean abundances) and calculating bray-curtis.
option 2, for each assemblage, computing bray-curtis dissimilarity among pairs of sites, computing distances among sites centroids. Then summing up the 2 distance matrix.
In case I am not clear, I did these 2 operations in the R codes below.
Please, could you tell me if the option 2 is correct and more appropriate than option 1.
thank you in advance.
Pierre
here is below the R code exemples
generating data
library(plyr);library(vegan)
#assemblage 1: 15 fish species, 6 replicates per site
a1.env=data.frame(
Habitat=paste("H",gl(2,12*6),sep=""),
Site=paste("S",gl(24,6),sep=""),
Replicate=rep(paste("R",1:6,sep=""),24))
summary(a1.env)
a1.bio=as.data.frame(replicate(15,rpois(144,sample(1:10,1))))
names(a1.bio)=paste("F",1:15,sep="")
a1.bio[1:72,]=2*a1.bio[1:72,]
#assemblage 2: 10 taxa of macro-invertebrates, 3 replicates per site
a2.env=a1.env[a1.env$Replicate%in%c("R1","R2","R3"),]
summary(a2.env)
a2.bio=as.data.frame(replicate(10,rpois(72,sample(10:100,1))))
names(a2.bio)=paste("I",1:10,sep="")
a2.bio[1:36,]=0.5*a2.bio[1:36,]
#environmental data at the sit scale
env=unique(a1.env[,c("Habitat","Site")])
env=env[order(env$Site),]
OPTION 1, averaging abundances and cbind
a1.bio.mean=ddply(cbind(a1.bio,a1.env),.(Habitat,Site),numcolwise(mean))
a1.bio.mean=a1.bio.mean[order(a1.bio.mean$Site),]
a2.bio.mean=ddply(cbind(a2.bio,a2.env),.(Habitat,Site),numcolwise(mean))
a2.bio.mean=a2.bio.mean[order(a2.bio.mean$Site),]
bio.mean=cbind(a1.bio.mean[,-c(1:2)],a2.bio.mean[,-c(1:2)])
dist.mean=vegdist(sqrt(bio.mean),"bray")
OPTION 2, computing for each assemblage distance among centroids and summing the 2 distances matrix
a1.dist=vegdist(sqrt(a1.bio),"bray")
a1.coord.centroid=betadisper(a1.dist,a1.env$Site)$centroids
a1.dist.centroid=vegdist(a1.coord.centroid,"eucl")
a2.dist=vegdist(sqrt(a2.bio),"bray")
a2.coord.centroid=betadisper(a2.dist,a2.env$Site)$centroids
a2.dist.centroid=vegdist(a2.coord.centroid,"eucl")
summing up the two distance matrices using Gavin Simpson 's fuse()
dist.centroid=fuse(a1.dist.centroid,a2.dist.centroid,weights=c(15/25,10/25))
summing up the two euclidean distance matrices (thanks to Jari Oksanen correction)
dist.centroid=sqrt(a1.dist.centroid^2 + a2.dist.centroid^2)
and the 'coord.centroid' below for further distance-based analysis (is it correct ?)
coord.centroid=cmdscale(dist.centroid,k=23,add=TRUE)
COMPARING OPTION 1 AND 2
pco.mean=cmdscale(vegdist(sqrt(bio.mean),"bray"))
pco.centroid=cmdscale(dist.centroid)
comparison=procrustes(pco.centroid,pco.mean)
protest(pco.centroid,pco.mean)

An easier solution is just to flexibly combine the two dissimilarity matrices, by weighting each matrix. The weights need to sum to 1. For two dissimilarity matrices the fused dissimilarity matrix is
d.fused = (w * d.x) + ((1 - w) * d.y)
where w is a numeric scalar (length 1 vector) weight. If you have no reason to weight one of the sets of dissimilarities more than the other, just use w = 0.5.
I have a function to do this for you in my analogue package; fuse(). The example from ?fuse is
train1 <- data.frame(matrix(abs(runif(100)), ncol = 10))
train2 <- data.frame(matrix(sample(c(0,1), 100, replace = TRUE),
ncol = 10))
rownames(train1) <- rownames(train2) <- LETTERS[1:10]
colnames(train1) <- colnames(train2) <- as.character(1:10)
d1 <- vegdist(train1, method = "bray")
d2 <- vegdist(train2, method = "jaccard")
dd <- fuse(d1, d2, weights = c(0.6, 0.4))
dd
str(dd)
This idea is used in supervised Kohonen networks (supervised SOMs) to bring multiple layers of data into a single analysis.
analogue works closely with vegan so there won't be any issues running the two packages side by side.

The correctness of averaging distances depends on what are you doing with those distances. In some applications you may expect that they really are distances. That is, they satisfy some metric properties and have a defined relation to the original data. Combined dissimilarities may not satisfy these requirements.
This issue is related to the controversy of partial Mantel type analysis of dissimilarities vs. analysis of rectangular data that is really hot (and I mean red hot) in studies of beta diversities. We in vegan provide tools for both, but I think that in most cases analysis of rectangular data is more robust and more powerful. With rectangular data I mean normal sampling units times species matrix. The preferred dissimilarity based methods in vegan map dissimilarities onto rectangular form. These methods in vegan include db-RDA (capscale), permutational MANOVA (adonis) and analysis of within-group dispersion (betadisper). Methods working with disismilarities as such include mantel, anosim, mrpp, meandis.
The mean of dissimilarities or distances usually has no clear correspondence to the original rectangular data. That is: mean of the dissimilarities does not correspond to the mean of the data. I think that in general it is better to average or handle data and then get dissimilarities from transformed data.
If you want to combine dissimilarities, analogue::fuse() style approach is most practical. However, you should understand that fuse() also scales dissimilarity matrices into equal maxima. If you have dissimilarity measures in scale 0..1, this is usually minor issue, unless one of the data set is more homogeneous and has a lower maximum dissimilarity than others. In fuse() they are all equalized so that it is not a simple averaging but averaging after range equalizing. Moreover, you must remember that averaging dissimilarities usually destroys the geometry, and this will matter if you use analysis methods for rectangularized data (adonis, betadisper, capscale in vegan).
Finally about geometry of combining dissimilarities. Dissimilarity indices in scale 0..1 are fractions of type A/B. Two fractions can be added (and then divided to get the average) directly only if the denominators are equal. If you ignore this and directly average the fractions, then the result will not be equal to the same fraction from averaged data. This is what I mean with destroying geometry. Some open-scaled indices are not fractions and may be additive. Manhattan distances are additive. Euclidean distances are square roots of squared differences, and their squares are additive but not the distances directly.
I demonstrate these things by showing the effect of adding together two dissimilarities (and averaging would mean dividing the result by two, or by suitable weights). I take the Barro Colorado Island data of vegan and divide it into two subsets of slightly unequal sizes. A geometry preserving addition of distances of subsets of the data will give the same result as the analysis of the complete data:
library(vegan) ## data and vegdist
library(analogue) ## fuse
data(BCI)
dim(BCI) ## [1] 50 225
x1 <- BCI[, 1:100]
x2 <- BCI[, 101:225]
## Bray-Curtis and fuse: not additive
plot(vegdist(BCI), fuse(vegdist(x1), vegdist(x2), weights = c(100/225, 125/225)))
## summing distances is straigthforward (they are vectors), but preserving
## their attributes and keeping the dissimilarities needs fuse or some trick
## like below where we make dist structure dtmp to be replaced with the result
dtmp <- dist(BCI) ## dist skeleton with attributes
dtmp[] <- dist(x1, "manhattan") + dist(x2, "manhattan")
## manhattans are additive and can be averaged
plot(dist(BCI, "manhattan"), dtmp)
## Fuse rescales dissimilarities and they are no more additive
dfuse <- fuse(dist(x1, "man"), dist(x2, "man"), weights=c(100/225, 125/225))
plot(dist(BCI, "manhattan"), dfuse)
## Euclidean distances are not additive
dtmp[] <- dist(x1) + dist(x2)
plot(dist(BCI), dtmp)
## ... but squared Euclidean distances are additive
dtmp[] <- sqrt(dist(x1)^2 + dist(x2)^2)
plot(dist(BCI), dtmp)
## dfuse would rescale squared Euclidean distances like Manhattan (not shown)
I only considered addition above, but if you cannot add, you cannot average. It is a matter of taste if this is important. Brave people will average things that cannot be averaged, but some people are more timid and want to follow the rules. I rather go the second group.

I like this simplicity of this answer, but it only applies to adding 2 distance matrices:
d.fused = (w * d.x) + ((1 - w) * d.y)
so I wrote my own snippet to combine an array of multiple distance matrices (not just 2), and using standard R packages:
# generate array of distance matrices
x <- matrix(rnorm(100), nrow = 5)
y <- matrix(rnorm(100), nrow = 5)
z <- matrix(rnorm(100), nrow = 5)
dst_array <- list(dist(x),dist(y),dist(z))
# create new distance matrix with first element of array
dst <- dst_array[[1]]
# loop over remaining array elements, add them to distance matrix
for (jj in 2:length(dst_array)){
dst <- dst + dst_array[[jj]]
}
You could also use a vector of similar size to dst_array in order to define scaling factors
dst <- dst + my_scale[[jj]] * dst_array[[jj]]

Related

Working with spatial data: How to find the nearest neighbour of points without replacement?

I am currently working with some forest inventory data.
The data were collected on sample plots whose positions are available as point data (spatial data).
I have two datasets:
dataset dat.1 with n sample plots of species A
dataset dat.2 with k sample plots of species B
with n < k
What I want to do is to match every point of dat.1 with a point of dat.2. The result should be n pairs of points. So n of k plots from dat.2 should be selected.
The criteria for matching are:
spatial distance between a pair of points is as close as possible
one point of dat.2 can only be matched with one point in dat.1 and vice versa. So if there is a pair of points, these points should not be used in any other pair, even if it would be useful in terms of shortest distance. The "occupied" points should not be replaced and should not be used in the further matching process.
I have been looking for a very long time for ways to perform this analysis. There are functions like st_nn from 'nngeo' or nn2 from 'RANN' which give out the k nearest neighbours of a point. However, it is not possible to exclude the possibility of a replacement with these functions.
In the package 'matchIt' there are possibilites to perform a nearest neighbour matching without replacement. Yet these functions are adapted to find the closest distance between control variables and not between spatial locations.
Could anyone come up with an idea for a possibility to match my requirements?
I would really appreciate any hints or suggestions for packages and / or functions that could help me with this issue.
The first thing you should do is create your own distance matrix. The rows should correspond to those in dat.1 and the columns to those in dat.2, and each entry in the matrix is the distance between the plot in the row and the plot in the column. You can do this manually by looping through your datasets and computing the Euclidean (or other) distance between the points. You can also use the match_on function in the optmatch package to do this with the following code:
d <- rbind(dat.1, dat.2)
d$dat <- c(rep(1, nrow(dat.1)), rep(0, nrow(dat.2))
dist <- optmatch::match_on(dat ~ x.coor + y.coord, data = d,
method = "euclidean")
Once you have a distance matrix in this form, you can supply it to pairmatch in the optmatch package. pairmatch performs K:1 optimal matching without replacement. The matching is optimal in that the sum of the absolute distances between matched pairs in the matched sample is as low as possible. It doesn't guarantee that any one unit will get its nearest neighbor, but it does yield matched samples that ensure no units are matched to other units too far apart from them. You can specify an argument to controls to choose how many dat.2 units you want to be matched to each dat.1 unit. For example, to match 2 plots from dat.2 to each unit in dat.1, you can use
d$pairs <- optmatch::pairmatch(dist)
The output is a factor containing pair membership for each unit. Unmatched units will have a value of NA.
You can also do this in one single step with
d$pairs <- optmatch::pairmatch(dat ~ x.coor + y.coord, data = d,
method = "euclidean")
Then you can subset your dataset so only matched plots remain:
matched <- d[!is.na(d$pairs),]

How to remove outliers from distance matrix or Hierarchical clustering in R?

I have some questions
First, I don't know how to find and remove outliers in distance matrix or symmetry matrix.
Second, also I used Hierarachical clustering with Average linkage.
My data is engmale161 (already made symmetry matrix with DTW )
engmale161 <- na.omit(engmale161)
engmale161 <- scale(engmale161)
d <- dist(engmale161, method = "euclidean")
hc1_engmale161 <- hclust(d, method="average")
and I find optimize index 4 with silhouette, wss & gap.
>sub_grp <- cutree(hc1_engmale161,h=60, k = 4)
>table(sub_grp)
>table(sub_grp)
sub_grp
1 2 3 4
741 16 7 1
> subset(sub_grp,sub_grp==4)
4165634865
4
>fviz_cluster(list(data = engmale161, cluster = sub_grp), geom = "point")
So, I think the right upper point(4165634865) is outlier and it has only one point.
How to delete the outlier in H-C algorithm.
just some ideas.
in a nutshell,
don't do "na.omit" on engmale161. find the outlier(s) using
quantiles and box-and-whiskers put outliers to NA in the dist matrix
proceed with your processing
long version:
"dist" behaves nicely with NAs (from the R documentation, "Missing
values are allowed, and are excluded from all computations involving
the rows within which they occur. Further, when Inf values are
involved, all pairs of values are excluded when their contribution to
the distance gave NaN or NA)"
to find an outlier I would use concepts from exploratory statistics.
use "quantile" with default probs and na.rm = true (because your dist
matrix still contains NAs) --> you'll get values for the quartiles
(dataset split in 4: 0-25%, 25-50%m and so on). 25-75 is the "box".
To find the "whiskers" is a debated topic. the standard approach is
to find the InterQuartileRange (IQR), which is third-first quartile,
then first quartile - 1.5*IQR is the "lower" whiskers, and third
quartile + 1.5*IQR is the "upper" whisker. Any value outside the
whiskers is to be considered an outlier. Mark them as NA, and proceed.
Best of luck, and my compliments for being someone who actually looks at the data!

How to calculate the volume under a surface defined by discrete data?

I need to determine the volume beneath a series of surfaces represented by discrete data points. In my data, each sample is stored as a separate data frame within a list of data frames. Here is some (small) example data:
df1 <- data.frame(x=c(2,2,2,3,3,3,4,4,4,5,5,5,6,6,6),
y=c(1,2,3,1,2,3,1,2,3,1,2,3,1,2,3),
z=c(0,2,0,4,6,7,3,2,1,2,7,8,9,4,2))
df2 <- data.frame(x=c(2,2,2,3,3,3,4,4,4,5,5,5,6,6,6),
y=c(1,2,3,1,2,3,1,2,3,1,2,3,1,2,3),
z=c(1,1,2,3,5,6,2,1,3,3,8,9,8,3,1))
DF <- list(df1,df2)
Answers to similar questions are either in other languages (matlab, python), or the answers do not contain useable script to address the problem (as here). I can think of two acceptable ways to estimate the volume beneath each surface: 1) write out a discretized version of simpson's rule as a function in R that is applied across the list of data frames (DF); 2) calculate an arbitrary relationship between x, y, and z and use multivariate numerical integration to find the volume under the surface (with functions like simpson2d / quad2d in the package pracma or adaptIntegrate in cubature).
Regarding the first approach, the formula for the composite simpson's rule (that I would like to use) is here, but due to its complexity, I have been unsuccessful in writing a working double summation function. In this expression, I(lambda(em) lambda(ex)) is equal to z in the above datasets at each x,y grid point, and Delta(em) and Delta(ex) represent the interval between x and y points.
The second approach would essentially extend the approach found here to multivariate spline fits and then pass the predicted z values as a function for integration. Here's what I have tried so far for this approach:
require(pracma)
df1.loess <- loess(z ~ x + y, data=DF[[1]])
mod.fun <- function(x,y) predict(df1.loess, newdata=x,y)
simpson2d(mod.fun, x=c(2,6), y=c(1,3))
But this does not yield useful results.
In reality, I have a list of almost 100 data frames for individual samples, so I really need to be able to express the solution as a series of lapply functions that automate these calculations across all data frames in the list. An example looks something like this:
require(akima)
DF.splines <- lapply(DF, function(x,y,z) interp(x = "x", y = "y", z = "z",
linear=F, nx=4, ny=2))
Unfortunately, this produces an exception for missing values and Infs. I'm extremely open to any suggestions for how to successfully implement one of these strategies, or to utilize a different (simpler?) approach. Could a kriging function (like km in the DiceKriging package) produce a better fit that could be passed on for numerical integration?
I am assuming that the volume surface mesh is defined by connecting points via straight lines. Then you can find the volume beneath that surface via
triangular tessellation of the (x,y) grid into triangles T_i with area A_i
finding the corresponding z values Z_i for each of the triangles T_i
calculating the volume V_i of the truncated prisms (defined by T_i and Z_i) via V_i=A_i*sum(Z_i)/3 (see https://en.wikipedia.org/wiki/Prism_(geometry) and https://math.stackexchange.com/questions/2371139/volume-of-truncated-prism)
summing up all truncated prism volumes V_i
Keep in mind, however, that the volume does depend on your tessellation and that the tessellation is not unique. But your problem is not fully defined in the sense that it does not describe how one should interpolate between points. So any approach to calculate a volume will have to make additional assumptions.
Going back to my solution approach, points 1 and 2 can be achieved via the geometry package.
Here some code
library(geometry)
getVolume=function(df) {
#find triangular tesselation of (x,y) grid
res=delaunayn(as.matrix(df[,-3]),full=TRUE,options="Qz")
#calulates sum of truncated prism volumes
sum(mapply(function(triPoints,A) A/3*sum(df[triPoints,"z"]),
split.data.frame(res$tri,seq_along(res$areas)),
res$areas))
}
sapply(DF,getVolume)
#[1] 32.50000 30.33333
Since it's hard to check whether the results are consistent, here a simple example where we know the right answer. It's a cube with side length 2 where we have cut out a wedge along the x axis. The cut-out region is 1/4 of the total volume.
cutOutCube=expand.grid(c(0,1,2),c(0,1,2))
colnames(cutOutCube)=c("x","y")
cutOutCube$z=ifelse(cutOutCube$x==1,1,2)
sapply(list(cutOutCube),getVolume)
#[1] 6
That's correct since 2^3*(1-1/4)=6.
Another sanity check can be performed by calculating the "complement" of the volume w.r.t. to a simple cuboid where all z values are set to the max z value (in your case max(z)=9 in both cases). The simple cuboid volumes are 72 for both of your cases. Not let's define the complement surfaces and sum up volume and complement volume
df1c=df1
df1c$z=max(df1c$z)-df1c$z
df2c=df2
df2c$z=max(df2c$z)-df2c$z
DFc=list(df1c,df2c)
sapply(DFc,getVolume)+sapply(DF,getVolume)
#[1] 72 72
So volume and complement volume give the right simple cuboid volume in both cases.
You could approximate the surface through a "barycentric Lagrangian" approach as implemented in function barylag2d in the pracma package. Then, to avoid any vectorization problems, apply the Gaussian quadrature rules explicitly.
library(pracma)
df1 <- data.frame(x=c(2,2,2,3,3,3,4,4,4,5,5,5,6,6,6),
y=c(1,2,3,1,2,3,1,2,3,1,2,3,1,2,3),
z=c(0,2,0,4,6,7,3,2,1,2,7,8,9,4,2))
# Define the nodes in x- and y-direction
xn <- df1$x[c(1,4,7,10,13)]
yn <- df1$y[1:3]
# Define the matrix representing the function
m1 <- matrix(df1$z, nrow=5, byrow=TRUE)
f <- function(x, y)
c(pracma::barylag2d(m1, xn, yn, x, y))
# 32 nodes in integration intervals
n <- 32
xa <- 2; xb <- 6; ya <- 1; yb <- 3
# Apply quadrature rules explicitely
cx <- gaussLegendre(n, xa, xb)
x <- cx$x; wx <- cx$w
cy <- gaussLegendre(n, ya, yb)
y <- cy$x; wy <- cy$w
# Sum weights * values over all nodes
I <- 0
for (i in 1:n) {
for (j in 1:n) {
I <- I + wx[i] * wy[j] * f(x[i], y[j])
}
}
I # 40.37037
The integral value of 40 seems reasonable given the data. simpson2d or quad2d will not work in this setting.
You may try whether adaptIntegrate will work with the so defined function f.

Calculate Rao's quadratic entropy

Rao QE is a weighted Euclidian distance matrix. I have the vectors for the elements of the d_ijs in a data table dt, one column per element (say there are x of them). p is the final column. nrow = S. The double sums are for the lower left (or upper right since it is symmetric) elements of the distance matrix.
If I only needed an unweighted distance matrix I could simply do dist() over the x columns. How do I weight the d_ijs by the product of p_i and p_j?
And example data set is at https://github.com/GeraldCNelson/nutmod/blob/master/RaoD_example.csv with the ps in the column called foodQ.ratio.
You still start with dist for the raw Euclidean distance matrix. Let it be D. As you will read from R - How to get row & column subscripts of matched elements from a distance matrix, a "dist" object is not a real matrix, but a 1D array. So first do D <- as.matrix(D) or D <- dist2mat(D) to convert it to a complete matrix before the following.
Now, let p be the vector of weights, the Rao's QE is just a quadratic form q'Dq / 2:
c(crossprod(p, D %*% p)) / 2
Note, I am not doing everything in the most efficient way. I have performed a symmetric matrix-vector multiplication D %*% p using the full D rather than just its lower triangular part. However, R does not have a routine doing triangular matrix-vector multiplication. So I compute the full version than divide 2.
This doubles computation amount that is necessary; also, making D a full matrix doubles memory costs. But if your problem is small to medium size this is absolutely fine. For large problem, if you are R and C wizard, call BLAS routine dtrmv or even dtpmv for the triangular matrix-vector computation.
Update
I just found this simple paper: Rao's quadratic entropy as a measure of functional diversity based on multiple traits for definition and use of Rao's EQ. It mentions that we can replace Euclidean distance with Mahalanobis distance. In case we want to do this, use my code in Mahalanobis distance of each pair of observations for fast computation of Mahalanobis distance matrix.

Principal component analysis with EQUAMAX rotation

I need to do a principal component analysis (PCA) with EQUAMAX-rotation in R.
Unfortunately the function principal() I normally use for PCA does not offer this kind of rotation.
I could find out that it may be possible somehow with the package GPArotation but I could not yet figure out how to use this in the PCA.
Maybe someone can give an example on how to do an equamax-rotation PCA?
Or is there a function for PCA in another package that offers the use of equamax-rotation directly?
The package psych from i guess you are using principal() has the rotations varimax, quatimax, promax, oblimin, simplimax, and cluster but not equamax (psych p.232) which is a compromise between Varimax and Quartimax
excerpt from the STATA manual: mvrotate p.3
Rotation criteria
In the descriptions below, the matrix to be rotated is denoted as A, p denotes the number of rows of A, and f denotes the number of columns of A (factors or components). If A is a loading matrix from factor or pca, p is the number of variables, and f is the number of factors or components.
Criteria suitable only for orthogonal rotations
varimax and vgpf apply the orthogonal varimax rotation (Kaiser 1958). varimax maximizes the variance of the squared loadings within factors (columns of A). It is equivalent to cf(1/p) and to oblimin(1). varimax, the most popular rotation, is implemented with a dedicated fast algorithm and ignores all optimize options. Specify vgpf to switch to the general GPF algorithm used for the other criteria.
quartimax uses the quartimax criterion (Harman 1976). quartimax maximizes the variance of
the squared loadings within the variables (rows of A). For orthogonal rotations, quartimax is equivalent to cf(0) and to oblimax.
equamax specifies the orthogonal equamax rotation. equamax maximizes a weighted sum of the
varimax and quartimax criteria, reflecting a concern for simple structure within variables (rows of A) as well as within factors (columns of A). equamax is equivalent to oblimin(p/2) and cf(#), where # = f /(2p).
now the cf (Crawford-Ferguson) method is also available in GPArotation
cfT orthogonal Crawford-Ferguson family
cfT(L, Tmat=diag(ncol(L)), kappa=0, normalize=FALSE, eps=1e-5, maxit=1000)
The argument kappa parameterizes the family for the Crawford-Ferguson method. If m is the number of factors and p is the number of indicators then kappa values having special names are 0=Quartimax, 1/p=Varimax, m/(2*p)=Equamax, (m-1)/(p+m-2)=Parsimax, 1=Factor parsimony.
X <- matrix(rnorm(500), ncol=10)
C <- cor(X)
eig <- eigen(C)
# PCA by hand scaled by sqrt
eig$vectors * t(matrix(rep(sqrt(eig$values), 10), ncol=10))
require(psych)
PCA0 <- principal(C, rotate='none', nfactors=10) #PCA by psych
PCA0
# as the original loadings PCA0 are scaled by their squarroot eigenvalue
apply(PCA0$loadings^2, 2, sum) # SS loadings
## PCA with Equimax rotation
# now i think the Equamax rotation can be performed by cfT with m/(2*p)
# p number of variables (10)
# m (or f in STATA manual) number of components (10)
# gives m==p --> kappa=0.5
PCA.EQ <- cfT(PCA0$loadings, kappa=0.5)
PCA.EQ
I upgraded some of my PCA knowledge by your question, hope it helps, good luck
Walter's answer helped a great deal!
I'll add some sidenotes for what it's worth:
R's psych::principal says under option "rotate", that more rotations are available. Under the linked "fa", there's in fact an "equamax". Sadly, the results are neither replicable with STATA nor with SPSS, at least not with the standard syntax I tried:
# R:
PCA.5f=principal(data, nfactors=5, rotate="equamax", use="complete.obs")
Walter's solution replicates SPSS' equamax rotation (Kaiser-normalized by default) in the first 3 decimal places (i.e. loadings and rotating matrix fairly equivalent) using the following syntax with m=no of factors and p=no of indicators:
# R:
PCA.5f=principal(data, nfactors=5, rotate="none", use="complete.obs")
PCA.5f.eq = cfT(PCA.5f$loadings, kappa=m/(2*p), normalize=TRUE) # replace kappa factor formula with your actual numbers!
# SPSS:
FACTOR
/VARIABLES listofvariables
/MISSING LISTWISE
/ANALYSIS listofvariables
/PRINT ROTATION
/CRITERIA FACTORS(5) ITERATE(1000)
/EXTRACTION PC
/CRITERIA ITERATE(1000)
/ROTATION EQUAMAX
/METHOD=CORRELATION.
STATA's equamax - Kaiser-normalized and unnormalized - is replicable at least in the first 4 decimal places with Kappa .5 irrespective of your actual number of factors and indicators which seems to contradict their manual (c.f. Walter's citation).
# R:
PCA.5f=principal(data, nfactors=5, rotate="none", use="complete.obs")
PCA.5f.eq = cfT(PCA.5f$loadings, kappa=.5, normalize=TRUE)
# STATA:
factor listofvars, pcf factors(5)
rotate, equamax normalize # kick the "normalize" to replicate R's "normalize=FALSE"
mat list e(r_L)

Resources