PCA scores for only the first principal components are of "wrong" sign - r

I am currently trying to get into principal component analysis and regression. I therefore tried caclulating the principal components of a given matrix by hand and compare it with the results you get out of the r-package rcomp.
The following is the code for doing pca by hand
### compute principal component loadings and scores by hand
df <- matrix(nrow = 5, ncol = 3, c(90,90,60,60,30,
60,90,60,60,30,
90,30,60,90,60))
# calculate covariance matrix to see variance and covariance of
cov.mat <- cov.wt(df)
cen <- cov.mat$center
n.obs <- cov.mat$n.obs
cv <- cov.mat$cov * (1-1/n.obs)
## calcualate the eigenvector and values
edc <- eigen(cv, symmetric = TRUE)
ev <- edc$values
evec <- edc$vectors
cn <- paste0("Comp.", 1L:ncol(cv))
cen <- cov.mat$center
### get loadings (or principal component weights) out of the eigenvectors and compute scores
loadings <- structure(edc$vectors, class = "loadings")
df.scaled <- scale(df, center = cen, scale = FALSE)
scr <- df.scaled %*% evec
I compared my results to the ones obtained by using the princomp-package
pca.mod <- princomp(df)
loadings.mod <- pca.mod$loadings
scr.mod <- pca.mod$scores
scr
scr.mod
> scr
[,1] [,2] [,3]
[1,] -6.935190 32.310906 7.7400588
[2,] -48.968014 -19.339313 -0.3529382
[3,] 1.733797 -8.077726 -1.9350147
[4,] 13.339605 18.519500 -9.5437444
[5,] 40.829802 -23.413367 4.0916385
> scr.mod
Comp.1 Comp.2 Comp.3
[1,] 6.935190 32.310906 7.7400588
[2,] 48.968014 -19.339313 -0.3529382
[3,] -1.733797 -8.077726 -1.9350147
[4,] -13.339605 18.519500 -9.5437444
[5,] -40.829802 -23.413367 4.0916385
So apparently, I did quite good. The computed scores equal at least scale-wise. However: The scores for the first pricipal components differ in the sign. This is not the case for the other two.
This leads to two questions:
I have read that it is no problem multiplying the loadings and the scores of principal components by minus one. Does this hold, when only one of the principal components are of a different sign as well?
What am I doing "wrong" from a computational standpoint? The procedure seems straightforward to me and I dont see what I could change in my own calculations to get the same signs as the princomp-package.
When checking this with the mtcars data set, the signs for my first PC were right, however now the second and fourth PC scores are of different signs, compared to the package. I can not make any sense of this. Any help is appreciated!

The signs of eigenvectors and loadings are arbitrary, so there is nothing "wrong" here. The only thing that you should expect to be preserved is the overall pattern of signs within each loadings vector, i.e. in the example above the princomp answer for PC1 gives +,+,-,-,- while yours gives -,-,+,+,+. That's fine. If yours gave e.g. -,+,-,-,+ that would be trouble (because the two would no longer be equivalent up to multiplication by -1).
However, while it's generally true that the signs are arbitrary and hence could vary across algorithms, compilers, operating systems, etc., there's an easy solution in this particular case. princomp has a fix_sign argument:
fix_sign: Should the signs of the loadings and scores be chosen so that
the first element of each loading is non-negative?
Try princomp(df,fix_sign=FALSE)$scores and you'll see that the signs (probably!) line up with your results. (In general the fix_sign=TRUE option is useful because it breaks the symmetry in a specific way and thus will always result in the same answers across all platforms.)

Related

Removing Multivariate Outliers With mvoutlier

Problem
I have a dataframe that composes of > 5 variables at any time and am trying to do a K-Means of it. Because K-Means is greatly affected by outliers, I've been trying to look for a few hours on how to calculate and remove multivariate outliers. Most examples demonstrated are with 2 variables.
Possible Solutions Explored
mvoutlier - Kind user here noted that mvoutlier may be what I need.
Another Outlier Detection Method - Poster here commented with a mix of R functions to generate an ordered list of outliers.
Issues thus Far
Regarding mvoutlier, I was unable to generate a result because it noted my dataset contained negatives and it could not work because of that. I'm not sure how to alter my data to only positive since I need negatives in the set I am working with.
Regarding Another Outlier Detection Method I was able to come up with a list of outliers, but am unsure how to exclude them from the current data set. Also, I do know that these calculations are done after K-Means, and thus I probably will apply the math prior to doing K-Means.
Minimal Verifiable Example
Unfortunately, the dataset I'm using is off-limits to be shown to anyone, so what you'll need is any random data set with more than 3 variables. The code below is code converted from the Another Outlier Detection Method post to work with my data. It should work dynamically if you have a random data set as well. But it should have enough data where cluster center amount should be okay with 5.
clusterAmount <- 5
cluster <- kmeans(dataFrame, centers = clusterAmount, nstart = 20)
centers <- cluster$centers[cluster$cluster, ]
distances <- sqrt(rowSums(clusterDataFrame - centers)^2)
m <- tapply(distances, cluster$cluster, mean)
d <- distances/(m[cluster$cluster])
# 1% outliers
outliers <- d[order(d, decreasing = TRUE)][1:(nrow(clusterDataFrame) * .01)]
Output: A list of outliers ordered by their distance away from the center they reside in I believe. The issue then is getting these results paired up to the respective rows in the data frame and removing them so I can start my K-Means procedure. (Note, while in the example I used K-Means prior to removing outliers, I'll make sure to take the necessary steps and remove outliers before K-Means upon solution).
Question
With Another Outlier Detection Method example in place, how do I pair the results with the information in my current data frame to exclude those rows before doing K-Means?
I don't know if this is exactly helpful but if your data is multivariate normal you may want to try out a Wilks (1963) based method. Wilks showed that the mahalanobis distances of multivariate normal data follow a Beta distribution. We can take advantage of this (iris Sepal data used as an example):
test.dat <- iris[,-c(1,2))]
Wilks.function <- function(dat){
n <- nrow(dat)
p <- ncol(dat)
# beta distribution
u <- n * mahalanobis(dat, center = colMeans(dat), cov = cov(dat))/(n-1)^2
w <- 1 - u
F.stat <- ((n-p-1)/p) * (1/w-1) # computing F statistic
p <- 1 - round( pf(F.stat, p, n-p-1), 3) # p value for each row
cbind(w, F.stat, p)
}
plot(test.dat,
col = "blue",
pch = c(15,16,17)[as.numeric(iris$Species)])
dat.rows <- Wilks.function(test.dat); head(dat.rows)
# w F.stat p
#[1,] 0.9888813 0.8264127 0.440
#[2,] 0.9907488 0.6863139 0.505
#[3,] 0.9869330 0.9731436 0.380
#[4,] 0.9847254 1.1400985 0.323
#[5,] 0.9843166 1.1710961 0.313
#[6,] 0.9740961 1.9545687 0.145
Then we can simply find which rows of our multivariate data are significantly different from the beta distribution.
outliers <- which(dat.rows[,"p"] < 0.05)
points(test.dat[outliers,],
col = "red",
pch = c(15,16,17)[as.numeric(iris$Species[outliers])])

Different results when performing PCA in R with princomp() and principal ()

I tried to use princomp() and principal() to do PCA in R with data set USArressts. However, I got two different results for loadings/rotaion and scores.
First, I centered and normalised the original data frame so it is easier to compare the outputs.
library(psych)
trans_func <- function(x){
x <- (x-mean(x))/sd(x)
return(x)
}
A <- USArrests
USArrests <- apply(USArrests, 2, trans_func)
princompPCA <- princomp(USArrests, cor = TRUE)
principalPCA <- principal(USArrests, nfactors=4 , scores=TRUE, rotate = "none",scale=TRUE)
Then I got the results for the loadings and scores using the following commands:
princompPCA$loadings
principalPCA$loadings
Could you please help me to explain why there is a difference? and how can we interprete these results?
At the very end of the help document of ?principal:
"The eigen vectors are rescaled by the sqrt of the eigen values to produce the component loadings more typical in factor analysis."
So principal returns the scaled loadings. In fact, principal produces a factor model estimated by the principal component method.
In 4 years, I would like to provide a more accurate answer to this question. I use iris data as an example.
data = iris[, 1:4]
First, do PCA by the eigen-decomposition
eigen_res = eigen(cov(data))
l = eigen_res$values
q = eigen_res$vectors
Then the eigenvector corresponding to the largest eigenvalue is the factor loadings
q[,1]
We can treat this as a reference or the correct answer. Now we check the results by different r functions.
First, by function 'princomp'
res1 = princomp(data)
res1$loadings[,1]
# compare with
q[,1]
No problem, this function actually just return the same results as 'eigen'. Now move to 'principal'
library(psych)
res2 = principal(data, nfactors=4, rotate="none")
# the loadings of the first PC is
res2$loadings[,1]
# compare it with the results by eigendecomposition
sqrt(l[1])*q[,1] # re-scale the eigen vector by sqrt of eigen value
You may find they are still different. The problem is the 'principal' function does eigendecomposition on the correlation matrix by default. Note: PCA is not invariant with rescaling the variables. If you modify the code as
res2 = principal(data, nfactors=4, rotate="none", cor="cov")
# the loadings of the first PC is
res2$loadings[,1]
# compare it with the results by eigendecomposition
sqrt(l[1])*q[,1] # re-scale the eigen vector by sqrt of eigen value
Now, you will get the same results as 'eigen' and 'princomp'.
Summarize:
If you want to do PCA, you'd better apply 'princomp' function.
PCA is a special case of the Factor model or a simplified version of the factor model. It is just equivalent to eigendecomposition.
We can apply PCA to get an approximation of a factor model. It doesn't care about the specific factors, i.e. epsilons in a factor model. So, if you change the number of factors in your model, you will get the same estimations of the loadings. It is different from the maximum likelihood estimation.
If you are estimating a factor model, you'd better use 'principal' function, since it provides more functions, like rotation, calculating the scores by different methods, and so on.
Rescale the loadings of a PCA model doesn't affect the results too much. Since you still project the data onto the same optimal direction, i.e. maximize the variation in the resulting PC.
ev <- eigen(R) # R is a correlation matrix of DATA
ev$vectors %*% diag(ev$values) %*% t(ev$vectors)
pc <- princomp(scale(DATA, center = F, scale = T),cor=TRUE)
p <-principal(DATA, rotate="none")
#eigen values
ev$values^0.5
pc$sdev
p$values^0.5
#eigen vectors - loadings
ev$vectors
pc$loadings
p$weights %*% diag(p$values^0.5)
pc$loading %*% diag(pc$sdev)
p$loadings
#weights
ee <- diag(0,2)
for (j in 1:2) {
for (i in 1:2) {
ee[i,j] <- ev$vectors[i,j]/p$values[j]^0.5
}
};ee
#scores
s <- as.matrix(scale(DATA, center = T, scale = T)) %*% ev$vectors
scale(s)
p$scores
scale(pc$scores)

Custom contrasts in R: contrast coefficient matrix or contrast matrix / coding scheme? And how to get there?

Custom contrasts are very widely used in analyses, e.g.: "Do DV values at level 1 and level 3 of this three-level factor differ significantly?"
Intuitively, this contrast is expressed in terms of cell means as:
c(1,0,-1)
One or more of these contrasts, bound as columns, form a contrast coefficient matrix, e.g.
mat = matrix(ncol = 2, byrow = TRUE, data = c(
1, 0,
0, 1,
-1, -1)
)
[,1] [,2]
[1,] 1 0
[2,] 0 1
[3,] -1 -1
However, when it comes to running these contrasts specified by the coefficient matrix, there is a lot of (apparently contradictory) information on the web and in books. My question is which information is correct?
Claim 1: contrasts(factor) takes a coefficient matrix
In some examples, the user is shown that the intuitive contrast coefficient matrix can be used directly via the contrasts() or C() functions. So it's as simple as:
contrasts(myFactor) <- mat
Claim 2: Transform coefficients to create a coding scheme
Elsewhere (e.g. UCLA stats) we are told the coefficient matrix (or basis matrix) must be transformed from a coefficient matrix into a contrast matrix before use. This involves taking the inverse of the transform of the coefficient matrix: (mat')⁻¹, or, in Rish:
contrasts(myFactor) = solve(t(mat))
This method requires padding the matrix with an initial column of means for the intercept. To avoid this, some sites recommend using a generalized inverse function which can cope with non-square matrices, i.e., MASS::ginv()
contrasts(myFactor) = ginv(t(mat))
Third option: premultiply by the transform, take the inverse, and post multiply by the transform
Elsewhere again (e.g. a note from SPSS support), we learn the correct algebra is: (mat'mat)-¹ mat'
Implying to me that the correct way to create the contrasts matrix should be:
x = solve(t(mat)%*% mat)%*% t(mat)
[,1] [,2] [,3]
[1,] 0 0 1
[2,] 1 0 -1
[3,] 0 1 -1
contrasts(myFactor) = x
My question is, which is right? (If I am interpreting and describing each piece of advice accurately). How does one specify custom contrasts in R for lm, lme etc?
Refs
Claim 2 is correct (see the answers here and here) and sometimes claim 1, too. This is because there are cases in which the generalized inverse of the (transposed) coefficient matrix is equal to the matrix itself.
For what it's worth....
If you have a factor with 3 levels (levels A, B, and C) and you want to test the following orthogonal contrasts: A vs B, and the avg. of A and B vs C, your contrast codes would be:
Cont1<- c(1,-1, 0)
Cont2<- c(.5,.5, -1)
If you do as directed on the UCLA site (transform coefficients to make a coding scheme), as such:
Contrasts(Variable)<- solve(t(cbind(c(1,1,1), Cont1, Cont2)))[,2:3]
then your results are IDENTICAL to if you had created two dummy variables (e.g.:
Dummy1<- ifelse(Variable=="A", 1, ifelse(Variable=="B", -1, 0))
Dummy2<- ifelse(Variable=="A", .5, ifelse(Variable=="B", .5, -1))
and entered them both into the regression equation instead of your factor, which makes me inclined to think that this is the correct way.
PS I don't write the most elegant R code, but it gets the job done. Sorry, I'm sure there are easier ways to recode variables, but you get the gist.
I'm probably missing something, but in each of your three examples, you specify the contrast matrix in the same way, i.e.
## Note it should plural of contrast
contrasts(myFactor) = x
The only thing that differs is the value of x.
Using the data from the UCLA website as an example
hsb2 = read.table('http://www.ats.ucla.edu/stat/data/hsb2.csv', header=T, sep=",")
#creating the factor variable race.f
hsb2$race.f = factor(hsb2$race, labels=c("Hispanic", "Asian", "African-Am", "Caucasian"))
We can specify either the treatment version of the contrasts
contrasts(hsb2$race.f) = contr.treatment(4)
summary(lm(write ~ race.f, hsb2))
or the sum version
contrasts(hsb2$race.f) = contr.sum(4)
summary(lm(write ~ race.f, hsb2))
Alternatively, we can specify a bespoke contrast matrix.
See ?contr.sum for other standard contrasts.

Chi squared goodness of fit for a geometric distribution

As an assignment I had to develop and algorithm and generate a samples for a given geometric distribution with PMF
Using the inverse transform method, I came up with the following expression for generating the values:
Where U represents a value, or n values depending on the size of the sample, drawn from a Unif(0,1) distribution and p is 0.3 as stated in the PMF above.
I have the algorithm, the implementation in R and I already generated QQ Plots to visually assess the adjustment of the empirical values to the theoretical ones (generated with R), i.e., if the generated sample follows indeed the geometric distribution.
Now I wanted to submit the generated sample to a goodness of fit test, namely the Chi-square, yet I'm having trouble doing this in R.
[I think this was moved a little hastily, in spite of your response to whuber's question, since I think before solving the 'how do I write this algorithm in R' problem, it's probably more important to deal with the 'what you're doing is not the best approach to your problem' issue (which certainly belongs where you posted it). Since it's here, I will deal with the 'doing it in R' aspect, but I would urge to you go back an ask about the second question (as a new post).]
Firstly the chi-square test is a little different depending on whether you test
H0: the data come from a geometric distribution with parameter p
or
H0: the data come from a geometric distribution with parameter 0.3
If you want the second, it's quite straightforward. First, with the geometric, if you want to use the chi-square approximation to the distribution of the test statistic, you will need to group adjacent cells in the tail. The 'usual' rule - much too conservative - suggests that you need an expected count in every bin of at least 5.
I'll assume you have a nice large sample size. In that case, you'll have many bins with substantial expected counts and you don't need to worry so much about keeping it so high, but you will still need to choose how you will bin the tail (whether you just choose a single cut-off above which all values are grouped, for example).
I'll proceed as if n were say 1000 (though if you're testing your geometric random number generation, that's pretty low).
First, compute your expected counts:
dgeom(0:20,.3)*1000
[1] 300.0000000 210.0000000 147.0000000 102.9000000 72.0300000 50.4210000
[7] 35.2947000 24.7062900 17.2944030 12.1060821 8.4742575 5.9319802
[13] 4.1523862 2.9066703 2.0346692 1.4242685 0.9969879 0.6978915
[19] 0.4885241 0.3419669 0.2393768
Warning, dgeom and friends goes from x=0, not x=1; while you can shift the inputs and outputs to the R functions, it's much easier if you subtract 1 from all your geometric values and test that. I will proceed as if your sample has had 1 subtracted so that it goes from 0.
I'll cut that off at the 15th term (x=14), and group 15+ into its own group (a single group in this case). If you wanted to follow the 'greater than five' rule of thumb, you'd cut it off after the 12th term (x=11). In some cases (such as smaller p), you might want to split the tail across several bins rather than one.
> expec <- dgeom(0:14,.3)*1000
> expec <- c(expec, 1000-sum(expec))
> expec
[1] 300.000000 210.000000 147.000000 102.900000 72.030000 50.421000
[7] 35.294700 24.706290 17.294403 12.106082 8.474257 5.931980
[13] 4.152386 2.906670 2.034669 4.747562
The last cell is the "15+" category. We also need the probabilities.
Now we don't yet have a sample; I'll just generate one:
y <- rgeom(1000,0.3)
but now we want a table of observed counts:
(x <- table(factor(y,levels=0:14),exclude=NULL))
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 <NA>
292 203 150 96 79 59 47 25 16 10 6 7 0 2 5 3
Now you could compute the chi-square directly and then calculate the p-value:
> (chisqstat <- sum((x-expec)^2/expec))
[1] 17.76835
(pval <- pchisq(chisqstat,15,lower.tail=FALSE))
[1] 0.2750401
but you can also get R to do it:
> chisq.test(x,p=expec/1000)
Chi-squared test for given probabilities
data: x
X-squared = 17.7683, df = 15, p-value = 0.275
Warning message:
In chisq.test(x, p = expec/1000) :
Chi-squared approximation may be incorrect
Now the case for unspecified p is similar, but (to my knowledge) you can no longer get chisq.test to do it directly, you have to do it the first way, but you have to estimate the parameter from the data (by maximum likelihood or minimum chi-square), and then test as above but you have one fewer degree of freedom for estimating the parameter.
See the example of doing a chi-square for a Poisson with estimated parameter here; the geometric follows the much same approach as above, with the adjustments as at the link (dealing with the unknown parameter, including the loss of 1 degree of freedom).
Let us assume you've got your randomly-generated variates in a vector x. You can do the following:
x <- rgeom(1000,0.2)
x_tbl <- table(x)
x_val <- as.numeric(names(x_tbl))
x_df <- data.frame(count=as.numeric(x_tbl), value=x_val)
# Expand to fill in "gaps" in the values caused by 0 counts
all_x_val <- data.frame(value = 0:max(x_val))
x_df <- merge(all_x_val, x_df, by="value", all.x=TRUE)
x_df$count[is.na(x_df$count)] <- 0
# Get theoretical probabilities
x_df$eprob <- dgeom(x_df$val, 0.2)
# Chi-square test: once with asymptotic dist'n,
# once with bootstrap evaluation of chi-sq test statistic
chisq.test(x=x_df$count, p=x_df$eprob, rescale.p=TRUE)
chisq.test(x=x_df$count, p=x_df$eprob, rescale.p=TRUE,
simulate.p.value=TRUE, B=10000)
There's a "goodfit" function described as "Goodness-of-fit Tests for Discrete Data" in package "vcd".
G.fit <- goodfit(x, type = "nbinomial", par = list(size = 1))
I was going to use the code you had posted in an earlier question, but it now appears that you have deleted that code. I find that offensive. Are you using this forum to gather homework answers and then defacing it to remove the evidence? (Deleted questions can still be seen by those of us with sufficient rep, and the interface prevents deletion of question with upvoted answers so you should not be able to delete this one.)
Generate a QQ Plot for testing a geometrically distributed sample
--- question---
I have a sample of n elements generated in R with
sim.geometric <- function(nvals)
{
p <- 0.3
u <- runif(nvals)
ceiling(log(u)/log(1-p))
}
for which i want to test its distribution, specifically if it indeed follows a geometric distribution. I want to generate a QQ PLot but have no idea how to.
--------reposted answer----------
A QQ-plot should be a straight line when compared to a "true" sample drawn from a geometric distribution with the same probability parameter. One gives two vectors to the functions which essentially compares their inverse ECDF's at each quantile. (Your attempt is not particularly successful:)
sim.res <- sim.geometric(100)
sim.rgeom <- rgeom(100, 0.3)
qqplot(sim.res, sim.rgeom)
Here I follow the lead of the authors of qqplot's help page (which results in flipping that upper curve around the line of identity):
png("QQ.png")
qqplot(qgeom(ppoints(100),prob=0.3), sim.res,
main = expression("Q-Q plot for" ~~ {G}[n == 100]))
dev.off()
---image not included---
You can add a "line of good fit" by plotting a line through through the 25th and 75th percentile points for each distribution. (I added a jittering feature to this to get a better idea where the "probability mass" was located:)
sim.res <- sim.geometric(500)
qqplot(jitter(qgeom(ppoints(500),prob=0.3)), jitter(sim.res),
main = expression("Q-Q plot for" ~~ {G}[n == 100]), ylim=c(0,max( qgeom(ppoints(500),prob=0.3),sim.res )),
xlim=c(0,max( qgeom(ppoints(500),prob=0.3),sim.res )))
qqline(sim.res, distribution = function(p) qgeom(p, 0.3),
prob = c(0.25, 0.75), col = "red")

Impossible to create correlated variables from this correlation matrix?

I would like to generate correlated variables specified by a correlation matrix.
First I generate the correlation matrix:
require(psych)
require(Matrix)
cor.table <- matrix( sample( c(0.9,-0.9) , 2500 , prob = c( 0.8 , 0.2 ) , repl = TRUE ) , 50 , 50 )
k=1
while (k<=length(cor.table[1,])){
cor.table[1,k]<-0.55
k=k+1
}
k=1
while (k<=length(cor.table[,1])){
cor.table[k,1]<-0.55
k=k+1
}
ind<-lower.tri(cor.table)
cor.table[ind]<-t(cor.table)[ind]
diag(cor.table) <- 1
This correlation matrix is not consistent, therefore, eigenvalue decomposition is impossible.
TO make it consistent I use nearPD:
c<-nearPD(cor.table)
Once this is done I generate the correlated variables:
fit<-principal(c, nfactors=50,rotate="none")
fit$loadings
loadings<-matrix(fit$loadings[1:50, 1:50],nrow=50,ncol=50,byrow=F)
loadings
cases <- t(replicate(50, rnorm(10)) )
multivar <- loadings %*% cases
T_multivar <- t(multivar)
var<-as.data.frame(T_multivar)
cor(var)
However the resulting correlations are far from anything that I specified initially.
Is it not possible to create such correlations or am I doing something wrong?
UPDATE from Greg Snow's comment it became clear that the problem is that my initial correlation matrix is unreasonable.
The question then is how can I make the matrix reasonable. The goal is:
each of the 49 variables should correlate >.5 with the first variable.
~40 of the variables should have a high >.8 correlation with each other
the remaining ~9 variables should have a low or negative correlation with each other.
Is this whole requirement impossible ?
Try using the mvrnorm function from the MASS package rather than trying to construct the variables yourself.
**Edit
Here is a matrix that is positive definite (so it works as a correlation matrix) and comes close to your criteria, you can tweak the values from there (all the Eigen values need to be positive, so you can see how changing a number affects things):
cor.mat <- matrix(0.2,nrow=50, ncol=50)
cor.mat[1,] <- cor.mat[,1] <- 0.55
cor.mat[2:41,2:41] <- 0.9
cor.mat[42:50, 42:50] <- 0.25
diag(cor.mat) <- 1
eigen(cor.mat)$values
Some numerical experimentation based on your specifications above suggests that the generated matrix will never (what never? well, hardly ever ...) be positive definite, but it also doesn't look far from PD with these values (making lcor below negative will almost certainly make things worse ...)
rmat <- function(n=49,nhcor=40,hcor=0.8,lcor=0) {
m <- matrix(lcor,n,n) ## fill matrix with 'lcor'
## select high-cor variables
hcorpos <- sample(n,size=nhcor,replace=FALSE)
## make all of these highly correlated
m[hcorpos,hcorpos] <- hcor
## compute min real part of eigenvalues
min(Re(eigen(m,only.values=TRUE)$values))
}
set.seed(101)
r <- replicate(1000,rmat())
## NEVER pos definite
max(r)
## [1] -1.069413e-15
par(las=1,bty="l")
png("eighist.png")
hist(log10(abs(r)),breaks=50,col="gray",main="")
dev.off()

Resources