Impossible to create correlated variables from this correlation matrix? - r

I would like to generate correlated variables specified by a correlation matrix.
First I generate the correlation matrix:
require(psych)
require(Matrix)
cor.table <- matrix( sample( c(0.9,-0.9) , 2500 , prob = c( 0.8 , 0.2 ) , repl = TRUE ) , 50 , 50 )
k=1
while (k<=length(cor.table[1,])){
cor.table[1,k]<-0.55
k=k+1
}
k=1
while (k<=length(cor.table[,1])){
cor.table[k,1]<-0.55
k=k+1
}
ind<-lower.tri(cor.table)
cor.table[ind]<-t(cor.table)[ind]
diag(cor.table) <- 1
This correlation matrix is not consistent, therefore, eigenvalue decomposition is impossible.
TO make it consistent I use nearPD:
c<-nearPD(cor.table)
Once this is done I generate the correlated variables:
fit<-principal(c, nfactors=50,rotate="none")
fit$loadings
loadings<-matrix(fit$loadings[1:50, 1:50],nrow=50,ncol=50,byrow=F)
loadings
cases <- t(replicate(50, rnorm(10)) )
multivar <- loadings %*% cases
T_multivar <- t(multivar)
var<-as.data.frame(T_multivar)
cor(var)
However the resulting correlations are far from anything that I specified initially.
Is it not possible to create such correlations or am I doing something wrong?
UPDATE from Greg Snow's comment it became clear that the problem is that my initial correlation matrix is unreasonable.
The question then is how can I make the matrix reasonable. The goal is:
each of the 49 variables should correlate >.5 with the first variable.
~40 of the variables should have a high >.8 correlation with each other
the remaining ~9 variables should have a low or negative correlation with each other.
Is this whole requirement impossible ?

Try using the mvrnorm function from the MASS package rather than trying to construct the variables yourself.
**Edit
Here is a matrix that is positive definite (so it works as a correlation matrix) and comes close to your criteria, you can tweak the values from there (all the Eigen values need to be positive, so you can see how changing a number affects things):
cor.mat <- matrix(0.2,nrow=50, ncol=50)
cor.mat[1,] <- cor.mat[,1] <- 0.55
cor.mat[2:41,2:41] <- 0.9
cor.mat[42:50, 42:50] <- 0.25
diag(cor.mat) <- 1
eigen(cor.mat)$values

Some numerical experimentation based on your specifications above suggests that the generated matrix will never (what never? well, hardly ever ...) be positive definite, but it also doesn't look far from PD with these values (making lcor below negative will almost certainly make things worse ...)
rmat <- function(n=49,nhcor=40,hcor=0.8,lcor=0) {
m <- matrix(lcor,n,n) ## fill matrix with 'lcor'
## select high-cor variables
hcorpos <- sample(n,size=nhcor,replace=FALSE)
## make all of these highly correlated
m[hcorpos,hcorpos] <- hcor
## compute min real part of eigenvalues
min(Re(eigen(m,only.values=TRUE)$values))
}
set.seed(101)
r <- replicate(1000,rmat())
## NEVER pos definite
max(r)
## [1] -1.069413e-15
par(las=1,bty="l")
png("eighist.png")
hist(log10(abs(r)),breaks=50,col="gray",main="")
dev.off()

Related

Pearson coefficient per rows on large matrices

I'm currently working with a large matrix (4 cols and around 8000 rows).
I want to perform a correlation analysis using Pearson's correlation coefficient between the different rows composing this matrix.
I would like to proceed the following way:
Find Pearson's correlation coefficient between row 1 and row 2. Then between rows 1 and 3... and so on with the rest of the rows.
Then find Pearson's correlation coefficient between row 2 and row 3. Then between rows 2 and 4... and so on with the rest of the rows. Note I won't find the coefficient with row 1 again...
For those coefficients being higher or lower than 0.7 or -0.7 respectively, I would like to list on a separate file the row names corresponding to those coefficients, plus the coefficient. E.g.:
row 230 - row 5812 - 0.76
I wrote the following code for this aim. Unfortunately, it takes a too long running time (I estimated almost a week :( ).
for (i in 1:7999) {
print("Analyzing row:")
print(i)
for (j in (i+1):8000) {
value<- cor(alpha1k[i,],alpha1k[j,],use = "everything",method = "pearson")
if(value>0.7 | value<(-0.7)){
aristi <- c(row.names(alpha1k)[i],row.names(alpha1k)[j],value)
arist1p<-rbind(arist1p,aristi)
}
}
Then my question is if there's any way I could do this faster. I read about making these calculations in parallel but I have no clue on how to make this work. I hope I made myself clear enough, thank you on advance!
As Roland pointed out, you can use the matrix version of cor to simplify your task. Just transpose your matrix to get a "row" comparison.
mydf <- data.frame(a = c(1,2,3,1,2,3,1,2,3,4), b = rep(5,2,10), c = c(1:10))
cor_mat <- cor(t(mydf)) # correlation of your transposed matrix
idx <- which((abs(cor_mat) > 0.7), arr.ind = T) # get relevant indexes in a matrix form
cbind(idx, cor_mat[idx]) # combine coordinates and the correlation
Note that parameters use = everything and method = "pearson" are used by default for correlation. There is no need to specify them.

Removing Multivariate Outliers With mvoutlier

Problem
I have a dataframe that composes of > 5 variables at any time and am trying to do a K-Means of it. Because K-Means is greatly affected by outliers, I've been trying to look for a few hours on how to calculate and remove multivariate outliers. Most examples demonstrated are with 2 variables.
Possible Solutions Explored
mvoutlier - Kind user here noted that mvoutlier may be what I need.
Another Outlier Detection Method - Poster here commented with a mix of R functions to generate an ordered list of outliers.
Issues thus Far
Regarding mvoutlier, I was unable to generate a result because it noted my dataset contained negatives and it could not work because of that. I'm not sure how to alter my data to only positive since I need negatives in the set I am working with.
Regarding Another Outlier Detection Method I was able to come up with a list of outliers, but am unsure how to exclude them from the current data set. Also, I do know that these calculations are done after K-Means, and thus I probably will apply the math prior to doing K-Means.
Minimal Verifiable Example
Unfortunately, the dataset I'm using is off-limits to be shown to anyone, so what you'll need is any random data set with more than 3 variables. The code below is code converted from the Another Outlier Detection Method post to work with my data. It should work dynamically if you have a random data set as well. But it should have enough data where cluster center amount should be okay with 5.
clusterAmount <- 5
cluster <- kmeans(dataFrame, centers = clusterAmount, nstart = 20)
centers <- cluster$centers[cluster$cluster, ]
distances <- sqrt(rowSums(clusterDataFrame - centers)^2)
m <- tapply(distances, cluster$cluster, mean)
d <- distances/(m[cluster$cluster])
# 1% outliers
outliers <- d[order(d, decreasing = TRUE)][1:(nrow(clusterDataFrame) * .01)]
Output: A list of outliers ordered by their distance away from the center they reside in I believe. The issue then is getting these results paired up to the respective rows in the data frame and removing them so I can start my K-Means procedure. (Note, while in the example I used K-Means prior to removing outliers, I'll make sure to take the necessary steps and remove outliers before K-Means upon solution).
Question
With Another Outlier Detection Method example in place, how do I pair the results with the information in my current data frame to exclude those rows before doing K-Means?
I don't know if this is exactly helpful but if your data is multivariate normal you may want to try out a Wilks (1963) based method. Wilks showed that the mahalanobis distances of multivariate normal data follow a Beta distribution. We can take advantage of this (iris Sepal data used as an example):
test.dat <- iris[,-c(1,2))]
Wilks.function <- function(dat){
n <- nrow(dat)
p <- ncol(dat)
# beta distribution
u <- n * mahalanobis(dat, center = colMeans(dat), cov = cov(dat))/(n-1)^2
w <- 1 - u
F.stat <- ((n-p-1)/p) * (1/w-1) # computing F statistic
p <- 1 - round( pf(F.stat, p, n-p-1), 3) # p value for each row
cbind(w, F.stat, p)
}
plot(test.dat,
col = "blue",
pch = c(15,16,17)[as.numeric(iris$Species)])
dat.rows <- Wilks.function(test.dat); head(dat.rows)
# w F.stat p
#[1,] 0.9888813 0.8264127 0.440
#[2,] 0.9907488 0.6863139 0.505
#[3,] 0.9869330 0.9731436 0.380
#[4,] 0.9847254 1.1400985 0.323
#[5,] 0.9843166 1.1710961 0.313
#[6,] 0.9740961 1.9545687 0.145
Then we can simply find which rows of our multivariate data are significantly different from the beta distribution.
outliers <- which(dat.rows[,"p"] < 0.05)
points(test.dat[outliers,],
col = "red",
pch = c(15,16,17)[as.numeric(iris$Species[outliers])])

Generating Random Variables with given correlations between pairs of them:

I want to generate 2 continuous random variables Q1, Q2 (quantitative traits, each are normal) and 2 binary random variables Z1, Z2 (binary traits) with given pairwise correlations between all possible pairs of them.
Say
(Q1,Q2):0.23
(Q1,Z1):0.55
(Q1,Z2):0.45
(Q2,Z1):0.4
(Q2,Z2):0.5
(Z1,Z2):0.47
Please help me generate such data in R.
This is crude but might get you started in the right direction.
library(copula)
options(digits=3)
probs <- c(0.5,0.5)
corrs <- c(0.23,0.55,0.45,0.4,0.5,0.47) ## lower triangle
Simulate correlated values (first two quantitative, last two transformed to binary)
sim <- function(n,probs,corrs) {
tmp <- normalCopula( corrs, dim=4 , "un")
getSigma(tmp) ## test
x <- rCopula(1000, tmp)
x2 <- x
x2[,3:4] <- qbinom(x[,3:4],size=1,prob=rep(probs,each=nrow(x)))
x2
}
Test SSQ distance between observed and target correlations:
objfun <- function(corrs,targetcorrs,probs,n=1000) {
cc <- try(cor(sim(n,probs,corrs)),silent=TRUE)
if (is(cc,"try-error")) return(NA)
sum((cc[lower.tri(cc)]-targetcorrs)^2)
}
See how bad things are when input corrs=target:
cc0 <- cor(sim(1000,probs=probs,corrs=corrs))
cc0[lower.tri(cc0)]
corrs
objfun(corrs,corrs,probs=probs) ## 0.112
Now try to optimize.
opt1 <- optim(fn=objfun,
par=corrs,
targetcorrs=corrs,probs=c(0.5,0.5))
opt1$value ## 0.0208
Stops after 501 iterations with "max iterations exceeded". This will never work really well because we're trying to use a deterministic hill-climbing algorithm on a stochastic objective function ...
cc1 <- cor(sim(1000,probs=c(0.5,0.5),corrs=opt1$par))
cc1[lower.tri(cc1)]
corrs
Maybe try simulated annealing?
opt2 <- optim(fn=objfun,
par=corrs,
targetcorrs=corrs,probs=c(0.5,0.5),
method="SANN")
It doesn't seem to do much better than the previous value. Two possible problems (left as an exercise for the reader are) (1) we have specified a set of correlations that are not feasible with the marginal distributions we have chosen, or (2) the error in the objective function surface is getting in the way -- to do better we would have to average over more replicates (i.e. increase n).

How to find significant correlations in a large dataset

I'm using R.
My dataset has about 40 different Variables/Vektors and each has about 80 entries. I'm trying to find significant correlations, that means I want to pick one variable and let R calculate all the correlations of that variable to the other 39 variables.
I tried to do this by using a linear modell with one explaining variable that means: Y=a*X+b.
Then the lm() command gives me an estimator for a and p-value of that estimator for a. I would then go on and use one of the other variables I have for X and try again until I find a p-value thats really small.
I'm sure this is a common problem, is there some sort of package or function that can try all these possibilities (Brute force),show them and then maybe even sorts them by p-value?
You can use the function rcorr from the package Hmisc.
Using the same demo data from Richie:
m <- 40
n <- 80
the_data <- as.data.frame(replicate(m, runif(n), simplify = FALSE))
colnames(the_data) <- c("y", paste0("x", seq_len(m - 1)))
Then:
library(Hmisc)
correlations <- rcorr(as.matrix(the_data))
To access the p-values:
correlations$P
To visualize you can use the package corrgram
library(corrgram)
corrgram(the_data)
Which will produce:
In order to print a list of the significant correlations (p < 0.05), you can use the following.
Using the same demo data from #Richie:
m <- 40
n <- 80
the_data <- as.data.frame(replicate(m, runif(n), simplify = FALSE))
colnames(the_data) <- c("y", paste0("x", seq_len(m - 1)))
Install Hmisc
install.packages("Hmisc")
Import library and find the correlations (#Carlos)
library(Hmisc)
correlations <- rcorr(as.matrix(the_data))
Loop over the values printing the significant correlations
for (i in 1:m){
for (j in 1:m){
if ( !is.na(correlations$P[i,j])){
if ( correlations$P[i,j] < 0.05 ) {
print(paste(rownames(correlations$P)[i], "-" , colnames(correlations$P)[j], ": ", correlations$P[i,j]))
}
}
}
}
Warning
You should not use this for drawing any serious conclusion; only useful for some exploratory analysis and formulate hypothesis. If you run enough tests, you increase the probability of finding some significant p-values by random chance: https://www.xkcd.com/882/. There are statistical methods that are more suitable for this and that do do some adjustments to compensate for running multiple tests, e.g. https://en.wikipedia.org/wiki/Bonferroni_correction.
Here's some sample data for reproducibility.
m <- 40
n <- 80
the_data <- as.data.frame(replicate(m, runif(n), simplify = FALSE))
colnames(the_data) <- c("y", paste0("x", seq_len(m - 1)))
You can calculate the correlation between two columns using cor. This code loops over all columns except the first one (which contains our response), and calculates the correlation between that column and the first column.
correlations <- vapply(
the_data[, -1],
function(x)
{
cor(the_data[, 1], x)
},
numeric(1)
)
You can then find the column with the largest magnitude of correlation with y using:
correlations[which.max(abs(correlations))]
So knowing which variables are correlated which which other variables can be interesting, but please don't draw any big conclusions from this knowledge. You need to have a proper think about what you are trying to understand, and which techniques you need to use. The folks over at Cross Validated can help.
If you are trying to predict y using only one variable than you have to take the one that is mainly correlated with y.
To do this just use the command which.max(abs(cor(x,y))). If you want to use more than one variable in your model then you have to consider something like the lasso estimator
One option is to run a correlation matrix:
cor_result=cor(data)
write.csv(cor_result, file="cor_result.csv")
This correlates all the variables in the file against each other and outputs a matrix.

using k-NN in R with categorical values

I'm looking to perform classification on data with mostly categorical features. For that purpose, Euclidean distance (or any other numerical assuming distance) doesn't fit.
I'm looking for a kNN implementation for [R] where it is possible to select different distance methods, like Hamming distance.
Is there a way to use common kNN implementations like the one in {class} with different distance metric functions?
I'm using R 2.15
As long as you can calculate a distance/dissimilarity matrix (in whatever way you like) you can easily perform kNN classification without the need of any special package.
# Generate dummy data
y <- rep(1:2, each=50) # True class memberships
x <- y %*% t(rep(1, 20)) + rnorm(100*20) < 1.5 # Dataset with 20 variables
design.set <- sample(length(y), 50)
test.set <- setdiff(1:100, design.set)
# Calculate distance and nearest neighbors
library(e1071)
d <- hamming.distance(x)
NN <- apply(d[test.set, design.set], 1, order)
# Predict class membership of the test set
k <- 5
pred <- apply(NN[, 1:k, drop=FALSE], 1, function(nn){
tab <- table(y[design.set][nn])
as.integer(names(tab)[which.max(tab)]) # This is a pretty dirty line
}
# Inspect the results
table(pred, y[test.set])
If anybody knows a better way of finding the most common value in a vector than the dirty line above, I'd be happy to know.
The drop=FALSE argument is needed to preserve the subset of NN as matrix in the case k=1. If not it will be converted to a vector and apply will throw an error.

Resources