Add p-values in corrplot matrix - r

I calculated the Spearman correlation between two matrices and I'm plotting the r values using corrplot. How can I plot only the significant correlations (so only those correlations having p value lower than 0.00 and delete those having higher p value, even if are strong correlations - high value of r). I generated the correlation matrix using corr.test in psych package, so I already have the p values in cor.matrix$p
This is the code I'm using:
library(corrplot)
library(psych)
corr.test(mydata_t1, mydata_t2, method="spearman")
M <- corrplot(cor.matrix$r, method="square",type="lower",col=col1(100),is.corr=T,mar=c(1,1,1,1),tl.cex=0.5)
How can I modify it to plot only significant corelations?

Take a look at the examples of corrplot. do ?corrplot. It has options for doing what you want.
You can plot the p-values on the graph itself, which I think is better than putting stars,
as people not familiar with that terminology have one more thing to look up.
to put p-values on graph do this corrplot(cor.matrix$r, p.mat = cor.matrix$p, insig = "p-value") where cor.matrix is object holding the result of cor.test.
The insig option can put:
p-values (as shown above)
blank out insignificant correlations with corrplot(cor.matrix$r, p.mat = cor.matrix$p, insig = "blank")`
Cross out (put a X on) insignificant correlations) with option corrplot(cor.matrix$r, p.mat = cor.matrix$p, insig = "pch") (DEFAULT)
Do nothing with to the plot, with corrplot(cor.matrix$r, p.mat = cor.matrix$p, insig = "n")
If you do want stars, p-value on the correlation matrix plot - take a look at this thread Correlation Corrplot Configuration
Though I have to say I really like #sven hohenstein's elegant subset solution.

Create a copy of cor.mat and replace the corresponding correlation coefficients with zero:
cor.matrix2 <- cor.matrix
# find cells with p-values > 0.05 and replace corresponding
# correlations coefficients with zero
cor.matrix2$r[cor.matrix2$p > 0.05] <- 0
# use this matrix for corrplot
M <- corrplot(cor.matrix2$r, method="square",type="lower",col=col1(100),
is.corr=T,mar=c(1,1,1,1),tl.cex=0.5)
The replaced values will appear as a white cell.

What you are asking is similar to what subset does:
Return subsets of vectors, matrices or data frames which meet
conditions.
So you can do:
cor.matrix <- subset(cor.matrix, p<0.00)
P <- corrplot(cor.matrix$r, method="square",type="lower",col=col1(100),is.corr=T,mar=c(1,1,1,1),tl.cex=0.5)

Related

Removing Multivariate Outliers With mvoutlier

Problem
I have a dataframe that composes of > 5 variables at any time and am trying to do a K-Means of it. Because K-Means is greatly affected by outliers, I've been trying to look for a few hours on how to calculate and remove multivariate outliers. Most examples demonstrated are with 2 variables.
Possible Solutions Explored
mvoutlier - Kind user here noted that mvoutlier may be what I need.
Another Outlier Detection Method - Poster here commented with a mix of R functions to generate an ordered list of outliers.
Issues thus Far
Regarding mvoutlier, I was unable to generate a result because it noted my dataset contained negatives and it could not work because of that. I'm not sure how to alter my data to only positive since I need negatives in the set I am working with.
Regarding Another Outlier Detection Method I was able to come up with a list of outliers, but am unsure how to exclude them from the current data set. Also, I do know that these calculations are done after K-Means, and thus I probably will apply the math prior to doing K-Means.
Minimal Verifiable Example
Unfortunately, the dataset I'm using is off-limits to be shown to anyone, so what you'll need is any random data set with more than 3 variables. The code below is code converted from the Another Outlier Detection Method post to work with my data. It should work dynamically if you have a random data set as well. But it should have enough data where cluster center amount should be okay with 5.
clusterAmount <- 5
cluster <- kmeans(dataFrame, centers = clusterAmount, nstart = 20)
centers <- cluster$centers[cluster$cluster, ]
distances <- sqrt(rowSums(clusterDataFrame - centers)^2)
m <- tapply(distances, cluster$cluster, mean)
d <- distances/(m[cluster$cluster])
# 1% outliers
outliers <- d[order(d, decreasing = TRUE)][1:(nrow(clusterDataFrame) * .01)]
Output: A list of outliers ordered by their distance away from the center they reside in I believe. The issue then is getting these results paired up to the respective rows in the data frame and removing them so I can start my K-Means procedure. (Note, while in the example I used K-Means prior to removing outliers, I'll make sure to take the necessary steps and remove outliers before K-Means upon solution).
Question
With Another Outlier Detection Method example in place, how do I pair the results with the information in my current data frame to exclude those rows before doing K-Means?
I don't know if this is exactly helpful but if your data is multivariate normal you may want to try out a Wilks (1963) based method. Wilks showed that the mahalanobis distances of multivariate normal data follow a Beta distribution. We can take advantage of this (iris Sepal data used as an example):
test.dat <- iris[,-c(1,2))]
Wilks.function <- function(dat){
n <- nrow(dat)
p <- ncol(dat)
# beta distribution
u <- n * mahalanobis(dat, center = colMeans(dat), cov = cov(dat))/(n-1)^2
w <- 1 - u
F.stat <- ((n-p-1)/p) * (1/w-1) # computing F statistic
p <- 1 - round( pf(F.stat, p, n-p-1), 3) # p value for each row
cbind(w, F.stat, p)
}
plot(test.dat,
col = "blue",
pch = c(15,16,17)[as.numeric(iris$Species)])
dat.rows <- Wilks.function(test.dat); head(dat.rows)
# w F.stat p
#[1,] 0.9888813 0.8264127 0.440
#[2,] 0.9907488 0.6863139 0.505
#[3,] 0.9869330 0.9731436 0.380
#[4,] 0.9847254 1.1400985 0.323
#[5,] 0.9843166 1.1710961 0.313
#[6,] 0.9740961 1.9545687 0.145
Then we can simply find which rows of our multivariate data are significantly different from the beta distribution.
outliers <- which(dat.rows[,"p"] < 0.05)
points(test.dat[outliers,],
col = "red",
pch = c(15,16,17)[as.numeric(iris$Species[outliers])])

Different results when performing PCA in R with princomp() and principal ()

I tried to use princomp() and principal() to do PCA in R with data set USArressts. However, I got two different results for loadings/rotaion and scores.
First, I centered and normalised the original data frame so it is easier to compare the outputs.
library(psych)
trans_func <- function(x){
x <- (x-mean(x))/sd(x)
return(x)
}
A <- USArrests
USArrests <- apply(USArrests, 2, trans_func)
princompPCA <- princomp(USArrests, cor = TRUE)
principalPCA <- principal(USArrests, nfactors=4 , scores=TRUE, rotate = "none",scale=TRUE)
Then I got the results for the loadings and scores using the following commands:
princompPCA$loadings
principalPCA$loadings
Could you please help me to explain why there is a difference? and how can we interprete these results?
At the very end of the help document of ?principal:
"The eigen vectors are rescaled by the sqrt of the eigen values to produce the component loadings more typical in factor analysis."
So principal returns the scaled loadings. In fact, principal produces a factor model estimated by the principal component method.
In 4 years, I would like to provide a more accurate answer to this question. I use iris data as an example.
data = iris[, 1:4]
First, do PCA by the eigen-decomposition
eigen_res = eigen(cov(data))
l = eigen_res$values
q = eigen_res$vectors
Then the eigenvector corresponding to the largest eigenvalue is the factor loadings
q[,1]
We can treat this as a reference or the correct answer. Now we check the results by different r functions.
First, by function 'princomp'
res1 = princomp(data)
res1$loadings[,1]
# compare with
q[,1]
No problem, this function actually just return the same results as 'eigen'. Now move to 'principal'
library(psych)
res2 = principal(data, nfactors=4, rotate="none")
# the loadings of the first PC is
res2$loadings[,1]
# compare it with the results by eigendecomposition
sqrt(l[1])*q[,1] # re-scale the eigen vector by sqrt of eigen value
You may find they are still different. The problem is the 'principal' function does eigendecomposition on the correlation matrix by default. Note: PCA is not invariant with rescaling the variables. If you modify the code as
res2 = principal(data, nfactors=4, rotate="none", cor="cov")
# the loadings of the first PC is
res2$loadings[,1]
# compare it with the results by eigendecomposition
sqrt(l[1])*q[,1] # re-scale the eigen vector by sqrt of eigen value
Now, you will get the same results as 'eigen' and 'princomp'.
Summarize:
If you want to do PCA, you'd better apply 'princomp' function.
PCA is a special case of the Factor model or a simplified version of the factor model. It is just equivalent to eigendecomposition.
We can apply PCA to get an approximation of a factor model. It doesn't care about the specific factors, i.e. epsilons in a factor model. So, if you change the number of factors in your model, you will get the same estimations of the loadings. It is different from the maximum likelihood estimation.
If you are estimating a factor model, you'd better use 'principal' function, since it provides more functions, like rotation, calculating the scores by different methods, and so on.
Rescale the loadings of a PCA model doesn't affect the results too much. Since you still project the data onto the same optimal direction, i.e. maximize the variation in the resulting PC.
ev <- eigen(R) # R is a correlation matrix of DATA
ev$vectors %*% diag(ev$values) %*% t(ev$vectors)
pc <- princomp(scale(DATA, center = F, scale = T),cor=TRUE)
p <-principal(DATA, rotate="none")
#eigen values
ev$values^0.5
pc$sdev
p$values^0.5
#eigen vectors - loadings
ev$vectors
pc$loadings
p$weights %*% diag(p$values^0.5)
pc$loading %*% diag(pc$sdev)
p$loadings
#weights
ee <- diag(0,2)
for (j in 1:2) {
for (i in 1:2) {
ee[i,j] <- ev$vectors[i,j]/p$values[j]^0.5
}
};ee
#scores
s <- as.matrix(scale(DATA, center = T, scale = T)) %*% ev$vectors
scale(s)
p$scores
scale(pc$scores)

Z-scores rounded to infinity for small p-values in R

I am working with a genome-wide association study dataset, with p-values ranging from 1E-30 to 1. I have an R data frame "data" which includes a variable "p" for the p-values.
I need to perform genomic correction of the p-values, which I am doing using the following code:
p=data$p
Zsq = qchisq(1-p, 1)
lambda = median(Zsq)/0.456
newZsq = Zsq/lambda
Newp = 1-pchisq(newZsq, 1)
In the command on the second line, where I use the qchisq function to convert p-values to z-scores, z-scores for p-values < 1E-16 are being rounded to infinity. This means the p-values for my most significant data points are rounded to 0 after the genomic correction, and I lose their ranking.
Is there any way around this?
Read help(".Machine"). Then set lower.tail=FALSE and avoid taking differences with 1:
p <- 1e-17
Zsq = qchisq(p, 1, lower.tail=FALSE)
lambda = median(Zsq)/0.456
newZsq = Zsq/lambda
Newp = pchisq(newZsq, 1, lower.tail=FALSE)
#[1] 0.4994993

How do I weight variables with gower distance in r

I am new to R and am working on a data set including nominal, ordinal and metric data.
Therefore I am using the gower distance. In the next step I use this distance with hclust(x, method="complete") to create clusters based on this distance.
Now I want to know how I can put different weights on variables in the gower distance.
The documentation says:
daisy(x, metric = c("euclidean", "manhattan", "gower"), stand = FALSE, type = list(), weights = rep.int(1, p))
So there is a way, but I am unsure about the syntax (weights = ...).
The documentation of weights and rep.int, did not help.
I also didn't find any other helpful explanation.
I would be very glad, if some one can help out.
Not sure if this is what you are getting at, but...
Let's say you have 5 variables, e.g. 5 columns in your data frame or matrix. Then weights would be a vector of length=5 containing the weights for the corresponding columns.
The notation weights=rep.int(1,p) in the documentation simply means that the default value of weights is a vector of length p that has all 1's, eg. the weights are all equal to 1. Elsewhere in the documentation it explains that p is the number of columns.
Also, note that daisy(...) produces a dissimilarity matrix. This is what you use in hclust(...). So if x is a data frame or matrix with five columns for your variables, then:
d <- daisy(x, metric="gower", weights=c(1,2,3,4,5))
hc <- hclust(d, method="complete")
EDIT (Response to OP's comments)
The code below shows how the clustering depends on the weights.
clust.anal <- function(df,w,h) {
require(cluster)
d <- daisy(df, metric="gower", weights=w)
hc <- hclust(d, method="complete")
clust <- cutree(hc,h=h)
plot(hc, sub=paste("weights=",paste(wts,collapse=",")))
rect.hclust(hc,h=0.8,border="red")
}
df <- read.table("ExampleClusterData.csv", sep=";",header=T)
df[1] <- factor(df[[1]])
df[2] <- factor(df[[2]])
# weights increase with col number...
wts=c(1,2,3,4,5,6,7)
clust.anal(df,wts,h=0.8)
# weights decrease with col number...
wts=c(7,6,5,4,3,2,1)
clust.anal(df,wts,h=0.8)

correlation failure - Pearson

I want to write to datafile information about correlation as follows:
*korelacja=cor(p2,d2,method="pearson",use = "complete.obs")
korelacja2=cor(p2,d2,method="kendall",use = "complete.obs")
korelacja3=cor(p2,d2,method="spearman",use = "complete.obs")
dane=paste(korelacja,korelacja2,korelacja3,sep=';')
write(dane,file=nazwa,append=TRUE)*
Results are strange for me - Pearson correlation is very high (always equal one), but Kendall and Spearman is very low. I create scatterplots and I don't see linear correlation.
It's not hard to replicate this pattern if you have some large outliers in your data that dominate the Pearson correlation but are relatively insignificant in the non-parametric (Kendall/Spearman) approaches. For example, here's a concocted data set with nothing going on except for one large outlier:
> set.seed(1001)
> x <- c(runif(1000),1e5)
> y <- c(runif(1000),1e5)
> cor(x,y,method="pearson")
[1] 1
> cor(x,y,method="kendall")
[1] -0.02216583
> cor(x,y,method="spearman")
[1] -0.03335352
This is consistent with your description so far, although you ought in this case to be able to see the outliers in your scatterplots ...

Resources