Creating correlation matrix p values [duplicate] - r

This question already has answers here:
How to iterate through parameters to analyse
(2 answers)
Closed 8 years ago.
I can get correlation matrix using following commands:
> df<-data.frame(x=c(5,6,5,9,4,2,1,3,5,7),y=c(3.1,2.5,3.8,5.4,6.5,2.5,1.5,8.1,7.1,6.1),z=c(5,6,4,9,2,4,1,6,2,4))
> cor(df)
x y z
x 1.0000000 0.2923939 0.6566866
y 0.2923939 1.0000000 0.1167084
z 0.6566866 0.1167084 1.0000000
>
I can get individual p-values using command:
> cor.test(x,y)$p.value
[1] 0.4123234
How can I get a matrix of p-values for all these correlation coefficients? Thanks for your help.

You can also use the package Hmisc.
An example of how it works:
mycor <- rcorr(as.matrix(data), type="pearson")
mycor$r shows the correlation matrix, mycor$p the matrix with corresponding p-values.

This example calculates the p value for each of the column combinations. It is not an optimal solution (x-y and y-x p values are both calculated for example), but should provide some inspiration for you. The main trick is to use expand.grid to generate the combinations of columns, and use mapply to call cor.test on each combination:
col_combinations = expand.grid(names(df), names(df))
cor_test_wrapper = function(col_name1, col_name2, data_frame) {
cor.test(data_frame[[col_name1]], data_frame[[col_name2]])$p.value
}
p_vals = mapply(cor_test_wrapper,
col_name1 = col_combinations[[1]],
col_name2 = col_combinations[[2]],
MoreArgs = list(data_frame = df))
matrix(p_vals, 3, 3, dimnames = list(names(df), names(df)))
x y z
x 0.00000000 0.4123234 0.03914453
y 0.41232343 0.0000000 0.74814951
z 0.03914453 0.7481495 0.00000000

one way is to use corr.test (notice the double r) from package psych
.. or if you're a fan of mapply and sapply you could write your own function doing this. something like:
rrapply <- function(A, FUN, ...) mapply(function(a, B) lapply(B,
function(x) FUN(a, x, ...)), a = A, MoreArgs = list(B = A))
cor.tests <- rrapply(df, cor.test) # a matrix of cor.tests
apply(cor.tests, 1:2, function(x) x[[1]]$p.value) # and it's there
And now you can use the same logic to make a matrix of t-tests or, say, CI's of correlations

Related

R - Cleanest way to run statistical test on every permutation of multiple populations

I have three populations stored as individual vectors. I need to run a statistical test (wilcoxon, if it matters) on each pair of these three populations.
I want to input three vectors into some block of code and get as output a vector of 6 p-values (one p-value is the result of one test and is a double).
I have a method that works but I am new to R and from what I've been reading I feel like there should be a better way, possibly involving storing the vectors as a data frame and using vectorization, to write this code.
Here is the code I have:
library(arrangements)
runAllTests <- function(pop1,pop2,pop3) {
populations <- list(pop1=pop1,pop2=pop2,pop3=pop3)
colLabels <- c("pop1", "pop2", "pop3")
#This line makes a data frame where each column is a pair of labels
perms <- data.frame(t(permutations(colLabels,2)))
pvals <- vector()
#This for loop gets each column of that data frame
for (pair in perms[,]) {
pair <- as.vector(pair)
p1 <- as.numeric(unlist(populations[pair[1]]))
p2 <- as.numeric(unlist(populations[pair[2]]))
pvals <- append(pvals, wilcox.test(p1, p2,alternative=c("less"))$p.value)
}
return(pvals)
}
What is a more R appropriate way to write this code?
Note: Generating populations and comparing them all to each other is a common enough thing (and tricky enough to code) that I think this question will apply to more people than myself.
EDIT: I forgot that my actual populations are of different sizes. This means I cannot make a data frame out of the vectors (as far as I know). I can make a list of vectors though. I have updated my code with a version that works.
Yes, this is indeed common; indeed so common that R has a built-in function for exactly this scenario: pairwise.table.
p <- list(pop1, pop2, pop3)
pairwise.table(function(i, j) {
wilcox.test(p[[i]], p[[j]])$p.value
}, 1:3)
There are also specific versions for t tests, proportion tests, and Wilcoxon tests; here's an example using pairwise.wilcox.test.
p <- list(pop1, pop2, pop3)
d <- data.frame(x=unlist(p), g=rep(seq_along(p), sapply(p, length)))
with(d, pairwise.wilcox.test(x, g))
Also, make sure you look into the p.adjust.method parameter to correctly adjust for multiple comparisons.
Per your comments, you're interested in tests where the order matters; that's really hard to imagine (and isn't true for the Wilcoxon test you mentioned) but still...
This is the pairwise.table function, edited to do tests in both directions.
pairwise.table.all <- function (compare.levels, level.names, p.adjust.method) {
ix <- setNames(seq_along(level.names), level.names)
pp <- outer(ix, ix, function(ivec, jvec)
sapply(seq_along(ivec), function(k) {
i <- ivec[k]; j <- jvec[k]
if (i != j) compare.levels(i, j) else NA }))
pp[] <- p.adjust(pp[], p.adjust.method)
pp
}
This is a version of pairwise.wilcox.test which uses the above function, and also runs on a list of vectors, instead of a data frame in long format.
pairwise.lazerbeam.test <- function(dat, p.adjust.method=p.adjust.methods) {
p.adjust.method <- match.arg(p.adjust.method)
level.names <- if(!is.null(names(dat))) names(dat) else seq_along(dat)
PVAL <- pairwise.table.all(function(i, j) {
wilcox.test(dat[[i]], dat[[j]])$p.value
}, level.names, p.adjust.method = p.adjust.method)
ans <- list(method = "Lazerbeam's special method",
data.name = paste(level.names, collapse=", "),
p.value = PVAL, p.adjust.method = p.adjust.method)
class(ans) <- "pairwise.htest"
ans
}
Output, both before and after tidying, looks like this:
> p <- list(a=1:5, b=2:8, c=10:16)
> out <- pairwise.lazerbeam.test(p)
> out
Pairwise comparisons using Lazerbeams special method
data: a, b, c
a b c
a - 0.2821 0.0101
b 0.2821 - 0.0035
c 0.0101 0.0035 -
P value adjustment method: holm
> pairwise.lazerbeam.test(p) %>% broom::tidy()
# A tibble: 6 x 3
group1 group2 p.value
<chr> <chr> <dbl>
1 b a 0.282
2 c a 0.0101
3 a b 0.282
4 c b 0.00350
5 a c 0.0101
6 b c 0.00350
Here is an example of one approach that uses combn() which has a function argument that can be used to easily apply wilcox.test() to all variable combinations.
set.seed(234)
# Create dummy data
df <- data.frame(replicate(3, sample(1:5, 100, replace = TRUE)))
# Apply wilcox.test to all combinations of variables in data frame.
res <- combn(names(df), 2, function(x) list(data = c(paste(x[1], x[2])), p = wilcox.test(x = df[[x[1]]], y = df[[x[2]]])$p.value), simplify = FALSE)
# Bind results
do.call(rbind, res)
data p
[1,] "X1 X2" 0.45282
[2,] "X1 X3" 0.06095539
[3,] "X2 X3" 0.3162251

Equivalent of rowsum function for Matrix-class (dgCMatrix)

For the base R matrix class we have the rowsum function, which is very fast for computing column sums across groups of rows.
Is there an equivalent function or approach implemented in the Matrix-package?
I'm particularly interested in a fast alternative to rowsum for large dgCMatrix-objects (i.e. millions of rows, but roughly 95% sparse).
I know this is an old question, but Matrix::rowSums might be the function you are looking for.
The DelayedArray BioConductor package now has a rowsum function that accepts sparse matrices that has been very fast when I tried it.
Here is an approach using matrix multiplication, based on an example in https://slowkow.com/notes/sparse-matrix/. First, let's create a sparse matrix to play with,
library(magrittr)
library(forcats)
library(stringr)
library(Matrix)
set.seed(42)
m <- sparseMatrix(
i = sample(x = 1e4, size = 1e4),
j = sample(x = 1e4, size = 1e4),
x = rnorm(n = 1e4)
)
colnames(m) <- str_c("col", seq(ncol(m)))
rownames(m) <- str_c("row", seq(nrow(m)))
and a grouping vector defining which rows to sum,
group <- sample(1:10, nrow(m), replace = TRUE) %>%
paste0("new_row", .) %>%
fct_inorder
Whether group is a factor and its level order will affect the final row order in the merged matrix. I made group a factor with levels ordered by first appearance in group to make the row order resemble that from the rowsum() operation with reorder = FALSE.
Next, we create a (sparse) matrix that we can left-multiply by m to get a version of m whose rows have been summed based on group,
group_mat <- sparse.model.matrix(~ 0 + group) %>% t
# Adjust row names to get the correct final row names
rownames(group_mat) <- rownames(group_mat) %>% str_extract("(?<=^group).+")
msum <- group_mat %*% m
The result matches base::rowsum() on the dense version of the matrix,
d <- as.matrix(m)
dsum <- rowsum(d, group, reorder = FALSE)
all.equal(as.matrix(msum), dsum)
#> [1] TRUE
but the sparse-matrix multiplication method is much faster,
bench::mark( msum <- group_mat %*% m )$median
#> [1] 344µs
bench::mark( dsum <- rowsum(d, group) )$median
#> [1] 146ms

Rolling Granger Causality

I am using MSBVAR package in R to calculate Granger causality between two variables. The data and commands are same as used in the package:
data(IsraelPalestineConflict)
granger.test(IsraelPalestineConflict, p=6)
It gives following results:
F-statistic p-value
p2i -> i2p 17.63100 0.000000e+00
i2p -> p2i 10.91235 7.134737e-12
I want to apply a loop/rollapply with this function and want to save the results in a file. I tried like this after watching past answers on rollapply but as i am new to R so don't know how to make it work.
rollapply(zoo(IsraelPalestineConflict),width=1275,
FUN = function(t)
{ t = granger.test(IsraelPalestineConflict, p=6);
},
by.column=FALSE, align="right")
But it gives the same results with first column replaced by years and i dont know how can i save the results of the F-statistics and P-values with rollapply.
F-statistic p-value
2003.8077 17.63100 0.000000e+00
2003.8269 10.91235 7.134737e-12
Kind answer is requested, please.
Perhaps you want this:
granger.test.c <- function(x) c(granger.test(x, p = 6))
rollapplyr(IsraelPalestineConflict, 1275, granger.test.c, by.column = FALSE )
This creates a list of the above for p = 2, 3, 4, 5:
granger.test.c <- function(x, p) c(granger.test(x, p = p))
p <- 2:5
roll <- function(p, DF) rollapplyr(DF, 1275, granger.test.c, by.column = FALSE, p = p )
L <- lapply(p, roll, DF = IsraelPalestineConflict)
names(L) <- p

Spearman correlation between two matrices of same dimensions

I have two matrices of equal dimensions (p and e) and I would like to make a spearman correlation between columns of the same name. I want to have the output of pair correlations in a matrix (M)
I used the corr.test() function from library Psych and here is what I did:
library(psych)
M <- data.frame(matrix(ncol=3,nrow=ncol(p)))
M[,1] <- as.character()
G <- colnames(p)
for(rs in 1:ncol(p){
M[rs,1] <- G[rs]
cor <- corr.test(p[,rs],e[,rs],method="spearman",adjust="none")
M[rs,2] <- cor$r
M[rs,3] <- cor$p
}
But I get an error message:
Error in 1:ncol(y) : argument of length 0
Could you please show me what is wrong? or suggest another method?
No need for all this looping and indexing etc:
# test data
p <- matrix(data = rnorm(100),nrow = 10)
e <- matrix(data = rnorm(100),nrow = 10)
cor <- corr.test(p, e, method="spearman", adjust="none")
data.frame(name=colnames(p), r=diag(cor$r), p=diag(cor$p))
# name r p
#a a 0.36969697 0.2930501
#b b 0.16363636 0.6514773
#c c -0.15151515 0.6760652
# etc etc
If the names of the matrices don't already match, then match them:
cor <- corr.test(p, e[,match(colnames(p),colnames(e))], method="spearman", adjust="none")
Since the two matrices are huge, it would take very long system.time to execute the function corr.test() on all possible pairs but the loop that finally worked is as follow:
library(psych)
M <- data.frame(matrix(ncol=3,nrow=ncol(p)))
M[,1] <- as.character()
G <- colnames(p)
for(rs in 1:ncol(p){
M[rs,1] <- G[rs]
cor <- corr.test(as.data.frame(p[,rs]),as.data.frame(e[,rs]),
method="spearman",adjust="none")
M[rs,2] <- cor$r
M[rs,3] <- cor$p
}

How do you find the sample sizes used in calculations on r?

I am running correlations between variables, some of which have missing data, so the sample size for each correlation are likely different. I tried print and summary, but neither of these shows me how big my n is for each correlation. This is a fairly simple problem that I cannot find the answer to anywhere.
like this..?
x <- c(1:100,NA)
length(x)
length(x[!is.na(x)])
you can also get the degrees of freedom like this...
y <- c(1:100,NA)
x <- c(1:100,NA)
cor.test(x,y)$parameter
But I think it would be best if you show the code for how your are estimating the correlation for exact help.
Here's an example of how to find the pairwise sample sizes among the columns of a matrix. If you want to apply it to (certain) numeric columns of a data frame, combine them accordingly, coerce the resulting object to matrix and apply the function.
# Example matrix:
xx <- rnorm(3000)
# Generate some NAs
vv <- sample(3000, 200)
xx[vv] <- NA
# reshape to a matrix
dd <- matrix(xx, ncol = 3)
# find the number of NAs per column
apply(dd, 2, function(x) sum(is.na(x)))
# tack on some column names
colnames(dd) <- paste0("x", seq(3))
# Function to find the number of pairwise complete observations
# among all pairs of columns in a matrix. It returns a data frame
# whose first two columns comprise all column pairs
pairwiseN <- function(mat)
{
u <- if(is.null(colnames(mat))) paste0("x", seq_len(ncol(mat))) else colnames(mat)
h <- expand.grid(x = u, y = u)
f <- function(x, y)
sum(apply(mat[, c(x, y)], 1, function(z) !any(is.na(z))))
h$n <- mapply(f, h[, 1], h[, 2])
h
}
# Call it
pairwiseN(dd)
The function can easily be improved; for example, you could set h <- expand.grid(x = u[-1], y = u[-length(u)]) to cut down on the number of calculations, you could return an n x n matrix instead of a three-column data frame, etc.
Here is a for-loop implementation of Dennis' function above to output an n x n matrix rather than have to pivot_wide() that result. On my databricks cluster it cut the compute time for 1865 row x 69 column matrix down from 2.5 - 3 minutes to 30-40 seconds.
Thanks for your answer Dennis, this helped me with my work.
pairwise_nxn <- function(mat)
{
cols <- if(is.null(colnames(mat))) paste0("x", seq_len(ncol(mat))) else colnames(mat)
nn <- data.frame(matrix(nrow = length(cols), ncol = length(cols)))
rownames(nn) <- colnames(nn) <- cols
f <- function(x, y)
sum(apply(mat[, c(x, y)], 1, function(z) !any(is.na(z))))
for (i in 1:nrow(nn))
for (j in 1:ncol(nn))
nn[i,j] <- f(rownames(nn)[i], colnames(nn)[j])
nn
}
If your variables are vectors named a and b, would something like sum(is.na(a) | is.na(b)) help you?

Resources