R correlation matrix by group using spearman

R correlation matrix by group using spearman - r

I am trying to create a set of correlation matrices by different levels of a factor variable.
This question has previously been answered (spearman correlation by group in R) but not for a matrix and the vector result doesn't seem to generalize as far as I can see.
The code below works, but can't be written to a csv as by() outputs a list - the error is "cannot coerce class ""by"" to a data.frame"
cor1<- by(data, INDICES=data$factor0, FUN = function(x) cor(x[,c("x","y","z","a",
"b","c")],method="spearman",use="pairwise"))
So I am looking for a method to either coerce the above into a data.frame so I can write it to a csv, or to produce the above result by an alternative method which outputs a data frame
Any help greatly appreciated

The reason you get a list is because if x is a matrix than cor(x) will be a matrix as well, not a scalar. In this case it will be a 6x6 matrix. So the result is a list of 6x6 matrices, one for each factor level.
This is the natural way to represent the result, it seems to me. You can make it into a single data frame if you want, though I'm not sure what you want the rows and columns to represent exactly. Here is one option.
data<-matrix(rnorm(500),100,5)
colnames(data)<-letters[1:5]
factors<-sample(LETTERS[1:3],100,T)
cors<-by(data,factors,cor)
cors[[1]]
# a b c d e
# a 1.00000000 0.05389618 -0.16944040 0.25747174 0.21660217
# b 0.05389618 1.00000000 0.22735796 -0.06002965 -0.30115444
# c -0.16944040 0.22735796 1.00000000 -0.06625523 -0.01120225
# d 0.25747174 -0.06002965 -0.06625523 1.00000000 0.10402791
# e 0.21660217 -0.30115444 -0.01120225 0.10402791 1.00000000
corsMatrix<-do.call(rbind,lapply(cors,function(x)x[upper.tri(x)]))
names<-outer(colnames(data),colnames(data),paste,sep="X")
colnames(corsMatrix)<-names[upper.tri(names)]
corsMatrix
# aXb aXc bXc aXd bXd cXd
# A 0.05389618 -0.16944040 0.22735796 0.25747174 -0.06002965 -0.06625523
# B -0.34231682 -0.14225269 0.20881053 -0.14237661 0.25970138 0.27254840
# C 0.27199944 -0.01333377 0.06402734 0.02583126 -0.03336077 -0.02207024
# aXe bXe cXe dXe
# A 0.216602173 -0.3011544 -0.01120225 0.10402791
# B 0.347006942 -0.2207421 0.33123175 -0.05290809
# C 0.007748369 -0.1257357 0.23048709 0.16037247
I'm not sure if this is what you are looking for. Another option is to export each correlation matrix to its own csv file.

You can use ddply from package library(plyr):
library(plyr)
n <- 1e2
mdat <- data.frame(factor0 = factor(LETTERS[sample(26, n, TRUE)]), x = rnorm(n),
y = rnorm(n), z = rnorm(n), a = rnorm(n), b = rnorm(n),
c = rnorm(n))
ddply(mdat, .(factor0), function(d) {
ret <- as.data.frame(cor(d[, letters[c(1:3, 24:26)]], method="spearman",use="pairwise"))
ret$col <- letters[c(1:3, 24:26)]
ret[, c(7, 1:6)]})

Your query is not that clear, at least to me. If I took it correctly, you may need to have a pairwise matrix first before computing correlation.
You may want try the following function in SciencesPo.
require(SciencesPo)
m<-rprob(mtcars, df = nrow(mtcars) - 2)
The following will stack you matrix, so it becomes easier to check r and related p-values.
rstack(m)

Related

Iteratively/sequentially drop and replace df variable and feed into function

I am trying to assess the stability of a correlation analysis by iteratively dropping a variable, and re-running the analysis.
As I understand it, this requires me to (1) create matrices of length p-1, by iteratively/sequentially dropping a variable from a dataframe, (2) run a correlation function over a series of matrices, and (3) feed the output into a common dataframe or list, for subsequent analysis.
I am able to achieve each of these steps manually, as follows:
#required library for cc function
library(CCA)
#set seed
set.seed(123)
#X and Y dataframes
X_df <- data.frame(replicate(4,sample(1:10,10,rep=TRUE)))
Y_df <- data.frame(replicate(3,sample(1:10,10,rep=TRUE)))
#X and Y as scaled matrices
X <- scale(X_df)
Y <- scale(Y_df)
#manually omit a variable/column from the X df
X1 <- scale(X_df[2:4])
X2 <- scale(X_df[c(1, 3:4)])
X3 <- scale(X_df[c(1:2, 4)])
X4 <- scale(X_df[1:3])
#manually omit a variable/column from the Y df
Y1 <- scale(Y_df[2:3])
Y2 <- scale(Y_df[c(1, 3)])
Y3 <- scale(Y_df[1:2])
#perform canonical correlation - X sets and Y
cX1 <- cc(X1,Y)$cor
cX2 <- cc(X2,Y)$cor
cX3 <- cc(X3,Y)$cor
cX4 <- cc(X4,Y)$cor
#perform canonical correlation - Y sets and X
cY1 <- cc(X,Y1)$cor
cY2 <- cc(X,Y2)$cor
cY3 <- cc(X,Y3)$cor
#get canonical correlation values into a df
XVALS <- as.data.frame(rbind(cX1, cX2, cX3, cX4))
YVALS <- as.data.frame(rbind(cY1, cY2, cY3))
Of course, I know it's very bad to do this manually, and my real data is much larger.
Unfortunately, I am pretty new to R (and coding), and have been struggling to achieve any of these steps in a better way. I am familiar with the (existence of) the apply functions and also some functions in dplyr that I think are likely relevant (e.g., select) but I just can't get it to work despite reading documentation and seemingly similar posts for hours -- any guidance would be greatly appreciated.

Don't scale.
First of all, there is no need for scaled vectors as the code below shows.
The reason why vectors are scaled is a variant of R FAQ 7.31, see also this SO post.
With older processors the precision loss was a real problem, leading to clearly wrong results. This is no longer true, at least not in the general case.
#perform canonical correlation - original X sets and Y
cX1b <- cc(X_df[2:4], Y)$cor
cX2b <- cc(X_df[c(1, 3:4)], Y)$cor
cX3b <- cc(X_df[c(1:2, 4)], Y)$cor
cX4b <- cc(X_df[1:3], Y)$cor
XVALSb <- as.data.frame(rbind(cX1b, cX2b, cX3b, cX4b))
XVALS and XVALSb row names are different, make them equal in order to please all.equal().
row.names(XVALS) <- 1:4
row.names(XVALSb) <- 1:4
The results are not exactly equal but are within floating-point accuracy. In this case I'm testing equality with all.equal's default of .Machine$double.eps^0.5.
identical(XVALS, XVALSb)
#[1] FALSE
all.equal(XVALS, XVALSb)
#[1] TRUE
XVALS - XVALSb
# V1 V2 V3
#1 0.000000e+00 1.110223e-16 0.000000e+00
#2 -1.110223e-16 1.110223e-16 5.551115e-17
#3 1.110223e-16 -2.220446e-16 2.220446e-16
#4 1.110223e-16 4.440892e-16 1.110223e-16
The question.
To get all combinations of columns leaving one out there is function combn.
Function cc_df_one_out first calls combn on each of its arguments then apply to those indices an anonymous function computing CCA::cc.
Note that the rows order is not the same as in your posted example, since combn does not follow your order of column indices.
cc_df_one_out <- function(X, Y){
f <- function(x) combn(ncol(x), ncol(x) - 1)
X_inx <- f(X)
Y_inx <- f(Y)
ccX <- t(apply(X_inx, 2, function(i) cc(X[, i], Y)$cor))
ccY <- t(apply(Y_inx, 2, function(i) cc(X, Y[, i])$cor))
list(XVALS = as.data.frame(ccX), YVALS = as.data.frame(ccY))
}
cc_df_one_out(X_df, Y_df)
#$XVALS
# V1 V2 V3
#1 0.8787169 0.6999526 0.5073979
#2 0.8922514 0.7244302 0.2979096
#3 0.8441566 0.7807032 0.3331449
#4 0.9059585 0.7371382 0.1344559
#
#$YVALS
# V1 V2
#1 0.8975949 0.7309265
#2 0.8484323 0.7488632
#3 0.8721945 0.7452478

R - Cleanest way to run statistical test on every permutation of multiple populations

I have three populations stored as individual vectors. I need to run a statistical test (wilcoxon, if it matters) on each pair of these three populations.
I want to input three vectors into some block of code and get as output a vector of 6 p-values (one p-value is the result of one test and is a double).
I have a method that works but I am new to R and from what I've been reading I feel like there should be a better way, possibly involving storing the vectors as a data frame and using vectorization, to write this code.
Here is the code I have:
library(arrangements)
runAllTests <- function(pop1,pop2,pop3) {
populations <- list(pop1=pop1,pop2=pop2,pop3=pop3)
colLabels <- c("pop1", "pop2", "pop3")
#This line makes a data frame where each column is a pair of labels
perms <- data.frame(t(permutations(colLabels,2)))
pvals <- vector()
#This for loop gets each column of that data frame
for (pair in perms[,]) {
pair <- as.vector(pair)
p1 <- as.numeric(unlist(populations[pair[1]]))
p2 <- as.numeric(unlist(populations[pair[2]]))
pvals <- append(pvals, wilcox.test(p1, p2,alternative=c("less"))$p.value)
}
return(pvals)
}
What is a more R appropriate way to write this code?
Note: Generating populations and comparing them all to each other is a common enough thing (and tricky enough to code) that I think this question will apply to more people than myself.
EDIT: I forgot that my actual populations are of different sizes. This means I cannot make a data frame out of the vectors (as far as I know). I can make a list of vectors though. I have updated my code with a version that works.

Yes, this is indeed common; indeed so common that R has a built-in function for exactly this scenario: pairwise.table.
p <- list(pop1, pop2, pop3)
pairwise.table(function(i, j) {
wilcox.test(p[[i]], p[[j]])$p.value
}, 1:3)
There are also specific versions for t tests, proportion tests, and Wilcoxon tests; here's an example using pairwise.wilcox.test.
p <- list(pop1, pop2, pop3)
d <- data.frame(x=unlist(p), g=rep(seq_along(p), sapply(p, length)))
with(d, pairwise.wilcox.test(x, g))
Also, make sure you look into the p.adjust.method parameter to correctly adjust for multiple comparisons.
Per your comments, you're interested in tests where the order matters; that's really hard to imagine (and isn't true for the Wilcoxon test you mentioned) but still...
This is the pairwise.table function, edited to do tests in both directions.
pairwise.table.all <- function (compare.levels, level.names, p.adjust.method) {
ix <- setNames(seq_along(level.names), level.names)
pp <- outer(ix, ix, function(ivec, jvec)
sapply(seq_along(ivec), function(k) {
i <- ivec[k]; j <- jvec[k]
if (i != j) compare.levels(i, j) else NA }))
pp[] <- p.adjust(pp[], p.adjust.method)
pp
}
This is a version of pairwise.wilcox.test which uses the above function, and also runs on a list of vectors, instead of a data frame in long format.
pairwise.lazerbeam.test <- function(dat, p.adjust.method=p.adjust.methods) {
p.adjust.method <- match.arg(p.adjust.method)
level.names <- if(!is.null(names(dat))) names(dat) else seq_along(dat)
PVAL <- pairwise.table.all(function(i, j) {
wilcox.test(dat[[i]], dat[[j]])$p.value
}, level.names, p.adjust.method = p.adjust.method)
ans <- list(method = "Lazerbeam's special method",
data.name = paste(level.names, collapse=", "),
p.value = PVAL, p.adjust.method = p.adjust.method)
class(ans) <- "pairwise.htest"
ans
}
Output, both before and after tidying, looks like this:
> p <- list(a=1:5, b=2:8, c=10:16)
> out <- pairwise.lazerbeam.test(p)
> out
Pairwise comparisons using Lazerbeams special method
data: a, b, c
a b c
a - 0.2821 0.0101
b 0.2821 - 0.0035
c 0.0101 0.0035 -
P value adjustment method: holm
> pairwise.lazerbeam.test(p) %>% broom::tidy()
# A tibble: 6 x 3
group1 group2 p.value
<chr> <chr> <dbl>
1 b a 0.282
2 c a 0.0101
3 a b 0.282
4 c b 0.00350
5 a c 0.0101
6 b c 0.00350

Here is an example of one approach that uses combn() which has a function argument that can be used to easily apply wilcox.test() to all variable combinations.
set.seed(234)
# Create dummy data
df <- data.frame(replicate(3, sample(1:5, 100, replace = TRUE)))
# Apply wilcox.test to all combinations of variables in data frame.
res <- combn(names(df), 2, function(x) list(data = c(paste(x[1], x[2])), p = wilcox.test(x = df[[x[1]]], y = df[[x[2]]])$p.value), simplify = FALSE)
# Bind results
do.call(rbind, res)
data p
[1,] "X1 X2" 0.45282
[2,] "X1 X3" 0.06095539
[3,] "X2 X3" 0.3162251

How do I vectorize the ecdf function in R?

I have a data frame that looks like this:
set.seed(42)
data <- runif(1000)
utility <- sample(c("abc","bcd","cde","def"),1000,replace=TRUE)
stage <- sample(c("vwx","wxy","xyz"),1000,replace=TRUE)
x <- data.frame(data,utility,stage)
head(x)
data utility stage
1 0.9148060 def xyz
2 0.9370754 abc wxy
3 0.2861395 def xyz
4 0.8304476 cde xyz
5 0.6417455 bcd xyz
6 0.5190959 abc xyz
and I want to generate cumulative distribution functions for the unique combinations of utility and stage. In my real application I'll end up generating about 100 cdfs but this random data will have 12 (4x3) unique combinations. But I'll be using each of those cdfs thousands of times, so I don't want to calculate the cdf on the fly each time. The ecdf() function works exactly as I'd like, except I'd need to vectorize it. The following code doesn't work, but it's the gist of what I'm trying to do:
ecdf_multiple <- function(x)
{
i=0
utilities <- levels(x$utilities)
stages <- levels(x$stages)
for(utility in utilities)
{
for(stage in stages)
{
i <- i + 1
y <- ecdf(x[x$utilities == utility & x$stage == stage,1])
# calculate ecdf for the unique util/stage combo
z[i] <- list(y,utility,stage)
# then assign it to a data element (list, data frame, json, whatever) note-this doesn't actually work
}
}
z # return value
}
so after running ecdf_multiple and assigning it to a variable, I'd reference that variable somehow by passing a value (for which I wanted the cdf), the utility and the stage.
Is there a way to vectorize the ecdf function (or use/build another) so that I can the output several times without neededing to generate distributions over and over?
-------Added to respond to #Pascal 's excellent suggestion.-------
How might one expand this to a more general case of taking "n" dimensions of categories? This is my stab, based on Pascal's case of two dimensions. Notice how I tried to assign "y":
set.seed(42)
data <- runif(1000)
utility <- sample(c("abc","bcd","cde","def"),1000,replace=TRUE)
stage <- sample(c("vwx","wxy","xyz"),1000,replace=TRUE)
openclose <- sample(c("open","close"),1000,replace=TRUE)
x <- data.frame(data,utility,stage,openclose)
numlabels <- length(names(x))-1
y <- split(x, list(x[,2:(numlabels+1)]))
l <- lapply(y,function(x) ecdf(x[,"data"]))
#execute
utility <- "abc"
stage <- "xyz"
openclose <- "close"
comb <- paste(utility, stage, openclose, sep = ".")
# call the function
l[[comb]](.25)
During the assignment of "y" above, I get this error message:
"Error in sort.list(y) : 'x' must be atomic for 'sort.list'
Have you called 'sort' on a list?"

The following might help:
# we create a list of criteria by excluding
# the first column of the data.frame
y <- split(x, as.list(x[,-1]))
l <- lapply(y, function(x) ecdf(x[,"data"]))
utility <- "abc"
stage <- "xyz"
comb <- paste(utility, stage, sep = ".")
l[[comb]](0.25)
# [1] 0.2613636
plot(l[[comb]])

Creating correlation matrix p values [duplicate]

This question already has answers here:
How to iterate through parameters to analyse
(2 answers)
Closed 8 years ago.
I can get correlation matrix using following commands:
> df<-data.frame(x=c(5,6,5,9,4,2,1,3,5,7),y=c(3.1,2.5,3.8,5.4,6.5,2.5,1.5,8.1,7.1,6.1),z=c(5,6,4,9,2,4,1,6,2,4))
> cor(df)
x y z
x 1.0000000 0.2923939 0.6566866
y 0.2923939 1.0000000 0.1167084
z 0.6566866 0.1167084 1.0000000
>
I can get individual p-values using command:
> cor.test(x,y)$p.value
[1] 0.4123234
How can I get a matrix of p-values for all these correlation coefficients? Thanks for your help.

You can also use the package Hmisc.
An example of how it works:
mycor <- rcorr(as.matrix(data), type="pearson")
mycor$r shows the correlation matrix, mycor$p the matrix with corresponding p-values.

This example calculates the p value for each of the column combinations. It is not an optimal solution (x-y and y-x p values are both calculated for example), but should provide some inspiration for you. The main trick is to use expand.grid to generate the combinations of columns, and use mapply to call cor.test on each combination:
col_combinations = expand.grid(names(df), names(df))
cor_test_wrapper = function(col_name1, col_name2, data_frame) {
cor.test(data_frame[[col_name1]], data_frame[[col_name2]])$p.value
}
p_vals = mapply(cor_test_wrapper,
col_name1 = col_combinations[[1]],
col_name2 = col_combinations[[2]],
MoreArgs = list(data_frame = df))
matrix(p_vals, 3, 3, dimnames = list(names(df), names(df)))
x y z
x 0.00000000 0.4123234 0.03914453
y 0.41232343 0.0000000 0.74814951
z 0.03914453 0.7481495 0.00000000

one way is to use corr.test (notice the double r) from package psych
.. or if you're a fan of mapply and sapply you could write your own function doing this. something like:
rrapply <- function(A, FUN, ...) mapply(function(a, B) lapply(B,
function(x) FUN(a, x, ...)), a = A, MoreArgs = list(B = A))
cor.tests <- rrapply(df, cor.test) # a matrix of cor.tests
apply(cor.tests, 1:2, function(x) x[[1]]$p.value) # and it's there
And now you can use the same logic to make a matrix of t-tests or, say, CI's of correlations

tapply on matrices of data and indices

I am calculating sums of matrix columns to each group, where the corresponding group values are contained in matrix columns as well. At the moment I am using a loop as follows:
index <- matrix(c("A","A","B","B","B","B","A","A"),4,2)
x <- matrix(1:8,4,2)
for (i in 1:2) {
tapply(x[,i], index[,i], sum)
}
At the end of the day I need the following result:
1 2
A 3 15
B 7 11
Is there a way to do this using matrix operations without a loop? On top, the real data is large (e.g. 500 x 10000), therefore it has to be fast.
Thanks in advance.

Here are a couple of solutions:
# 1
ag <- aggregate(c(x), data.frame(index = c(index), col = c(col(x))), sum)
xt <- xtabs(x ~., ag)
# 2
m <- mapply(rowsum, as.data.frame(x), as.data.frame(index))
dimnames(m) <- list(levels(factor(index)), 1:ncol(index))
The second only works if every column of index has at least one of each level and also requires that there be at least 2 levels; however, its faster.

This is ugly and works but there's a much better way to do it that is more generalizable. Just getting the ball rolling.
data.frame("col1"=as.numeric(table(rep(index[,1], x[,1]))),
"col2"=as.numeric(table(rep(index[,2], x[,2]))),
row.names=names(table(index)))

I still suspect there's a better option, but this seems reasonably fast actually:
index <- matrix(sample(LETTERS[1:4],size = 500*1000,replace = TRUE),500,10000)
x <- matrix(sample(1:10,500*10000,replace = TRUE),500,10000)
rs <- matrix(NA,4,10000)
rownames(rs) <- LETTERS[1:4]
for (i in LETTERS[1:4]){
tmp <- x
tmp[index != i] <- 0
rs[i,] <- colSums(tmp)
}
It runs in ~0.8 seconds on my machine. I upped the number of categories to four and scaled it up to the size data you have. But I don't having to copy x each time.
You can get clever with matrix multiplication, but I think you still have to do one row or column at a time.

You used tapply. If you add mapply, you can complete your objective.
It does the same thing as that for loop.
index <- matrix(c("A","A","B","B","B","B","A","A"),4,2)
x <- matrix(1:8,4,2)
mapply( function(i) tapply(x[,i], index[,i], sum), 1:2 )
result:
[,1] [,2]
A 3 15
B 7 11

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

R correlation matrix by group using spearman - r

Related

Iteratively/sequentially drop and replace df variable and feed into function

R - Cleanest way to run statistical test on every permutation of multiple populations

How do I vectorize the ecdf function in R?

Creating correlation matrix p values [duplicate]

tapply on matrices of data and indices

Categories

Resources