I have a data frame with n columns and want to apply a function to each combination of columns. This is very similar to how the cor() function takes a data frame as input and produces a correlation matrix as output, for example:
X <- data.frame(A=rnorm(100), B=rnorm(100), C=rnorm(100))
cor(X)
Which will generate this output:
> cor(X)
A B C
A 1.00000000 -0.01199511 0.02337429
B -0.01199511 1.00000000 0.07918920
C 0.02337429 0.07918920 1.00000000
However, I have a custom function that I need to apply to each combination of columns. I am now using a solution that uses nested for loops, which works:
f <- function(x, y) sum((x+y)^2) # some placeholder function
out <- matrix(NA, ncol = ncol(X), nrow = ncol(X)) # pre-allocate
for(i in seq_along(X)) {
for(j in seq_along(X)) {
out[i, j] <- f(X[, i], X[, j]) # apply f() to each combination
}
}
Which produces:
> out
[,1] [,2] [,3]
[1,] 422.4447 207.0833 211.4198
[2,] 207.0833 409.1242 218.2430
[3,] 211.4198 218.2430 397.5321
I am currently trying to transition into the tidyverse and would prefer to avoid using for loops. Could someone show me a tidy solution for this situation? Thanks!
You could do
library(tidyverse)
f <- function(x, y) sum((x+y)^2)
X <- data.frame(A=rnorm(100), B=rnorm(100), C=rnorm(100))
as.list(X) %>%
expand.grid(., .) %>%
mutate(out = map2_dbl(Var1, Var2, f)) %>%
as_tibble()
This isn’t a tidyverse solution, but it does avoid using for loops. We use RcppAlgos (I am the author) to generate all pair-wise permutations of columns and apply your custom function to each of these. After that, we coerce to a matrix.
set.seed(42)
X <- data.frame(A=rnorm(100), B=rnorm(100), C=rnorm(100))
library(RcppAlgos)
matrix(permuteGeneral(ncol(X), 2, repetition = TRUE, FUN = function(y) {
sum((X[,y[1]] + X[,y[2]])^2)
}), ncol = ncol(X))
# [,1] [,2] [,3]
# [1,] 429.8549 194.4271 179.4449
# [2,] 194.4271 326.8032 197.2585
# [3,] 179.4449 197.2585 409.6313
Using base R you could do:
set.seed(42)
X <- data.frame(A=rnorm(100), B=rnorm(100), C=rnorm(100))
OUT = diag(colSums((X+X)^2))
OUT[lower.tri(OUT)] = combn(X, 2, function(x) sum(do.call('+', x)^2)) #combn(X,2,function(x)sum(rowSums(x)^2))
OUT[upper.tri(OUT)] = OUT[lower.tri(OUT)]
OUT
[,1] [,2] [,3]
[1,] 429.8549 194.4271 179.4449
[2,] 194.4271 326.8032 197.2585
[3,] 179.4449 197.2585 409.6313
Related
I am familiar with Matlab, but recently started to use R. I encounterd a problem when using parallel computing in R.
I want to use a matrix or a 3-d array as an output after parellel computing. In Matlab, an example what I want to do is as follows.
X=zeros(10,5,100);
Y=zeros(100,2);
parfor i=1:100;
X(:,:,i) = randn(10,5);
Y(i,:) = randn(1,2);
end
However, as long as I investigated, foreach in R seem return only vectors(list?) and a matrix or an array seem not allowed. I'm wondering how I need to wright a code to implement what Matlab does.
Here is my suggested solution based on the Matlab code in the question.
#install.packages("foreach")
#install.packages("doParallel")
library(foreach)
library(doParallel)
X <- array(c(rep(0,10), rep(0,5), rep(0,100)),dim = c(10,5,100))
Y <- array(c(rep(0,100), rep(0,2)),dim = c(100,2))
X=foreach(i=1:100) %dopar% {
X[ , , i]= array(c( rnorm(10), rnorm(5)),dim = c(10,5))
X
}
Y=foreach(i=1:100) %dopar% {
Y[ i , ]= array(c( rnorm(1), rnorm(2)),dim = c(1,2))
Y
}
You can use the following R code to replicate your Matlab code. Note that R is vectorized, so for loops are often unnecessary.
X <- rep(0, 10*5*100)
Y <- rep(0, 100*2)
dim(X) <- c(10,5,100)
dim(Y) <- c(100,2)
set.seed(1234) # This is just for replication. Can omit if you want fully random numbers
X[] <- rnorm(length(X))
Y[] <- rnorm(length(Y))
apply(X, c(1,2), mean)
apply(Y, 2, mean)
Output:
> apply(X, c(1,2), mean)
[,1] [,2] [,3] [,4] [,5]
[1,] 0.055024278 -0.0205575146 -0.071816174 -0.087659326 0.12707484
[2,] -0.018254630 0.0821710032 -0.005589141 0.049391162 -0.01413225
[3,] -0.009338280 0.0284398951 -0.004083183 0.013750904 -0.02076238
[4,] 0.039544807 -0.0425689205 0.016054568 0.013936608 -0.06183537
[5,] 0.053641184 0.1362104005 0.069674908 0.008190821 -0.21042331
[6,] 0.065895652 -0.0098767327 -0.082148255 0.038705556 -0.05255018
[7,] 0.007917696 -0.0002747114 -0.045812106 -0.062452164 0.23984287
[8,] -0.173435275 0.0328011859 -0.053835173 0.057308693 -0.03760174
[9,] 0.047481900 -0.0225967973 -0.161777736 0.005625679 -0.05406814
[10,] -0.031460338 0.0628553230 -0.176667084 -0.098304874 0.06060704
> apply(Y, 2, mean)
[1] 0.06083109 -0.03338719
You can also simplify this a little with the following commands:
set.seed(1234)
X <- rnorm(10*5*100)
Y <- rnorm(100*2)
dim(X) <- c(10,5,100)
dim(Y) <- c(100,2)
apply(X, c(1,2), mean)
apply(Y, 2, mean)
I am thinking about problem. How to count in R
(A-square matrix,k-any natural number) WITHOUT "for"?
If I've interpreted your notation correctly, perhaps something like this in base R...
A <- matrix(c(1,2,3,4), nrow = 2) #example matrix
k <- 10
B <- Reduce(`%*%`, (rep(list(A), k)), accumulate = TRUE) #list of A^(1:k)
BB <- lapply(1:k, function(k) B[[k]]/k) #list of A^(1:k)/k
Reduce(`+`, BB) #sum of series BB
[,1] [,2]
[1,] 603684.8 1319741
[2,] 879827.1 1923425
I need to calculate this
where x is a vector of length n and f is a function.
What is the most efficient calculation for this in R?
One method is a double for loop, but that is obviously slow.
One fast way to do is the following:
Assume we have this vector:
x = c(0,1,2)
i.e. n=3, and assume f is a multiplication function:
Now, we use expand.grid.unique custom function which produces unique combinations within vector; in other words, it is similar to expand.grid base function but with unique combinations:
expand.grid.unique <- function(x, y, include.equals=FALSE)
{
x <- unique(x)
y <- unique(y)
g <- function(i)
{
z <- setdiff(y, x[seq_len(i-include.equals)])
if(length(z)) cbind(x[i], z, deparse.level=0)
}
do.call(rbind, lapply(seq_along(x), g))
}
In our vector case, when we cal expand.grid.unique(x,x), it produces the following result:
> expand.grid.unique(x,x)
[,1] [,2]
[1,] 0 1
[2,] 0 2
[3,] 1 2
Let's assign two_by_two to it:
two_by_two <- expand.grid.unique(x,x)
Since our function is assumed to be multiplication, then we need to calculate sum-product, i.e. dot product of first and second columns of two_by_two. For this we need %*% operator:
output <- two_by_two[,1] %*% two_by_two[,2]
> output
[,1]
[1,] 2
See ?combn
x <- 0:2
combn(x, 2)
# unique combos
[,1] [,2] [,3]
#[1,] 0 0 1
#[2,] 1 2 2
sum(combn(x, 2))
#[1] 6
combn() creates all the unique combinations. If you have a function that you want to sum, you can add a FUN to the call:
random_f <- function(x){x[1] + 2 * x[2]}
combn(x, 2, FUN = random_f)
#[1] 2 4 5
sum(combn(x, 2, FUN = random_f))
#[1] 11
I have two matrices with same number of rows and different number of columns as:
mat1 <- matrix(rnorm(20), 4, 5)
mat2 <- matrix(rnorm(12), 4, 3)
Since i have the same number of rows I want to calculate the following correlation between the columns of the matrices:
cor.test(mat1[,1], mat2[,1])
cor.test(mat1[,1], mat2[,2])
cor.test(mat1[,1], mat2[,3])
cor.test(mat1[,2], mat2[,1])
cor.test(mat1[,2], mat2[,2])
cor.test(mat1[,2], mat2[,3])
...........
...........
cor.test(mat1[,5], mat2[,3])
for(i in 1:5){
for(j in 1:3){
pv[i,j] <- cor.test(mat1[, i], mat2[ , j])$p.value
}
}
At the end I want a matrix(5 * 3) or vector containing the correlation values, can anyone help?
Can i use this to return both p.value and estimate?
FUN <- function(x, y) {
res <- cor.test(x, y, method="spearman", exact=F)
return(list(c = res$estimate, p = res$p.value))
}
r1 <- outer(colnames(mat1), colnames(mat2), Vectorize(function(i,j) FUN(mat1[,i], mat2[,j])$p))
r2 <- outer(colnames(mat1), colnames(mat2), Vectorize(function(i,j) FUN(mat1[,i], mat2[,j])$c))
Thank you.
Why don't you just use cor function to calculate the pearson correlation?
seed(1)
mat1 <- matrix(rnorm(20), 4, 5)
mat2 <- matrix(rnorm(12), 4, 3)
cor(mat1, mat2)
[,1] [,2] [,3]
[1,] 0.4406765 -0.70959590 0.10731768
[2,] -0.2566199 -0.01588993 -0.63630159
[3,] -0.9813313 0.85082165 -0.77172317
[4,] 0.6121358 -0.38564314 0.87077092
[5,] -0.6897573 0.66272015 -0.08380553
To double check,
> col_1 <- 3
> col_2 <- 2
# all.equal is used to compare numeric equality where `==` is discouraged
> all.equal(cor(mat1, mat2)[col_1, col_2], cor(mat1[,col_1], mat2[,col_2]))
[1] TRUE
They are equal!
An alternative, slightly easier to understand than loops in my opinion:
sapply(
data.frame(mat1),
function(x) Map(function(a,b) cor.test(a,b)$p.value,
list(x),
as.data.frame(mat2))
)
Result:
# X1 X2 X3 X4 X5
#[1,] 0.7400541 0.8000358 0.5084979 0.4441933 0.9104712
#[2,] 0.2918163 0.2764817 0.956807 0.6072979 0.4395218
#[3,] 0.2866105 0.4095909 0.5648188 0.1746428 0.9125866
I supose you would like to do it without for's. With base stuff, here is the double apply aproach:
apply(mat1, 2, function(col_mat1){
apply(mat2, 2, function(col2, col1) {
cor.test(col2, col1)$p.value
}, col1=col_mat1)
})
The outter apply iterates at mat1 columns and serves one side of cor.test(). The inner one does the same, but now fills the second side of cor.test(). In practie, apply is replacing the for's.
I think all you need is to define your matrix first
mat_cor <- matrix(nrow=ncol(mat1), ncol=ncol(mat2))
for(i in 1:5)
{
for(j in 1:3)
{
mat_cor[i,j] <- cor.test(mat1[, i], mat2[ , j])$p.value
}
}
Output
mat_cor
[,1] [,2] [,3]
[1,] 0.9455569 0.8362242 0.162569342
[2,] 0.7755360 0.9849619 0.775006329
[3,] 0.8799139 0.8050564 0.001358697
[4,] 0.1574388 0.1808167 0.618624825
[5,] 0.8571844 0.8897125 0.879818822
You can try with something like this
pv <- c()
for(i in 1:dim(mat1)[2]){
for(j in 1:dim(mat2)[2]){
pv <-c(c, cor.test(mat1[, i], mat2[ , j])$estimate)
}
}
dim(pv) <- c(dim(mat1)[2], dim(mat2)[2])
For example: I have a list of matrices, and I would like to evaluate their differences, sort of a 3-D diff. So if I have:
m1 <- matrix(1:4, ncol=2)
m2 <- matrix(5:8, ncol=2)
m3 <- matrix(9:12, ncol=2)
mat.list <- list(m1,m2,m3)
I want to obtain
mat.diff <- list(m2-m1, m3-m2)
The solution I found is the following:
mat.diff <- mapply(function (A,B) B-A, mat.list[-length(mat.list)], mat.list[-1])
Is there a nicer/built-in way to do this?
You can do this with just lapply or other ways of looping:
mat.diff <- lapply( tail( seq_along(mat.list), -1 ),
function(i) mat.list[[i]] - mat.list[[ i-1 ]] )
You can use combn to generate the indexes of matrix and apply a function on each combination.
combn(1:length(l),2,FUN=function(x)
if(diff(x) == 1) ## apply just for consecutive index
l[[x[2]]]-l[[x[1]]],
simplify = FALSE) ## to get a list
Using #Arun data, I get :
[[1]]
[,1] [,2]
[1,] 4 4
[2,] 4 4
[[2]]
NULL
[[3]]
[,1] [,2]
[1,] 4 4
[2,] 4 4