parallel computing in R

parallel computing in R - r

I am familiar with Matlab, but recently started to use R. I encounterd a problem when using parallel computing in R.
I want to use a matrix or a 3-d array as an output after parellel computing. In Matlab, an example what I want to do is as follows.
X=zeros(10,5,100);
Y=zeros(100,2);
parfor i=1:100;
X(:,:,i) = randn(10,5);
Y(i,:) = randn(1,2);
end
However, as long as I investigated, foreach in R seem return only vectors(list?) and a matrix or an array seem not allowed. I'm wondering how I need to wright a code to implement what Matlab does.

Here is my suggested solution based on the Matlab code in the question.
#install.packages("foreach")
#install.packages("doParallel")
library(foreach)
library(doParallel)
X <- array(c(rep(0,10), rep(0,5), rep(0,100)),dim = c(10,5,100))
Y <- array(c(rep(0,100), rep(0,2)),dim = c(100,2))
X=foreach(i=1:100) %dopar% {
X[ , , i]= array(c( rnorm(10), rnorm(5)),dim = c(10,5))
X
}
Y=foreach(i=1:100) %dopar% {
Y[ i , ]= array(c( rnorm(1), rnorm(2)),dim = c(1,2))
Y
}

You can use the following R code to replicate your Matlab code. Note that R is vectorized, so for loops are often unnecessary.
X <- rep(0, 10*5*100)
Y <- rep(0, 100*2)
dim(X) <- c(10,5,100)
dim(Y) <- c(100,2)
set.seed(1234) # This is just for replication. Can omit if you want fully random numbers
X[] <- rnorm(length(X))
Y[] <- rnorm(length(Y))
apply(X, c(1,2), mean)
apply(Y, 2, mean)
Output:
> apply(X, c(1,2), mean)
[,1] [,2] [,3] [,4] [,5]
[1,] 0.055024278 -0.0205575146 -0.071816174 -0.087659326 0.12707484
[2,] -0.018254630 0.0821710032 -0.005589141 0.049391162 -0.01413225
[3,] -0.009338280 0.0284398951 -0.004083183 0.013750904 -0.02076238
[4,] 0.039544807 -0.0425689205 0.016054568 0.013936608 -0.06183537
[5,] 0.053641184 0.1362104005 0.069674908 0.008190821 -0.21042331
[6,] 0.065895652 -0.0098767327 -0.082148255 0.038705556 -0.05255018
[7,] 0.007917696 -0.0002747114 -0.045812106 -0.062452164 0.23984287
[8,] -0.173435275 0.0328011859 -0.053835173 0.057308693 -0.03760174
[9,] 0.047481900 -0.0225967973 -0.161777736 0.005625679 -0.05406814
[10,] -0.031460338 0.0628553230 -0.176667084 -0.098304874 0.06060704
> apply(Y, 2, mean)
[1] 0.06083109 -0.03338719
You can also simplify this a little with the following commands:
set.seed(1234)
X <- rnorm(10*5*100)
Y <- rnorm(100*2)
dim(X) <- c(10,5,100)
dim(Y) <- c(100,2)
apply(X, c(1,2), mean)
apply(Y, 2, mean)

Related

Calculation of the sum of matrices in R

I am thinking about problem. How to count in R
(A-square matrix,k-any natural number) WITHOUT "for"?

If I've interpreted your notation correctly, perhaps something like this in base R...
A <- matrix(c(1,2,3,4), nrow = 2) #example matrix
k <- 10
B <- Reduce(`%*%`, (rep(list(A), k)), accumulate = TRUE) #list of A^(1:k)
BB <- lapply(1:k, function(k) B[[k]]/k) #list of A^(1:k)/k
Reduce(`+`, BB) #sum of series BB
[,1] [,2]
[1,] 603684.8 1319741
[2,] 879827.1 1923425

Apply a function to each combination of columns

I have a data frame with n columns and want to apply a function to each combination of columns. This is very similar to how the cor() function takes a data frame as input and produces a correlation matrix as output, for example:
X <- data.frame(A=rnorm(100), B=rnorm(100), C=rnorm(100))
cor(X)
Which will generate this output:
> cor(X)
A B C
A 1.00000000 -0.01199511 0.02337429
B -0.01199511 1.00000000 0.07918920
C 0.02337429 0.07918920 1.00000000
However, I have a custom function that I need to apply to each combination of columns. I am now using a solution that uses nested for loops, which works:
f <- function(x, y) sum((x+y)^2) # some placeholder function
out <- matrix(NA, ncol = ncol(X), nrow = ncol(X)) # pre-allocate
for(i in seq_along(X)) {
for(j in seq_along(X)) {
out[i, j] <- f(X[, i], X[, j]) # apply f() to each combination
}
}
Which produces:
> out
[,1] [,2] [,3]
[1,] 422.4447 207.0833 211.4198
[2,] 207.0833 409.1242 218.2430
[3,] 211.4198 218.2430 397.5321
I am currently trying to transition into the tidyverse and would prefer to avoid using for loops. Could someone show me a tidy solution for this situation? Thanks!

You could do
library(tidyverse)
f <- function(x, y) sum((x+y)^2)
X <- data.frame(A=rnorm(100), B=rnorm(100), C=rnorm(100))
as.list(X) %>%
expand.grid(., .) %>%
mutate(out = map2_dbl(Var1, Var2, f)) %>%
as_tibble()

This isn’t a tidyverse solution, but it does avoid using for loops. We use RcppAlgos (I am the author) to generate all pair-wise permutations of columns and apply your custom function to each of these. After that, we coerce to a matrix.
set.seed(42)
X <- data.frame(A=rnorm(100), B=rnorm(100), C=rnorm(100))
library(RcppAlgos)
matrix(permuteGeneral(ncol(X), 2, repetition = TRUE, FUN = function(y) {
sum((X[,y[1]] + X[,y[2]])^2)
}), ncol = ncol(X))
# [,1] [,2] [,3]
# [1,] 429.8549 194.4271 179.4449
# [2,] 194.4271 326.8032 197.2585
# [3,] 179.4449 197.2585 409.6313

Using base R you could do:
set.seed(42)
X <- data.frame(A=rnorm(100), B=rnorm(100), C=rnorm(100))
OUT = diag(colSums((X+X)^2))
OUT[lower.tri(OUT)] = combn(X, 2, function(x) sum(do.call('+', x)^2)) #combn(X,2,function(x)sum(rowSums(x)^2))
OUT[upper.tri(OUT)] = OUT[lower.tri(OUT)]
OUT
[,1] [,2] [,3]
[1,] 429.8549 194.4271 179.4449
[2,] 194.4271 326.8032 197.2585
[3,] 179.4449 197.2585 409.6313

Parallelization Apply to parRapply

My data set is:
ll <- matrix(c(5, 6, 60, 60), ncol=2)
And I use the function spDistsN1 from the library "sp" to obtain a distance matrix with apply:
apply(ll, 1, function(x) spDistsN1(as.matrix(ll), x, longlat = T))
But I want to do it with parallelization, so for that:
library(parallel)
ncore <- detectCores()
cl <- makeCluster(ncore)
clusterEvalQ(cl = cl, expr = c(library(sp)))
parRapply(cl = cl, x = ll, FUN = function(x) spDistsN1(as.matrix(ll), x,
longlat = T))
It shows the following error:
Error in checkForRemoteErrors(val) :
4 nodes produced errors; first error: object 'll' not found
How do I fix it?

An easier alternative to using parallel's parApply() or parRapply() is to use future_apply() of the future.apply package (disclaimer: I'm the author) because global variables are automatically exported - no need to worry about parallel::clusterExport() etc. Just use it as you would use apply(), e.g.
library(sp)
library(future.apply)
plan(multiprocess) ## parallelize on local machine
ll <- matrix(c(5, 6, 60, 60), ncol = 2)
## Sequentially
y0 <- apply(ll, 1, function(x) A(ll, x, longlat = TRUE))
print(y0)
# [,1] [,2]
# [1,] 0.00000 55.79918
# [2,] 55.79918 0.00000
## In parallel
y1 <- future_apply(ll, 1, function(x) spDistsN1(ll, x, longlat = TRUE))
print(y1)
# [,1] [,2]
# [1,] 0.00000 55.79918
# [2,] 55.79918 0.00000
print(identical(y1, y0))
# [1] TRUE
You may also find the blog post future.apply - Parallelize Any Base R Apply Function helpful.

You need to export all variables to workers. See ?parallel::clusterExport.

How to calculate correlation between matrices with different column dimention in R

I have two matrices with same number of rows and different number of columns as:
mat1 <- matrix(rnorm(20), 4, 5)
mat2 <- matrix(rnorm(12), 4, 3)
Since i have the same number of rows I want to calculate the following correlation between the columns of the matrices:
cor.test(mat1[,1], mat2[,1])
cor.test(mat1[,1], mat2[,2])
cor.test(mat1[,1], mat2[,3])
cor.test(mat1[,2], mat2[,1])
cor.test(mat1[,2], mat2[,2])
cor.test(mat1[,2], mat2[,3])
...........
...........
cor.test(mat1[,5], mat2[,3])
for(i in 1:5){
for(j in 1:3){
pv[i,j] <- cor.test(mat1[, i], mat2[ , j])$p.value
}
}
At the end I want a matrix(5 * 3) or vector containing the correlation values, can anyone help?
Can i use this to return both p.value and estimate?
FUN <- function(x, y) {
res <- cor.test(x, y, method="spearman", exact=F)
return(list(c = res$estimate, p = res$p.value))
}
r1 <- outer(colnames(mat1), colnames(mat2), Vectorize(function(i,j) FUN(mat1[,i], mat2[,j])$p))
r2 <- outer(colnames(mat1), colnames(mat2), Vectorize(function(i,j) FUN(mat1[,i], mat2[,j])$c))
Thank you.

Why don't you just use cor function to calculate the pearson correlation?
seed(1)
mat1 <- matrix(rnorm(20), 4, 5)
mat2 <- matrix(rnorm(12), 4, 3)
cor(mat1, mat2)
[,1] [,2] [,3]
[1,] 0.4406765 -0.70959590 0.10731768
[2,] -0.2566199 -0.01588993 -0.63630159
[3,] -0.9813313 0.85082165 -0.77172317
[4,] 0.6121358 -0.38564314 0.87077092
[5,] -0.6897573 0.66272015 -0.08380553
To double check,
> col_1 <- 3
> col_2 <- 2
# all.equal is used to compare numeric equality where `==` is discouraged
> all.equal(cor(mat1, mat2)[col_1, col_2], cor(mat1[,col_1], mat2[,col_2]))
[1] TRUE
They are equal!

An alternative, slightly easier to understand than loops in my opinion:
sapply(
data.frame(mat1),
function(x) Map(function(a,b) cor.test(a,b)$p.value,
list(x),
as.data.frame(mat2))
)
Result:
# X1 X2 X3 X4 X5
#[1,] 0.7400541 0.8000358 0.5084979 0.4441933 0.9104712
#[2,] 0.2918163 0.2764817 0.956807 0.6072979 0.4395218
#[3,] 0.2866105 0.4095909 0.5648188 0.1746428 0.9125866

I supose you would like to do it without for's. With base stuff, here is the double apply aproach:
apply(mat1, 2, function(col_mat1){
apply(mat2, 2, function(col2, col1) {
cor.test(col2, col1)$p.value
}, col1=col_mat1)
})
The outter apply iterates at mat1 columns and serves one side of cor.test(). The inner one does the same, but now fills the second side of cor.test(). In practie, apply is replacing the for's.

I think all you need is to define your matrix first
mat_cor <- matrix(nrow=ncol(mat1), ncol=ncol(mat2))
for(i in 1:5)
{
for(j in 1:3)
{
mat_cor[i,j] <- cor.test(mat1[, i], mat2[ , j])$p.value
}
}
Output
mat_cor
[,1] [,2] [,3]
[1,] 0.9455569 0.8362242 0.162569342
[2,] 0.7755360 0.9849619 0.775006329
[3,] 0.8799139 0.8050564 0.001358697
[4,] 0.1574388 0.1808167 0.618624825
[5,] 0.8571844 0.8897125 0.879818822

You can try with something like this
pv <- c()
for(i in 1:dim(mat1)[2]){
for(j in 1:dim(mat2)[2]){
pv <-c(c, cor.test(mat1[, i], mat2[ , j])$estimate)
}
}
dim(pv) <- c(dim(mat1)[2], dim(mat2)[2])

Finding dimensional index in a multi-dimensional array in R

Am looking at say 3-dimensional array M: M<-dim(3,3,3)
I want to find an efficient way to populate M with the following rule:
M[i,j,k] = i/10 + j^2 + sqrt(k),
ideally without having to write a loop with a for statemenet.
For clarification, there is a simple way to accomplishing this if M were 2-dimensional. If i wanted to have
M[i,j] = i/10 + j^2,
then i could just do
M<-row(M)/10 + col(M)*col(M)
Is there something equivalent for 3-or-higher dimensional arrays?

#James's answer is better, but I think the narrow answer to your question (multidimensional equivalent of row()/col()) is slice.index ...
M<- array(dim=c(3,3,3))
slice.index(M,1)/10+slice.index(M,2)^2+sqrt(slice.index(M,3))
It would be a good idea if someone (I or someone else) posted a suggestion on the r-devel list to make slice.index a "See also" entry on ?row/?col ...
Alternatively (similar to #flodel's new answer):
d <- do.call(expand.grid,lapply(dim(M),seq)) ## create data.frame of indices
v <- with(d,Var1/10+Var2^2+sqrt(Var3)) ## default names Var1, ... Varn
dim(v) <- dim(M) ## reshape into array

How about using nested outers?
outer(1:3/10,outer((1:3)^2,sqrt(1:3),"+"),"+")
, , 1
[,1] [,2] [,3]
[1,] 2.1 5.1 10.1
[2,] 2.2 5.2 10.2
[3,] 2.3 5.3 10.3
, , 2
[,1] [,2] [,3]
[1,] 2.514214 5.514214 10.51421
[2,] 2.614214 5.614214 10.61421
[3,] 2.714214 5.714214 10.71421
, , 3
[,1] [,2] [,3]
[1,] 2.832051 5.832051 10.83205
[2,] 2.932051 5.932051 10.93205
[3,] 3.032051 6.032051 11.03205

You can also use arrayInd:
M <- array(dim = c(3, 3, 3))
foo <- function(dim1, dim2, dim3) dim1/10 + dim2^2 + sqrt(dim3)
idx <- arrayInd(seq_along(M), dim(M), useNames = TRUE)
M[] <- do.call(foo, as.data.frame(idx))
I feel this approach has potential for less typing as the number of dimensions increases.

Doing it from the "ground up" so to speak.
i <- rep(1:3, times=3*3)
j <- rep(1:3 , times= 3, each=3)
k <- rep(1:3 , each= 3*3)
M <- array( i/10 + j^2 + sqrt(k), c(3, 3, 3))
M