Convert R apply statement to lapply for parallel processing - r

I have the following R "apply" statement:
for(i in 1:NROW(dataframe_stuff_that_needs_lookup_from_simulation))
{
matrix_of_sums[,i]<-
apply(simulation_results[,colnames(simulation_results) %in%
dataframe_stuff_that_needs_lookup_from_simulation[i,]],1,sum)
}
So, I have the following data structures:
simulation_results: A matrix with column names that identify every possible piece of desired simulation lookup data for 2000 simulations (rows).
dataframe_stuff_that_needs_lookup_from_simulation: Contains, among other items, fields whose values match the column names in the simulation_results data structure.
matrix_of_sums: When function is run, a 2000 row x 250,000 column (# of simulations x items being simulated) structure meant to hold simulation results.
So, the apply function is looking up the dataframe columns values for each row in a 250,000 data set, computing the sum, and storing it in the matrix_of_sums data structure.
Unfortunately, this processing takes a very long time. I have explored the use of rowsums as an alternative, and it has cut the processing time in half, but I would like to try multi-core processing to see if that cuts processing time even more. Can someone help me convert the code above to "lapply" from "apply"?
Thanks!

With base R parallel, try
library(parallel)
cl <- makeCluster(detectCores())
matrix_of_sums <- parLapply(cl, 1:nrow(dataframe_stuff_that_needs_lookup_from_simulation), function(i)
rowSums(simulation_results[,colnames(simulation_results) %in%
dataframe_stuff_that_needs_lookup_from_simulation[i,]]))
stopCluster(cl)
ans <- Reduce("cbind", matrix_of_sums)
You could also try foreach %dopar%
library(doParallel) # will load parallel, foreach, and iterators
cl <- makeCluster(detectCores())
registerDoParallel(cl)
matrix_of_sums <- foreach(i = 1:NROW(dataframe_stuff_that_needs_lookup_from_simulation)) %dopar% {
rowSums(simulation_results[,colnames(simulation_results) %in%
dataframe_stuff_that_needs_lookup_from_simulation[i,]])
}
stopCluster(cl)
ans <- Reduce("cbind", matrix_of_sums)
I wasn't quite sure how you wanted your output at the end, but it looks like you're doing a cbind of each result. Let me know if you're expecting something else however.

without really having any applicable or sample data to go off of... the process would look like this:
Create a holding matrix(matrix_of_sums)
loop by row through variable table(dataframe_stuff_that_needs_lookup_from_simulation)
find matching indices within the simulation model(simulation_results)
bind the rowSums into the holding matrix(matrix of sums)
I recreated a sample set which is meaningless and produces identical results but should work for your data
# Holding matrix which will be our end-goal
msums <- matrix(nrow = 2000,ncol = 0)
# Loop
parallel::mclapply(1:nrow(ts_df), function(i){
# Store the row to its own variable for ease
d <- ts_df[i,]
# cbind the results using the global assignment operator `<<-`
msums <<- cbind(
msums,
rowSums(
sim_df[,which(colnames(sim_df) %in% colnames(d))]
))
}, mc.cores = parallel::detectCores(), mc.allow.recursive = TRUE)

Related

Using "sample" within mclapply in R not working properly

I'm trying to run multiple iterations of a function using a different subset of of my dataframe each time. In reality the function takes a very long time, so I want to split the iterations across multiple cores using mclapply. For each iteration I'm using sample to randomly select a subset of the dataframe, and this is inside the function I have written to give to mclapply. However, the results of each of the iterations in the output list are identical, suggesting that mclapply is not re-running the sample lines each time. This must be something to do with how I have written the code, any ideas where I have gone wrong?
Here is a reproducible example of a small dataset that runs quickly. You will notice that the 10 iterations in the d.val.all output list are identical, which is not what I am after.
library(bipartite)
library(doBy)
library(parallel)
# create dummy data
ecto.matrix1=data.frame(replicate(10,sample(0:80,81,rep=TRUE)),Species.mix.90=c(sample(c("R","M","S","B"),81,rep=TRUE)))
# set up the function
funct.resample.d <- function(i) {
RedSites <- row.names(ecto.matrix1)[ecto.matrix1$Species.mix.90=="R"]
MountainSites <- row.names(ecto.matrix1)[ecto.matrix1$Species.mix.90=="M"]
randomSilverSites <- sample(row.names(ecto.matrix1)[ecto.matrix1$Species.mix.90=="S"],8,replace=F)
randomBlackSites <- sample(row.names(ecto.matrix1)[ecto.matrix1$Species.mix.90=="B"],8,replace=F)
resampledSites <- c(RedSites,MountainSites,randomSilverSites,randomBlackSites) # make vector of the site names
matrix=ecto.matrix1[resampledSites,] # select only those rows from the resampled row names
matrix1 = matrix[,colSums(matrix[,-c(ncol(matrix))]) > 0] # drop cols that sum to 0
matrix2=summaryBy(matrix1[,-c(ncol(matrix1))]~Species.mix.90,data=matrix1,FUN=sum)
for (col in 1:ncol(matrix2)){
colnames(matrix2)[col] <- sub(".sum", "", colnames(matrix2)[col]) # remove the sum bit from the col names
}
row.names(matrix2)<-matrix2$Species.mix.90 # make row names
matrix2=subset(matrix2, select=-c(Species.mix.90)) # drop host col
d.val <- dfun(matrix2)$dprime
}
# run mclapply
reps=c(1:10)
d.val.all <- mclapply(reps, funct.resample.d, mc.cores = 10)
In case anyone else is having similar issues, I figured out that the problem was with the summaryBy function rather than sample. I replaced summaryBy with aggregate, and the randomization worked fine.
matrix2=aggregate(. ~ Species.mix.90, matrix1, sum)

parallel programming for a function taking two arguments from a list of data frames using R

I have a function called DTW in similarity measure package. It takes two matrix or data frame as its arguments and returns the Dynamic time warping distance. Those data frames are the longitudes and latitudes of trajectory.
My program looks like this and all the data frames like df1, df2,df3 and so on are available:
distance <- function(arg1,arg2) {
DTW(arg1, arg2)
}
for(i in 1:length(LIST)){
for(j in 1:length(LIST)){
a <- get(paste0("df",i))
b <- get(paste0("df",j))
ddist[i,j] <- distance(a,b)
print(ddist)
}
}
I am making a matrix ddist in which all the values are inserted returned by distance function. The program is working fine. I want to make it fast using parallel programming like parapply or parlapply function.
Here is a simple method to give you an idea of how to make it parallel
k<-length(LIST)
ddist<-matrix(0,k,k)
library("doParallel")
cl<-makeCluster(4,outfile='')
registerDoParallel(cl)
for(i in 1:k) {
a <- get(paste0("df",i))
ddist[i,]=foreach(j = 1:k , .combine='cbind' ,.export=paste0("df",1:k)) %dopar% {
b <- get(paste0("df",j))
distance(a,b)
}
}
stopCluster(cl)
Having said that , things to evaluate
if the distance function takes more than 2 seconds ,then only use
parallel
df1 , df2 etc may not be a good idea , store each
dataframe as df[[1]] , df[[2]]. Better than using the get function
if length(k) is very huge , then the amount of time taken for
transferring the exported df1,df2 etc is quite a long time , hence
try to hit the sweet spot of performance with various iterations
You can see the option of data.table where there is inplace edit,
use this instead of the ddist as it might be faster
If this code is called within a function , then you might also need to
export the function ddist , like .export=c(ddist,paste0("df",1:k))
Change the "4" in makeCluster to chose the cores you want, as a
thumbrule , keep it as detectCores()-1

clusterMap split list of data.frames

I'm working with two lists of data.frames and currently run something similar to this (simplified version of what I'm doing):
df1 <- data.frame("a","a1","L","R","b","c",1,2,3,4)
df2 <- data.frame("a","a1","L","R","b","c",4,4,4,4,4,44)
df3 <- data.frame(7,7,7,7)
df4 <- data.frame(5,5,5,5,9,9)
L1 <- list(df1,df2)
L2 <- list(df3,df4)
myfun <- function(x,y) {
difa = rowSums(abs(x[c(T,F)] - x[c(F,T)]))
difb=sum(abs(as.numeric(y[-c(1:6)])[c(T,F)] - as.numeric(y[-c(1:6)])[c(F,T)]))
diff <- difa + difb
return(diff)
}
output1 <- mapply(myfun, x = L2, y = L1)
The same number of data frames are in each list and each dataframe from one list corresponds to the dataframe in the other list. The dataframes in one list contain a single row while the other dataframes in the second list contain a dynamic number of rows; hence the use of sum and rowSums. The number of numeric columns are also dynamic but always the same between corresponding dataframes.
I'm looking to use parallel processing to speed up the computation when dealing with 1-10 million dataframes per list. I tried the following:
library(parallel)
if(detectCores() > 1) {no_cores <- detectCores() - 1}
if(.Platform$OS.type == "unix") {ptype <- "FORK"}
cl <- makeCluster(no_cores, type = ptype)
clusterMap(cl, myfun, x = L2, y = L1)
stopCluster(cl)
However, due to the significant amount of data I'm using, this will quickly fill up the memory. I assume due to loading the entire lists of data frames in each cluster?? I'm new to parallel processing in R and have read that splitting up the data into chunks according to the number of cores available is required for some parallel functions that don't implement it automatically, so I tried the following which does not work:
library(parallel)
if(detectCores() > 1) {no_cores <- detectCores() - 1}
if(.Platform$OS.type == "unix") {ptype <- "FORK"}
cl <- makeCluster(no_cores, type = ptype)
output1 <- clusterMap(cl, myfun, x = split(L2, ceiling(seq_along(L2)/no_cores)), y = split(L1, ceiling(seq_along(L1)/no_cores)))
stopCluster(cl)
Can someone help a newbie out? Most of the information I've been reading about uses parApply/parLapply/etc. I was able to use mcmapply, but since it only uses forking, I cannot use it. My code has to run on both unix and windows systems; hence my testing for OS.type to set it at fork.
UPDATE: So I think it is working correctly in the sense that it's parsing out chunks to different clusters, but the data type is not playing nice with binary operators inside the clusters. Issue appears to be related to it becoming a list of lists of dataframes and being treated as non-numeric in the clusters.

incorrect number of dimensions error using parLapply

I am trying to parallelize some function on the 4 cores of my machine using parLapply.
My function defines two embedded loops which are meant to fill out some empty columns of a predefined matrix M.
However, when I run the code below I obtain the following error
2 nodes produced errors; first error: incorrect number of dimensions
Code:
require("parallel")
TheData<-list(E,T) # list of 2 matrices of different dimensions, T is longer and wider than E
myfunction <- function(TheData) {
for (k in 1:length(TheData[[1]][,1])) {
distance<-matrix(,nrow=length(TheData[[1]][,1]),ncol=1)
for (j in 1:length(TheData[[2]][,1])) {
distance[j]<-sqrt((as.numeric(TheData[[2]][j,1])-as.numeric(TheData[[1]][k,2]))^2+(as.numeric(TheData[[2]][j,2])-as.numeric(TheData[[1]][k,1]))^2)
}
index<-which(distance == min(distance))
M[k,4:9]<-c(as.numeric(TheData[[2]][index,1]),as.numeric(TheData[[2]][index,2]),as.numeric(TheData[[2]][index,3]),as.numeric(TheData[[2]][index,4]),as.numeric(TheData[[2]][index,5]),as.numeric(TheData[[2]][index,6]))
rm(distance)
gc()
}
}
n_cores <- 4
Cl = makeCluster(n_cores)
Results <- parLapplyLB(Cl, TheData, myfunction)
# I also tried: Results <- parLapply(Cl, TheData, myfunction)
In your example, parLapply is iterating over a list of matrices, and passing those matrices as the argument to "myfunction". However, "myfunction" seems to expect its argument to be a list of two matrices, and so an error occurs. I can reproduce that error with:
> E <- matrix(0, 4, 4)
> E[[1]][,1]
Error in E[[1]][, 1] : incorrect number of dimensions
I'm not sure what you're really trying to do, but with the current implementation of "myfunction", I would expect you to call parLapply with a list of lists containing two matrices, such as:
TheDataList <- list(list(A,B), list(C,D), list(E,F), list(G,H))
Passing this as the second argument to parLapply would result in "myfunction" being called four times, each time with a list containing two matrices.
But your example has another problem. It looks like you expect parLapply to modify the matrix "M" as a side-effect, but it can't. I think you should change "myfunction" to return a matrix. parLapply will return the matrices in a list which you can then bind together into the desired result.
Update
From your comment, I now believe that you essentially want to parallelize "myfunction". Here's my attempt to do that:
library(parallel)
cl <- makeCluster(4)
myfunction <- function(Exy) {
iM <- integer(nrow(Exy))
for (k in 1:nrow(Exy)) {
distance <- sqrt((Txy[,1] - Exy[k,2])^2 + (Txy[,2] - Exy[k,1])^2)
iM[k] <- which.min(distance)
}
iM
}
# Random example data for testing
T <- matrix(rnorm(150), 10)
E <- matrix(rnorm(120), 10)
# Only export the first two columns to T to the workers
Txy <- T[,1:2]
clusterExport(cl, c('Txy'))
# Parallelize "myfunction" by calling it in parallel on block rows of "E".
ExyList <- parallel:::splitRows(E[,1:2], length(cl))
iM <- do.call('c', clusterApply(cl, ExyList, myfunction))
# Update "M" using data from "T" indexed by "iM"
M <- matrix(0, nrow(T), 9) # more fake data
for (k in iM) {
M[k,4:9] <- T[k, 1:6]
}
print(M)
stopCluster(cl)
Notes:
I vectorized myfunction which should make it more efficient. Hopefully it's nearly correct.
I also modified myfunction to return a vector of indices into "T" to reduce the amount of data sent back to the master.
The splitRows function from the parallel package is used to split the first two columns of "E" into a list of submatrices.
splitRows isn't exported by parallel, so I used ':::'. If this offends you, then use the splitRows function from snow which is exported.
The first two columns of "T" are exported to each of the workers since each task requires the entire first two columns.
clusterApply is used rather than parLapply since we need to iterate over submatrices of E.

Need help on combine function in a Parallel Simulation study using doMC

I want to ask for some help on writing a combine function for foreach(). Consider the function below:
library(mvtnorm)
library(doMC)
mySimFunc <- function(){
myNum <- runif(1)
myVec <- rnorm(10)
myMat <- rmvnorm(5, rep(0, 3), diag(3))
myListRslt <- list("myNum" = myNum, "myVec" = myVec, "myMat" = myMat)
return (myListRslt)
}
Now I'd like to run the code above for 1000 times using foreach() %dopar% and in each iteration I'd like to:
return myNum as is
get average of myVec and return it
get colMeans() of myMat and return it.
I'd like foreach() %dopar% to return a final list including:
a vector of length 1000 including 1000 myNum each corresponding to an iteration
a vector of length 1000 including 1000 average of myVec in each iteration
a matrix with 1000 rows where each row includes colMeans of myMat in that iteration
My Ideal solution
My ideal solution is o find a way that foreach() acts exactly like for so that I can simply define:
myNumRslt <- NULL
myVecRslt <- NULL
myMatRslt <- NULL
# and then simply aggregate result of each iteration to the variables above as:
foreach(i = 1:1000) %dopar%{
rslt <- mySimFunc()
myNumRslt <- c(myNumRslt, rslt$myNum)
myVecRslt <- c(myVecRslt, mean(rslt$myVec))
myMatRslt.tmp <- colMeans(rslt$myMat)
myMatRslt <- rbind(myMatRslt, myMatRslt.tmp)
}
BUT, unfortunately seems that it's not possible to do that with foreach() so then I think the only solution is to write a combine function that does similar to result aggregation above.
Challenge
1) How could I write a combine function that returns what I explained above?
2) When we do %dopar% (suppose using doMC package), does doMC distribute each iteration to a CPU or it goes further and divide each iteration to further pieces and distribute them?
3) Is there any better (more efficient) way than using doMC and foreach() ?
idea's
In this question Brian mentioned a brilliant way to deal with lists including numeric values. In my case, I have numeric values as well as vectors and matrices. I don't know how to extend Brian's idea in my case.
Thanks very much for your help.
Edit
Cleaned up, generalizable solution using .combine:
#modify function to include aggregation
mySimFunc2 <- function(){
myNum <- runif(1)
myVec <- mean(rnorm(10))
myMat <- colMeans(rmvnorm(5, rep(0, 3), diag(3)))
myListRslt <- list("myNum" = myNum, "myVec" = myVec, "myMat" = myMat)
return (myListRslt)
}
#.combine function
MyComb1 <- function(...) {
lst=list(...)
vec<-sapply(1:length(lst), function (i) return(lst[[i]][[1]] ))
vecavg<-sapply(1:length(lst),function (i) return(lst[[i]][[2]] ))
colmeans<-t(sapply(1:length(lst), function (i) return(lst[[i]][[3]])))
final<-list(vec,vecavg,colmeans)
names(final)<-c("vec","vecavg","colmeans")
return(final)
}
library(doParallel)
cl <- makeCluster(3) #set cores
registerDoParallel(cl)
foreach(i=1:1000,.export=c("mySimFunc2","MyComb1"),.combine=MyComb1,
.multicombine=TRUE,.maxcombine=1000, .packages=c("mvtnorm"))%dopar%{mySimFunc2()}
You should now have a list output containing the desired three objects, which I've titled respectively as vec, vecavg, and colmeans. Note you must set .maxcombine to the number of iterations if iterations are greater than 100.
As a side note, it does not make sense to parallelize for this example task, although I'm guessing the real task may be more complex.

Resources