I'd like input on how my code below is structured. Would like to know if it needs to be organized in a different way to execute faster. Specifically, whether I need to be using foreach and dopar differently in the nested loops. Currently, the inner loop is the bulk of the work (ddply with between 1-8 breakdown variables, each of which has 10-200 levels), and that's what I have running in parallel. I left out the code details for simplicity.
Any ideas? My code, as organized below, does work, but it takes a few hours on a 6-core, 41gb machine. The dataset isn't that large (< 20k records).
for(m in 1:length(Predictors)){ # has up to three elements in the vector
# construct the dataframe based on the specified predictor
# subset the original dataframe based on the breakdown variables, outcome, predictor and covariates
for(l in 1:nrow(pairwisematrixReduced)){ # this has 1-6 rows;subset based on correct comparison groups
# some code here
cl <- makeCluster(detectCores())
registerDoParallel(cl)
for (i in 1:nrow(subsetting_table)){ # this table has about 50 rows
# this uses the columns specified by k in the glm; the prior columns will be used as breakdown variables
# up to 10 covariates
result[[length(result) + 1]] <- foreach(k = 11:17, .packages=c('plyr','reshape2', 'fastmatch')) %dopar% {
ddply(
df,
b, # vector of breakdown variables
function(x) {
# run a GLM and manipulate the output
,.parallel = TRUE) # close ddply
} # close k loop -- set of covariates
} # close i loop -- subsetting table
} #close l -- group combinations
} # close m loop - this is the pairwise predictor matrix
stopCluster(cl)
result <- unlist(result, recursive = FALSE)
tmp2<-do.call(rbind.fill, result)
Copied out of vignette("nested")
3 Using %:% with %dopar%
When parallelizing nested for loops, there is always a question of which loop to parallelize. The standard advice is...
You also are using foreach %dopar% along with ddply and .parallel=TRUE. With a six core processor (and presumably hyper threading) means the foreach block would start 12 environments and then the ddply would start 12 environments within each of those for 144 simultaneous environments. The foreach should be changed to %do% to be consistent with your questions text of running the inner loop in parallel. Or to make it cleaner, change both to foreach and use %dopar% for one loop and %:% for the other.
Related
I'm trying to run multiple iterations of a function using a different subset of of my dataframe each time. In reality the function takes a very long time, so I want to split the iterations across multiple cores using mclapply. For each iteration I'm using sample to randomly select a subset of the dataframe, and this is inside the function I have written to give to mclapply. However, the results of each of the iterations in the output list are identical, suggesting that mclapply is not re-running the sample lines each time. This must be something to do with how I have written the code, any ideas where I have gone wrong?
Here is a reproducible example of a small dataset that runs quickly. You will notice that the 10 iterations in the d.val.all output list are identical, which is not what I am after.
library(bipartite)
library(doBy)
library(parallel)
# create dummy data
ecto.matrix1=data.frame(replicate(10,sample(0:80,81,rep=TRUE)),Species.mix.90=c(sample(c("R","M","S","B"),81,rep=TRUE)))
# set up the function
funct.resample.d <- function(i) {
RedSites <- row.names(ecto.matrix1)[ecto.matrix1$Species.mix.90=="R"]
MountainSites <- row.names(ecto.matrix1)[ecto.matrix1$Species.mix.90=="M"]
randomSilverSites <- sample(row.names(ecto.matrix1)[ecto.matrix1$Species.mix.90=="S"],8,replace=F)
randomBlackSites <- sample(row.names(ecto.matrix1)[ecto.matrix1$Species.mix.90=="B"],8,replace=F)
resampledSites <- c(RedSites,MountainSites,randomSilverSites,randomBlackSites) # make vector of the site names
matrix=ecto.matrix1[resampledSites,] # select only those rows from the resampled row names
matrix1 = matrix[,colSums(matrix[,-c(ncol(matrix))]) > 0] # drop cols that sum to 0
matrix2=summaryBy(matrix1[,-c(ncol(matrix1))]~Species.mix.90,data=matrix1,FUN=sum)
for (col in 1:ncol(matrix2)){
colnames(matrix2)[col] <- sub(".sum", "", colnames(matrix2)[col]) # remove the sum bit from the col names
}
row.names(matrix2)<-matrix2$Species.mix.90 # make row names
matrix2=subset(matrix2, select=-c(Species.mix.90)) # drop host col
d.val <- dfun(matrix2)$dprime
}
# run mclapply
reps=c(1:10)
d.val.all <- mclapply(reps, funct.resample.d, mc.cores = 10)
In case anyone else is having similar issues, I figured out that the problem was with the summaryBy function rather than sample. I replaced summaryBy with aggregate, and the randomization worked fine.
matrix2=aggregate(. ~ Species.mix.90, matrix1, sum)
I'm using the R package QCA (https://cran.r-project.org/web/packages/QCA/index.html) for Qualitative Comparative Analysis. I want to be able to try out many different combinations, which is taking a very long time. On my faster CPU, trying all the options that I am interested in takes a little over 24 hours. R seems to be using only one of the cores available on my CPU and requires relatively little memory (just under 100MB). I am hoping someone has a good idea on how to speed up this process, perhaps through parallelization?
Here's what I'm doing:
Loading my data set (data), which is a CSV file with the outcome condition and all the options for my causal conditions. The causal conditions are in 4 groups A, B, C, and D. There are approx. 200 observations, i.e., rows in the data set.
Starting a log file with sink()
Creating a series of nested loops to generate each combination of causal conditions I want to examine.
Running the minimize() function within the nested loops. Specifically this looks like this:
for (a in causal_condition_group_A) {
for (b in causal_condition_group_B) {
for (c in causal_condition_group_C) {
for (d in causal_condition_group_D) {
minimize(data, outcome = my_outcome, conditions = paste0(a, ", ", b, ", ", c, ", ", d), ...)
}
}
}
}
The minimize function's conditions argument essentially takes a character vector as input and this is all my nested loops are creating. For example, a random conditions argument might read:
conditions = "causal_condition_A_87, causal_condition_B_2, causal_condition_C_42, causal_condition_D_219"
I tried several different things in an attempt to parallelize this approach, but so far I have not been successful. I tried experimenting with both parSapply and foreach %dopar%, but I am running into various problems. I either can't get the actual parallelization process to work properly or - and this is in some of my toy experiments - I am having trouble logging all the output, which is essential.
Please let me know if I can provide additional information to help clear things up! Thanks for your help!
EDIT:
I was able to create a working foreach() loop based on #HenrikB's advice, but I'm running into a different problem now.
Here's my test solution so far. It includes one less nested loop than I want in the final solution, but that's not important for now:
# SET QCA OUTCOME CONDITION
outcome = "c_outcome"
# LOAD LIBRARIES
library(doParallel)
library(QCA)
# CREATE CLUSTER FOR PARALLELIZATION
cores <- detectCores()
cl <- makeCluster(cores[1]-1, type = "PSOCK", outfile="")
registerDoParallel(cl)
# LOAD AND SET UP DATA
outcomecond <- read.csv("outcomes.csv", header=TRUE, row.names="ID")
causalcond <- read.csv("causal_conditions.csv", header=TRUE, row.names="ID")
data <- cbind(outcomecond[outcome], causalcond)
temp <- data[!is.na(data[outcome]), ] #keep only rows where outcome is not NA
# EXPORT CURRENT DATASET TO CLUSTERS
clusterExport(cl, "temp")
# CREATE CAUSAL CONDITION LISTS
causal_condition_group_A <- colnames(causalcond[, 1:99])
causal_condition_group_B <- colnames(causalcond[, 100:141])
causal_condition_group_C <- colnames(causalcond[, 142:183])
# EXPORT LIBRARIES TO CLUSTERS
clusterEvalQ(cl, library(doParallel))
clusterEvalQ(cl, library(QCA))
# START TIMER
start.time <- Sys.time()
# THREE NESTED FOREACH LOOPS (ONE FOR EACH CAUSAL CONDITION GROUP)
x <-
foreach(c=causal_condition_group_C, .combine='cbind') %:%
foreach(b=causal_condition_group_B, .combine='cbind') %:%
foreach(a=causal_condition_group_A, .combine='cbind') %dopar% {
tryCatch({
minimize(temp, outcome = outcome, conditions = paste0(a,",",b,",",c), n.cut = 1, incl.cut = 0.400, include = "?", details = TRUE, use.letters = TRUE)
}, error=function(e){cat("ERROR :",conditionMessage(e), "\n")})
}
# END TIMER
end.time <- Sys.time()
# PROGRAM RUNNING TIME:
print(end.time - start.time)
# END CLUSTER
stopCluster(cl)
When I run this as a sequential %do% loop, I need something like 3 GB of memory to store x. However, when I try to run this sequentially, the memory rises exponentially. Here's a screenshot from shortly before I gave up:
Screenshot of task manager
Does someone know why %dopar% is using so much more memory and is there a way to avoid this?
Could I, for example, be writing x to file once in a while and purge memory after I do this? x is a list that is of dimension 14 by number of iterations in the foreach() loops. Here is what one of the "columns" in x looks like:
result.98
tt List,11
options List,10
negatives Numeric,3
initials "~A*B"
PIchart TRUE
primes Integer,2
solution List,1
essential "~A*B"
inputcases "205,245,253,306,425,468,490,511,514,585,587,657,684,696,739,740,784,796"
pims List,1
IC List,4
numbers Numeric,4
SA List,1
call Expression
I have a function with two loops involved and the results is two lists of data.
The structure
function (){
for (i in 1:50){
for (j in 1:100){
"Do something"
"get results a and b"
a list
b list
}
"use the series of a and b calculate two parameter A and B"
"put A and B into their list"
list A = append(list A, A)
'or'list B = cbind(list B, B) # I don't know which one is better
}
plot the figure using list A and B
"saving the results"
dataframe = df(listA, listB)
dataframe to csv
}
The code needs simulate 5000 times and each step takes at least 1 minutes:
I want to run this whole function using parallel programming; I tried lapply but it only works well with one loop, if I do so the results is not consistent and the plot can not work, i.e. I cannot get the results;
and I find some parallel code can not work on Windows and some cannot work on Mac, I am confused with those ...
Each steps in the loop is independently so one alternative way I thought is just divide the jobs to do them simultaneously, but I need the results constantly (as the order they should be).
To using the data in further plot requirements I need to save the results, I feel trouble with this one(upper here) and also the parallel one;
The way I save the results is looks like a mess.
For example, what I want is:
A B
0 0
0.1 1
1.2 4
3 9
6 12
... ...
but what I got is:
V1
0 0 0.1 1 1.2 4 3 9 6 12 ... ...
I don't know how to save two columns data from parallel programming.
I like using the foreach package for tasks like this (check the documentation). This function is like a for loop, but it works on a cluster. So each for iteration is done separately and is combined afterwards. I made a small example with the structure you are using. You can modify this for your task.
library(foreach)
library(doParallel)
#number of your cluster precessors, i choosed 4
cl <- makeCluster(4)
registerDoParallel(cl)
# use for z=1:10 your range, the .combine declares how to combine your dataframe afterwrads,
#.inorder makes sure it's sorted and the values are in the right order (TRUE is default)
df<-foreach(z = 1:10, .combine=rbind, .inorder=TRUE) %dopar%{
list_b = list()
list_a = list()
for (i in 1:50){
for (j in 1:100){
#some random task you are doing
a = i
b = 50-i
}
#combining them
list_b= cbind(list_b, b)
list_a= cbind(list_a, a)
}
#make sure you return the values, otherwise they don't get combined by foreach
return(do.call(rbind, Map(data.frame, A=list_a, B=list_b)))
}
#foreach returns nested lists, so you can change it to a dataframe easily
df= as.data.frame(df)
View(df)
stopCluster(cl)
I have the following R "apply" statement:
for(i in 1:NROW(dataframe_stuff_that_needs_lookup_from_simulation))
{
matrix_of_sums[,i]<-
apply(simulation_results[,colnames(simulation_results) %in%
dataframe_stuff_that_needs_lookup_from_simulation[i,]],1,sum)
}
So, I have the following data structures:
simulation_results: A matrix with column names that identify every possible piece of desired simulation lookup data for 2000 simulations (rows).
dataframe_stuff_that_needs_lookup_from_simulation: Contains, among other items, fields whose values match the column names in the simulation_results data structure.
matrix_of_sums: When function is run, a 2000 row x 250,000 column (# of simulations x items being simulated) structure meant to hold simulation results.
So, the apply function is looking up the dataframe columns values for each row in a 250,000 data set, computing the sum, and storing it in the matrix_of_sums data structure.
Unfortunately, this processing takes a very long time. I have explored the use of rowsums as an alternative, and it has cut the processing time in half, but I would like to try multi-core processing to see if that cuts processing time even more. Can someone help me convert the code above to "lapply" from "apply"?
Thanks!
With base R parallel, try
library(parallel)
cl <- makeCluster(detectCores())
matrix_of_sums <- parLapply(cl, 1:nrow(dataframe_stuff_that_needs_lookup_from_simulation), function(i)
rowSums(simulation_results[,colnames(simulation_results) %in%
dataframe_stuff_that_needs_lookup_from_simulation[i,]]))
stopCluster(cl)
ans <- Reduce("cbind", matrix_of_sums)
You could also try foreach %dopar%
library(doParallel) # will load parallel, foreach, and iterators
cl <- makeCluster(detectCores())
registerDoParallel(cl)
matrix_of_sums <- foreach(i = 1:NROW(dataframe_stuff_that_needs_lookup_from_simulation)) %dopar% {
rowSums(simulation_results[,colnames(simulation_results) %in%
dataframe_stuff_that_needs_lookup_from_simulation[i,]])
}
stopCluster(cl)
ans <- Reduce("cbind", matrix_of_sums)
I wasn't quite sure how you wanted your output at the end, but it looks like you're doing a cbind of each result. Let me know if you're expecting something else however.
without really having any applicable or sample data to go off of... the process would look like this:
Create a holding matrix(matrix_of_sums)
loop by row through variable table(dataframe_stuff_that_needs_lookup_from_simulation)
find matching indices within the simulation model(simulation_results)
bind the rowSums into the holding matrix(matrix of sums)
I recreated a sample set which is meaningless and produces identical results but should work for your data
# Holding matrix which will be our end-goal
msums <- matrix(nrow = 2000,ncol = 0)
# Loop
parallel::mclapply(1:nrow(ts_df), function(i){
# Store the row to its own variable for ease
d <- ts_df[i,]
# cbind the results using the global assignment operator `<<-`
msums <<- cbind(
msums,
rowSums(
sim_df[,which(colnames(sim_df) %in% colnames(d))]
))
}, mc.cores = parallel::detectCores(), mc.allow.recursive = TRUE)
So I've got this function meant to group measurements from multiple probes that fall into defined regions.
HMkit.dmr<-function(Mat,Classes,method.fdr=c("BH","bonferroni"),probe.features) {
#Annotate first...
require(plyr)
require(dplyr)
#Filter matrix for testing and stuff...
message("Setting up merged table")
Mat2<-Mat[match(probe.features$probe,rownames(Mat)),]
#Split by classes
if(!is.factor(Classes)) {
Classes<-as.factor(Classes)
}
Class.1<-levels(Classes)[[1]]
Class.2<-levels(Classes)[[2]]
C1.Mat<-Mat2[,Classes==Class.1]
C2.Mat<-Mat2[,Classes==Class.2]
#Summarise and run wilcoxon's test for each dmr...
num.regions<-length(unique(as.character(probe.features$region.id)))
pvals.vec<-numeric(length=num.regions)
unique.regions<-unique(as.character(probe.features$region.id))
message(num.regions)
Meds.1<-numeric(length=num.regions);Meds.2<-numeric(length=num.regions)
for (i in 1:num.regions) {
region<-probe.features%>%filter(region.id %in% unique.regions[[i]])
Set1.Mat<-as.numeric(C1.Mat[rownames(C1.Mat) %in% region$probe,])
Set2.Mat<-as.numeric(C2.Mat[rownames(C2.Mat) %in% region$probe,])
pvals.vec[[i]]<-wilcox.test(Set1.Mat,Set2.Mat)$p.value
Meds.1[[i]]<-median(Set1.Mat)
Meds.2[[i]]<-median(Set2.Mat)
message(i)
}
#Output frame
dmrs.frame<-data.frame(region=unique.regions,pval=pvals.vec,G1=Meds.1,G2=Meds.2,dB=Meds.1-Meds.2)
dmrs.frame$q.val<-p.adjust(dmrs.frame$pval,method=method.fdr)
groups.ids<-levels(Classes)
return(list(dmrs=dmrs.frame,groups=groups.ids))
}
The code basically splits a matrix into two groups by samples and then pulls in the values of all probes that are defined as being in a region, calls a wilcox.test and a median summarisation step and uses it to populate vectors created beforehand.
I have tried to replace the for in the for loop with doparallel function in the foreach package but have not been able to get it to populate the vector with the correct outcomes. I want to know how to correctly use parallelisation with the function above - either by modifying the for loop, or by modifying the function call so regions are broken down into chunks that are processed in parallel.
Example objects follow below...
Mat<-matrix(runif(200,0,1), ncol=10,nrow=20)
rownames(Mat)<-paste0("p",1:20)
colnames(Mat)<-paste0("S",1:10)
Classes<-as.character(c(rep("G1",6),rep("G2",4)))
probe.features<-data.frame(probe=paste0("p",1:20),region.id=c(rep("R1",5),rep("R2",3),rep("R3",4),rep("R5",4),rep("R6",4))
and the function is run using
x<-HMkit.dmr(Mat,Classes,method.fdr=c("BH"),probe.features=probe.features)
In practise, there are 30,000 regions I am looking at, and want to parallelise the function across multiple cores on windows because serial execution can take up to 40 minutes. How do I do this?
Addendum - I tried to do this with
library(doParallel)
ncores<-2
Cl<-makeCluster(2)
registerDoParallel(Cl)
x<-foreach(i=1:length(unique(probe.features$region.id)),packages=c("plyr","dplyr"))%dopar%HMkit.dmr(Mat,Classes,probe.features=probe.features,method.fdr="BH")
However, doing that just returned two copies of the same results as the serial function, what I want it to do is break down regions in probe.features$region.id into chunks that go to different cores.
It appears to me that your "for" loop can be easily parallelized. It's just building up three vectors, one element per iteration, where each vector will become a column of "dmrs.frame". So each iteration is computing one row of the result.
To use "foreach", you can simply concatenate those three values into a vector. The .combine option is used to combine all of those the vectors into a matrix with "rbind":
m <- foreach(uregion=unique.regions, .combine='rbind',
.packages=c('plyr', 'dplyr')) %dopar% {
region<-probe.features%>%filter(region.id %in% uregion)
Set1.Mat<-as.numeric(C1.Mat[rownames(C1.Mat) %in% region$probe,])
Set2.Mat<-as.numeric(C2.Mat[rownames(C2.Mat) %in% region$probe,])
c(wilcox.test(Set1.Mat, Set2.Mat)$p.value,
median(Set1.Mat), median(Set2.Mat))
}
I got rid of the "i" variable since I think it's more readable to simply iterate over the elements of "unique.regions".
Now you can create "dmrs.frame" using the columns of matrix "m":
dmrs.frame <- data.frame(region=unique.regions,
pval=m[,1] G1=m[,2] G2=m[,3], dB=m[,2]-m[,3])