Using %dopar% with a custom function - r

So I've got this function meant to group measurements from multiple probes that fall into defined regions.
HMkit.dmr<-function(Mat,Classes,method.fdr=c("BH","bonferroni"),probe.features) {
#Annotate first...
require(plyr)
require(dplyr)
#Filter matrix for testing and stuff...
message("Setting up merged table")
Mat2<-Mat[match(probe.features$probe,rownames(Mat)),]
#Split by classes
if(!is.factor(Classes)) {
Classes<-as.factor(Classes)
}
Class.1<-levels(Classes)[[1]]
Class.2<-levels(Classes)[[2]]
C1.Mat<-Mat2[,Classes==Class.1]
C2.Mat<-Mat2[,Classes==Class.2]
#Summarise and run wilcoxon's test for each dmr...
num.regions<-length(unique(as.character(probe.features$region.id)))
pvals.vec<-numeric(length=num.regions)
unique.regions<-unique(as.character(probe.features$region.id))
message(num.regions)
Meds.1<-numeric(length=num.regions);Meds.2<-numeric(length=num.regions)
for (i in 1:num.regions) {
region<-probe.features%>%filter(region.id %in% unique.regions[[i]])
Set1.Mat<-as.numeric(C1.Mat[rownames(C1.Mat) %in% region$probe,])
Set2.Mat<-as.numeric(C2.Mat[rownames(C2.Mat) %in% region$probe,])
pvals.vec[[i]]<-wilcox.test(Set1.Mat,Set2.Mat)$p.value
Meds.1[[i]]<-median(Set1.Mat)
Meds.2[[i]]<-median(Set2.Mat)
message(i)
}
#Output frame
dmrs.frame<-data.frame(region=unique.regions,pval=pvals.vec,G1=Meds.1,G2=Meds.2,dB=Meds.1-Meds.2)
dmrs.frame$q.val<-p.adjust(dmrs.frame$pval,method=method.fdr)
groups.ids<-levels(Classes)
return(list(dmrs=dmrs.frame,groups=groups.ids))
}
The code basically splits a matrix into two groups by samples and then pulls in the values of all probes that are defined as being in a region, calls a wilcox.test and a median summarisation step and uses it to populate vectors created beforehand.
I have tried to replace the for in the for loop with doparallel function in the foreach package but have not been able to get it to populate the vector with the correct outcomes. I want to know how to correctly use parallelisation with the function above - either by modifying the for loop, or by modifying the function call so regions are broken down into chunks that are processed in parallel.
Example objects follow below...
Mat<-matrix(runif(200,0,1), ncol=10,nrow=20)
rownames(Mat)<-paste0("p",1:20)
colnames(Mat)<-paste0("S",1:10)
Classes<-as.character(c(rep("G1",6),rep("G2",4)))
probe.features<-data.frame(probe=paste0("p",1:20),region.id=c(rep("R1",5),rep("R2",3),rep("R3",4),rep("R5",4),rep("R6",4))
and the function is run using
x<-HMkit.dmr(Mat,Classes,method.fdr=c("BH"),probe.features=probe.features)
In practise, there are 30,000 regions I am looking at, and want to parallelise the function across multiple cores on windows because serial execution can take up to 40 minutes. How do I do this?
Addendum - I tried to do this with
library(doParallel)
ncores<-2
Cl<-makeCluster(2)
registerDoParallel(Cl)
x<-foreach(i=1:length(unique(probe.features$region.id)),packages=c("plyr","dplyr"))%dopar%HMkit.dmr(Mat,Classes,probe.features=probe.features,method.fdr="BH")
However, doing that just returned two copies of the same results as the serial function, what I want it to do is break down regions in probe.features$region.id into chunks that go to different cores.

It appears to me that your "for" loop can be easily parallelized. It's just building up three vectors, one element per iteration, where each vector will become a column of "dmrs.frame". So each iteration is computing one row of the result.
To use "foreach", you can simply concatenate those three values into a vector. The .combine option is used to combine all of those the vectors into a matrix with "rbind":
m <- foreach(uregion=unique.regions, .combine='rbind',
.packages=c('plyr', 'dplyr')) %dopar% {
region<-probe.features%>%filter(region.id %in% uregion)
Set1.Mat<-as.numeric(C1.Mat[rownames(C1.Mat) %in% region$probe,])
Set2.Mat<-as.numeric(C2.Mat[rownames(C2.Mat) %in% region$probe,])
c(wilcox.test(Set1.Mat, Set2.Mat)$p.value,
median(Set1.Mat), median(Set2.Mat))
}
I got rid of the "i" variable since I think it's more readable to simply iterate over the elements of "unique.regions".
Now you can create "dmrs.frame" using the columns of matrix "m":
dmrs.frame <- data.frame(region=unique.regions,
pval=m[,1] G1=m[,2] G2=m[,3], dB=m[,2]-m[,3])

Related

Convert R apply statement to lapply for parallel processing

I have the following R "apply" statement:
for(i in 1:NROW(dataframe_stuff_that_needs_lookup_from_simulation))
{
matrix_of_sums[,i]<-
apply(simulation_results[,colnames(simulation_results) %in%
dataframe_stuff_that_needs_lookup_from_simulation[i,]],1,sum)
}
So, I have the following data structures:
simulation_results: A matrix with column names that identify every possible piece of desired simulation lookup data for 2000 simulations (rows).
dataframe_stuff_that_needs_lookup_from_simulation: Contains, among other items, fields whose values match the column names in the simulation_results data structure.
matrix_of_sums: When function is run, a 2000 row x 250,000 column (# of simulations x items being simulated) structure meant to hold simulation results.
So, the apply function is looking up the dataframe columns values for each row in a 250,000 data set, computing the sum, and storing it in the matrix_of_sums data structure.
Unfortunately, this processing takes a very long time. I have explored the use of rowsums as an alternative, and it has cut the processing time in half, but I would like to try multi-core processing to see if that cuts processing time even more. Can someone help me convert the code above to "lapply" from "apply"?
Thanks!
With base R parallel, try
library(parallel)
cl <- makeCluster(detectCores())
matrix_of_sums <- parLapply(cl, 1:nrow(dataframe_stuff_that_needs_lookup_from_simulation), function(i)
rowSums(simulation_results[,colnames(simulation_results) %in%
dataframe_stuff_that_needs_lookup_from_simulation[i,]]))
stopCluster(cl)
ans <- Reduce("cbind", matrix_of_sums)
You could also try foreach %dopar%
library(doParallel) # will load parallel, foreach, and iterators
cl <- makeCluster(detectCores())
registerDoParallel(cl)
matrix_of_sums <- foreach(i = 1:NROW(dataframe_stuff_that_needs_lookup_from_simulation)) %dopar% {
rowSums(simulation_results[,colnames(simulation_results) %in%
dataframe_stuff_that_needs_lookup_from_simulation[i,]])
}
stopCluster(cl)
ans <- Reduce("cbind", matrix_of_sums)
I wasn't quite sure how you wanted your output at the end, but it looks like you're doing a cbind of each result. Let me know if you're expecting something else however.
without really having any applicable or sample data to go off of... the process would look like this:
Create a holding matrix(matrix_of_sums)
loop by row through variable table(dataframe_stuff_that_needs_lookup_from_simulation)
find matching indices within the simulation model(simulation_results)
bind the rowSums into the holding matrix(matrix of sums)
I recreated a sample set which is meaningless and produces identical results but should work for your data
# Holding matrix which will be our end-goal
msums <- matrix(nrow = 2000,ncol = 0)
# Loop
parallel::mclapply(1:nrow(ts_df), function(i){
# Store the row to its own variable for ease
d <- ts_df[i,]
# cbind the results using the global assignment operator `<<-`
msums <<- cbind(
msums,
rowSums(
sim_df[,which(colnames(sim_df) %in% colnames(d))]
))
}, mc.cores = parallel::detectCores(), mc.allow.recursive = TRUE)

parallel programming for a function taking two arguments from a list of data frames using R

I have a function called DTW in similarity measure package. It takes two matrix or data frame as its arguments and returns the Dynamic time warping distance. Those data frames are the longitudes and latitudes of trajectory.
My program looks like this and all the data frames like df1, df2,df3 and so on are available:
distance <- function(arg1,arg2) {
DTW(arg1, arg2)
}
for(i in 1:length(LIST)){
for(j in 1:length(LIST)){
a <- get(paste0("df",i))
b <- get(paste0("df",j))
ddist[i,j] <- distance(a,b)
print(ddist)
}
}
I am making a matrix ddist in which all the values are inserted returned by distance function. The program is working fine. I want to make it fast using parallel programming like parapply or parlapply function.
Here is a simple method to give you an idea of how to make it parallel
k<-length(LIST)
ddist<-matrix(0,k,k)
library("doParallel")
cl<-makeCluster(4,outfile='')
registerDoParallel(cl)
for(i in 1:k) {
a <- get(paste0("df",i))
ddist[i,]=foreach(j = 1:k , .combine='cbind' ,.export=paste0("df",1:k)) %dopar% {
b <- get(paste0("df",j))
distance(a,b)
}
}
stopCluster(cl)
Having said that , things to evaluate
if the distance function takes more than 2 seconds ,then only use
parallel
df1 , df2 etc may not be a good idea , store each
dataframe as df[[1]] , df[[2]]. Better than using the get function
if length(k) is very huge , then the amount of time taken for
transferring the exported df1,df2 etc is quite a long time , hence
try to hit the sweet spot of performance with various iterations
You can see the option of data.table where there is inplace edit,
use this instead of the ddist as it might be faster
If this code is called within a function , then you might also need to
export the function ddist , like .export=c(ddist,paste0("df",1:k))
Change the "4" in makeCluster to chose the cores you want, as a
thumbrule , keep it as detectCores()-1

Optimizing nested foreach dopar in R

I'd like input on how my code below is structured. Would like to know if it needs to be organized in a different way to execute faster. Specifically, whether I need to be using foreach and dopar differently in the nested loops. Currently, the inner loop is the bulk of the work (ddply with between 1-8 breakdown variables, each of which has 10-200 levels), and that's what I have running in parallel. I left out the code details for simplicity.
Any ideas? My code, as organized below, does work, but it takes a few hours on a 6-core, 41gb machine. The dataset isn't that large (< 20k records).
for(m in 1:length(Predictors)){ # has up to three elements in the vector
# construct the dataframe based on the specified predictor
# subset the original dataframe based on the breakdown variables, outcome, predictor and covariates
for(l in 1:nrow(pairwisematrixReduced)){ # this has 1-6 rows;subset based on correct comparison groups
# some code here
cl <- makeCluster(detectCores())
registerDoParallel(cl)
for (i in 1:nrow(subsetting_table)){ # this table has about 50 rows
# this uses the columns specified by k in the glm; the prior columns will be used as breakdown variables
# up to 10 covariates
result[[length(result) + 1]] <- foreach(k = 11:17, .packages=c('plyr','reshape2', 'fastmatch')) %dopar% {
ddply(
df,
b, # vector of breakdown variables
function(x) {
# run a GLM and manipulate the output
,.parallel = TRUE) # close ddply
} # close k loop -- set of covariates
} # close i loop -- subsetting table
} #close l -- group combinations
} # close m loop - this is the pairwise predictor matrix
stopCluster(cl)
result <- unlist(result, recursive = FALSE)
tmp2<-do.call(rbind.fill, result)
Copied out of vignette("nested")
3 Using %:% with %dopar%
When parallelizing nested for loops, there is always a question of which loop to parallelize. The standard advice is...
You also are using foreach %dopar% along with ddply and .parallel=TRUE. With a six core processor (and presumably hyper threading) means the foreach block would start 12 environments and then the ddply would start 12 environments within each of those for 144 simultaneous environments. The foreach should be changed to %do% to be consistent with your questions text of running the inner loop in parallel. Or to make it cleaner, change both to foreach and use %dopar% for one loop and %:% for the other.

lapply with growing data.table function in R

I come from a Java/Python comp sci theory background so I am still getting used to the various R packages and how they can save run time in functions.
Basically, I am working on a few projects and all of them involve taking individual factors in a long-list data set (15,000 to 200,000 factors) and performing calculations on individual factors in an equally-large data set, and concurrently storing the results of those calculations in an exponentially-longer data frame.
So far I have been using nested while loops and concatenating into a growing list, but that is taking days. Ive recently learned about 'lapply' and the 'data.frame' options in R, and I would love to see an example of how to apply (no pun intended) them to the following basic correlation function:
Corr<-function(miRdf, mRNAdf)
{
j=1
k=1
m=1
n=1
c=0
corrList=NULL
while(n<=71521)
{
while(m<=1477)
{
corr=cor(as.numeric(miRdf[k,2:13]), as.numeric(mRNAdf[j,2:13]), use ="complete.obs")
corrList<-c(corrList, corr)
j=j+1
c=c+1
print(c) #just a counter to see how far the function has run
m=m+1
}
k=k+1
n=n+1
j=1
m=1 #to reset the inner while loop
}
corrList<-matrix(unlist(corrList), ncol=1477, byrow=FALSE)
colnames(corrList)<-miRdf[,1]
rownames(corrList)<-mRNAdf[,1]
write.csv(corrList, "testCorrWhole.csv")
}
As you can see, the nested while loop results in 105,636,517 (71521x1477) miRNA vs mRNA expression-value correlation scores that need to be performed and stored in a data frame that is 1477 cols x 71521 rows in order to generate a scoring matrix.
My question is, can anyone shed light on how to turn the above monstrosity into an efficient function that utilizes 'lapply' instead of the while loops, and uses the 'data.table' set() function to do away with the inefficiency of concatenating a list during every pass through the loop?
Thank you in advance!
Your names end with 'df', which makes it seem like your data are a data.frame. But #Troy's answer uses a matrix. A matrix is appropriate when the data are homogeneous, and generally matrix operations are much faster than data.frame operations. So you can see already that if you'd provided a small example of your data set (e.g., dput(mRNAdf[1:10,]) that people might be in a better position to help you; this is what they're asking for.
In large numerical calculations it makes sense to 'hoist' any repeated calculations outside the loop, so they are performed only once. Repeated calculations in your case include sub-setting to columns 2:13, and coercion to numeric. With this idea, and guessing that you actually have a data.frame where each column is already a numeric vector, I'd start with
mRNAmatrix <- as.matrix(mRNAdf[,2:13])
miRmatrix <- as.matrix(miRdf[,2:13])
From the help page ?cor we see that the arguments can be a matrix, and if so the correlation is calculated between columns. You're interested in the result when the arguments are transposed relative to your current representation. So
result <- cor(t(mRNAmatrix), t(miRmatrix), use="complete.obs")
This is fast enough for your purposes
> m1 = matrix(rnorm(71521 * 12), 71521)
> m2 = matrix(rnorm(1477 * 12), 1477)
> system.time(ans <- cor(t(m1), t(m2)))
user system elapsed
9.124 0.200 9.340
> dim(ans)
[1] 71521 1477
result is the same as your corrList -- it's not a list, but a matrix; probably the row and column names have been carried forward. You'd write this to a file as you do above, write.csv(result, "testCorrWhole.csv")
UPDATED BELOW TO SHOW PARALLEL PROCESSING - ABOUT A 60% SAVING
Using apply() might not be quick enough for you. Here's how to do it, though. Will have a think about performance since this example (1M output correlations in 1000x1000 grid) takes over a minute on laptop.
miRdf=matrix(rnorm(13000,10,1),ncol=13)
mRNAdf=matrix(rnorm(13000,10,1),ncol=13)
miRdf[,1]<-1:nrow(miRdf) # using column 1 as indices since they're not in the calc.
mRNAdf[,1]<-1:nrow(mRNAdf)
corRow<-function(y){
apply(miRdf,1,function(x)cor(as.numeric(x[2:13]), as.numeric(mRNAdf[y,2:13]), use ="complete.obs"))
}
system.time(apply(mRNAdf,1,function(x)corRow(x[1])))
# user system elapsed
# 72.94 0.00 73.39
And with parallel::parApply on a 4 core Win64 laptop
require(parallel) ## Library to allow parallel processing
miRdf=matrix(rnorm(13000,10,1),ncol=13)
mRNAdf=matrix(rnorm(13000,10,1),ncol=13)
miRdf[,1]<-1:nrow(miRdf) # using column 1 as indices since they're not in the calc.
mRNAdf[,1]<-1:nrow(mRNAdf)
corRow<-function(y){
apply(miRdf,1,function(x)cor(as.numeric(x[2:13]), as.numeric(mRNAdf[y,2:13]), use ="complete.obs"))
}
# Make a cluster from all available cores
cl=makeCluster(detectCores())
# Use clusterExport() to distribute the function and data.frames needed in the apply() call
clusterExport(cl,c("corRow","miRdf","mRNAdf"))
# time the call
system.time(parApply(cl,mRNAdf,1,function(x)corRow(x[[1]])))
# Stop the cluster
stopCluster(cl)
# time the call without clustering
system.time(apply(mRNAdf,1,function(x)corRow(x[[1]])))
## WITH CLUSTER (4)
user system elapsed
0.04 0.03 29.94
## WITHOUT CLUSTER
user system elapsed
73.96 0.00 74.46

Loop within a function and automatically create objects in R

I try to calculate the column means for diffrent groups in R. there exist several methods to assign groups and so two columns where created that contain diffrent groupings.
# create a test df
df.abcd.2<-data.frame(Grouping1=c("a","f","a","d","d","f","a"),Grouping2=c("y","y","z","z","x","x","q"),Var1=sample(1:7),Var2=sample(1:7),Var3=rnorm(1:7))
df.abcd.2
Now I created a loop with assign, lapply, split and colMeans to get my results and store the in diffrent dfs. The loop works fine.
#Loop to create the colmeans and store them in dataframes
for (i in 1:2){
nam <- paste("RRRRRR",deparse(i), sep=".")
assign(nam, as.data.frame(
lapply(
split(df.abcd.2[,3:5], df.abcd.2[,i]), colMeans)
)
)
}
So now i would like to create a function to apply this method on diffrent dataframes. My attemp looked like this:
# 1. function to calculate colMeans for diffrent groups
# df= desired datatframe,
# a=starting column: beginning of the columns that contain the groups, b= end of columns that contain the groups
# c=startinc column: beginning of columns to be analized, d=end of columns do be analized
function.split.colMeans<-function(df,a,b,c,d)
{for (i in a:b){
nam <- paste("OOOOO",deparse(i), sep=".")
assign(nam, as.data.frame(
lapply(
split(df[,c:d], df[,i]), colMeans)
)
)
}
}
#test the function
function.split.colMeans(df.abcd.2,1,2,3,5)
So when I test this function I get neither an error message nor results... Can anyone help me out, please?
It's working perfectly. Read the help for assign. Learn about frames and environments.
In other words, its creating the variables inside your function, but they don't leak out into the environment you see when you do ls() at the command line. If you put print(ls()) inside your functions loop you'll see them, but when the function ends, they disappear.
Normally, the only way functions interact with their calling environment is by their return value. Any other method is entering a whole world of pain.
DONT use assign to create things with sequential or informative names. Ever. Unless you know what you are doing, which you don't... Stick them in lists, then you can index the parts for looping and so on.

Resources