Using R Parallel with other R packages - r

I am working on a very time intensive analysis using the LQMM package in R. I set the model to start running on Thursday, it is now Monday, and is still running. I am confident in the model itself (tested as a standard MLM), and I am confident in my LQMM code (have run several other very similar LQMMs with the same dataset, and they all took over a day to run). But I'd really like to figure out how to make this run faster if possible using the parallel processing capabilities of the machines I have access to (note all are Microsoft Windows based).
I have read through several tutorials on using parallel, but I have yet to find one that shows how to use the parallel package in concert with other R packages....am I over thinking this, or is it not possible?
Here is the code that I am running using the R package LQMM:
install.packages("lqmm")
library(lqmm)
g1.lqmm<-lqmm(y~x+IEP+pm+sd+IEPZ+IEP*x+IEP*pm+IEP*sd+IEP*IEPZ+x*pm+x*sd+x*IEPZ,random=~1+x+IEP+pm+sd+IEPZ, group=peers, tau=c(.1,.2,.3,.4,.5,.6,.7,.8,.9),na.action=na.omit,data=g1data)
The dataset has 122433 observations on 58 variables. All variables are z-scored or dummy coded.

The dependent libraries will need to be evaluated on all your nodes. The function clusterEvalQ is foreseen inside the parallel package for this purpose. You might also need to export some of your data to the global environments of your subnodes: For this you can use the clusterExport function. Also view this page for more info on other relevant functions that might be useful to you.
In general, to speed up your application by using multiple cores you will have to split up your problem in multiple subpieces that can be processed in parallel on different cores. To achieve this in R, you will first need to create a cluster and assign a particular number of cores to it. Next, You will have to register the cluster, export the required variables to the nodes and then evaluate the necessary libraries on each of your subnodes. The exact way that you will setup your cluster and launch the nodes will depend on the type of sublibraries and functions that you will use. As an example, your clustersetup might look like this when you choose to utilize the doParallel package (and most of the other parallelisation sublibraries/functions):
library(doParallel)
nrCores <- detectCores()
cl <- makeCluster(nrCores)
registerDoParallel(cl);
clusterExport(cl,c("g1data"),envir=environment());
clusterEvalQ(cl,library("lqmm"))
The cluster is now prepared. You can now assign subparts of the global task to each individual node in your cluster. In the general example below each node in your cluster will process subpart i of the global task. In the example we will use the foreach %dopar% functionality that is provided by the doParallel package:
The doParallel package provides a parallel backend for the
foreach/%dopar% function using the parallel package of R 2.14.0 and
later.
Subresults will automatically be added to the resultList. Finally, when all subprocesses are finished we merge the results:
resultList <- foreach(i = 1:nrCores) %dopar%
{
#process part i of your data.
}
stopCluster(cl)
#merge data..
Since your question was not specifically on how to split up your data I will let you figure out the details of this part for yourself. However, you can find a more detailed example using the doParallel package in my answer to this post.

It sounds like you want to use parallel computing to make a single call of the lqmm function execute more quickly. To do that, you either have to:
Split the one call of lqmm into multiple function calls;
Parallelize a loop inside lqmm.
Some functions can be split up into multiple smaller pieces by specifying a smaller iteration value. Examples include parallelizing randomForest over the ntree argument, or parallelizing kmeans over the nstart argument. Another common case is to split the input data into smaller pieces, operate on the pieces in parallel, and then combine the results. That is often done when the input data is a data frame or a matrix.
But many times in order to parallelize a function you have to modify it. It may actually be easier because you may not have to figure out how to split up the problem and combine the partial results. You may only need to convert an lapply call into a parallel lapply, or convert a for loop into a foreach loop. However, it's often time consuming to understand the code. It's also a good idea to profile the code so that your parallelization really speeds up the function call.
I suggest that you download the source distribution of the lqmm package and start reading the code. Try to understand it's structure and get an idea which loops could be executed in parallel. If you're lucky, you might figure out a way to split one call into multiple calls, but otherwise you'll have to rebuild a modified version of the package on your machine.

Related

Executing parallel function calls in R

Currently I am using foreach loop in R to run parallel function calls on multiple cores of the same machine, and the code looks something like this:
result=foreach(i=1:length(list_of_dataframes))
{
temp=some_function(list_of_dataframes[[i]])
return(temp)
}
In this list_of_dataframes each data frame is for one product and has different number of columns, some_function is a function that does modelling task on each of the data frame. There are multiple function calls inside this function, some do data wrangling, others perform some sort of variable selection and so on. The result is a list of lists with each sub-list being a list of 3 data frames.For now, I have hardly 500 products and I am using a 32GB machine with 12 cores to perform this task in parallel using doparallel and foreach. My first question is - how do I scale this up,say when I have 500000 products and which framework should be ideal for this? My second question is- Can I use sparkR for this? Is spark meant to perform tasks like these? Would sparkR.lapply() be a good thing to use? I have read that it should be used as a last resort.
I am very new to all this parallel stuff, any help or suggestions would be of great help.Thanks in advance.

Running Multiple Bayesian Chains in Parallel R:3 nodes produced errors; first error: incorrect number of dimensions

I am trying to run replicate chains of a Bayesian statistical function on multiple cores; 1 chain per core. The function mcmc below is a script to run a single mcmc chain. My thinking is I can just run the mcmc function three times, each instance on a separate core. I found a couple examples I tried to modify, but have been unable to get it to run appropriately. I get the following error: 3 nodes produced errors; first error: incorrect number of dimensions. This makes me think I am not understanding how to use the parallel version of the apply function. I keep thinking it should be straight forward but can't seem to find my error. I am learning so much about Bayesian statistics, programming, and computers on the fly. Can anyone tell me what I am doing wrong?
My apologies if this has been answered previously, I was unable to find an answer that helped.
library(parallel)
library(snowfall)
library(rlecuyer)
cps=detectCores()-5 #I have access to 8 cores, but want to target only three
sfInit(parallel=TRUE, cpus=cps)
sfExportAll()
sfClusterSetupRNG()
#necessary input; GB, all.layers, ind defined previously
nchain=3
n.mcmc=2000
df=9
#mcmc is a function to run a single mcmc chain
tmp.fcn <- function(i){
tmp.out[i]=mcmc(GB,all.layers,ind,df,n.mcmc)
}
sfExport("GB","all.layers","ind","df","n.mcmc","nchain")
tmp.time=Sys.time()
score.list=sfClusterApplySR(1:nchain,tmp.fcn)
time.1=Sys.time()-tmp.time
I figured out what my issues were. First, I switched packages and used parallel instead of snowfall. Second, I figured out I needed a list which breaks up my problem into pieces to apply my function to. In my case, instead of having the starting values for my variable t randomly generated inside the function I wanted to apply, I made a list of 3 random start values then applied my function (tmp.func) to that list (t.start).
library(parallel)
n.core<-detectCores()-5
cl <- makeCluster(n.core)
#write function
tmp.func<-function(i){
tmp.out=mcmc(GB,all.layers,ind,df=8,cov.1=.5,cov.0=.15,t.start[i],n.mcmc=11000)
tmp.out
}
t.start=c(.15,.1,.2)
clusterEvalQ(cl,{library(adegenet);library(Matrix);library(MASS);library(mvtnorm);library(fda);library(raster);library(rgdal)})
clusterExport(cl,c("t.start","GB","all.layers","ind","mcmc"))
#apply function
out.11k=parLapply(cl,1:length(t.start),tmp.func)
stopCluster(cl)

When does foreach call .combine?

I have written some code using foreach which processes and combines a large number of CSV files. I am running it on a 32 core machine, using %dopar% and registering 32 cores with doMC. I have set .inorder=FALSE, .multicombine=TRUE, verbose=TRUE, and have a custom combine function.
I notice if I run this on a sufficiently large set of files, it appears that R attempts to process EVERY file before calling .combine the first time. My evidence is that in monitoring my server with htop, I initially see all cores maxed out, and then for the remainder of the job only one or two cores are used while it does the combines in batches of ~100 (.maxcombine's default), as seen in the verbose console output. What's really telling is the more jobs i give to foreach, the longer it takes to see "First call to combine"!
This seems counter-intuitive to me; I naively expected foreach to process .maxcombine files, combine them, then move on to the next batch, combining those with the output of the last call to .combine. I suppose for most uses of .combine it wouldn't matter as the output would be roughly the same size as the sum of the sizes of inputs to it; however my combine function pares down the size a bit. My job is large enough that I could not possibly hold all 4200+ individual foreach job outputs in RAM simultaneously, so I was counting on my space-saving .combine and separate batching to see me through.
Am I right that .combine doesn't get called until ALL my foreach jobs are individually complete? If so, why is that, and how can I optimize for that (other than making the output of each job smaller) or change that behavior?
The short answer is to use either doMPI or doRedis as your parallel backend. They work more as you expect.
The doMC, doSNOW and doParallel backends are relatively simple wrappers around functions such as mclapply and clusterApplyLB, and don't call the combine function until all of the results have been computed, as you've observed. The doMPI, doRedis, and (now defunct) doSMP backends are more complex, and get inputs from the iterators as needed and call the combine function on-the-fly, as you have assumed they would. These backends have a number of advantages in my opinion, and allow you to handle an arbitrary number of tasks if you have appropriate iterators and combine function. It surprises me that so many people get along just fine with the simpler backends, but if you have a lot of tasks, the fancy ones are essential, allowing you to do things that are quite difficult with packages such as parallel.
I've been thinking about writing a more sophisticated backend based on the parallel package that would handle results on the fly like my doMPI package, but there's hasn't been any call for it to my knowledge. In fact, yours has been the only question of this sort that I've seen.
Update
The doSNOW backend now supports on-the-fly result handling. Unfortunately, this can't be done with doParallel because the parallel package doesn't export the necessary functions.

parallelize process in missForest package

I am using a package called missForest to estimate the missing values in my data set.
My question is: how can we parallelize this process to shorten the time that it takes to get the results?
Please refer to this example (from missForest package):
data(iris)
summary(iris)
The data contains four continuous and one categorical variable.
Artificially produce missing values using the prodNA function:
set.seed(81)
iris.mis <- prodNA(iris, noNA = 0.2)
summary(iris.mis)
Impute missing values providing the complete matrix for illustration. Use ’verbose’ to see what happens between iterations:
iris.imp <- missForest(iris.mis, xtrue = iris, verbose = TRUE)
Yesterday I submitted version 1.4 of missForest to CRAN; the Windows and Linux packages are ready, the Mac version will follow soon.
The new function has an additional argument "parallelize" which allows to either compute the single forests in a parallel fashion (parallelize="forests") or to compute several forests on multiple variables at the same time (parallelize="variables"). The default setting is without parallel computing (parallelize="no").
Do not forget to register a suitable parallel backend, e.g. using the package "doParallel", before trying it for the first time. The "doParallel" vignette gives an illustrative example in Section 4.
Due to some other details I had to temporarily remove the "missForest" vignette from the package. But I will resolve this in due course and release it as version 1.4-1.
It's a bit tricky to do a good job of parallelizing the missForest function. There seem to be two basic ways to do it:
Create the randomForest model objects in parallel;
Execute multiple randomForest operations (create model and predict) in parallel for each of the columns of the data frame that contain NA's.
Method 1 is rather easy to implement, except that you have to compute the error estimates yourself since the randomForest combine function doesn't compute them for you. However, if the randomForest objects don't take that long to compute and there are many columns containing NA's, you may get very little if any speed up, even though the operations in aggregate take a long time to compute.
Method 2 is a bit harder to implement because the sequential algorithm updates the columns of the xmis data frame after each randomForest operation. I think the right way to parallelize this is to process n columns in parallel at a time (where n is the number of worker processes), thus requiring another loop around the n columns in order to process all of the columns of the data frame. My experiments suggest that unless this is done, the outer loop takes longer to converge, thus losing the benefit of executing in parallel.
In general, to get a performance improvement you will need to implement both of these methods, and choose which to use based on your input data. If you just have a few columns with NA's but the randomForest models take a long time to compute, you should choose method 1. If you have many columns with NA's, you should probably choose method 2, even if the individual randomForest models take a long time to compute because this can be done more efficiently, although it's possible that it will still require an extra iteration of the outer while loop.
In the process of experimenting with missForest, I eventually developed a parallel version of the package. I put the modified version of library.R on GitHub Gist, however it isn't trivial to use in that form, especially without documentation. So I contacted the author of missForest, and he is very interested in incorporating at least some of my modifications into the official package, so hopefully the next version of missForest that is posted to CRAN will support parallel execution.

Using all cores for R MASS::stepAIC process

I've been struggling to perform this sort of analysis and posted on the stats site about whether I was taking things in the right direction, but as I've been investigating I've also found that my lovely beefy processor (linux OS, i7) is only actually using 1 of its cores. Turns out this is default behaviour, but I have a fairly large dataset and between 40 and 50 variables to select from.
A stepAIC function that is checking various different models seems like the ideal sort of thing for parellizing, but I'm a relative newb with R and I only have sketchy notions about parallel computing.
I've taken a look at the documentation for the packages parallel, and snowfall, but these seems to have some built-in list functions for parallelisation and I'm not sure how to morph the stepAIC into a form that can be run in parellel using these packages.
Does anyone know 1) whether this is a feasible exercise, 2) how to do what I'm looking to do and can give me a sort of basic structure/list of keywords I'll need?
Thanks in advance,
Steph
I think that a process in which a step depends on de last (as in step wise selection) is not trivial to do in parallel.
The simplest way to do something in parallel I know is:
library(doMC)
registerDoMC()
l <- foreach(i=1:X) %dopar% { fun(...) }
in my poor understanding of stepwise one extracts variables (or add forward/backward) of a model and measure the fitting in each step. If extracting a variable the model fit is best you keep this model, for example. In the foreach parallel function each step is blind to other step, maybe you could write your own function to perform this task as in
http://beckmw.wordpress.com/tag/stepwise-selection/
I looked for this code, and seems to me that you could use parallel computing with the vif_func function...
I think you also should check optimized codes to do that task as in the package leaps
http://cran.r-project.org/web/packages/leaps/index.html
hope this helps...

Resources