Executing parallel function calls in R - r

Currently I am using foreach loop in R to run parallel function calls on multiple cores of the same machine, and the code looks something like this:
result=foreach(i=1:length(list_of_dataframes))
{
temp=some_function(list_of_dataframes[[i]])
return(temp)
}
In this list_of_dataframes each data frame is for one product and has different number of columns, some_function is a function that does modelling task on each of the data frame. There are multiple function calls inside this function, some do data wrangling, others perform some sort of variable selection and so on. The result is a list of lists with each sub-list being a list of 3 data frames.For now, I have hardly 500 products and I am using a 32GB machine with 12 cores to perform this task in parallel using doparallel and foreach. My first question is - how do I scale this up,say when I have 500000 products and which framework should be ideal for this? My second question is- Can I use sparkR for this? Is spark meant to perform tasks like these? Would sparkR.lapply() be a good thing to use? I have read that it should be used as a last resort.
I am very new to all this parallel stuff, any help or suggestions would be of great help.Thanks in advance.

Related

Parallel processing a function being applied with mapply a large dataset

My problem is the following: I have a large dataset on R (I run it on VS-Code), which I'll call full of about 1.2GB (28kk rows with 10 columns) and subset of this dataset, which I'll call main (4.3kk rows and 10 columns). I use a Windows with an i7-10700k CPU, 3.8 GHz, 8-core, with 16gb of RAM.
These datasets contain unique identifiers for products, which then spam over multiple time periods and stores. For each (product-store) combination, I need to calculate summary statistics for similar products excluding that store and product. For this reason, I essentially need the full dataset to be loaded, and I cannot split it.
I have created a function that takes a given product-store, filters the dataset to exclude that product-store, and then perform the summary statistics.
There are over 1 million product-stores, so an apply would take 1 million runs. Each run of the function is taking about 0.5 seconds, which is a lot.
I then decided to use furrr's future_map2 along with plan(cluster, workers=8) to try and parallelize the process.
One adviced that normally goes against parallelization is that, if a lot of data needs to be moved around for each cluster, this process can take a long time. My understanding is that the parallelization would move the large datasets to each cluster once, and then it would perform the apply in parallel. This seems to imply that my process will be more efficient under parallelization, even with a large dataset.
I wanted to know if overall I am doing the most advisable thing in terms of speeding up the function. I already switched fully to data.table functions to improve speed, so I don't believe there's a lot to be done within the function.
Tried parallelizing, worried about whats the smartest approach

Force foreach to combine at some point during a %dopar%

I'm currently facing an issue using foreach (with doParallel) in R. I use the sum as combine function, but it appears that R waits for all the tasks to be completed before performing the sum. The fact is that my objects are pretty heavy and numerous, and my machine is unable to store all of those.
Is there any way to sum on the fly (in other words, to force a call to the combine function within the foreach) ? Thank you :)

Spawn process that run in parallel in R

I am writing a script that needs to be running continuously storing information on a MySQL database.
However, at some point of the day I will like to produce some summary of the data being colected, but writing this in the same script will stop collecting data while doing these summaries. Here's a sketch of the problem:
while (1==1) {
# get data and store it on the relational database
# At some point of the day (or time interval) do some summaries
if (time == certain_time) {
source("analyze_data.R")
}
}
The problem is that I'll like the data collection not to stop, being executed by another core of the computer.
I have seen references to packages parallel and multicore but my impression is that they are useful to repetitive tasks applied over vectors or lists.
You can use parallel to fork a process but you are right that the program will wait eternally for all the forked processes to come back together before proceeding (that is kind of the use case of parallel).
Why not run two separate R programs, one that collects the data and one that grabs it? Then, you simply run one continuously in the background and the other at set times. The problem then becomes one of getting the data out of the continuous data gathering program and into the summary program.
Do the logic outside of R:
Write 2 scripts; 1 with a while loop storing data, the other with a check. Run the while loop with one process and just leave it running.
Meanwhile, run your other (checking script) on demand to crunch the data. Or, put it in a cron job.
There are robust tools outside of R to handle this kind of thing; why do it inside R?

Using R Parallel with other R packages

I am working on a very time intensive analysis using the LQMM package in R. I set the model to start running on Thursday, it is now Monday, and is still running. I am confident in the model itself (tested as a standard MLM), and I am confident in my LQMM code (have run several other very similar LQMMs with the same dataset, and they all took over a day to run). But I'd really like to figure out how to make this run faster if possible using the parallel processing capabilities of the machines I have access to (note all are Microsoft Windows based).
I have read through several tutorials on using parallel, but I have yet to find one that shows how to use the parallel package in concert with other R packages....am I over thinking this, or is it not possible?
Here is the code that I am running using the R package LQMM:
install.packages("lqmm")
library(lqmm)
g1.lqmm<-lqmm(y~x+IEP+pm+sd+IEPZ+IEP*x+IEP*pm+IEP*sd+IEP*IEPZ+x*pm+x*sd+x*IEPZ,random=~1+x+IEP+pm+sd+IEPZ, group=peers, tau=c(.1,.2,.3,.4,.5,.6,.7,.8,.9),na.action=na.omit,data=g1data)
The dataset has 122433 observations on 58 variables. All variables are z-scored or dummy coded.
The dependent libraries will need to be evaluated on all your nodes. The function clusterEvalQ is foreseen inside the parallel package for this purpose. You might also need to export some of your data to the global environments of your subnodes: For this you can use the clusterExport function. Also view this page for more info on other relevant functions that might be useful to you.
In general, to speed up your application by using multiple cores you will have to split up your problem in multiple subpieces that can be processed in parallel on different cores. To achieve this in R, you will first need to create a cluster and assign a particular number of cores to it. Next, You will have to register the cluster, export the required variables to the nodes and then evaluate the necessary libraries on each of your subnodes. The exact way that you will setup your cluster and launch the nodes will depend on the type of sublibraries and functions that you will use. As an example, your clustersetup might look like this when you choose to utilize the doParallel package (and most of the other parallelisation sublibraries/functions):
library(doParallel)
nrCores <- detectCores()
cl <- makeCluster(nrCores)
registerDoParallel(cl);
clusterExport(cl,c("g1data"),envir=environment());
clusterEvalQ(cl,library("lqmm"))
The cluster is now prepared. You can now assign subparts of the global task to each individual node in your cluster. In the general example below each node in your cluster will process subpart i of the global task. In the example we will use the foreach %dopar% functionality that is provided by the doParallel package:
The doParallel package provides a parallel backend for the
foreach/%dopar% function using the parallel package of R 2.14.0 and
later.
Subresults will automatically be added to the resultList. Finally, when all subprocesses are finished we merge the results:
resultList <- foreach(i = 1:nrCores) %dopar%
{
#process part i of your data.
}
stopCluster(cl)
#merge data..
Since your question was not specifically on how to split up your data I will let you figure out the details of this part for yourself. However, you can find a more detailed example using the doParallel package in my answer to this post.
It sounds like you want to use parallel computing to make a single call of the lqmm function execute more quickly. To do that, you either have to:
Split the one call of lqmm into multiple function calls;
Parallelize a loop inside lqmm.
Some functions can be split up into multiple smaller pieces by specifying a smaller iteration value. Examples include parallelizing randomForest over the ntree argument, or parallelizing kmeans over the nstart argument. Another common case is to split the input data into smaller pieces, operate on the pieces in parallel, and then combine the results. That is often done when the input data is a data frame or a matrix.
But many times in order to parallelize a function you have to modify it. It may actually be easier because you may not have to figure out how to split up the problem and combine the partial results. You may only need to convert an lapply call into a parallel lapply, or convert a for loop into a foreach loop. However, it's often time consuming to understand the code. It's also a good idea to profile the code so that your parallelization really speeds up the function call.
I suggest that you download the source distribution of the lqmm package and start reading the code. Try to understand it's structure and get an idea which loops could be executed in parallel. If you're lucky, you might figure out a way to split one call into multiple calls, but otherwise you'll have to rebuild a modified version of the package on your machine.

Running two commands parallel to each other in R on windows

I have tried reading around on the net on using parallel computing in R.
My problem is that I want to utilize all my cores on my PC, and after reading different ressources I am not sure I need packages like multicore for my purposes, which unfortunately does not work on windows.
Can I simply split my very large data sets into several sub datasets, and run the same function on each, and have the same function run on different cores? They dont need to talk to each other, and I just need the output from each. Is that really impossible to do on windows?
Suppose I have a function called timeanalysis() and two datasets, 1 and 2. Cant I call the same function twice, and tell it to use a different core each time?
timeanalysis(1)
timeanalysis(2)
I have found the snowfall package to be the easiest to use in windows for parallel tasks.

Resources