Running two commands parallel to each other in R on windows - r

I have tried reading around on the net on using parallel computing in R.
My problem is that I want to utilize all my cores on my PC, and after reading different ressources I am not sure I need packages like multicore for my purposes, which unfortunately does not work on windows.
Can I simply split my very large data sets into several sub datasets, and run the same function on each, and have the same function run on different cores? They dont need to talk to each other, and I just need the output from each. Is that really impossible to do on windows?
Suppose I have a function called timeanalysis() and two datasets, 1 and 2. Cant I call the same function twice, and tell it to use a different core each time?
timeanalysis(1)
timeanalysis(2)

I have found the snowfall package to be the easiest to use in windows for parallel tasks.

Related

Using multiple programs in OpenCL

I am writing a piece of code that utilizes the GPU using OpenCL. I succeeded in making a kernel that runs Vector addition (in a function called VecAdd), so I know it is working. Suppose I want to make a second kernel for Vector subtraction VecSub. How should I go about that? Or more specifically: can I use the same context for both the VecAdd and VecSub function?
Hi #debruss welcome to StackOverflow!
Yes, you certainly can run multiple Kernels in the same Context.
You can define the Kernels in the same or multiple Programs. You could even run them simultaneously in two different Command Queues or a single Command Queue configured for out of order execution.
There is an example (in rust) of defining and running two Kernels in a Program here: opencl2_kernel_test.rs.

R - Is it possible to run two (or more) consoles using the SAME environment simultaneously?

And if so, how?
I use RStudio. I know I can fork a project in order to perform calculations over two copies of the same environment (as described here). Although, it doesn't fit my needs because the environment I'm currently using is very big, and I don't have enough RAM for duplicating it.
Therefore, I am wondering if there is some way in which I can open two (or more) consoles using the same one (in particular, I would be particularly interested on not having to replicate the very big data frames).
Is there a way in which I can use RStudio this way, or is there any other IDE or tool which supports it?
Thank you for your help.
EDIT:
I will explain what I'm trying to do: I'm developing some machine learning models based on a large dataset.
I load the dataset into a data frame.
Then I perform different treatments over the data in order to transform them into ML-friendly data.
I perform these two steps in one R script, and I end up with an environment loaded with a heavy data frame, libraries and some other objects.
Then I'm using this dataset to feed several ML models: those models are of different classes, and within each class I'm trying several models with different parameters.
I have one R script for each class of models, and I would like to run and score each class parallel. Each model within each class will run sequentially.
The key here is: I know I can use different projects in order to do this, but that would suppose having to load several times the same environment, and for me that is problematic because it would mean having to load to RAM several times the same big data frame. Therefore I would like to know if there is a way to have several R scripts run in parallel while using the same environment.
Then I will use another script to rank all the models.

Sharing a data.table in memory for parallel computing

Following the post about data.table and parallel computing, I'm trying to find a way to get an operation on a data.table parallized.
I have a data.table with 4 million rows of 14 observations and would like to share it in a common memory so that operations on it can be parallelized by using the "parallel"-package with parLapply without having to copy the table for each node in the cluster (what parLapply does). At the moment the costs for moving the data.table around are bigger than the benefit of parallel computation.
I found the "bigmemory"-package as an answer for sharing memory, but it doesn't maintain the "data.table"-structure of the data. So does anyone know a way to:
1) put the data.table in shared memory
2) maintain the "data.table"-structure of the data by doing so
3) use parallel processing on this data.table?
Thanks in advance!
Old question, but here is an answer since nobody else has answered and it might be helpful. I assume the problem you are having is because you are on windows and having to use the PSOCK type of cluster. Unfortunately for windows this means you have to copy the data to each node. However, there is a work around. Get hold of docker and spin up an Rserve instance on the docker vm (e.g. stevenpollack/docker-rserve). Since this will be linux based you can create a FORK cluster on the docker vm. Then using your native R instance you can send over only once copy of the data to the Rserve instance (check out the RSclient library), do your parallelized job on the vm, and collect the results back into your native R.
The "complete" solution, shared read and write access from multiple processes, and their problems is discussed here: https://github.com/Rdatatable/data.table/issues/3104
As rookie mentioned, if you fork an R process (with parallel::makeCluster(type = "FORK") or future::plan(multicore) (note that this does not work reliably in RStudio), the operating system will reuse memory pages that are not modified by the child process. So, your workers will share the same memory as long as they don't modify it (Copy-on-write). But this works only if you have all parallel workers on the same machine and fork() has its own problems (although this might be going too far if you simply want to conduct some parallel analysis).
Meanwhile, you could find the packages feather and fst interesting. feather provides a file format that can be read both by R and python and if I understood the docs correctly, feather::feather() gives you a file-backed read-only data-frame, albeit no data.table. This allows for moving data between those two languages.
fst employs the Zstandard compression algorithm to achieve very fast reading and writing speeds to disk. You can read in a part of a fst file using the fst() function (instead of read_fst()). So, every worker could just read the part of your table that it needs. Concurrent writing to the fst file is not possible. You would need to save every result in its own file and concatenate them afterwards.
Alternatively, for concurrent reading and writing, you could switch to a database, albeit that is slower than data.table. See SO/SQLite concurrent access

Using R Parallel with other R packages

I am working on a very time intensive analysis using the LQMM package in R. I set the model to start running on Thursday, it is now Monday, and is still running. I am confident in the model itself (tested as a standard MLM), and I am confident in my LQMM code (have run several other very similar LQMMs with the same dataset, and they all took over a day to run). But I'd really like to figure out how to make this run faster if possible using the parallel processing capabilities of the machines I have access to (note all are Microsoft Windows based).
I have read through several tutorials on using parallel, but I have yet to find one that shows how to use the parallel package in concert with other R packages....am I over thinking this, or is it not possible?
Here is the code that I am running using the R package LQMM:
install.packages("lqmm")
library(lqmm)
g1.lqmm<-lqmm(y~x+IEP+pm+sd+IEPZ+IEP*x+IEP*pm+IEP*sd+IEP*IEPZ+x*pm+x*sd+x*IEPZ,random=~1+x+IEP+pm+sd+IEPZ, group=peers, tau=c(.1,.2,.3,.4,.5,.6,.7,.8,.9),na.action=na.omit,data=g1data)
The dataset has 122433 observations on 58 variables. All variables are z-scored or dummy coded.
The dependent libraries will need to be evaluated on all your nodes. The function clusterEvalQ is foreseen inside the parallel package for this purpose. You might also need to export some of your data to the global environments of your subnodes: For this you can use the clusterExport function. Also view this page for more info on other relevant functions that might be useful to you.
In general, to speed up your application by using multiple cores you will have to split up your problem in multiple subpieces that can be processed in parallel on different cores. To achieve this in R, you will first need to create a cluster and assign a particular number of cores to it. Next, You will have to register the cluster, export the required variables to the nodes and then evaluate the necessary libraries on each of your subnodes. The exact way that you will setup your cluster and launch the nodes will depend on the type of sublibraries and functions that you will use. As an example, your clustersetup might look like this when you choose to utilize the doParallel package (and most of the other parallelisation sublibraries/functions):
library(doParallel)
nrCores <- detectCores()
cl <- makeCluster(nrCores)
registerDoParallel(cl);
clusterExport(cl,c("g1data"),envir=environment());
clusterEvalQ(cl,library("lqmm"))
The cluster is now prepared. You can now assign subparts of the global task to each individual node in your cluster. In the general example below each node in your cluster will process subpart i of the global task. In the example we will use the foreach %dopar% functionality that is provided by the doParallel package:
The doParallel package provides a parallel backend for the
foreach/%dopar% function using the parallel package of R 2.14.0 and
later.
Subresults will automatically be added to the resultList. Finally, when all subprocesses are finished we merge the results:
resultList <- foreach(i = 1:nrCores) %dopar%
{
#process part i of your data.
}
stopCluster(cl)
#merge data..
Since your question was not specifically on how to split up your data I will let you figure out the details of this part for yourself. However, you can find a more detailed example using the doParallel package in my answer to this post.
It sounds like you want to use parallel computing to make a single call of the lqmm function execute more quickly. To do that, you either have to:
Split the one call of lqmm into multiple function calls;
Parallelize a loop inside lqmm.
Some functions can be split up into multiple smaller pieces by specifying a smaller iteration value. Examples include parallelizing randomForest over the ntree argument, or parallelizing kmeans over the nstart argument. Another common case is to split the input data into smaller pieces, operate on the pieces in parallel, and then combine the results. That is often done when the input data is a data frame or a matrix.
But many times in order to parallelize a function you have to modify it. It may actually be easier because you may not have to figure out how to split up the problem and combine the partial results. You may only need to convert an lapply call into a parallel lapply, or convert a for loop into a foreach loop. However, it's often time consuming to understand the code. It's also a good idea to profile the code so that your parallelization really speeds up the function call.
I suggest that you download the source distribution of the lqmm package and start reading the code. Try to understand it's structure and get an idea which loops could be executed in parallel. If you're lucky, you might figure out a way to split one call into multiple calls, but otherwise you'll have to rebuild a modified version of the package on your machine.

Running R jobs on a grid computing environment

I am running some large regression models in R in a grid computing environment. As far as I know, the grid just gives me more memory and faster processors, so I think this question would also apply for those who are using R on a powerful computer.
The regression models I am running have lots of observations, and several factor variables that have many (10s or 100s) of levels each. As a result, the regression can get computationally intensive. I have noticed that when I line up 3 regressions in a script and submit it to the grid, it exits (crashes) due to memory constraints. However, if I run it as 3 different scripts, it runs fine.
I'm doing some clean up, so after each model runs, I save the model object to a separate file, rm(list=ls()) to clear all memory, then run gc() before the next model is run. Still, running all three in one script seems to crash, but breaking up the job seems to be fine.
The sys admin says that breaking it up is important, but I don't see why, if I'm cleaning up after each run. 3 in one script runs them in sequence anyways. Does anyone have an idea why running three individual scripts works, but running all the models in one script would cause R to have memory issues?
thanks! EXL
Similar questions that are worth reading through:
Forcing garbage collection to run in R with the gc() command
Memory Usage in R
My experience has been that R isn't superb at memory management. You can try putting each regression in a function in the hope that letting variables go out of scope works better than gc(), but I wouldn't hold your breath. Is there a particular reason you can't run each in its own batch? More information as Joris requested would help as well.

Resources