Is it possible to launch multiple EC2 instances within R? - r

This is more of a beginner's question. Say I have the following code:
library("multicore")
library("iterators")
library("foreach")
library("doMC")
registerDoMC(16)
foreach(i in 1:M) %dopar% {
##do stuff
}
This code then will run on 16 cores, if they are available. Now if I understand correctly, using Amazon EC2, on one instance, I get depending on the instance only few cores. So if I want to run simulations on 16 cores, I need to use several instances, which means as I far as I understand launching new R processes. But then I need to write additional code outside of R to gather the results.
So my question is, is there an R package, which lets to launch EC2 instances from within R, automagicaly distributes the load between these instances, and gathers the results in the initial R launched?

To be precise, the maximum instance type on EC2 is currently 8 cores, so anyone, even users of R, would need multiple instances in order to have run concurrently on more than 8 cores.
If you want to use more instances, then you have two options for deploying R: "regular" R invocations or MapReduce invocations. In the former case, you will have to set up code to launch instances, distribute tasks (e.g. the independent iterations in foreach), return results, etc. This is doable, but you're not likely to enjoy it. In this case, you can use something like rmr or RHipe to manage a MapReduce grid, or you can use snow and many other HPC tools to create a simple grid. Use of snow may make it easier to keep your code intact, but you will have to learn how to tie this stuff together.
In the latter case, you can build upon infrastructure that Amazon has provided, such as Elastic MapReduce (EMR) and packages that make that simpler, such as JD's segue. I'd recommend segue as a good starting point, as others have done, as it has a gentler learning curve. The developer is also on SO, so you can easily embarrass query him when it breaks.

Related

Overview of the different Parallel programming methods [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 2 months ago.
Improve this question
I'm learning how to use parallel programming (specifically in R, but I'm trying to make this question as general as possible). There are many different libraries that work with it, and understanding their differences is hard without knowing the computer science terms used in their descriptions.
I identified some attributes that define these categories, such as: fine-grained and coarse-grained, explicit and implicit, levels of parallelism (bit-level etc.), classes of parallel computers (multi-core computing, grid computing, etc.), and what I call "methods" (I will explain what I mean by that later).
First question: is that list complete? Or there are other relevant attributes that define categories of parallel programing?
Secondary question: for each attribute, what are the pros and cons of the different options? When to use each option?
About the "methods": I saw materials talking about socket and forking; other talking about Parallel Virtual Machine (PVM) and Message Passing Interface (MPI) (and another option called "NWS"); and some other specific methods like Hadoop/map-reduce and futures (from R's "future" package).
Second question: I don't understand some of them, and I'm not sure if it makes sense to join them in this list that I called (parallel processing) "methods". Also, are there other "methods" that I left out?
Secondary question: what are the pros and cons of each of these methods, and when is it better to use each?
Third question: in light of this categorization and the discussion on the pros and cons of each category, how can we make an overview of the parallel computing libraries in R?
My post might be too broad, and ask too much at once, so I'll answer it with what I found until now, and maybe you can correct it/add to it in your own answer. The points that I feel that are most lacking is understanding the pros/cons of each "method", and of each R package.
OP here. I can't answer the "is that list complete?" part of the questions, but here are the explanations to each attribute and the pros and cons of each option. Just to reiterate that I'm new to this subject and might write something false/misleading.
Categories
Fine-grained, coarse-grained, and embarrassing parallelism (ref):
Attribute: how often their subtasks need to synchronize or communicate with each other.
Fine-grained parallelism: if its subtasks must communicate many times per second;
Coarse-grained parallelism if they do not communicate many times per second;
Embarrassing parallelism if they rarely or never have to communicate.
When to use each: self-explanatory
Explicit and implicit parallelism (ref):
Attribute: the need to write code that directly instructs the computer to parallelize
Explicit: needs it
Implicit: automatically detects a task that needs parallelism
When to use each: Parallel computing might introduce too much complexity when working with tasks, such that implicitly parallelism can lead to inefficiencies in some cases.
Types/levels of parallelism (ref):
Attribute: the code level where parallelism happens.
Bit-level, instruction-level, task and data-level, superword-level
When to use each: From what i understood, in the most common statistics/R applications, we use task and data-level, thus I didn't searched about when to use the other ones.
Classes of parallel computers (ref)
Attribute: the level at which the hardware supports parallelism
Multi-core computing, Symmetric multiprocessing, Distributed computing, Cluster computing, Massively parallel computing, Grid computing, Specialized parallel computers.
When to use each: From what i understood, unless you have a really big task that needs several/external computers working, you can use multi-core computing (using only your own machine).
Parallelism "method":
Attribute: different methods (there probably is a better word for this).
This post makes a distinction between socket approach (launches a new version the code on each core) and forking approach (copies the entire current version of your project on each core). Forking isn't supported by Windows.
This post makes a difference between Parallel Virtual Machine (PVM) and Message Passing Interface (MPI) (which I couldn't quite understand). Apparently, there is another option called "NWS", which I couldn't find information about.
This CRAN task view contains a list of packages, grouped by topic, that are useful for high-performance computing in R. It refers to a method called "Hadoop" which was built upon "map-reduce", and to the "future" package, that uses "futures" to introduce parallelism to R.
When to use each: While in socket each node is unique (avoiding cross-contamination), and runs on any system, forking is faster and allows to access your entire workspace in each process. I couldn't find information to talk about PVM, MPI, and NWS; and didn't get in depth into Hadoop and futures, so there is alot of space to contribute to this paragraph.
R packages
The CRAN task view is a great reference to this. It separates which packages deal with explicit/implicit parallel processing, most of which work with multicore-processing (while also pointing out some that do grid-processing). It points to a specific group of packages that use Hadoop, and other groups of parallel processing tools and specific applications. As they're more common, I'll list the explicit parallelism ones. Package names inside parenthesis are upgrades to the listed package.
rpvm (no longer actively maintained): PVM dedicated;
Rmpi (pdbMPI): MPI dedicated;
snow (snowFT, snowfall): works with PVM, MPI, NWS and socket, but not forking;
parallel (parallelly) was built upon multicore (forking focused and no longer actively maintained) and snow, and it is in base R;
future (future.apply and furrr): "parallel evaluations via abstraction of futures"
foreach: needs a "parallel backend", which can be doMPI (using Rmpi), doMC (using multicore), doSNOW (using snow), doPararell (using parallel), and doFuture (using future)
RHIPE, rmr, segue, and RProtoBuf: use Hadoop and other map-reduce techniques.
I also didn't get in depth into how each package works, and there is room to add pros/cons and when to use each.

Cyclic Workflow in Data Comparison Process

I am searching for a solution to automatize an iterative data comparison process until all data packages are consistent. My general guess is to use something like Apache Airflow, but the iterative nature seem to be a cyclic graph. Apache airflow only allows DAGs (directed acyclic graph). Since I have not even a lot of knowledge in Airflow, I am a bit lost and would appreciate some expert knowledge here.
Current status: I am in a position were I regularly need to compare data packages for consistency and communicate errors to and between the two different parties manually.
On the one hand there is a design data set and on the other hand there are measured data sets. Both datasets involve many manual steps from different parties. So if an inconsistency occurs, I contact one or the other party and the error is removed manually. There are also regular changes to both data sets that can introduce new errors to already checked datasets.
I guess this process was not automatized yet, because the datasets are not directly comparable, but some transformations need to be done in between. I automatized this transformation process the last weeks so all that need to be done now from my side is to run the script and to communicate the errors.
What I would need now is a tool that orchestrates my script against the correct datasets and contacts the according persons as long as errors exists. In case something changes or was added the script needs to be run again.
My first guess was that I would need to create a workflow in apache airflow, but this iterative process seems to me as a cyclic graph, which is not allowed in Airflow. Do you have any suggestions or is this a common occurrence, were also solutions with Airflow exists?
I think one way to solve your problem could be to have a DAG workflow for the main task of comparing the datasets and sending notifications. Then run a periodical task in Cron, Quartz, etc, that triggers that DAG workflow. You are correct in Airflow not liking cyclic workflows.
I worked on Cylc, a cyclic graph workflow tool. Cyclic workflows (or workflows with loops) are very common in areas such as Numerical Weather Prediction NWP (reason why Cylc was created), and also in other fields such as optimization.
In NWP workflows, some steps may be waiting for datasets, and the workflow may stall and send notifications if the data is not as it was expected (e.g. some satellite imaging data is missing, and also the tides model output file is missing).
Also, in production, NWP models run multiple times a day. Either because you have new observation data, or new input data, or maybe because you want to run ensemble models, etc. So you end up with multiple runs of the workflow in parallel, where the workflow manager is responsible to manage dependencies, optimize the use of resources, send notifications, and more.
Cyclic workflows are complicated, that's probably why most implementations opt to support only DAGs.
If you'd like to try Cylc, the team has been trying to make it more generic so that it's not specific to NWP only. It has a new GUI, and the input format and documentation were improved with ease of use in mind.
There are other tools that support cyclic workflows too, such as StackStorm, Prefect, and I am currently checking if Autosubmit supports it too. Take a look at these tools if you'd like.
If you are in life sciences, or are interested in reproducible workflows, the CWL standard also has some ongoing discussion about adding support to loops, which could allow you to achieve something akin to what you described I reckon.

Is it possible to run two websocket connections asynhronously or use same variables in two separate sessions in R?

I'm setting up an algorithm in R which involves performing tasks on data that is streamed from a websocket. I have sucessfuly implemented a connection to a websocket (using package https://github.com/rstudio/websocket), but my algorithm does not perform optimal due to linear processing of data received from websocket (some important tasks get delayed because less important ones get triggered before them). As the tasks could easily be divided, I am wondering:
1) whether it would be possible to run two websocket connections simultaneously, providing that there is a single data frame (as a global variable) that gets updated in one instance and is accessible in another?
2) is it possible to check the queue from websocket and prioritize certain tables?
3) I am considering also a solution that includes two separate R sessions, but I am not sure if there is a way to access data that gets updated in real time in another R session? Is there a workaround that does not include saving a table in one and loading it in another?
I have already tried this with async (https://github.com/r-lib/async) without much success and I have also tried 'jobs' pannels in the newer versions of RStudio.
I would be happy to provide some code, but at this point I think that this might be irrelevant, as the question that I have is more or less trying to expand the code rather than fix it. I am also aware that probably Python offers an easier solution, but I would still like to exhaust every option that R offers.
Any help would be greatly appreciated!
In the end I was able to accomplish this by running two separate R instances and connecting them using r sockets.

Memory virtualization with R on cluster

I don't know almost anything about parallel computing so this question might be very stupid and it is maybe impossible to do what I would like to.
I am using linux cluster with 40 nodes, however since I don't know how to write parallel code in R I am limited to using only one. On this node I am trying to analyse data which floods the memory (arround 64GB). So my problem isn't lack of computational power but rather memory limitation.
My question is, whether it is even possible to use some R package (like doSnow) for implicit parallelisation to use 2-3 nodes to increase the RAM limit or would I have to rewrite the script from ground to make it explicit parallelised ?
Sorry if my question is naive, any suggestions are welcomed.
Thanks,
Simon
I don't think there is such a package. The reason is that it would not make much sense to have one. Memory access is very fast, and accessing data from another computer over the network is very slow compared to that. So if such a package existed it would be almost useless, since the processor would need to wait for data over the network all the time, and this would make the computation very very slow.
This is true for common computing clusters, built from off-the-shelf hardware. If you happen to have a special cluster where remote memory access is fast, and is provided as a service of the operating system, then of course it might be not that bad.
Otherwise, what you need to do is to try to divide up the problem into multiple pieces, manually, and then parallelize, either using R, or another tool.
An alternative to this would be to keep some of the data on the disk, instead of loading all of it into the memory. You still need to (kind of) divide up the problem, to make sure that the part of the data in the memory is used for a reasonable amount of time for computation, before loading another part of the data.
Whether it is worth (or possible at all) doing either of these options, depends completely on your application.
Btw. a good list of high performance computing tools in R is here:
http://cran.r-project.org/web/views/HighPerformanceComputing.html
For future inquiry:
You may want to have a look at two packages "snow" and "parallel".
Library "snow" extends the functionality of apply/lapply/sapply... to work on more than one core and/or one node.
Of course, you can perform simple parallel computing using more than one core:
#SBATCH --cpus-per-task= (enter some number here)
You can also perform parallel computing using more than one node (preferably with the previously mentioned libraries) using:
#SBATCH --ntasks-per-node= (enter some number here)
However, for several implications, you may wanna think of using Python instead of R where parallelism can be much more efficient using "Dask" workers.
You might want to take a look at TidalScale, which can allow you to aggregate nodes on your cluster to run a single instance of Linux with the collective resources of the underlying nodes. www.tidalscale.com. Though the R application may be inherently single threaded, you'll be able to provide your R application with a single, simple coherent memory space across the nodes that will be transparent to your application.
Good luck with your project!

R Parallel, is it better to create a new cluster everytime you use parallel apply?

I am using parLapply function on 20 cores. I guess it is the same for the other function parSapply etc...
First, is it a bad practice to pass a cluster as an argument into a function so that the function can then dispatch the use of the cluster between different subfunctions?
Second, I pass this cluster argument into a function so i suppose it is the same cluster everytime i use parLapply, would it be better to use a new cluster for every parLapply call?
Thanks
Rgds
I am not an expert in parallel computing but will venture an answer anyway.
1) It is not bad practice to pass the cluster around as a function argument. It is just a collection of connections to worker processes, similar to connections to a file.
2) Restarting the cluster between calls is not needed. There will be problems if something has gone seriously wrong with a worker process, but in that case I would recommend cancelling the whole computation and restarting the master process too.
It's not bad practice to pass cluster objects as function arguments, so I don't see anything wrong with using them to dispatch between different sub-functions.
The problem with creating cluster objects for each operation is that it may take significant time, especially when starting many workers via ssh on a cluster, for example. However, it may be useful in some cases, and I think that that may be the purpose of Fork clusters, which are created by the makeForkCluster function in the parallel package. Fork cluster workers are very much like the workers forked by mclapply, which does create workers every time it's called. However, I think this is rarely done.

Resources