distributed scheduling system for R scripts - r

I would like to schedule and distribute on several machines - Windows or Ubuntu - (one task is only on one machine) the execution of R scripts (using RServe for instance).
I don't want to reinvent the wheel and would like to use a system that already exists to distribute these tasks in an optimal manner and ideally have a GUI to control the proper execution of the scripts.
1/ Is there a R package or a library that can be used for that?
2/ One library that seems to be quite widely used is mapReduce with Apache Hadoop.
I have no experience with this framework. What installation/plugin/setup would you advise for my purpose?
Edit: Here are more details about my setup:
I have indeed an office full of machines (small servers or workstations) that are sometimes also used for other purpose. I want to use the computing power of all these machines and distribute my R scripts on them.
I also need a scheduler eg. a tool to schedule the scripts at a fix time or regularly.
I am using both Windows and Ubuntu but a good solution on one of the system would be sufficient for now.
Finally, I don't need the server to get back the result of scripts. Scripts do stuff like accessing a database, saving files, etc, but do not return anything. I just would like to get back the errors/warnings if there are some.

If what you are wanting to do is distribute jobs for parallel execution on machines you have physical access to, I HIGHLY recommend the doRedis backend for foreach. You can read the vignette PDF to get more details. The gist is as follows:
Why write a doRedis package? After all, the foreach package already
has available many parallel back end packages, including doMC, doSNOW
and doMPI. The doRedis package allows for dynamic pools of workers.
New workers may be added at any time, even in the middle of running
computations. This feature is relevant, for example, to modern cloud
computing environments. Users can make an economic decision to \turn
on" more computing resources at any time in order to accelerate
running computations. Similarly, modernThe doRedis Package cluster
resource allocation systems can dynamically schedule R workers as
cluster resources become available
Hadoop works best if the machines running Hadoop are dedicated to the cluster, and not borrowed. There's also considerable overhead to setting up Hadoop which can be worth the effort if you need the map/reduce algo and distributed storage provided by Hadoop.
So what, exactly is your configuration? Do you have an office full of machines you're wanting to distribute R jobs on? Do you have a dedicated cluster? Is this going to be EC2 or other "cloud" based?
The devil is in the details, so you can get better answers if the details are explicit.
If you want the workers to do jobs and have the results of the jobs reconfigured back in one master node, you'll be much better off using a dedicated R solution and not a system like TakTuk or dsh which are more general parallelization tools.

Look into TakTuk and dsh as starting points. You could perhaps roll your own mechanism with pssh or clusterssh, though these may be more effort.

Related

Is there a way to run complex R workloads in a serverless manner in the cloud?

We are currently manually running complex R workloads on a monster VM in the Azure Cloud. Some workloads consume all VM resources and create bottlenecks. Typically workloads take 30min - 3hrs
Is there a way to improve performance to run R workloads in a serverless and isolated manner, perhaps using containers or cloud functions ?
We are also interested in investing in a tool that we could use to manage/administer/orchestrate workloads in a seamless end to end fashion.
Something like Azure Data Factory but for stitching together stuff in R.
Any helpful suggestions would be appreciated. Thank you.
There are several options to run R in Azure. I would think HDInsights, Databricks or AML would be good for you to review.
Source

Is there a way to protect my R code that runs on a AWS account owned by a client?

I just joined a company that needs to build an ETL pipeline inside an AWS account owned by a client.
There's one part of the ETL pipeline that runs a code written in R. The problem is, this R code is a very important part of our business, and our intelectual property. Our clients can't see this code.
Is there any way to run this in their AWS environment without them having access to our code? R is not compilable, so we can't just deploy an executable file there. And we HAVE to run this in their environment. I suggested creating an API to run this in our AWS environment, but this is not an option.
In my experience, these are the options I've realized in situations like this, in increasing order of difficulty:
Take the computation off-premises. This sounds like not an option for you.
Generate an API (e.g., shiny, opencpu, plumber) that is callable from their premises. This might require some finessing on their end, as I'm inferring (since they want it all done within their environment) that they might prefer a locked-down computation (perhaps disabling network access).
Rewrite the sensitive portions in Rcpp. While this does have the possible benefit of speed improvements, it makes it slightly harder for them to "discover" the underlying intellectual property. Realize that R and Rcpp are both GPL, which means that anything linked to by R must also be GPL, meaning source-code available. (It is feasible that since you are not making it public that you can argue your case here, but I am not a lawyer and would not want to be the first consultant found on the wrong side of GPL law here. Again, IANAL.)
Rewrite the sensitive portions in a non-R executable (note that I don't say "as a non-R library and link to it via R calls", since the linking action taints the library with R's GPL). This executable can be called by your otherwise releasable R package (via system or processx::run).
(For the record, one might infer C or C++ here, but other higher-level languages do allow compilable executables and are not GPL. Python has some such modes. Be sure to obfuscate your variables :-)
I think your "safest" options are #2 and #4.

How to get all running processes in Qt

I have two questions:
Is there any API in Qt for getting all the processes that are running right now?
Given the name of a process, can I check if there is such a process currently running?
Process APIs are notoriously platform-dependent. Qt provides just the bare minimum for spawning new processes with QProcess. Interacting with any processes on the system (that you didn't start) is out of its depth.
It's also beyond the reach of things like Boost.Process. Well, at least for now. Note their comment:
Boost.Process' long-term goal is to provide a portable abstraction layer over the operating system that allows the programmer to manage any running process, not only those spawned by it. Due to the complexity in offering such an interface, the library currently focuses on child process management alone.
I'm unaware of any good C++ library for cross-platform arbitrary process listing and management. You kind of have to just pick the platforms you want to support and call their APIs. (Or call out to an external utility of some kind that will give you back the info you need.)

Can the R compiler be used with a web server?

I'm interested in using a statistical programming language within a web site I'm building to do high performance stats processing that will then be displayed to the web.
I'm wondering if an R compiler can be embedded within a web server and threaded to work well with the LAMP stack so that it can work smoothly with the front-end and back-end of the web site and improve the performance of the site.
If R is not the right choice for such an application, then perhaps there is another tool that is?
The general rule is that webserver should do NO calculations -- whatever you do, it will always end in a bad user experience. The way is that the server should respond to calculation request by scheduling the job for some worker process, give the user some nice working status and then push the results obtained from worker when they are ready (most likely with AJAX polling or some more recent COMET idea).
Of course this requires some RPC protocol to R and some queuing agent -- this can be done either with background processes (easy yet slow), R HTTP servers (more difficult yet faster), or real RPC like Rserve or triggr (hard, yet fast to ultra-fast).
You are confusing two issues.
Yes, R can be used via a webplatform. In fact, the R FAQ has an entire section on this. In the fifteen+ years that both R and 'the Web' have ridden to prominence, many such frameworks have been proposed. And since R 2.13.0 R even has its own embedded web server (to drive documentation display).
Yes, R scripts can run faster via the bytecode compiler, but that does not give you orders of magnitude.

Build Server Hardware Configuration

So I've seen this question, but I'm looking for some more general advice: How do you spec out a build server? Specifically what steps should I take to decide exactly what processor, HD, RAM, etc. to use for a new build server. What factors should I consider to decide whether to use virtualization?
I'm looking for general steps I need to take to come to the decision of what hardware to buy. Steps that lead me to specific conclusions - think "I will need 4 gigs of ram" instead of "As much RAM as you can afford"
P.S. I'm deliberately not giving specifics because I'm looking for the teach-a-man-to-fish answer, not an answer that will only apply to my situation.
The answer is what requirements will the machine need in order to "build" your code. That is entirely dependent on the code you're talking about.
If its a few thousand lines of code then just pull that old desktop out of the closet. If its a few billion lines of code then speak to the bank manager about giving you a loan for a blade enclosure!
I think the best place to start with a build server though is buy yourself a new developer machine and then rebuild your old one to be your build server.
I would start by collecting some performance metrics on the build on whatever system you currently use to build. I would specifically look at CPU and memory utilization, the amount of data read and written from disk, and the amount of network traffic (if any) generated. On Windows you can use perfmon to get all of this data; on Linux, you can use tools like vmstat, iostat and top. Figure out where the bottlenecks are -- is your build CPU bound? Disk bound? Starved for RAM? The answers to these questions will guide your purchase decision -- if your build hammers the CPU but generates relatively little data, putting in a screaming SCSI-based RAID disk is a waste of money.
You may want to try running your build with varying levels of parallelism as you collect these metrics as well. If you're using gnumake, run your build with -j 2, -j 4 and -j 8. This will help you see if the build is CPU or disk limited.
Also consider the possibility that the right build server for your needs might actually be a cluster of cheap systems rather than a single massive box -- there are lots of distributed build systems out there (gmake/distcc, pvmgmake, ElectricAccelerator, etc) that can help you leverage an array of cheap computers better than you could a single big system.
Things to consider:
How many projects are going to be expected to build simultaneously? Is it acceptable for one project to wait while another finishes?
Are you going to do CI or scheduled builds?
How long do your builds normally take?
What build software are you using?
Most web projects are small enough (build times under 5 minutes) that buying a large server just doesn't make sense.
As an example,
We have about 20 devs actively working on 6 different projects. We are using a single TFS Build server running CI for all of the projects. They are set to build on every check in.
All of our projects build in under 3 minutes.
The build server is a single quad core with 4GB of ram. The primary reason we use it is to performance dev and staging builds for QA. Once a build completes, that application is auto deployed to the appropriate server(s). It is also responsible for running unit and web tests against those projects.
The type of build software you use is very important. TFS can take advantage of each core to parallel build projects within a solution. If your build software can't do that, then you might investigate having multiple build servers depending on your needs.
Our shop supports 16 products that range from a few thousands of lines of code to hundreds of thousands of lines (maybe a million+ at this point). We use 3 HP servers (about 5 years old), dual quad core with 10GB of RAM. The disks are 7200 RPM SCSI drives. All compiled via msbuild on the command line with the parallel compilations enabled.
With that setup, our biggest bottleneck by far is the disk I/O. We will completely wipe our source code and re-checkout on every build, and the delete and checkout times are really slow. The compilation and publishing times are slow as well. The CPU and RAM are not remotely taxed.
I am in the process of refreshing these servers, so I am going the route of workstation class machines, go with 4 instead of 3, and replacing the SCSI drives with the best/fastest SSDs I can afford. If you have a setup similar to this, then disk I/O should be a consideration.

Resources