Install R on the nodes for Azure batch services - r

I can create the batch service resources using Power shell as described here: https://learn.microsoft.com/en-us/azure/batch/batch-powershell-cmdlets-get-started
I want to run a R script on the nodes and I need R installed on the nodes as none of the available VM's(windows or linux) come with R installed. I have currently installed R by manually logging into the VM. But I want to create the batch resources and then install R on the nodes preferably through a script before I run the R code. How can I go about this?

There are 4 main ways to load necessary software on to VMs:
Create a start task along with potentially resource files to prep the compute node per your requirements.
Create a custom image that already contains all of your software preconfigured.
Use containers instead of directly loading software on the compute node.
Utilize application packages.

Related

Using Git to deploy R Package to remote server

I'm barely new to CI/CD piplines with GIT and hope that my questions is related to that functionality. I want to achieve the following:
Whenever a new commit (or tag) of a specific package in our companies local git server is created, I want to execute a shell script on a remote server. This shell script should download this new version of the package and install it into the global R-Installation, making it available for different processes that use this R environment. So far this is more like a manual step but I really would like to automate the process. Is this possible?
Thanks!

Running R script on hadoop and mapreduce

I have an R-script that does stuff with a bunch of tweets and I would like to use the same script on the same data but saved in an Hadoop file system. According to this Hortonworks tutorial I could use R code with data from my HDFS, but it is not quite clear.
Can I use the very same R-script, taking advantage of the mapreduce paradigm, by using this Revolution R? Should I change my code or is there a way to execute the same functions optimized for an Hadoop architecture?
My wish would be to write my code on a standard R IDE like R-Studio and then use it, or use the most of it, on my cloud services (such as Microsoft Azure) with mapreduce on the base.
Yes, you can run any R script across different data platform from Hadoop to Spark to Teradata and SQL Server by using environment specific compute context.
Following two links should help you get started on how to use Revolution R / Microsoft R Server on Hadoop:
https://msdn.microsoft.com/en-us/microsoft-r/scaler-hadoop-getting-started
https://github.com/Azure/Azure-MachineLearning-DataScience/blob/master/Misc/MicrosoftR/Samples/NYCTaxi/NYC2013_MRS_LinearBinary.Rmd

How do I setup and run SparkR projects and scripts (like a jar file)?

We have successfully gone through all the SparkR tutorials about setting it up and running basic programs in RStudio on an EC2 instance.
What we can't figure out now is how to then create a project with SparkR as a dependency, compile/jar it, and run any of the various R programs within it.
We're coming from Scala and Java, so we may be coming at this with the wrong mindset. Is this even possible in R or is it done differently that Java's build files and jars or do you just have to run each R script individually without a packaged jar?
do you just have to run each R script individually without a packaged jar?
More or less. While you can create a R package(-s) to store reusable parts of your code (see for example devtools::create or R packages) and optionally distribute it over the cluster (since current public API is limited to high level interactions with JVM backend it shouldn't be required), what you pass to spark-submit is simply a single R script which:
creates a SparkContext - SparkR::sparkR.init
creates a SQLContext / HiveContexts - SparkR::sparkRSQL.init / SparkR::sparkRHive.init
executes the rest of your code
stops SparkContext - SparkR::sparkR.stop
Assuming that external dependencies are present on the workers, missing packages can installed on the runtime using if not require pattern, for example:
if(!require("some_package")) install.packages("some_package")
or
if(!require("some_package")) devtools::install_github("some_user/some_package")

use julia language without internet connection (mirror?)

Problem:
I would like to make julia available for our developers on our corporate network, which has no internet access at all (no proxy), due to sensitive data.
As far as I understand julia is designed to use github.
For instance julia> Pkg.init() tries to access:
git://github.com/JuliaLang/METADATA.jl
Example:
I solved this problem for R by creating a local CRAN repository (rsync) and setting up a local webserver.
I also solved this problem for python the same way by creating a local PyPi repository (bandersnatch) + webserver.
Question:
Is there a way to create a local repository for metadata and packages for julia?
Thank you in advance.
Roman
Yes, one of the benefits from using the Julia package manager is that you should be able to fork METADATA and host it anywhere you'd like (and keep a branch where you can actually check new packages before allowing your clients to update). You might be one of the first people to actually set up such a system, so expect that you will need to submit some issues (or better yet; pull requests) in order to get everything working smoothly.
See the extra arguments to Pkg.init() where you specify the METADATA repo URL.
If you want a simpler solution to manage I would also think about having a two tier setup where you install packages on one system (connected to the internet), and then copy the resulting ~/.julia directory to the restricted system. If the packages you use have binary dependencies, you might run into problems if you don't have similar systems on both sides, or if some of the dependencies is installed globally, but Pkg.build("Pkgname") might be helpful.
This is how I solved it (for now), using second suggestion by
ivarne.I use a two tier setup, two networks one connected to internet (office network), one air gapped network (development network).
System information: openSuSE-13.1 (both networks), julia-0.3.5 (both networks)
Tier one (office network)
installed julia on an NFS share, /sharename/local/julia.
soft linked /sharename/local/bin/julia to /sharename/local/julia/bin/julia
appended /sharename/local/bin/ to $PATH using a script in /etc/profile.d/scriptname.sh
created /etc/gitconfig on all office network machines: [url "https://"] insteadOf = git:// (to solve proxy server problems with github)
now every user on the office network can simply run # julia
Pkg.add("PackageName") is then used to install various packages.
The two networks are connected periodically (with certain security measures ssh, firewall, routing) for automated data exchange for a short period of time.
Tier two (development network)
installed julia on NFS share equal to tier one.
When the networks are connected I use a shell script with rsync -avz --delete to synchronize the .julia directory of tier one to tier two for every user.
Conclusion (so far):
It seems to work reasonably well.
As ivarne suggested there are problems if a package is installed AND something more than just file copying is done (compiled?) on tier one, the package wont run on tier two. But this can be resolved with Pkg.build("Pkgname").
PackageCompiler.jl seems like the best tool for using modern Julia (v1.8) on secure systems. The following approach requires a build server with the same architecture as the deployment server, something your institution probably already uses for developing containers, etc.
Build a sysimage with PackageCompiler's create_sysimage()
Upload the build (sysimage and depot) along with the Julia binaries to the secure system
Alias a script to julia, similar to the following example:
#!/bin/bash
set -Eeu -o pipefail
unset JULIA_LOAD_PATH
export JULIA_PROJECT=/Path/To/Project
export JULIA_DEPOT_PATH=/Path/To/Depot
export JULIA_PKG_OFFLINE=true
/Path/To/julia -J/Path/To/sysimage.so "$#"
I've been able to run a research pipeline on my institution's secure system, for which there is a public version of the approach.

Amazon EMR: Using R code in Amazon EMR

I have a very beginner question. I've just been reading through some of the documentation regarding Amazon's EMR. Before I sign up etc. I just wanted to ask about using R in it.
I have one R module that calls several other modules, and then, just before it finishes running, saves several variables as .txt files.
My rather basic question is, can I do this in Amazon's EMR? And will I be able to access the .txt output files? Finally, my R script reads in some data from Excel spreadsheets. Will it still be able to do this from the EMR if I upload the Excel files into the system?
Thanks
Mike
#Mike, Answers to your 3 questions below
Running R on EMR: Yes you can.
You can run R programs on EMR once you have installed R on the EMR instance. I assume that you would write MapReduce moules if you plan to use multi-instance cluster. If you program is just about a "plain" R program then you may have to just use one sizable instance. I would rather use an EC2 instance with R AMI (look for Louis Aslett).
Moving output files:
Yes you can. It is possible to transfer your program output from EMR to S3 storage bucket of your choice. You will have to add a step calling S3DistCp command to move the files. An example from my project -
--jar /home/hadoop/lib/emr-s3distcp-1.0.jar --args '--src,hdfs:///contents,--dest,s3://<bucket-name>/'
Reading spreadsheets: AFAIK, If you are able to do this on local installation of R, then you should also be able to do it on EMR. You have to ensure that the necessary packages/libraries are installed during the bootstrap process.
I am able to install squeezy-cran and rmr2 on an EMR instance with all their dependencies (RCpp, reshap2, digest, RJSONIO, functional etc.). I am still unable to call the R program as a step. I am having to use SSH session and run R CMD commands on the shell prompt. Being on Windows, putty.exe works for me.

Resources