How to catch R background code in rmr map reduce in Rhadoop

How to catch R background code in rmr map reduce in Rhadoop - r

I am new in R Hadoop. I am able to run map reduce function of rmr package with Hadoop. Basically in background R runs this map reduce code in Java. It means R converts this R map reduce code in Java, So can I get the java background code when running map reduce.
Can anyone help me?

In Rhadoop, R is not converting R Map Reduce code to java.Rhadoop provides MapReduce interface; mapper and reducer can be described in R code and then called from R.
Rhadoop package will submit R code to Hadoop Cluster using Hadoop
streaming.Hadoop streaming is a utility that comes with the Hadoop
distribution. The utility allows you to create and run Map/Reduce jobs
with any executable or script as the mapper and/or the reducer.
You can understand about this by going throug Rhadoop package code in GitHub.
The RHadoop package submit the hadoop streaming job by using System command in R.
You can get an idea about this from this R scipt in RMR package.The code in that streaming.R is as given below.
final.command =
paste(
hadoop.command,
stream.mapred.io,
if(is.null(backend.parameters)) ""
else
do.call(paste.options, backend.parameters),
input,
output,
mapper,
combiner,
reducer,
image.cmd.line,
m.fl,
r.fl,
c.fl,
input.format.opt,
output.format.opt,
"2>&1")
if(verbose) {
retval = system(final.command)
if (retval != 0) stop("hadoop streaming failed with error code ", retval, "\n")}
else {
console.output = tryCatch(system(final.command, intern=TRUE),
warning = function(e) stop(e))
0}}

Related

using rstudioapi in devtools tests

I'm making a package which contains a function that calls rstudioapi::jobRunScript(), and I would like to to be able to write tests for this function that can be run normally by devtools::test(). The package is only intended for use during interactive RStudio sessions.
Here's a minimal reprex:
After calling usethis::create_package() to initialize my package, and then usethis::use_r("rstudio") to create R/rstudio.R, I put:
foo_rstudio <- function(...) {
script.file <- tempfile()
write("print('hello')", file = script.file)
rstudioapi::jobRunScript(
path = script.file,
name = "foo",
importEnv = FALSE,
exportEnv = "R_GlobalEnv"
)
}
I then call use_test() to make an accompanying test file, in which I put:
test_that("foo works", {
foo_rstudio()
})
I then run devtools::test() and get:
I think I understand the basic problem here: devtools runs a separate R session for the tests, and that session doesn't have access to RStudio. I see here that rstudioapi can work inside child R sessions, but seemingly only those "normally launched by RStudio."
I'd really like to use devtools to test my function as I develop it. I suppose I could modify my function to accept an argument passed from the test code which will simply run the job in the R session itself or in some other kind of child R process, instead of an RStudio job, but then I'm not actually testing the normal intended functionality, and if there's an issue which is specific to the rstudioapi::jobRunScript() call and which could occur during normal use, then my tests wouldn't be able to pick it up.
Is there a way to initialize an RStudio process from within a devtools::test() session, or some other solution here?

Calling R workbook in Databricks

Let's say I create a basic function in R:
Addn <- function(X,n)
{
X + n
}
and this is saved to a Databricks workbook; some filepath: "/shared/x/y/z/Addnfunction"
In RStudio, I would typically call that function from another script by writing something like:
source("/shared/x/y/z/Addnfuntion.r")
If I open a new databricks notebook and want to call the above function (example, a shared team function) and use the "source" methodology I just get an error in regards to the function/connection.
Is there a best practice for leverage shared functions/scripts in R for databricks?

Actually this was pretty straightforward:
%run "../z/Addnfunction""

How can I evaluate a C function from a dynamic library in an R package?

I’m trying to implement parallel computing in an R package that calls C from R with the .C function. It seems that the nodes of the cluster can’t access the dynamic library. I have made a parallel socket cluster, like this:
cl <- makeCluster(2)
I would like to evaluate a C function called valgrad from my R package on each of the nodes in my cluster using clusterEvalQ, from the R package parallel. However, my code is producing an error. I compile my package, but when I run
out <- clusterEvalQ(cl, cresults <- .C(C_valgrad, …))
where … represents the arguments in the C function valgrad. I get this error:
Error in checkForRemoteErrors(lapply(cl, recvResult)) :
2 nodes produced errors; first error: object 'C_valgrad' not found
I suspect there is a problem with clusterEvalQ’s ability to access the dynamic library. I attempted to fix this problem by loading the glmm package into the cluster using
clusterEvalQ(cl, library(glmm))
but that did not fix the problem.
I can evaluate valgrad on each of the clusters using the foreach function from the foreach R package, like this:
out <- foreach(1:no_cores) %dopar% {.C(C_valgrad, …)}
no_cores is the number of nodes in my cluster. However, this function doesn’t allow any of the results of the evaluation of valgrad to be accessed in any subsequent calculation on the cluster.
How can I either
(1) make the results of the evaluation of valgrad accessible for later calculations on the cluster or
(2) use clusterEvalQ to evaluate valgrad?

You have to load the external library. But this is not done with library calls, it's done with dyn.load.
The following two functions are usefull if you work with more than one operating system, they use the built-in variable .Platform$dynlib.ext.
Note also the unload function. You will need it if you develop a C functions library. If you change a C function before testing it the dynamic library has to be unloaded, then (the new version) reloaded.
See Writing R Extensions, file R-exts.pdf in the doc folder, section 5 or on CRAN.
dynLoad <- function(dynlib){
dynlib <- paste(dynlib, .Platform$dynlib.ext, sep = "")
dyn.load(dynlib)
}
dynUnload <- function(dynlib){
dynlib <- paste(dynlib, .Platform$dynlib.ext, sep = "")
dyn.unload(dynlib)
}

Is rJava object is exportable in future(Package for Asynchronous computing in R)

I'm trying to speed up my R code using future package by using mutlicore plan on Linux. In future definition I'm creating a java object and trying to pass it to .jcall(), But I'm getting a null value for java object in future. Could anyone please help me out to resolve this. Below is sample code-
library("future")
plan(multicore)
library(rJava)
.jinit()
# preprocess is a user defined function
my_value <- preprocess(a = value){
# some preprocessing task here
# time consuming statistical analysis here
return(lreturn) # return a list of 3 components
}
obj=.jnew("java.custom.class")
f <- future({
.jcall(obj, "V", "CustomJavaMethod", my_value)
})
Basically I'm dealing with large streaming data. In above code I'm sending the string of streaming data to user defined function for statistical analysis and returning the list of 3 components. Then want to send this list to custom java class [ java.custom.class ]for further processing using custom Java method [ CustomJavaMethod ].
Without using future my code is running fine. But I'm getting 12 streaming records in one minute and then my code is getting slow, observed delay in processing.
Currently I'm using Unix with 16 cores. After using future package my process is done fast. I have traced back my code, in .jcall something happens wrong.
Hope this clarifies my pain.

(Author of the future package here:)
Unfortunately, there are certain types of objects in R that cannot be sent to another R process for further processing. To clarify, this is a limitation to those type of objects - not to the parallel framework use (here the future framework). This simplest example of such an objects may be a file connection, e.g. con <- file("my-local-file.txt", open = "wb"). I've documented some examples in Section 'Non-exportable objects' of the 'Common Issues with Solutions' vignette (https://cran.r-project.org/web/packages/future/vignettes/future-4-issues.html).
As mentioned in the vignette, you can set an option (*) such that the future framework looks for these type of objects and gives an informative error before attempting to launch the future ("early stopping"). Here is your example with this check activated:
library("future")
plan(multisession)
## Assert that global objects can be sent back and forth between
## the main R process and background R processes ("workers")
options(future.globals.onReference = "error")
library("rJava")
.jinit()
end <- .jnew("java/lang/String", " World!")
f <- future({
start <- .jnew("java/lang/String", "Hello")
.jcall(start, "Ljava/lang/String;", "concat", end)
})
# Error in FALSE :
# Detected a non-exportable reference ('externalptr') in one of the
# globals ('end' of class 'jobjRef') used in the future expression
So, yes, your example actually works when using plan(multicore). The reason for that is that 'multicore' uses forked processes (available on Unix and macOS but not Windows). However, I would try my best to limit your software to parallelize only on "forkable" systems; if you can find an alternative approach I would aim for that. That way your code will also work on, say, a huge cloud cluster.
(*) The reason for these checks not being enabled by default is (a) it's still in beta testing, and (b) it comes with overhead because we basically need to scan for non-supported objects among all the globals. Whether these checks will be enabled by default in the future or not, will be discussed over at https://github.com/HenrikBengtsson/future.

The code in the question is calling unknown Method1 method, my_value is undefined, ... hard to know what you are really trying to achieve.
Take a look at the following example, maybe you can get inspiration from it:
library(future)
plan(multicore)
library(rJava)
.jinit()
end = .jnew("java/lang/String", " World!")
f <- future({
start = .jnew("java/lang/String", "Hello")
.jcall(start, "Ljava/lang/String;", "concat", end)
})
value(f)
[1] "Hello World!"

import R forecast library JAR files into java

I am trying to import the R package 'forecast; in netbeans to use its functions. I have managed to make the JRI connection and also to import the javaGD library and experimented with it with a certain success. The problem about the forecasting package is that I cannot find the corresponding JAR files so to include them as a library in my project. I am loading it normally : re.eval(library(forecast)), but when I implement one of the library's function, a null value is returned. Although I am quite sure that the code is correct I am posting it just in case.
tnx in advance
Rengine re = new Rengine(Rargs, false, null);
System.out.println("rengine created, waiting for R!");
if(!re.waitForR())
{
System.out.println("cannot load R");
return;
}
re.eval("library(forecast)");
re.eval("library(tseries)");
re.eval("myData <- read.csv('C:/.../I-35E-NB_1.csv', header=F, dec='.', sep=',')");
System.out.println(re.eval("myData"));
re.eval("timeSeries <- ts(myData,start=1,frequency=24)");
System.out.println("this is time series object : " + re.eval("timeSeries"));
re.eval("fitModel <- auto.arima(timeSeries)");
REXP fc = re.eval("forecast(fitModel, n=20)");
System.out.println("this is the forecast output values: " + fc);

You did not convert values from R into java, you should first create a numerical vector of auto.arima output in R, and then use the method .asDoubleArray() to read it into java.
I gave a complete example in [here] How I can load add-on R libraries into JRI and execute from Java? , that shows exactly How to use the auto.arima function in Java using JRI.