Running Groovy Script from R - r

I have groovy script file (.groovy). I want to execute this groovy script file in R. Please note the input for groovy file has 3 different variables and I created all 3 variables in the form of list

# This installs the package on your machine to read the groovy file
install.packages("readr")
variables <- list ()
variables["first"] = "First"
variables["second"] = "Second"
variables["third" = "Third"
library(readr)
myscript <- read_file("path/to/groovycode.groovy")
result <- Execute (groovyScript=myscript, variables=variables)
result
I didn't test the code, but can be a good start if you can fix any issues when trying to run it.

Related

Creating a dataframe in Azure ML Notebook with R kernel

I have written some scripts in R which I have to run in azure ml notebook but I have not found much documentation how to create a dataset by running code in notebook with R kernel. I have written the following python code which works with python kernel as:
from azureml.core import Dataset, Datastore,Workspace
subscription_id = 'abc'
resource_group = 'pqr'
workspace_name = 'xyz'
workspace = Workspace(subscription_id, resource_group, workspace_name)
datastore = Datastore.get(workspace, 'workspaceblobstore')
# create tabular dataset from all parquet files in the directory
tabular_dataset_3 = Dataset.Tabular.from_parquet_files(path=(datastore,'/UI/09-17-2022_125003_UTC/userdata1.parquet'))
df=tabular_dataset_3.to_pandas_dataframe()
It works fine with python kernel but I want to execute the equivalent R code in notebook with R kernel.
Can anyone please help me what is the equivalent R code ? Any help would be appreciated.
To create an R script and use the dataset, first we need to register the dataset to the portal. Once the dataset is added to the portal, we need to get the dataset URL and open the notebook and use the R kernel.
Upload the dataset and get the data source URL
Go to Machine Learning studio and create a new notebook.
Use the below R script to get the dataset and convert that to dataframe.
azureml_main <- function(dataframe1, dataframe2){
print("R script run.")
run = get_current_run()
ws = workspacename
dataset = azureml$core$dataset$Dataset$get_by_name(ws, “./path/insurance.csv")
dataframe2 <- dataset$to_pandas_dataframe()
# Return datasets as a Named List
return(list(dataset1=dataframe1, dataset2=dataframe2))
}

User input and output path of files in R

I am aiming to write an R script that would take the user path for the file and output name of the file. This will be followed by its processing and then output being stored in that file.
Normally, if I had to break this code on R studio it will look like this:
d<- read.table("data.txt" , header =T)
r<-summary(d)
print(r)
the output that is being displayed is also to be written in output file.
where data.txt is
1
2
3
4
45
58
10
What I would like to do is to put the code in a file called script.R and then run it as follows
R script.R input_file_path_name output_file_name
Could anyone spare a minute or two and help me out.
Many thanks in advance.
The most natural way to pass arguments from the command line is to use the function commandArgs. This function scans the arguments which have been supplied when the current R session was invoked. So creating a script named sillyScript.R which starts with
#!/usr/bin/env Rscript
args = commandArgs(trailingOnly=TRUE)
and running the following command line
Rscript --vanilla sillyScript.R iris.txt out.txt
will create a string vector args which contains the entries iris.txt and out.txt.
Use the args[1] and args[2] vector as the input and outputfile paths.
https://www.r-bloggers.com/passing-arguments-to-an-r-script-from-command-lines/
You can consider this method:
script<-function(input_file){
input_file<-read.table("data.txt", header=T)
r<- summary(input_file)
return(r)}
If you want to manually choose the input file location you can use:
read.table(file.choose(),header=T)
Just executing the script with the input name is sufficient to return the desired output. For eg,
output_file<-script(input_file)
If you also want to export the data from R, you can modify the script as follows:
script<-function(input_file,output_file_name){
input_file<-read.table("data.txt", header=T)
r<- summary(input_file)
write.table(r,paste0(output_file_name,".txt",collapse="")
return(r)}
You have to use file name within inverted commas:
> vector<- script(input_file,"output_file_name")
By default, the output will be exported to current R directory. You can also specify output file location by putting the file location before output file_name within the script:
write.table(r,paste0("output_location",output_file_name,".txt",collapse="")

Debugging R Scripts in azure-ml: Where can stdout and stderr logs be found? (or why are they empty?)

I'm using "studio (preview)" from Microsoft Azure Machine Learning to create a pipeline that applies machine learning to a dataset in a blob storage that is connected to our data warehouse.
In the "Designer", an "Exectue R Script" action can be added to the pipeline. I'm using this functionality to execute some of my own machine learning algorithms.
I've got a 'hello world' version of this script working (including using the "script bundle" to load the functions in my own R files). It applies a very simple manipulation (compute the days difference with the date in the date column and 'today'), and stores the output as a new file. Given that the exported file has the correct information, I know that the R script works well.
The script looks like this:
# R version: 3.5.1
# The script MUST contain a function named azureml_main
# which is the entry point for this module.
# The entry point function can contain up to two input arguments:
# Param<medals>: a R DataFrame
# Param<matches>: a R DataFrame
azureml_main <- function(dataframe1, dataframe2){
message("STARTING R script run.")
# If a zip file is connected to the third input port, it is
# unzipped under "./Script Bundle". This directory is added
# to sys.path.
message('Adding functions as source...')
if (FALSE) {
# This works...
source("./Script Bundle/first_function_for_script_bundle.R")
} else {
# And this works as well!
message('Sourcing all available functions...')
functions_folder = './Script Bundle'
list.files(path = functions_folder)
list_of_R_functions <- list.files(path = functions_folder, pattern = "^.*[Rr]$", include.dirs = FALSE, full.names = TRUE)
for (fun in list_of_R_functions) {
message(sprintf('Sourcing <%s>...', fun))
source(fun)
}
}
message('Executing R pipeline...')
dataframe1 = calculate_days_difference(dataframe = dataframe1)
# Return datasets as a Named List
return(list(dataset1=dataframe1, dataset2=dataframe2))
}
And although I do print some messages in the R Script, I haven't been able to find the "stdoutlogs" nor the "stderrlogs" that should contain these printed messages.
I need the printed messages for 1) information on how the analysis went and -most importantly- 2) debugging in case the code failed.
Now, I have found (on multiple locations) the files "stdoutlogs.txt" and "stderrlogs.txt". These can be found under "Logs" when I click on "Exectue R Script" in the "Designer".
I can also find "stdoutlogs.txt" and "stderrlogs.txt" files under "Experiments" when I click on a finished "Run" and then both under the tab "Outputs" and under the tab "Logs".
However... all of these files are empty.
Can anyone tell me how I can print messages from my R Script and help me locate where I can find the printed information?
Can you please click on the "Execute R module" and download the 70_driver.log? I tried message("STARTING R script run.") in an R sample and can found the output there.

Calling external program in parallel using foreach and doSNOW: How to import results?

I'm using R to call an external program in parallel on a cluster with multiple nodes and multiple cores. The external program requires three input data files and produces one output file (all files are stored in the same subfolder).
To run the program in parallel (or rather call it in a parallel fashion) I've initially used the foreach function together with the doParallel library. This works fine as long as I'm just using multiple cores on a single node.
However, I wanted to use multiple nodes with multiple cores. Therefore I modified my code accordingly to use the doSNOW library in conjunction with foreach (I tried Rmpi and doMPI, but I did not manage to run the code on multiple nodes with those libraries).
This works fine, i. e. the external program is now indeed run on multiple nodes (with multiple cores) and the cluster logfile shows, that it produces the required results. The problem I'm facing now, however, is that the external program no longer stores the results/output files on the master node/in the specified subfolder of the working directory (it did so, when I was using doParallel). This makes it impossible for me to import the results into R.
Indeed, if I check the content of the relevant folder it does not contain any output files, despite the logfile clearly showing that the external program ran successfully. I guess they are stored on the different nodes (?).
What modifications do I have to make to either my foreach function or the way I set up my cluster, to get those files saved on the master node/in the specified subfolder in my working directory?
Here some example R code, to showcase, what I'm doing:
# #Set working directory in non-interactive mode
setwd(system("pwd", intern = T))
# #Load some libraries
library(foreach)
library(parallel)
library(doParallel)
# ####Parallel tasks####
# #Create doSNOW cluster for parallel tasks
library(doSNOW)
nCoresPerNode <- as.numeric(Sys.getenv("PBS_NUM_PPN"))
nodeNames <- system("cat $PBS_NODEFILE | uniq", intern=TRUE)
machines <- rep(nodeNames, each = nCoresPerNode)
cl <- makeCluster(machines, type = "SOCK")
registerDoSNOW(cl)
# #How many workers are we using?
getDoParWorkers()
#####DUMMY CODE#####
# #The following 3 lines of code are just dummy code:
# #The idea is to create input files for the external program "myprogram"
external_Command_Script.cmd # #command file necessary for external program "myprogram" to run
startdata # #some input data for "myprogram"
enddata # #additional input data for "myprogram"
####DUMMY CODE######
# #Write necessary command and data files for external program: THIS WORKS!
for(i in 1:100)){
write(external_Command_Script.cmd[[i]], file=paste("./mysubfolder/external_Command_Script.",i,".cmd", sep=""))
write.table(startdata, file=paste("./mysubfolder/","startdata.",i,".txt", sep=""), col.names = FALSE, quote=FALSE)
write.table(enddata, file=paste("./mysubfolder/","enddata.",i,".txt", sep=""), col.names = FALSE, quote=FALSE)
}
# #Run external program "myprogram" in parallel: THIS WORKS!
foreach(i = 1:100)) %dopar% {
system(paste('(cd ./mysubfolder && ',"myprogram",' ' ,"enddata.",i,".txt ", "startdata.",i,".txt", sep="",' < external_Command_Script.',i,'.cmd)'))
}
# #Import results of external program: THIS DOES NOT WORK WHEN RUN ON MULTIPLE NODES!
results <- list()
for(i in 1:100)){
results[[i]] = read.table(paste("./mysubfolder/","enddata.txt.",i,".log.txt", sep=""), sep = "\t", quote="\"", header = TRUE)
}
# #The import does NOT work as the files created by the external program are NOT stored on the master node/in the
# #subfolder of the working directory!
# #Instead I get the following error message:
# #sh: line 0: cd: ./mysubfolder: No such file or directory
# #Error in { : task 6 failed - "cannot open the connection"
My pbs script for the cluster looks something like this:
#!/bin/bash
# request resources:
#PBS -l nodes=2:ppn=8
#PBS -l walltime=00:30:00
module add languages/R-3.3.3-ATLAS
export PBS_O_WORKDIR="/panfs/panasas01/gely/xxxxxxx/workingdirectory"
# on compute node, change directory to 'submission directory':
cd $PBS_O_WORKDIR
# run your program and time it:
time Rscript ./R_script.R
I'd like to suggest that you look into batchtools package. It provides methods for interacting with TORQUE / PBS from R.
If you're ok to use it's predecessor BatchJobs for a while, I'd also recommend to try that and when you understand how that works, look into the doFuture foreach adaptor. This will allow you to use the future.BatchJobs package. This combination of doFuture, future.BatchJobs, and BatchJobs allows you do everything from within R and you don't have to worry about creating temporary R scripts etc. (Disclaimer: I'm the author of both).
Example what it'll look like when you've got it set up:
## Tell foreach to use futures
library("doFuture")
registerDoFuture()
## Tell futures to use TORQUE / PBS with help from BatchJobs
library("future.BatchJobs")
plan(batchjobs_torque)
and then you use:
res <- foreach(i = 1:100) %dopar% {
my_function(pathname[i], arg1, arg2)
}
This will evaluate each iteration in a separate PBS job, i.e. you'll see 100 jobs added to the queue.
The future.BatchJobs vignettes have more examples and info.
UPDATE 2017-07-30: The future.batchtools package is on CRAN since May 2017. This package is now recommended over future.BatchJobs. The usage is very similar to the above, e.g. instead of plan(batchjobs_torque) you now use plan(batchtools_torque).
Problem solved:
I made a mistake: The external program is actually NOT running - I misinterpreted the log file. The reason for the external program to not run is that the subfolder (containing the necessary input data) is not found. It seems that the cluster defaults to the user directory instead of the working directory specified in the pbs submission script. This behaviour is different from clusters created with doParallel, which do indeed recognize the working directory. The problem is therefore solved by just adding the relative path to working directory and subfolder in the R script, i. e. ./workingdirectory/mysubfolder/ instead of just ./mysubfolder/. Alternatively, you can also use the full path to the folder.

Loop to import arguments into R

I am new to R and I am trying to have a script get arguments from a file. I am using the following code within my R script:
args <- commandArgs(TRUE)
covr <- args[1]
rpts <- args[2]
The arguments will come from a parameters.tsv which will have two fields, one for each argument.
I want to run the R script with the parameters given in a line from parameters.tsv until all lines have been used to run the R script.
The end result will be qsub'ing a bash script to run each line into the R script.
This is what I came up with:
#!/bin/bash
cat parameters.tsv | while read v1 v2; do RScript --slave ‘--args $v1 $v2’ myscript.R; done
It's currently terminating almost immediately after i submit it and i don't understand why.
Any help is greatly appreciated since i am very new to this and anything i read prior did not explain in enough detail to grasp.
How about something like:
var_df <- read.csv([your_file_here]) # or read table with correct specs
for (i in 1:dim(var_df)[1]) { # vectorise for speed; doing it with loops to
# make this clearer
this_var_a <- var_df[i,1]
this_var_b <- var_df[i,2]
source([Rscript file here], local=TRUE) #set local=T as otherwise the vars
# will not be visible to the script's operations
}

Resources