Creating a dataframe in Azure ML Notebook with R kernel - r

I have written some scripts in R which I have to run in azure ml notebook but I have not found much documentation how to create a dataset by running code in notebook with R kernel. I have written the following python code which works with python kernel as:
from azureml.core import Dataset, Datastore,Workspace
subscription_id = 'abc'
resource_group = 'pqr'
workspace_name = 'xyz'
workspace = Workspace(subscription_id, resource_group, workspace_name)
datastore = Datastore.get(workspace, 'workspaceblobstore')
# create tabular dataset from all parquet files in the directory
tabular_dataset_3 = Dataset.Tabular.from_parquet_files(path=(datastore,'/UI/09-17-2022_125003_UTC/userdata1.parquet'))
df=tabular_dataset_3.to_pandas_dataframe()
It works fine with python kernel but I want to execute the equivalent R code in notebook with R kernel.
Can anyone please help me what is the equivalent R code ? Any help would be appreciated.

To create an R script and use the dataset, first we need to register the dataset to the portal. Once the dataset is added to the portal, we need to get the dataset URL and open the notebook and use the R kernel.
Upload the dataset and get the data source URL
Go to Machine Learning studio and create a new notebook.
Use the below R script to get the dataset and convert that to dataframe.
azureml_main <- function(dataframe1, dataframe2){
print("R script run.")
run = get_current_run()
ws = workspacename
dataset = azureml$core$dataset$Dataset$get_by_name(ws, “./path/insurance.csv")
dataframe2 <- dataset$to_pandas_dataframe()
# Return datasets as a Named List
return(list(dataset1=dataframe1, dataset2=dataframe2))
}

Related

write_xlsx function in R path argument

I have a script in R that does some calculations.At the end the results are stored in a data frame like this one.
n= 10
x= rnorm(n)
t = seq(1:n)
d = data.frame(t,x);d
I want to send this script to many people which some of have Mac others have windows and some of them linux, run the script and the output to be a xlsx format file in their desktop.I know how to do it with the function write_xlsx() from writexl library to my own desktop.But how anyone can run the script and the xlsx file to be exported in their desktop ?
Might be something like :
write_xlsx(d,path = "my question")
but I don't know how.Any help ?
You can achieve it using writexl package:
library("writexl")
write_xlsx(d,"path/name.xlsx")

running r script in AWS

Looking at this page and this piece of code in particular:
import boto3
account_id = boto3.client("sts").get_caller_identity().get("Account")
region = boto3.session.Session().region_name
ecr_repository = "r-in-sagemaker-processing"
tag = ":latest"
uri_suffix = "amazonaws.com"
processing_repository_uri = "{}.dkr.ecr.{}.{}/{}".format(
account_id, region, uri_suffix, ecr_repository + tag
)
# Create ECR repository and push Docker image
!docker build -t $ecr_repository docker
!$(aws ecr get-login --region $region --registry-ids $account_id --no-include-email)
!aws ecr create-repository --repository-name $ecr_repository
!docker tag {ecr_repository + tag} $processing_repository_uri
!docker push $processing_repository_uri
This is not pure Python obviously? Are these AWS CLI commands? I have used docker previously but I find this example very confusing. Is anyone aware of an end-2-end example of simply running some R job in AWS using sage maker/docker? Thanks.
This is Python code mixed with shell script magic calls (the !commands).
Magic commands aren't unique to this platform, you can use them in Jupyter, but this particular code is meant to be run on their platform. In what seems like a fairly convoluted way of running R scripts as processing jobs.
However, the only thing you really need to focus on is the R script, and the final two cell blocks. The instruction at the top (don't change this line) creates a file (preprocessing.R) which gets executed later, and then you can see the results.
Just run all the code cells in that order, with your own custom R code in the first cell. Note the line plot_key = "census_plot.png" in the last cell. This refers to the image being created in the R code. As for other output types (eg text) you'll have to look up the necessary Python package (PIL is an image manipulation package) and adapt accordingly.
Try this to get the CSV file that the R script is also generating (this code is not validated, so you might need to fix any problems that arise):
import csv
csv_key = "plot_data.csv"
csv_in_s3 = "{}/{}".format(preprocessed_csv_data, csv_key)
!aws s3 cp {csv_in_s3} .
file = open(csv_key)
dat = csv.reader(file)
display(dat)
So now you should have an idea of how two different output types the R script example generates are being handled, and from there you can try and adapt your own R code based on what it outputs.

Debugging R Scripts in azure-ml: Where can stdout and stderr logs be found? (or why are they empty?)

I'm using "studio (preview)" from Microsoft Azure Machine Learning to create a pipeline that applies machine learning to a dataset in a blob storage that is connected to our data warehouse.
In the "Designer", an "Exectue R Script" action can be added to the pipeline. I'm using this functionality to execute some of my own machine learning algorithms.
I've got a 'hello world' version of this script working (including using the "script bundle" to load the functions in my own R files). It applies a very simple manipulation (compute the days difference with the date in the date column and 'today'), and stores the output as a new file. Given that the exported file has the correct information, I know that the R script works well.
The script looks like this:
# R version: 3.5.1
# The script MUST contain a function named azureml_main
# which is the entry point for this module.
# The entry point function can contain up to two input arguments:
# Param<medals>: a R DataFrame
# Param<matches>: a R DataFrame
azureml_main <- function(dataframe1, dataframe2){
message("STARTING R script run.")
# If a zip file is connected to the third input port, it is
# unzipped under "./Script Bundle". This directory is added
# to sys.path.
message('Adding functions as source...')
if (FALSE) {
# This works...
source("./Script Bundle/first_function_for_script_bundle.R")
} else {
# And this works as well!
message('Sourcing all available functions...')
functions_folder = './Script Bundle'
list.files(path = functions_folder)
list_of_R_functions <- list.files(path = functions_folder, pattern = "^.*[Rr]$", include.dirs = FALSE, full.names = TRUE)
for (fun in list_of_R_functions) {
message(sprintf('Sourcing <%s>...', fun))
source(fun)
}
}
message('Executing R pipeline...')
dataframe1 = calculate_days_difference(dataframe = dataframe1)
# Return datasets as a Named List
return(list(dataset1=dataframe1, dataset2=dataframe2))
}
And although I do print some messages in the R Script, I haven't been able to find the "stdoutlogs" nor the "stderrlogs" that should contain these printed messages.
I need the printed messages for 1) information on how the analysis went and -most importantly- 2) debugging in case the code failed.
Now, I have found (on multiple locations) the files "stdoutlogs.txt" and "stderrlogs.txt". These can be found under "Logs" when I click on "Exectue R Script" in the "Designer".
I can also find "stdoutlogs.txt" and "stderrlogs.txt" files under "Experiments" when I click on a finished "Run" and then both under the tab "Outputs" and under the tab "Logs".
However... all of these files are empty.
Can anyone tell me how I can print messages from my R Script and help me locate where I can find the printed information?
Can you please click on the "Execute R module" and download the 70_driver.log? I tried message("STARTING R script run.") in an R sample and can found the output there.

Read / write to folders with AWS and R

I am trying to use the free version of Amazon Web Services EC2 with Ubuntu and R. I created a simple R file I hope will read a small CSV input data file in one folder, perform a trivial operation and write the output to a CSV file in a separate folder. However, the output CSV file is not being created.
Here are the contents of the R file:
my.data <- read.csv('/my_cloud_input_file_test/my_input_test_data_Nov22_2019.csv')
my.data$c <- my.data$a + my.data$b
write.csv(my.data, '/my_cloud_output_file_test/my_output_test_data_Nov22_2019.csv', row.names = FALSE, quote = FALSE)
Here are the contents of the input data file:
a,b
100,12
200,22
300,32
400,42
500,52
Here are the only two lines I used in PuTTY after connecting to the instance:
ubuntu#ip-122-31-22-243:~$ sudo su
root#ip-122-31-22-243:/home/ubuntu# R CMD BATCH Cloud_test_R_file_Nov22_2019.R
The R file is located in the ubuntu folder according to FileZilla, as are my input and output folders.
Can someone please point out my mistake? If I put the R file and input data set both in the ubuntu folder then the output data set is created in the ubuntu folder without me having to use a setwd statement (after I modify the read.csv and write.csv statements to eliminate my input and output folder names). So, I am not using a setwd statement here. If I need a setwd statement here what should it be?
Sorry for such a trivial question.
This code worked:
setwd('/home/ubuntu/')
my.data <- read.csv('retest_input_data/my_input_data_Nov24_2019.csv')
my.data$c <- my.data$a + my.data$b
write.csv(my.data, 'retest_output_data/my_output_data_Nov24_2019.csv', row.names = FALSE, quote = FALSE)
PuTTY line:
ubuntu#ip-122-31-22-243:~$ R CMD BATCH retest_R_file_Nov24_2019.R

Running Groovy Script from R

I have groovy script file (.groovy). I want to execute this groovy script file in R. Please note the input for groovy file has 3 different variables and I created all 3 variables in the form of list
# This installs the package on your machine to read the groovy file
install.packages("readr")
variables <- list ()
variables["first"] = "First"
variables["second"] = "Second"
variables["third" = "Third"
library(readr)
myscript <- read_file("path/to/groovycode.groovy")
result <- Execute (groovyScript=myscript, variables=variables)
result
I didn't test the code, but can be a good start if you can fix any issues when trying to run it.

Resources