Read / write to folders with AWS and R - r

I am trying to use the free version of Amazon Web Services EC2 with Ubuntu and R. I created a simple R file I hope will read a small CSV input data file in one folder, perform a trivial operation and write the output to a CSV file in a separate folder. However, the output CSV file is not being created.
Here are the contents of the R file:
my.data <- read.csv('/my_cloud_input_file_test/my_input_test_data_Nov22_2019.csv')
my.data$c <- my.data$a + my.data$b
write.csv(my.data, '/my_cloud_output_file_test/my_output_test_data_Nov22_2019.csv', row.names = FALSE, quote = FALSE)
Here are the contents of the input data file:
a,b
100,12
200,22
300,32
400,42
500,52
Here are the only two lines I used in PuTTY after connecting to the instance:
ubuntu#ip-122-31-22-243:~$ sudo su
root#ip-122-31-22-243:/home/ubuntu# R CMD BATCH Cloud_test_R_file_Nov22_2019.R
The R file is located in the ubuntu folder according to FileZilla, as are my input and output folders.
Can someone please point out my mistake? If I put the R file and input data set both in the ubuntu folder then the output data set is created in the ubuntu folder without me having to use a setwd statement (after I modify the read.csv and write.csv statements to eliminate my input and output folder names). So, I am not using a setwd statement here. If I need a setwd statement here what should it be?
Sorry for such a trivial question.

This code worked:
setwd('/home/ubuntu/')
my.data <- read.csv('retest_input_data/my_input_data_Nov24_2019.csv')
my.data$c <- my.data$a + my.data$b
write.csv(my.data, 'retest_output_data/my_output_data_Nov24_2019.csv', row.names = FALSE, quote = FALSE)
PuTTY line:
ubuntu#ip-122-31-22-243:~$ R CMD BATCH retest_R_file_Nov24_2019.R

Related

Creating the user's specific directory in R

I want to read the CSV file "mydata.csv" as an input and create the output in the same directory using R. I have hard-coded for getting csv input(Domain_test.csv) and the output(MyData.csv) path as below. But I will have to share the same Rscript and the corresponding csv files with one of the users so that he/she can execute it and take the results. I want the user should to select his specific path where ever he wants and make it run without hard coding the input/output path in the script.
How it should be done in R?
#reading csv from this current directory
data <- read.csv("C:/Users/Desktop/input_output_directory/Domain_test.csv")
#generating the output In this same directory
write.csv(dataframe,"C:/Users/Desktop/input_output_directory/MyData.csv", row.names = FALSE)
You can use
wd <- choose.dir(default = "", caption = "Select folder")
setwd(wd)

How to source R code from another Jupyter notebook file?

I am new to using the Jupyter notebook with R kernel.
I have R code written in two files Settings.ipynb and Main_data.ipynb.
My Settings.ipynb file has a lot of details. I am showing sample details below
Schema = "dist"
resultsSchema = "results"
sourceName = "hos"
dbms = "postgresql" #Should be "sql server", "oracle", "postgresql" or "redshift"
user <- "hos"
pw <- "hos"
server <- "localhost/hos"
port <- "9763"
I would like to source Settings file in Main_data code file.
When I was using R studio, it was easy as I just use the below
source('Settings.R')
But now in Main_data Jupyter Notebook with R kernel, when I write the below piece of code
source('Settings.R') # settings file is in same directory as main_data file
I get the below error
Error in source("Settings.R"): Settings.R:2:11: unexpected '['
1: {
2: "cells": [
^
Traceback:
1. source("Settings.R")
When I try the below, I get another error as shown below
source('Settings.ipynb')
Error in source("Settings.ipynb"): Settings.ipynb:2:11: unexpected '['
1: {
2: "cells": [
^
Traceback:
1. source("Settings.ipynb")
How can I source an R code and what is the right way to save it (.ipynb or .R format in a jupyter notebook (which uses R kernel)). Can you help me with this please?
updated screenshot
We could create a .INI file in the same working directory (or different) and use ConfigParser to parse all the elements. The .INI file would be
Settings.INI
[settings-info]
schema = dist
resultsSchema = results
sourceName = hos
dbms = postgresql
user = hos
pw = hos
server = localhost/hos
Then, we initialize a parser object, read the contents from the file. We could have multiple subheadings (here it is only 'settings-info') and extract the components using either [[ or $
library(ConfigParser)
props <- ConfigParser$new()
props <- props$read("Settings.INI")$data
props[["settings-info"]]$schema
From the Jupyter notebook
the 'Settings.INI' file
Trying to save a Jupyter notebook file in .R format will not work as the format is a bit messed up (due to the presence of things like { "cells" : [....". You can verify this by opening your .R file in Jupyter Notebook.
However, you can use a vim editor/R studio to create a .R file. This will allow you to have the contents as is without any format issues such as { "cells" : [....".
Later from another jupyter notebook, you can import/source the .R file created using vim editor/R studio. This resolved the issue for me.
In summary, don't use jupyter notebook to create .R file and source them using another jupyter notebook file.

Debugging R Scripts in azure-ml: Where can stdout and stderr logs be found? (or why are they empty?)

I'm using "studio (preview)" from Microsoft Azure Machine Learning to create a pipeline that applies machine learning to a dataset in a blob storage that is connected to our data warehouse.
In the "Designer", an "Exectue R Script" action can be added to the pipeline. I'm using this functionality to execute some of my own machine learning algorithms.
I've got a 'hello world' version of this script working (including using the "script bundle" to load the functions in my own R files). It applies a very simple manipulation (compute the days difference with the date in the date column and 'today'), and stores the output as a new file. Given that the exported file has the correct information, I know that the R script works well.
The script looks like this:
# R version: 3.5.1
# The script MUST contain a function named azureml_main
# which is the entry point for this module.
# The entry point function can contain up to two input arguments:
# Param<medals>: a R DataFrame
# Param<matches>: a R DataFrame
azureml_main <- function(dataframe1, dataframe2){
message("STARTING R script run.")
# If a zip file is connected to the third input port, it is
# unzipped under "./Script Bundle". This directory is added
# to sys.path.
message('Adding functions as source...')
if (FALSE) {
# This works...
source("./Script Bundle/first_function_for_script_bundle.R")
} else {
# And this works as well!
message('Sourcing all available functions...')
functions_folder = './Script Bundle'
list.files(path = functions_folder)
list_of_R_functions <- list.files(path = functions_folder, pattern = "^.*[Rr]$", include.dirs = FALSE, full.names = TRUE)
for (fun in list_of_R_functions) {
message(sprintf('Sourcing <%s>...', fun))
source(fun)
}
}
message('Executing R pipeline...')
dataframe1 = calculate_days_difference(dataframe = dataframe1)
# Return datasets as a Named List
return(list(dataset1=dataframe1, dataset2=dataframe2))
}
And although I do print some messages in the R Script, I haven't been able to find the "stdoutlogs" nor the "stderrlogs" that should contain these printed messages.
I need the printed messages for 1) information on how the analysis went and -most importantly- 2) debugging in case the code failed.
Now, I have found (on multiple locations) the files "stdoutlogs.txt" and "stderrlogs.txt". These can be found under "Logs" when I click on "Exectue R Script" in the "Designer".
I can also find "stdoutlogs.txt" and "stderrlogs.txt" files under "Experiments" when I click on a finished "Run" and then both under the tab "Outputs" and under the tab "Logs".
However... all of these files are empty.
Can anyone tell me how I can print messages from my R Script and help me locate where I can find the printed information?
Can you please click on the "Execute R module" and download the 70_driver.log? I tried message("STARTING R script run.") in an R sample and can found the output there.

How to pass several files to a command function and output them to different files?

I have a directory made of 50 files, here's an excerpt about how the files are names :
input1.txt
input2.txt
input3.txt
input4.txt
I'm writing the script in R but I'm using bash commands inside it using "system"
I have a system command X that takes one file and outputs it to one file
example :
X input1.txt output1.txt
I want input1.txt to output to output1.txt, input2.txt to output to output2.txt etc..
I've been trying this:
for(i in 1:50)
{
setwd("outputdir");
create.file(paste("output",i,".txt",sep=""));
setwd("homedir");
system(paste("/usr/local/bin/command" , paste("input",i,".txt",sep=""),paste("/outputdir/output",i,".txt",sep="")));
}
What am I doing wrong? I'm getting an error at the line of system , it says incorrect string constant , I don't get it.. Did I apply the system command in a wrong manner?
Is there a way to get all the input files and output files without going through the paste command to get them inside system?
There is a pretty easy method in R to copy files to a new directory without using the system commands. This also has the benefit of being cross-capable on different operating systems (just have to change the files structures).
Modified code from: "Copying files with R" by Amy Whitehead
Using your method of running for files 1:50 I have some psudocode here. You will need to change the current.folder and new.folder to something
# identify the folders
current.folder <- "/usr/local/bin/command"
new.folder <- "/outputdir/output"
# find the files that you want
i <- 1:50
# Instead of looping we can use vector pasting to get multiple results at once!
inputfiles <- paste0(current.folder,"/input",i,".txt")
outputfiles <- paste0(new.folder,"/output",i,".txt")
# copy the files to the new folder
file.copy(inputfiles, outputfiles)

In R, opening an object saved to Excel through shell.exec

I would like to be able to open files quickly in Excel after saving them. I learned from R opening a specific worksheet in a excel workbook using shell.exec 1 on SO
On my Windows system, I can do so with the following code and could perhaps turn it into a function: saveOpen <_ function {... . However, I suspect there are better ways to accomplish this modest goal.
I would appreciate any suggestions to improve this multi-step effort.
# create tiny data frame
df <- data.frame(names = c("Alpha", "Baker"), cities = c("NYC", "Rome"))
# save the data frame to an Excel file in the working directory
save.xls(df, filename "test file.xlsx")
# I have to reenter the file name and add a forward slash for the paste() command below to create a proper file path
name <- "/test file.xlsx"
# add the working directory path to the file name
file <- paste0(getwd(), name)
# with shell and .exec for Windows, open the Excel file
shell.exec(file = file)
Do you just want to create a helper function to make this easier? How about
save.xls.and.open <- function(dataframe, filename, ...) {
save.xls(df, filename=filename, ...)
cmd <- file.path(getwd(), filename)
shell.exec(cmd)
}
then you just run
save.xls.and.open(df, filename ="testfile.xlsx")
I guess it doesn't seem like all that many steps to me.

Resources