running r script in AWS

running r script in AWS - r

Looking at this page and this piece of code in particular:
import boto3
account_id = boto3.client("sts").get_caller_identity().get("Account")
region = boto3.session.Session().region_name
ecr_repository = "r-in-sagemaker-processing"
tag = ":latest"
uri_suffix = "amazonaws.com"
processing_repository_uri = "{}.dkr.ecr.{}.{}/{}".format(
account_id, region, uri_suffix, ecr_repository + tag
)
# Create ECR repository and push Docker image
!docker build -t $ecr_repository docker
!$(aws ecr get-login --region $region --registry-ids $account_id --no-include-email)
!aws ecr create-repository --repository-name $ecr_repository
!docker tag {ecr_repository + tag} $processing_repository_uri
!docker push $processing_repository_uri
This is not pure Python obviously? Are these AWS CLI commands? I have used docker previously but I find this example very confusing. Is anyone aware of an end-2-end example of simply running some R job in AWS using sage maker/docker? Thanks.

This is Python code mixed with shell script magic calls (the !commands).
Magic commands aren't unique to this platform, you can use them in Jupyter, but this particular code is meant to be run on their platform. In what seems like a fairly convoluted way of running R scripts as processing jobs.
However, the only thing you really need to focus on is the R script, and the final two cell blocks. The instruction at the top (don't change this line) creates a file (preprocessing.R) which gets executed later, and then you can see the results.
Just run all the code cells in that order, with your own custom R code in the first cell. Note the line plot_key = "census_plot.png" in the last cell. This refers to the image being created in the R code. As for other output types (eg text) you'll have to look up the necessary Python package (PIL is an image manipulation package) and adapt accordingly.
Try this to get the CSV file that the R script is also generating (this code is not validated, so you might need to fix any problems that arise):
import csv
csv_key = "plot_data.csv"
csv_in_s3 = "{}/{}".format(preprocessed_csv_data, csv_key)
!aws s3 cp {csv_in_s3} .
file = open(csv_key)
dat = csv.reader(file)
display(dat)
So now you should have an idea of how two different output types the R script example generates are being handled, and from there you can try and adapt your own R code based on what it outputs.

Related

How do I allow for copying data to clipboard from .NET Interactive (Polyglot) Notebooks?

I am working in .NET Interactive (aka Polyglot) Notebooks in F# (but I believe the same would apply to C#). In my code, I am running functions that ultimately produce an F# list of floating point values, or alternatively might be an F# list of tuples which contain floating point values.
When I ask the notebook to display the variable, it shows the first 20 values and says ".. (more)." Ideally, I would like to either be able to download this data by pressing a link next to the table that's displayed, or alternatively, run some function that can copy the full data to the clipboard - similar to Pandas' to_clipboard function.
Is there a way to do this?

If you want to create a cell that, when run, copies the contents of a data frame to a clipboard, you can do this using the TextCopy package. For testing, I used the following (including also Deedle and extension for nicely rendering frames):
#i "nuget:https://www.myget.org/F/gregs-experimental-packages/api/v3/index.json"
#r "nuget:Deedle"
#r "nuget:Deedle.DotNet.Interactive.Extension,0.1.0-alpha9"
#r "nuget:TextCopy"
open Deedle
Let's create a sample data frame and a function to get its contents as CSV string:
let df =
Frame.ofRecords
[ for i in 0 .. 100 -> {| Name = $"Joe {i}" |} ]
let getFrameAsCsv (df:Frame<_, _>) =
let sb = System.Text.StringBuilder()
use sw = new System.IO.StringWriter(sb)
df.SaveCsv(sw)
sb.ToString()
To copy df to the clipboard, you can run:
TextCopy.ClipboardService.SetText(getFrameAsCsv df)
If you want to create a download link in the notebook output, this is also possible. You can use the HTML helper to output custom HTML and inside that, you can use the data: format to embed your CSV as a linked file in <a href=...> (as long as it is not too big):
let csv =
System.Convert.ToBase64String
(System.Text.UTF8Encoding.UTF8.GetBytes(getFrameAsCsv df))
HTML($"<a href='data:text/csv;name=file.csv;base64,{csv}'>Download CSV</a>")

How to set azure experiment name from the code after 2021-08-18 SDK change?

On 2021-08-18 Microsoft (for our convenience ?) made the following changes to their Azure ML SDK:
Azure Machine Learning Experimentation User Interface. Run Display Name.
The Run Display Name is a new, editable and optional display name that can be assigned to a run.
This name can help with more effectively tracking, organizing and discovering the runs.
The Run Display Name is defaulted to an adjective_noun_guid format (Example: awesome_watch_2i3uns).
This default name can be edited to a more customizable name. This can be edited from the Run details page in the Azure Machine Learning studio user interface.
Before this change to the SDK, Run Display Name = experiment name + hash.
I was assigning the experiment name from the SDK:
from azureml.core import Experiment
experiment_name = 'yus_runExperiment'
experiment=Experiment(ws,experiment_name)
run = experiment.submit(src)
After the change the Run Display Names are auto-generated.
I do not want to manually edit/change the Run Display Name as I may sometimes run 100-s experiments a day.
I tried to find an answer in the Microsoft documentation, but my attempts have failed.
Is there an Azure SDK function to assign the Run Display Name ?

Just tested in sdk v1.38.0.
You could do like this:
from azureml.core import Experiment
experiment_name = 'yus_runExperiment'
experiment=Experiment(ws,experiment_name)
run = experiment.submit(src)
run.display_name = "Training"
Screenshot

It's undocumented (so try at your own risk I guess) but I successfully changed the display name using the following python code.
from azureml.core import Experiment, Run, Workspace
ws = Workspace.from_config()
exp = Experiment(workspace=ws, name=<name>)
run = Run(exp, <run id>)
run.display_name = <new display name>

Tested inside notebook on AML compute instance and with azureml-sdk version 1.37.0:
from azureml.core import Workspace, Run
run = Run.get_context()
if "OfflineRun" not in run.id:
# works for an offline run on azureml compute instance
workspace = run.experiment.workspace
else:
# For an AML run, do your auth for AML here like:
# workspace = Workspace(subscription_id, resource_group, workspace_name, auth)
pass # remove that pass for sure when you're running inside AML cluster
run.display_name = "CustomDisplayName"
run.description = "CustomDescription"
print(f"workspace has the run {run.display_name} with the description {run.description}")
Output:
workspace has the run CustomDisplayName with the description CustomDescription

Debugging R Scripts in azure-ml: Where can stdout and stderr logs be found? (or why are they empty?)

I'm using "studio (preview)" from Microsoft Azure Machine Learning to create a pipeline that applies machine learning to a dataset in a blob storage that is connected to our data warehouse.
In the "Designer", an "Exectue R Script" action can be added to the pipeline. I'm using this functionality to execute some of my own machine learning algorithms.
I've got a 'hello world' version of this script working (including using the "script bundle" to load the functions in my own R files). It applies a very simple manipulation (compute the days difference with the date in the date column and 'today'), and stores the output as a new file. Given that the exported file has the correct information, I know that the R script works well.
The script looks like this:
# R version: 3.5.1
# The script MUST contain a function named azureml_main
# which is the entry point for this module.
# The entry point function can contain up to two input arguments:
# Param<medals>: a R DataFrame
# Param<matches>: a R DataFrame
azureml_main <- function(dataframe1, dataframe2){
message("STARTING R script run.")
# If a zip file is connected to the third input port, it is
# unzipped under "./Script Bundle". This directory is added
# to sys.path.
message('Adding functions as source...')
if (FALSE) {
# This works...
source("./Script Bundle/first_function_for_script_bundle.R")
} else {
# And this works as well!
message('Sourcing all available functions...')
functions_folder = './Script Bundle'
list.files(path = functions_folder)
list_of_R_functions <- list.files(path = functions_folder, pattern = "^.*[Rr]$", include.dirs = FALSE, full.names = TRUE)
for (fun in list_of_R_functions) {
message(sprintf('Sourcing <%s>...', fun))
source(fun)
}
}
message('Executing R pipeline...')
dataframe1 = calculate_days_difference(dataframe = dataframe1)
# Return datasets as a Named List
return(list(dataset1=dataframe1, dataset2=dataframe2))
}
And although I do print some messages in the R Script, I haven't been able to find the "stdoutlogs" nor the "stderrlogs" that should contain these printed messages.
I need the printed messages for 1) information on how the analysis went and -most importantly- 2) debugging in case the code failed.
Now, I have found (on multiple locations) the files "stdoutlogs.txt" and "stderrlogs.txt". These can be found under "Logs" when I click on "Exectue R Script" in the "Designer".
I can also find "stdoutlogs.txt" and "stderrlogs.txt" files under "Experiments" when I click on a finished "Run" and then both under the tab "Outputs" and under the tab "Logs".
However... all of these files are empty.
Can anyone tell me how I can print messages from my R Script and help me locate where I can find the printed information?

Can you please click on the "Execute R module" and download the 70_driver.log? I tried message("STARTING R script run.") in an R sample and can found the output there.

search_tweets with UTF8 query in windows command line returns 0 output

I have a script to scrape tweets using rtweet package in R. i am using the following code.
rt <- search_tweets(
q = ("اجرک"),
n = 5000,
include_rts = FALSE,
geocode = lookup_coords(),
parse = TRUE,
lang = 'ur',
retryonratelimit = TRUE,
token = create_token()
)
The code works fine in Rstudio (create_token and lookup_coords have respective inputs that are removed here). I am able to get a few hundred tweets containing the search query. The aim is to run this script using Windows task scheduler. However when the same script is run using command line, e.g.
Rscript -e "source('path\\to\\script.R')"
the script runs but the resulting data frame has zero rows. Using my very limited understanding of debugging, I pinpointed the problem to the type of query given as input in the above said function. If I use Latin characters, for example 'ajrak', it does return a data frame with tweets in command line.
In short, the behaviour of the R script I have written is different in R studio versus Windows Command line. The main cause is the use of UTF-8 query. After searching around a lot, I could not find a solution. Any way to fix this problem?

Use Linux or Mac
Use escaped unicode characters instead of utf8 text.

Converting PDF to a collection of images on the server using GhostScript

These are the steps I am trying to achieve:
Upload a PDF document on the server.
Convert the PDF document to a set of images using GhostScript (every page is converted to an image).
Send the collection of images back to the client.
So far, I am interested in #2.
First, I downloaded both gswin32c.exe and gsdll32.dll and managed to manually convert a PDF to a collection of images(I opened cmd and run the command bellow):
gswin32c.exe -dSAFER -dBATCH -dNOPAUSE -sDEVICE=jpeg -r150 -dTextAlphaBits=4 -dGraphicsAlphaBits=4 -dMaxStripSize=8192 -sOutputFile=image_%d.jpg somepdf.pdf
Then I thought, I'll put gswin32c.exe and gsdll32.dll into ClientBin of my web project, and run the .exe via a Process.
System.Diagnostics.Process process1 = new System.Diagnostics.Process();
process1.StartInfo.WorkingDirectory = Request.MapPath("~/");
process1.StartInfo.FileName = Request.MapPath("ClientBin/gswin32c.exe");
process1.StartInfo.Arguments = "-dSAFER -dBATCH -dNOPAUSE -sDEVICE=jpeg -r150 -dTextAlphaBits=4 -dGraphicsAlphaBits=4 -dMaxStripSize=8192 -sOutputFile=image_%d.jpg somepdf.pdf"
process1.Start();
Unfortunately, nothing was output in ClientBin. Anyone got an idea why? Any recommendation will be highly appreciated.

I've tried your code and it seem to be working fine. I would recommend checking following things:
verify if your somepdf.pdf is in the working folder of the gs process or specify the full path to the file in the command line. It would also be useful to see ghostscript's output by doing something like this:
....
process1.StartInfo.RedirectStandardOutput = true;
process1.StartInfo.UseShellExecute = false;
process1.Start();
// read output
string output = process1.StandardOutput.ReadToEnd();
...
process1.WaitForExit();
...
if gs can't find your file you would get an "Error: /undefinedfilename in (somepdf.pdf)" in the output stream.
another suggestion is that you proceed with your script without waiting for the gs process to finish and generate resulting image_N.jpg files. I guess adding process1.WaitForExit should solve the issue.

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

running r script in AWS - r

Related

How do I allow for copying data to clipboard from .NET Interactive (Polyglot) Notebooks?

How to set azure experiment name from the code after 2021-08-18 SDK change?

Debugging R Scripts in azure-ml: Where can stdout and stderr logs be found? (or why are they empty?)

search_tweets with UTF8 query in windows command line returns 0 output

Converting PDF to a collection of images on the server using GhostScript

Categories

Resources