How to set azure experiment name from the code after 2021-08-18 SDK change? - azuremlsdk

On 2021-08-18 Microsoft (for our convenience ?) made the following changes to their Azure ML SDK:
Azure Machine Learning Experimentation User Interface. Run Display Name.
The Run Display Name is a new, editable and optional display name that can be assigned to a run.
This name can help with more effectively tracking, organizing and discovering the runs.
The Run Display Name is defaulted to an adjective_noun_guid format (Example: awesome_watch_2i3uns).
This default name can be edited to a more customizable name. This can be edited from the Run details page in the Azure Machine Learning studio user interface.
Before this change to the SDK, Run Display Name = experiment name + hash.
I was assigning the experiment name from the SDK:
from azureml.core import Experiment
experiment_name = 'yus_runExperiment'
experiment=Experiment(ws,experiment_name)
run = experiment.submit(src)
After the change the Run Display Names are auto-generated.
I do not want to manually edit/change the Run Display Name as I may sometimes run 100-s experiments a day.
I tried to find an answer in the Microsoft documentation, but my attempts have failed.
Is there an Azure SDK function to assign the Run Display Name ?

Just tested in sdk v1.38.0.
You could do like this:
from azureml.core import Experiment
experiment_name = 'yus_runExperiment'
experiment=Experiment(ws,experiment_name)
run = experiment.submit(src)
run.display_name = "Training"
Screenshot

It's undocumented (so try at your own risk I guess) but I successfully changed the display name using the following python code.
from azureml.core import Experiment, Run, Workspace
ws = Workspace.from_config()
exp = Experiment(workspace=ws, name=<name>)
run = Run(exp, <run id>)
run.display_name = <new display name>

Tested inside notebook on AML compute instance and with azureml-sdk version 1.37.0:
from azureml.core import Workspace, Run
run = Run.get_context()
if "OfflineRun" not in run.id:
# works for an offline run on azureml compute instance
workspace = run.experiment.workspace
else:
# For an AML run, do your auth for AML here like:
# workspace = Workspace(subscription_id, resource_group, workspace_name, auth)
pass # remove that pass for sure when you're running inside AML cluster
run.display_name = "CustomDisplayName"
run.description = "CustomDescription"
print(f"workspace has the run {run.display_name} with the description {run.description}")
Output:
workspace has the run CustomDisplayName with the description CustomDescription

Related

running r script in AWS

Looking at this page and this piece of code in particular:
import boto3
account_id = boto3.client("sts").get_caller_identity().get("Account")
region = boto3.session.Session().region_name
ecr_repository = "r-in-sagemaker-processing"
tag = ":latest"
uri_suffix = "amazonaws.com"
processing_repository_uri = "{}.dkr.ecr.{}.{}/{}".format(
account_id, region, uri_suffix, ecr_repository + tag
)
# Create ECR repository and push Docker image
!docker build -t $ecr_repository docker
!$(aws ecr get-login --region $region --registry-ids $account_id --no-include-email)
!aws ecr create-repository --repository-name $ecr_repository
!docker tag {ecr_repository + tag} $processing_repository_uri
!docker push $processing_repository_uri
This is not pure Python obviously? Are these AWS CLI commands? I have used docker previously but I find this example very confusing. Is anyone aware of an end-2-end example of simply running some R job in AWS using sage maker/docker? Thanks.
This is Python code mixed with shell script magic calls (the !commands).
Magic commands aren't unique to this platform, you can use them in Jupyter, but this particular code is meant to be run on their platform. In what seems like a fairly convoluted way of running R scripts as processing jobs.
However, the only thing you really need to focus on is the R script, and the final two cell blocks. The instruction at the top (don't change this line) creates a file (preprocessing.R) which gets executed later, and then you can see the results.
Just run all the code cells in that order, with your own custom R code in the first cell. Note the line plot_key = "census_plot.png" in the last cell. This refers to the image being created in the R code. As for other output types (eg text) you'll have to look up the necessary Python package (PIL is an image manipulation package) and adapt accordingly.
Try this to get the CSV file that the R script is also generating (this code is not validated, so you might need to fix any problems that arise):
import csv
csv_key = "plot_data.csv"
csv_in_s3 = "{}/{}".format(preprocessed_csv_data, csv_key)
!aws s3 cp {csv_in_s3} .
file = open(csv_key)
dat = csv.reader(file)
display(dat)
So now you should have an idea of how two different output types the R script example generates are being handled, and from there you can try and adapt your own R code based on what it outputs.

Uploading large dataset from FiftyOne to CVAT

I'm trying to upload around 15GB of data from FiftyOne to CVAT using the 'annotate' function in order to fix annotations. The task is divided into jobs of 50 samples. During the sample upload, I get an 'Error 504 Gateway Time-Out' error. I can see the images in CVAT but they are without the current annotations.
Tried uploading the annotations separately using the 'task_id' and changing the 'cvat.py' file in FiftyOne but I wasn't able to load the changed annotations.
I can't break this down into multiple tasks since all tasks have the same name making it inconvenient.
In order to be able to use 'load_annotations' to update the dataset, I understand that I have to upload it using the 'annotate' function (unless there is another way).
Update: This seems to be a limitation of CVAT on the maximum size of requests to their API. In order to circumvent this for the time being, we are adding a task_size parameter to the annotate() method of FiftyOne which automatically breaks an annotation run into multiple tasks of a maximum task_size to avoid large data or annotation uploads.
Previous Answer:
The best way to manage this workflow now would be to break down your annotations into multiple tasks but then upload them to one CVAT project to be able to group and manage them nicely.
For example:
import fiftyone as fo
import fiftyone.zoo as foz
dataset = foz.load_zoo_dataset("quickstart").clone()
# The label schema is automatically inferred from the existing labels
# Alternatively, it can be specified with the `label_schema` kwarg
# when calling `annotate()`
label_field = "ground_truth"
# Upload batches of your dataset to different tasks
# all stored in the same project
project_name = "multiple_task_example"
anno_keys = []
for i in range(int(len(dataset)/50)):
anno_key = "example_%d" % i
view = dataset.skip(i*50).limit(50)
view.annotate(
anno_key,
label_field=label_field,
project_name=project_name,
)
anno_keys.append(anno_key)
# Annotate in CVAT...
# Load all annotations and cleanup tasks/project when complete
anno_keys = dataset.list_annotation_runs()
for anno_key in anno_keys:
dataset.load_annotations(anno_key, cleanup=True)
dataset.delete_annotation_run(anno_key)
Uploading to existing tasks and the project_name argument will be available in the next release. If you want to use this immediately you can install FiftyOne from source: https://github.com/voxel51/fiftyone#installing-from-source
We are working on further optimizations and stability improvements for large CVAT annotation jobs like yours.

Azure ML Python SDK mini_batch_size not working as expected on ParallelRunConfig for TabularDataset

I am using Azure ML Python SDK for building custom experiment pipeline. I am trying to run the training on my tabular dataset in parallel on a cluster of 4 VMs with GPUs. I am following the documentation available on this link https://learn.microsoft.com/en-us/python/api/azureml-contrib-pipeline-steps/azureml.contrib.pipeline.steps.parallelrunconfig?view=azure-ml-py
The issue I am facing is that no matter what value I set for mini_batch_size, the individual runs get all rows. I am using EntryScript().logger to check the number of rows passed on to each process. What I see is that my data is being processed 4 times by 4 VMs and not getting split into 4 parts. I have tried setting value of mini_batch_size to 1KB,10KB,1MB, but nothing seems to make a difference.
Here is my code for ParallelRunConfig and ParallelRunStep. Any hints are appreciated. Thanks
#------------------------------------------------#
# Step 2a - Batch config for parallel processing #
#------------------------------------------------#
from azureml.pipeline.steps import ParallelRunConfig
# python script step for batch processing
dataprep_source_dir = "./src"
entry_point = "batch_process.py"
mini_batch_size = "1KB"
time_out = 300
parallel_run_config = ParallelRunConfig(
environment=custom_env,
entry_script=entry_point,
source_directory=dataprep_source_dir,
output_action="append_row",
mini_batch_size=mini_batch_size,
error_threshold=1,
compute_target=compute_target,
process_count_per_node=1,
node_count=vm_max_count,
run_invocation_timeout=time_out
)
#-------------------------------#
# Step 2b - Run Processing Step #
#-------------------------------#
from azureml.pipeline.core import PipelineData
from azureml.pipeline.steps import PythonScriptStep
from azureml.pipeline.steps import ParallelRunStep
from datetime import datetime
# create upload dataset output for processing
output_datastore_name = processed_set_name
output_datastore = Datastore(workspace, output_datastore_name)
processed_output = PipelineData(name="scores",
datastore=output_datastore,
output_path_on_compute="outputs/")
# pipeline step for parallel processing
parallel_step_name = "batch-process-" + datetime.now().strftime("%Y%m%d%H%M")
process_step = ParallelRunStep(
name=parallel_step_name,
inputs=[data_input],
output=processed_output,
parallel_run_config=parallel_run_config,
allow_reuse=False
)
I have found the cause for this issue. What documentation neglects to mention is that mini_batch_size only works if your tabular dataset comprise of multiple files e.g., multiple parquet files with X number of rows per file. If you have one gigantic file that contains all rows, the mini_batch_size is unable to extract only partial data from it to be processed in parallel. I have solved the problem by configuring Azure Synapse Workspace data pipeline to only store few rows per file.
It works on CSV but not Parquet now. You can batch a CSV file, e.g. https://github.com/Azure/MachineLearningNotebooks/blob/master/how-to-use-azureml/machine-learning-pipelines/parallel-run/tabular-dataset-inference-iris.ipynb
The documentation does not make it clear that certain file types are treated differently

R extension write local data

I am creating a package and would like to store settings data locally, since it is unique for each user of the package and so that the setting does not have to be set each time the package is loaded.
How can I do this in the best way?
You could save your necessary data in an object and save it using saveRDS()
whenever a change it made or when user is leaving or giving command for saving.
It saves the R object as it is under a file name in the specified path.
saveRDS(<obj>, "path/to/filename.rds")
And you can load it next time when package is starting using loadRDS().
The good thing of loadRDS() is that you can assign a new name to the obj. (So you don't have to remember its old obj name. However the old obj name is also loaded with the object and will eventually pollute your namespace.
newly.assigned.name <- loadRDS("path/to/filename.rds")
# or also possible:
loadRDS("path/to/filename.rds") # and use its old name
Where to store
Windows
Maybe here:
You can use %systemdrive%%homepath% environment variable to accomplish
this.
The two command variables when concatenated gives you the desired
user's home directory path as below:
Running echo %systemdrive% on command prompt gives:
C:
Running echo %homepath% on command prompt gives:
\Users\
When used together it becomes:
C:\Users\
Linux/OsX
Either in the package location of the user,
path.to.package <- find.package("name.of.your.pacakge",
lib.loc = NULL, quiet = FALSE,
verbose = getOption("verbose"))
# and then construct with
destination.folder.path <- file.path(path.to.package,
"subfoldername", "filename")`
# the path to the final destination
# You should use `file.path()` to construct such paths, because it detects automatically the correct ('/' or '\') separators for the file paths in Unix-derived systems (Linux/Mac Os X) versus Windows.
Or use the $HOME variable of the user and there in a file - the name of which beginning with "." - this is convention in Unix-systems (Linux/Mac OS X) for such kind of file which save configurations of software programs.
e.g. ".your-packages-name.rds".
If anybody has a better solution, please help!

control dropbox in R without authentication every time?

Hi I wanna using R to check dropbox and get the latest file. Now using library(rdrop2)
library(rdrop2)
# drop_auth() # username/password for 1st time
drop.file = drop_dir('daily_export')
which1 = grepl("^daily_export/hat.*.gz$", drop.file$path) # files begin with hat and end with .gz
drop.file = drop.file[which1, ]
drop.file = drop.file[ drop.file==max(drop.file$path), 'path']# max file name indicates latest
drop_get(drop.file$path) #download to current folder
It works but when I restart R, drop_dir needs my authentication - I need click agree in the browser.
I want to automate and schedule the R code. So I'm wondering if there's a way to avoid authentication every time. - Ways using other tools are welcomed too. Thanks!

Resources