Azure ML Python SDK mini_batch_size not working as expected on ParallelRunConfig for TabularDataset - azure-machine-learning-studio

I am using Azure ML Python SDK for building custom experiment pipeline. I am trying to run the training on my tabular dataset in parallel on a cluster of 4 VMs with GPUs. I am following the documentation available on this link https://learn.microsoft.com/en-us/python/api/azureml-contrib-pipeline-steps/azureml.contrib.pipeline.steps.parallelrunconfig?view=azure-ml-py
The issue I am facing is that no matter what value I set for mini_batch_size, the individual runs get all rows. I am using EntryScript().logger to check the number of rows passed on to each process. What I see is that my data is being processed 4 times by 4 VMs and not getting split into 4 parts. I have tried setting value of mini_batch_size to 1KB,10KB,1MB, but nothing seems to make a difference.
Here is my code for ParallelRunConfig and ParallelRunStep. Any hints are appreciated. Thanks
#------------------------------------------------#
# Step 2a - Batch config for parallel processing #
#------------------------------------------------#
from azureml.pipeline.steps import ParallelRunConfig
# python script step for batch processing
dataprep_source_dir = "./src"
entry_point = "batch_process.py"
mini_batch_size = "1KB"
time_out = 300
parallel_run_config = ParallelRunConfig(
environment=custom_env,
entry_script=entry_point,
source_directory=dataprep_source_dir,
output_action="append_row",
mini_batch_size=mini_batch_size,
error_threshold=1,
compute_target=compute_target,
process_count_per_node=1,
node_count=vm_max_count,
run_invocation_timeout=time_out
)
#-------------------------------#
# Step 2b - Run Processing Step #
#-------------------------------#
from azureml.pipeline.core import PipelineData
from azureml.pipeline.steps import PythonScriptStep
from azureml.pipeline.steps import ParallelRunStep
from datetime import datetime
# create upload dataset output for processing
output_datastore_name = processed_set_name
output_datastore = Datastore(workspace, output_datastore_name)
processed_output = PipelineData(name="scores",
datastore=output_datastore,
output_path_on_compute="outputs/")
# pipeline step for parallel processing
parallel_step_name = "batch-process-" + datetime.now().strftime("%Y%m%d%H%M")
process_step = ParallelRunStep(
name=parallel_step_name,
inputs=[data_input],
output=processed_output,
parallel_run_config=parallel_run_config,
allow_reuse=False
)

I have found the cause for this issue. What documentation neglects to mention is that mini_batch_size only works if your tabular dataset comprise of multiple files e.g., multiple parquet files with X number of rows per file. If you have one gigantic file that contains all rows, the mini_batch_size is unable to extract only partial data from it to be processed in parallel. I have solved the problem by configuring Azure Synapse Workspace data pipeline to only store few rows per file.

It works on CSV but not Parquet now. You can batch a CSV file, e.g. https://github.com/Azure/MachineLearningNotebooks/blob/master/how-to-use-azureml/machine-learning-pipelines/parallel-run/tabular-dataset-inference-iris.ipynb
The documentation does not make it clear that certain file types are treated differently

Related

running r script in AWS

Looking at this page and this piece of code in particular:
import boto3
account_id = boto3.client("sts").get_caller_identity().get("Account")
region = boto3.session.Session().region_name
ecr_repository = "r-in-sagemaker-processing"
tag = ":latest"
uri_suffix = "amazonaws.com"
processing_repository_uri = "{}.dkr.ecr.{}.{}/{}".format(
account_id, region, uri_suffix, ecr_repository + tag
)
# Create ECR repository and push Docker image
!docker build -t $ecr_repository docker
!$(aws ecr get-login --region $region --registry-ids $account_id --no-include-email)
!aws ecr create-repository --repository-name $ecr_repository
!docker tag {ecr_repository + tag} $processing_repository_uri
!docker push $processing_repository_uri
This is not pure Python obviously? Are these AWS CLI commands? I have used docker previously but I find this example very confusing. Is anyone aware of an end-2-end example of simply running some R job in AWS using sage maker/docker? Thanks.
This is Python code mixed with shell script magic calls (the !commands).
Magic commands aren't unique to this platform, you can use them in Jupyter, but this particular code is meant to be run on their platform. In what seems like a fairly convoluted way of running R scripts as processing jobs.
However, the only thing you really need to focus on is the R script, and the final two cell blocks. The instruction at the top (don't change this line) creates a file (preprocessing.R) which gets executed later, and then you can see the results.
Just run all the code cells in that order, with your own custom R code in the first cell. Note the line plot_key = "census_plot.png" in the last cell. This refers to the image being created in the R code. As for other output types (eg text) you'll have to look up the necessary Python package (PIL is an image manipulation package) and adapt accordingly.
Try this to get the CSV file that the R script is also generating (this code is not validated, so you might need to fix any problems that arise):
import csv
csv_key = "plot_data.csv"
csv_in_s3 = "{}/{}".format(preprocessed_csv_data, csv_key)
!aws s3 cp {csv_in_s3} .
file = open(csv_key)
dat = csv.reader(file)
display(dat)
So now you should have an idea of how two different output types the R script example generates are being handled, and from there you can try and adapt your own R code based on what it outputs.

Uploading large dataset from FiftyOne to CVAT

I'm trying to upload around 15GB of data from FiftyOne to CVAT using the 'annotate' function in order to fix annotations. The task is divided into jobs of 50 samples. During the sample upload, I get an 'Error 504 Gateway Time-Out' error. I can see the images in CVAT but they are without the current annotations.
Tried uploading the annotations separately using the 'task_id' and changing the 'cvat.py' file in FiftyOne but I wasn't able to load the changed annotations.
I can't break this down into multiple tasks since all tasks have the same name making it inconvenient.
In order to be able to use 'load_annotations' to update the dataset, I understand that I have to upload it using the 'annotate' function (unless there is another way).
Update: This seems to be a limitation of CVAT on the maximum size of requests to their API. In order to circumvent this for the time being, we are adding a task_size parameter to the annotate() method of FiftyOne which automatically breaks an annotation run into multiple tasks of a maximum task_size to avoid large data or annotation uploads.
Previous Answer:
The best way to manage this workflow now would be to break down your annotations into multiple tasks but then upload them to one CVAT project to be able to group and manage them nicely.
For example:
import fiftyone as fo
import fiftyone.zoo as foz
dataset = foz.load_zoo_dataset("quickstart").clone()
# The label schema is automatically inferred from the existing labels
# Alternatively, it can be specified with the `label_schema` kwarg
# when calling `annotate()`
label_field = "ground_truth"
# Upload batches of your dataset to different tasks
# all stored in the same project
project_name = "multiple_task_example"
anno_keys = []
for i in range(int(len(dataset)/50)):
anno_key = "example_%d" % i
view = dataset.skip(i*50).limit(50)
view.annotate(
anno_key,
label_field=label_field,
project_name=project_name,
)
anno_keys.append(anno_key)
# Annotate in CVAT...
# Load all annotations and cleanup tasks/project when complete
anno_keys = dataset.list_annotation_runs()
for anno_key in anno_keys:
dataset.load_annotations(anno_key, cleanup=True)
dataset.delete_annotation_run(anno_key)
Uploading to existing tasks and the project_name argument will be available in the next release. If you want to use this immediately you can install FiftyOne from source: https://github.com/voxel51/fiftyone#installing-from-source
We are working on further optimizations and stability improvements for large CVAT annotation jobs like yours.

rxImport fails on large dataset

I've been trying unsuccessfully the past two days to convert a large CSV (9 gigs) into XDF format using the RxImport function.
The process seems to start off well with R server reading in the data chunk by chunk but after a few minutes it slows to a crawl and then fails completely after around 6 hours with Windows stopping the server saying its run out of RAM.
The code I'm using is as follows:
pd_in_file <- RxTextData("cca_pd_entity.csv", delimiter = ",") #file to import
pd_out_file <- file.path("cca_pd_entity.xdf") #desired output file
pd_data <- rxImport(inData = pd_in_file, outFile = pd_out_file,
stringsAsFactors = TRUE, overwrite = TRUE)
I'm running Microsoft R Server, version 9.0.1. on a Windows 7 machine with 16gig of RAM.
Thanks
Its been solved using Hong Ooi's recommendation to set the colInfo in the rxTextData. I'm not sure why it made such a big difference but it converted the entire 9gig dataset in less than 2 minutes when it completely failed to import after several hours before.

checkpointing DataFrames in SparkR

I am looping over a number of csv data files using R/spark. About 1% of each file must be retained (filtered based on certain criteria) and merged with the next data file (I have used union/rbind). However, as the loop runs, the lineage of the data gets longer and longer as spark remembers all the previous datasets and filter()-s.
Is there a way to do checkpointing in spark R API? I have learned that spark 2.1 has checkpointing for DataFrames but this seems not to be made available from R.
We got the same issue with Scala/GraphX on a quite large graph (few billions of data) and the search for connected components .
I'm not sure what is available in R for your specific version, but a usual workaround is to break the lineage by "saving" the data then reloading it. In our case, we break the lineage every 15 iterations:
def refreshGraph[VD: ClassTag, ED: ClassTag](g: Graph[VD, ED], checkpointDir: String, iterationCount: Int, numPartitions: Int): Graph[VD, ED] = {
val path = checkpointDir + "/iter-" + iterationCount
saveGraph(g, path)
g.unpersist()
loadGraph(path, numPartitions)
}
An incomplete solution/workaround is to collect() your dataframe into an R object, and later re-parallelize by createDataFrame(). This works well for small data but for larger datasets it become too slow and complains about too large tasks.

How to Import SQLite data (gathered by an Android device) into either Octave or MatLab?

I have some data gathered by an Android phone and it is stored in SQLite format in an SQLite file. I would like to play around with this data (analysing it) using either MatLab or Octave. The SQLite data is stored as a file.
I was wondering what commands you would use to import this data into MatLab? To say, put it into a vector or matrix. Do I need any special toolboxes or packages like the Database Package to access the SQL format?
There is the mksqlite tool.
I've used it personally, had some issues of getting the correct version for my version of matlab. But after that, no problems. You can even query the database file directly to reduce the amount of data you import into matlab.
Although mksqlite looks nice it is not available for Octave, and may not be suitable as a long-term solution. Exporting the tables to CSV-files is an option, but the importing (into Octave) can be quite slow for larger data sets because of the string-parsing involved.
As an alternative, I ended up writing a small Python script to convert my SQLite table into a MAT file, which is fast to load into either Matlab or Octave. MAT files are platform-neutral binary files, and the method works both for columns with numbers and strings.
import sqlite3
import scipy.io
conn = sqlite3.connect('my_data.db')
csr = conn.cursor()
res = csr.execute('SELECT * FROM MY_TABLE')
db_parms = list(map(lambda x: x[0], res.description))
# Remove those variables in db_parms you do not want to export
X = {}
for prm in db_parms:
csr.execute('SELECT "%s" FROM MY_TABLE' % (prm))
v = csr.fetchall()
# v is now a list of 1-tuples
X[prm] = list(*zip(*v))
scipy.io.savemat('my_data.mat', X)

Resources