Uploading large dataset from FiftyOne to CVAT - cvat

I'm trying to upload around 15GB of data from FiftyOne to CVAT using the 'annotate' function in order to fix annotations. The task is divided into jobs of 50 samples. During the sample upload, I get an 'Error 504 Gateway Time-Out' error. I can see the images in CVAT but they are without the current annotations.
Tried uploading the annotations separately using the 'task_id' and changing the 'cvat.py' file in FiftyOne but I wasn't able to load the changed annotations.
I can't break this down into multiple tasks since all tasks have the same name making it inconvenient.
In order to be able to use 'load_annotations' to update the dataset, I understand that I have to upload it using the 'annotate' function (unless there is another way).

Update: This seems to be a limitation of CVAT on the maximum size of requests to their API. In order to circumvent this for the time being, we are adding a task_size parameter to the annotate() method of FiftyOne which automatically breaks an annotation run into multiple tasks of a maximum task_size to avoid large data or annotation uploads.
Previous Answer:
The best way to manage this workflow now would be to break down your annotations into multiple tasks but then upload them to one CVAT project to be able to group and manage them nicely.
For example:
import fiftyone as fo
import fiftyone.zoo as foz
dataset = foz.load_zoo_dataset("quickstart").clone()
# The label schema is automatically inferred from the existing labels
# Alternatively, it can be specified with the `label_schema` kwarg
# when calling `annotate()`
label_field = "ground_truth"
# Upload batches of your dataset to different tasks
# all stored in the same project
project_name = "multiple_task_example"
anno_keys = []
for i in range(int(len(dataset)/50)):
anno_key = "example_%d" % i
view = dataset.skip(i*50).limit(50)
view.annotate(
anno_key,
label_field=label_field,
project_name=project_name,
)
anno_keys.append(anno_key)
# Annotate in CVAT...
# Load all annotations and cleanup tasks/project when complete
anno_keys = dataset.list_annotation_runs()
for anno_key in anno_keys:
dataset.load_annotations(anno_key, cleanup=True)
dataset.delete_annotation_run(anno_key)
Uploading to existing tasks and the project_name argument will be available in the next release. If you want to use this immediately you can install FiftyOne from source: https://github.com/voxel51/fiftyone#installing-from-source
We are working on further optimizations and stability improvements for large CVAT annotation jobs like yours.

Related

Validate data from website before downloading in R

I have a bunch of weather data files I want to download, but there's a mix of website url's that have data and those that don't. I'm using the download.file function in R to download the text files, which is working fine, but I'm also downloading a lot of empty text files because all the url's are valid, even if no data is present.
For example, this url provides good data.
http://weather.uwyo.edu/cgi-bin/sounding?region=naconf&TYPE=TEXT%3ALIST&YEAR=2021&MONTH=12&FROM=3000&TO=3000&STNM=72645
But this one doesn't.
http://weather.uwyo.edu/cgi-bin/sounding?region=naconf&TYPE=TEXT%3ALIST&YEAR=1970&MONTH=12&FROM=3000&TO=3000&STNM=72645
Is there a way to check to see if there's valid data in the text file before I download it? I've looked for something in the RCurl package, but didn't see what I needed. Thank you.
You can use httr::HEAD to determine the data size before downloading it. Note that this saves you the "pain" of downloading; if there is any cost on the server side, it feels the query-pain even if you do not download it. (These two seem quick enough, perhaps it's not a problem.)
# good data
res1 <- httr::HEAD("http://weather.uwyo.edu/cgi-bin/sounding?region=naconf&TYPE=TEXT%3ALIST&YEAR=2021&MONTH=12&FROM=3000&TO=3000&STNM=72645")
httr::headers(res1)$`content-length`
# [1] "9435"
# no data
res2 <- httr::HEAD("http://weather.uwyo.edu/cgi-bin/sounding?region=naconf&TYPE=TEXT%3ALIST&YEAR=1970&MONTH=12&FROM=3000&TO=3000&STNM=72645")
httr::headers(res2)$`content-length`
# NULL
If the API provides a function for estimating size (or at least presence of data), then it might be nicer to the remote end to use that instead of using this technique. For example: let's assume that an API call requires a 20 second SQL query. A call to HEAD will take 20 seconds, just like a call to GET, the only difference being that you don't get the data. If you see that you will get data and then subsequently call httr::GET(.), then you'll wait another 20 seconds (unless the remote end is caching queries).
Alternatively, they may have a heuristic to find presence of data, perhaps just a simple yes/no, that only takes a few seconds. In that case, it would be much "nicer" for you to make a 3 second "is data present" API call before calling the 20-second full query call.
Bottom line: if the API has a "data size" estimator, use it, otherwise HEAD should work fine.
As an alternative to HEAD, just GET the data, check the content-length, and save to file only if found:
res1 <- httr::GET("http://weather.uwyo.edu/cgi-bin/sounding?region=naconf&TYPE=TEXT%3ALIST&YEAR=2021&MONTH=12&FROM=3000&TO=3000&STNM=72645")
stuff <- as.character(httr::content(res1))
if (!is.null(httr::headers(res1)$`content-length`)) {
writeLines(stuff, "somefile.html")
}
# or do something else with the results, in-memory

running r script in AWS

Looking at this page and this piece of code in particular:
import boto3
account_id = boto3.client("sts").get_caller_identity().get("Account")
region = boto3.session.Session().region_name
ecr_repository = "r-in-sagemaker-processing"
tag = ":latest"
uri_suffix = "amazonaws.com"
processing_repository_uri = "{}.dkr.ecr.{}.{}/{}".format(
account_id, region, uri_suffix, ecr_repository + tag
)
# Create ECR repository and push Docker image
!docker build -t $ecr_repository docker
!$(aws ecr get-login --region $region --registry-ids $account_id --no-include-email)
!aws ecr create-repository --repository-name $ecr_repository
!docker tag {ecr_repository + tag} $processing_repository_uri
!docker push $processing_repository_uri
This is not pure Python obviously? Are these AWS CLI commands? I have used docker previously but I find this example very confusing. Is anyone aware of an end-2-end example of simply running some R job in AWS using sage maker/docker? Thanks.
This is Python code mixed with shell script magic calls (the !commands).
Magic commands aren't unique to this platform, you can use them in Jupyter, but this particular code is meant to be run on their platform. In what seems like a fairly convoluted way of running R scripts as processing jobs.
However, the only thing you really need to focus on is the R script, and the final two cell blocks. The instruction at the top (don't change this line) creates a file (preprocessing.R) which gets executed later, and then you can see the results.
Just run all the code cells in that order, with your own custom R code in the first cell. Note the line plot_key = "census_plot.png" in the last cell. This refers to the image being created in the R code. As for other output types (eg text) you'll have to look up the necessary Python package (PIL is an image manipulation package) and adapt accordingly.
Try this to get the CSV file that the R script is also generating (this code is not validated, so you might need to fix any problems that arise):
import csv
csv_key = "plot_data.csv"
csv_in_s3 = "{}/{}".format(preprocessed_csv_data, csv_key)
!aws s3 cp {csv_in_s3} .
file = open(csv_key)
dat = csv.reader(file)
display(dat)
So now you should have an idea of how two different output types the R script example generates are being handled, and from there you can try and adapt your own R code based on what it outputs.

Schedule a task (update data) each monday in Shiny

I have a dashboard living in a Shiny Server pro that shows different analysis. The data is coming from a long query that takes around 20 minutes to be completed.
In my current set up, I have a button that updates the data:
queries new data
transforms the data
saves the data in a file .RData
saves the data in a global object (using data <<-)
Just in case, outside the server and ui functions I have a statement that checks if data object exists. In case that does not exists, it reads the data from the .RData file instead of doing the query again.
Now I would like to update the data each Monday at 5:00pm (I do not want to open the app and push the button each Monday). I think that the best way to do it is using a cron job using cronR. The code will be located in the app.R outside the server and ui functions. Now I have the following questions:
If I am using Shiny server pro how many times, the app, will create the cron job if it is located in the app.R outside the server and ui functions?
How can I replace the object data in the shiny app? In such a way that if a user open the app on Monday after 5:00 pm the data will be in place, without the need of reading the .RData file and of course not doing the query again.
What is the best practice?
Just create your cron process with cronR completely outside the shiny application and make sure it saves your data to the correct place.
Create the R code which gets your data:
library(...)
# ...
# x <- mydata
save(x, file = "NewData.Rda")
Create the cron job:
cmd <- cron_rscript("path/to/getdata.R")
cron_add(cmd, frequency = 'daily', id = 'job5', at = '05:00')
I cant't see your point 1. The app will not create the cron job if it is not named "global.R" or "ui.R" or "server.R", I think. Also, you don't have to put your code under the /srv/shiny-server/ directory.
For your point 2., check the reactiveFileReader function from the shiny library. This function checks a file's last modified time and the file is re-read if changed
data <- reactiveFileReader(5*60*1000, filePath="NewData.Rda", readFunc = load)

Azure ML Python SDK mini_batch_size not working as expected on ParallelRunConfig for TabularDataset

I am using Azure ML Python SDK for building custom experiment pipeline. I am trying to run the training on my tabular dataset in parallel on a cluster of 4 VMs with GPUs. I am following the documentation available on this link https://learn.microsoft.com/en-us/python/api/azureml-contrib-pipeline-steps/azureml.contrib.pipeline.steps.parallelrunconfig?view=azure-ml-py
The issue I am facing is that no matter what value I set for mini_batch_size, the individual runs get all rows. I am using EntryScript().logger to check the number of rows passed on to each process. What I see is that my data is being processed 4 times by 4 VMs and not getting split into 4 parts. I have tried setting value of mini_batch_size to 1KB,10KB,1MB, but nothing seems to make a difference.
Here is my code for ParallelRunConfig and ParallelRunStep. Any hints are appreciated. Thanks
#------------------------------------------------#
# Step 2a - Batch config for parallel processing #
#------------------------------------------------#
from azureml.pipeline.steps import ParallelRunConfig
# python script step for batch processing
dataprep_source_dir = "./src"
entry_point = "batch_process.py"
mini_batch_size = "1KB"
time_out = 300
parallel_run_config = ParallelRunConfig(
environment=custom_env,
entry_script=entry_point,
source_directory=dataprep_source_dir,
output_action="append_row",
mini_batch_size=mini_batch_size,
error_threshold=1,
compute_target=compute_target,
process_count_per_node=1,
node_count=vm_max_count,
run_invocation_timeout=time_out
)
#-------------------------------#
# Step 2b - Run Processing Step #
#-------------------------------#
from azureml.pipeline.core import PipelineData
from azureml.pipeline.steps import PythonScriptStep
from azureml.pipeline.steps import ParallelRunStep
from datetime import datetime
# create upload dataset output for processing
output_datastore_name = processed_set_name
output_datastore = Datastore(workspace, output_datastore_name)
processed_output = PipelineData(name="scores",
datastore=output_datastore,
output_path_on_compute="outputs/")
# pipeline step for parallel processing
parallel_step_name = "batch-process-" + datetime.now().strftime("%Y%m%d%H%M")
process_step = ParallelRunStep(
name=parallel_step_name,
inputs=[data_input],
output=processed_output,
parallel_run_config=parallel_run_config,
allow_reuse=False
)
I have found the cause for this issue. What documentation neglects to mention is that mini_batch_size only works if your tabular dataset comprise of multiple files e.g., multiple parquet files with X number of rows per file. If you have one gigantic file that contains all rows, the mini_batch_size is unable to extract only partial data from it to be processed in parallel. I have solved the problem by configuring Azure Synapse Workspace data pipeline to only store few rows per file.
It works on CSV but not Parquet now. You can batch a CSV file, e.g. https://github.com/Azure/MachineLearningNotebooks/blob/master/how-to-use-azureml/machine-learning-pipelines/parallel-run/tabular-dataset-inference-iris.ipynb
The documentation does not make it clear that certain file types are treated differently

Force sys.source() to put functions in specified environment

Question:
I'm using sys.source to source a script's output into a new environment. However, that script itself source()'s some things as well.
When it sources functions, they (and their output) get loaded into R_GlobalEnv instead of into the environment specified by sys.source(). It seems the functions enclosing and binding environments end up being under R_GlobalEnv instead of what you specify in sys.source().
Is there a way like sys.source() to run a script and keep everything it makes in a separate environment? An ideal solution would not require modifying the scripts I'm sourcing and still have "chdir = TRUE" style functionality.
Example:
Running this should show you what I mean:
# setup an external folder
other.folder = tempdir()
# make a functions script, it just adds "1" to the argument.
# Note: the strange-looking "assign(x=" bit is important
# to what I'm actually doing, so any solution needs to be
# robust to this.
functions = file.path(other.folder, "functions.R")
writeLines("myfunction = function(a){assign(x=c('function.output'), a+1, pos = 1)}", functions)
# make a parent script, which source()'s functions.R
# and invokes it on some data, and then modifies that data
parent = file.path(other.folder, "parent.R")
writeLines("source('functions.R')\n
original.data=1\n
myfunction(original.data)\n
resulting.data = function.output + 1", parent)
# make a separate environment
myenv = new.env()
# source parent.R into that new environment,
# using chdir=TRUE so parent.R can find functions.R
sys.source(parent, myenv, chdir = TRUE)
# You can see "myfunction" and "function.output"
# end up in R_GlobalEnv.
# Whereas "original.data" and "resulting.data" end up in the intended environment.
ls(myenv)
More information (what I'm actually trying to do):
I have data from several similar experiments. I'm trying to keep everything in line with "reproducible research" ideals (for my own sanity if nothing else). So what I'm doing is keeping each experiment in its own folder. The folder contains the raw data, and all the metadata which describes each sample (treatment, genotype, etc.). The folder also contains the necessary R scripts to read the raw data, match it with metadata, process it, and output graphs and summary statistics. These are tied into a "mother script" which will do the whole process for each experiment.
This works really well but if I want to do some meta-analysis or just compare results between experiments there are some difficulties. Right now I am thinking the best way would be to run each experiment's "mother script" in its own environment, and then pull out the data from each environment to do my meta-analysis. An alternative approach might be running each mother script in its own instance and then saving the .RData files separately and then re-loading them into a new environment in a new instance. This seems kinda hacky though and I feel like there's a more elegant solution.

Resources