Equivalent api in fast-ai 2 - fast-ai

Previously I have fast-ai version 1.
I am using the followings for training.
from fastai.basic_data import DataBunch
from fastai.train import Learner
from fastai.metrics import accuracy
#DataBunch takes data and internall create data loader
data = DataBunch.create(train_ds, valid_ds, bs=batch_size, path='./data')
#Learner uses Adam as default for learning
learner = Learner(data, model, loss_func=F.cross_entropy, metrics=[accuracy])
#Gradient is clipped
learner.clip = 0.1
Now I have updated to fast-ai==2.1.6 and all these fastai.basic_data, fastai.train and fastai.metrics become ModuleNotFoundError.
What are equivalent apis in fast-ai2?

This is one of the main differences of 2.x to 1.x.
The 2.x way of doing this is using DataBlock API.
Learner receives dataloaders and not a databunch.
But if you already have datasets, you can easily create dataloaders from your datasets:
dls = DataLoaders.from_dsets(train, valid,
after_batch=[Normalize.from_stats(*imagenet_stats), *aug_transforms()])
Check the Siamese Tutorial if in doubt.

Related

RobotFramework: Purpose and best practice for the resource- and library-folders

I ponder on purpose and best practice for the Resource- and Library-folders usage in RobotFramework.
Below I have formulated some statements which serves to illustrate my questions. (Abbreviations used: KW = KeyWord, RF = RobotFramework, TS = TestSuite).
Statements/Questions:
Every KW, that is designed to be shared among TS, and written in RF-syntax, should be put inside a .resource-file in the Resource-folder?
Every KW written in Python should be put (as a method inside a .py-file) in the Library-folder?
I.e. the distinction-line between the Resource- and Library-folder is drawn based on syntax used when writing the KW (RF-KW go into Resource-folder and Python-KW go into Libraries-folder).
Or, should the distinction-line rather be drawn upon closeness to the test-rig and system under test. (i.e. High- or Low-level keywords. Where Low-level Keywords are said to be interact with the system under test). And hence you could place python KW (methods) in the Resource-folder?
My take - Yes on everything, even on the last paragraph with the "Or,". Everything up until it were questions on the on the content/syntax of a file. And if your python (library) file has KW-s that make contextual sense to be in a folder with other similar RF (resource) files - place it there.
Remember two things: for Robotframework the distinction between Resource and Library is mainly what syntax it is expecting & how to import the target's resources. It doesn't enforce any rigid expectations on its purpose.
E.g. nothing stops you of having a high-level keyword developed in python like
def login_with_user_and_do_complex_compund_action(user, pass, other_arg)
, nor to create a relatively low-level KW written in Robotframework syntax:
Keyword For Complex Math That Should Better Be In Python
[Arguments] ${complex_number} ${transformer_functuon} ${other_arg}
The other thing is Robotframework is the tool(-set) with which you construct your automated testing framework for the SUT. By your framework I mean the structure & organization of suites and tests, their interconnections and hierarchy, and - the "helpers" for their operations - the before-mentioned resource (RF) and library (py) files.
As long as this framework is logically sound, has established conventions and is easy to grasp & follow, you can have any structure you find suiting you.

AzureML Dataset.File.from_files creation extremely slow even with 4 files

I have a few thousand of video files in my BlobStorage, which I set it as a datastore.
This blob storage receives new files every night and I need to split the data and register each split as a new version of AzureML Dataset.
This is how I do the data split, simply getting the blob paths and splitting them.
container_client = ContainerClient.from_connection_string(AZ_CONN_STR,'keymoments-clips')
blobs = container_client.list_blobs('soccer')
blobs = map(lambda x: Path(x['name']), blobs)
train_set, test_set = get_train_test(blobs, 0.75, 3, class_subset={'goal', 'hitWoodwork', 'penalty', 'redCard', 'contentiousRefereeDecision'})
valid_set, test_set = split_data(test_set, 0.5, 3)
train_set, test_set, valid_set are just nx2 numpy arrays containing blob storage path and class.
Here is when I try to create a new version of my Dataset:
datastore = Datastore.get(workspace, 'clips_datastore')
dataset_train = Dataset.File.from_files([(datastore, b) for b, _ in train_set[:4]], validate=True, partition_format='**/{class_label}/*.mp4')
dataset_train.register(workspace, 'train_video_clips', create_new_version=True)
How is it possible that the Dataset creation seems to hang for an indefinite time even with only 4 paths?
I saw in the doc that providing a list of Tuple[datastore, path] is perfectly fine. Do you know why?
Thanks
Do you have your Azure Machine Learning Workspace and your Azure Storage Account in different Azure Regions? If that's true, latency may be a contributing factor with validate=True.
Another possibility may be slowness in the way datastore paths are resolved. This is an area where improvements are being worked on.
As an experiment, could you try creating the dataset using a url instead of datastore? Let us know if that makes a difference to performance, and whether it can unblock your current issue in the short term.
Something like this:
dataset_train = Dataset.File.from_files(path="https://bloburl/**/*.mp4?accesstoken", validate=True, partition_format='**/{class_label}/*.mp4')
dataset_train.register(workspace, 'train_video_clips', create_new_version=True)
I'd be interested to see what happens if you run the dataset creation code twice in the same notebook/script. Is it faster the second time? I ask because it might be an issue with the .NET core runtime startup (which would only happen on the first time you run the code)
EDIT 9/16/20
While it doesn't seem to make sense that .NET core invoked when not data is moving, is suspect it is the validate=True part of the param that requires that all the data be inspected (which can computationally expensive). I'd be interested to see what happens if that param is False

Is there a way to expand groups with the XDSM diagram creation in OpenMDAO?

Most of my test files involve the creation of an IndepVarComp that gets connected to a group. When I go to create an XDSM from the test file, it only shows the IndepVarComp Box and the Group Box. Is there a way to get it to expand the group and show what's inside?
This would also be useful when dealing with a top level model that contains many levels of groups where I want to expand one or two levels and leave the rest closed.
There is a recurse option, which controls if groups are expanded or not. Here is a small example with the Sellar problem to explore this option. The disciplines d1 and d2 are part of a Group called cycle.
import numpy as np
import openmdao.api as om
from openmdao.test_suite.components.sellar import SellarNoDerivatives
from omxdsm import write_xdsm
prob = om.Problem()
prob.model = model = SellarNoDerivatives()
model.add_design_var('z', lower=np.array([-10.0, 0.0]),
upper=np.array([10.0, 10.0]), indices=np.arange(2, dtype=int))
model.add_design_var('x', lower=0.0, upper=10.0)
model.add_objective('obj')
model.add_constraint('con1', equals=np.zeros(1))
model.add_constraint('con2', upper=0.0)
prob.setup()
prob.final_setup()
# Write output. PDF will only be created, if pdflatex is installed
write_xdsm(prob, filename='sellar_pyxdsm', out_format='pdf', show_browser=True,
quiet=False, output_side='left', recurse=True)
The same code with recurse=False (d1 and d2 are not shown, instead their Group cycle):
To enable the recursion from the command line, use the --recurse flag:
openmdao xdsm sellar_pyxdsm.py -f pdf --recurse
With the function it is turned on by default, in the command line you have to include the flag. If this does not work as expected for you, please provide an example.
You can find a lot of examples with different options in the tests of the XDSM plugin. Some of the options, like recurse, include_indepvarcomps, include_solver and model_path control what is included in the XDSM.

Can a parameter be used to set the unit attribute for a component?

So far, using Wolfram System Modeler 4.3 and 5.1 the following minimal example would compile without errors:
model UnitErrorModel
MyComponent c( hasUnit = "myUnit" );
block MyComponent
parameter String hasUnit = "1";
output Real y( unit = hasUnit );
equation
y = 10;
end MyComponent;
end UnitErrorModel;
But with the new release of WSM 12.0 (the jump in version is due to an alignment with the current release of Wolfram's flagship Mathematica) I am getting an error message:
Internal error: Codegen.getValueString: Non-constant expression:c.hasUnit
(Note: The error is given by WSMLink'WSMSimulate in Mathematica 12.0 which is running System Modeler 12.0 internally; here asking for the "InternalValues" property of the above model since I have not installed WSM 12.0 right now).
Trying to simulate the above model in OpenModelica [OMEdit v. 1.13.2 (64-bit)] reveals:
SimCodeUtil.mo: 8492:9-8492:218]: Internal error Unexpected expression (should have been handled earlier, probably in the front-end. Unit/displayUnit expression is not a string literal: c.hasUnit
So it seems that to set the unit attribute I cannot make use of a variable that has parameter variability? Why is this - after all shouldn't it suffice that the compiler can hard-wire the unit when compiling for runtime (after all the given model will run without any error in WSM 4.3 and 5.1)?
EDIT: From the answer to an older question of mine I had believed that at least final parameters might be used to set the unit-attribute. Making the modification final (e.g. c( final hasUnit = "myUnit" ) does not resolve the issue.
I have been given feedback on Wolfram Community by someone from Wolfram MathCore regarding this issue:
You are correct in that it's not in violation with the specification,
although making it a constant makes more sense since you would
invalidate all your static unit checking if you are allowed to change
the unit after building the simulation. We filed an issue on the
specification regarding this (Modelica Specification Issue # 2362).
So, MatheCore is a bit ahead of the game in proposing a Modelica specification change that they have already implemented. ;-)
Note: That in Wolfram System Modeler (12.0) using the annotation Evaluate = true will not cure the problem (cf. the comment above by #matth).
As a workaround variables used to set the unit attribute should have constant variability, but can nevertheless by included in user dialogs to be interactively changed using annotation(Dialog(group = "GroupName")).

using dask for scraping via requests

I like the simplicity of dask and would love to use it for scraping a local supermarket. My multiprocessing.cpu_count() is 4, but this code only achieves a 2x speedup. Why?
from bs4 import BeautifulSoup
import dask, requests, time
import pandas as pd
base_url = 'https://www.lider.cl/supermercado/category/Despensa/?No={}&isNavRequest=Yes&Nrpp=40&page={}'
def scrape(id):
page = id+1; start = 40*page
bs = BeautifulSoup(requests.get(base_url.format(start,page)).text,'lxml')
prods = [prod.text for prod in bs.find_all('span',attrs={'class':'product-description js-ellipsis'})]
prods = [prod.text for prod in prods]
brands = [b.text for b in bs.find_all('span',attrs={'class':'product-name'})]
sdf = pd.DataFrame({'product': prods, 'brand': brands})
return sdf
data = [dask.delayed(scrape)(id) for id in range(10)]
df = dask.delayed(pd.concat)(data)
df = df.compute()
Firstly, a 2x speedup - hurray!
You will want to start by reading http://dask.pydata.org/en/latest/setup/single-machine.html
In short, the following three things may be important here:
you only have one network, and all the data has to come through it, so that may be a bottleneck
by default, you are using threads to parallelise, but the python GIL limits concurrent execution (see the link above)
the concat operation is happening in a single task, so this cannot be parallelised, and with some data types may be a substantial part of the total time. You are also drawing all the final data into your client's process with the .compute().
There are meaningful differences between multiprocessing and multithreading. See my answer here for a brief commentary on the differences. In your case that results in only getting a 2x speedup instead of, say, a 10x - 50x plus speedup.
Basically your problem doesn't scale as well by adding more cores as it would by adding threads (since it's I/O bound... not processor bound).
Configure Dask to run in multithreaded mode instead of multiprocessing mode. I'm not sure how to do this in dask but this documentation may help

Resources