What is the best practice on folder structure for Azure Machine Learning service (preview) projects - azure-machine-learning-studio

I'm very excited on the newly released Azure Machine Learning service (preview), which is a great step up from the previous (and deprecated) Machine Learning Workbench.
However, I am thinking a lot about the best practice on structuring the folders and files in my project(s). I'll try to explain my thoughts.
Looking at the documentation for the training of a model (e.g. Tutorial #1), there seems to be good-practice to put all training scripts and necessary additional scripts inside a subfolder, so that it can be passed into the Estimator object without also passing all other files in the project. This is fine.
But when working with the deployment of the service, specifically the deployment of the image, the documentation (e.g. Tutorial #2) seems to indicate that the scoring script need to be located in the root folder. If I try to refer to a script located in a subfolder, I get an error message saying
WebserviceException: Unable to use a driver file not in current directory. Please navigate to the location of the driver file and try again.
This may not be a big deal. Except, I have some additional scripts that I import both in the training script and in the scoring script, and I don't want to duplicate those additional scripts to be able to import them in both the training and the scoring scripts.
I am working mainly in Jupyter Notebooks when executing the training and the deployment, and I could of course use some tricks to read the particular scripts from some other folder, save them to disk as a copy, execute the training or deployment while referring to the copies and finally delete the copies. This would be a decent workaround, but it seems to me that there should be a better way than just decent.
What do you think?

Currently, the score.py needs to be in current working directory, but dependency scripts - the dependencies argument to ContainerImage.image_configuration - can be in a subfolder.
Therefore, you should be able to use folder structure like this:
./score.py
./myscripts/train.py
./myscripts/common.py
Note that the relative folder structure is preserved during web service deployment; if you reference the common file in subfolder from your score.py, that reference should be valid within deployed image.

Related

Airflow/Composer recommended folder structure

Do you guys have any recommended for Composer folder/directories structure? The way it should be structured is different from the way our internal Airflow server is using right now.
Based on Google documentation: https://cloud.google.com/composer/docs/concepts/cloud-storage:
plugins/: Stores your custom plugins, operators, hooks
dags/: store dags and any data the web server needs to parse a dag.
data/: Stores the data that tasks produce and use.
This is an example of how I organize my dags folder:
I had trouble before when I put the key.json file in the data/ folder and the dags cannot be parsed using the keys in the data/ folder. So now I tend to put all the support files in the dags/ folder.
Would the performance of the scheduler be impacted if I put the supported files (sql, keys, schema) for the dag in the dags/ folder? Is there a good use case to use the data/ folder?
It would be helpful if you guys can show me an example of how to structure the composer folder to support multiple projects with different dags, plugins and supported files.
Right now, we only have 1 Github for the entire Airflow folder. Is it better to have a separate git per project?
Thanks!
The impact on the scheduler should be fairly minimal as long as the files you place in the dags folder are not .py files; however, you can also place the files in the plugins folder which is also synced via copy.
I would use top-level folders to separate projects (e.g. dags/projectA/dagA.py), or even separate environments if the projects are large enough.
First question:
I had trouble before when I put the key.json file in the data/ folder and the dags cannot be parsed using the keys in the data/ folder. So now I tend to put all the support files in the dags/ folder.
You only need to set the correct configuration to read these files from the data/ or plugins/ folders. The difference between the filesystem between running airflow in composer against running it locally is that the path to these folders changes.
To help you with that, in another post I describe a solution to find the correct path for these folders. I quote my comment from the post:
"If the path I entered does not work for you, you will need to find the path for your Cloud Composer instance. This is not hard to find. In any DAG you could simply log the sys.path variable and see the path printed."
Second question:
Would the performance of the scheduler be impacted if I put the supported files (sql, keys, schema) for the dag in the dags/ folder? Is there a good use case to use the data/ folder?
Yes, at least the scheduler needs to check if these are python scripts or not. It's not really that much but it does have an impact.
If you solve the issue of reading from the data or plugins folders, you should move these files to those folders.
Third question:
Right now, we only have 1 Github for the entire Airflow folder. Is it better to have a separate git per project?
If your different projects require different pypi packages, having separate repositories for each one and different airflow environments too would be ideal. The reason for this is that you'll leverage the risk of falling into pypi packages dependencies errors and reduce build times.
In the other hand, if your projects will use the same pypi packages, I'd suggest to keep everything in a single repository until it becomes easier to have every project in a different repo. But having everything in a single repo makes the deployment easier.

Single git repo setup tracking multiple locations on hard drive

I'm very new to the world of git (done some svn in the past) and would like some advice on trying to accomplish the following.
My current workflow is that I setup the static html files using Middleman to get the base HTML structure and styles before porting over to a Wordpress template. These static files are located at C:/git/project-name/HTMLTemplates.
My wordpress setup uses Xampp so the theme files are kept in C:/Xampp/wordpress/wp-content/themes/project-theme.
What I would like to do is have a single git repo that tracks the changes of the two different locations (HTMLTemplates and project-theme)
Is this at all possible, or do I simply create two individual repos (eg: proect-static and project-wordpress)?
No, there is no mechanism in git for this. Git assumes that all files that it manages (the "working copy") live in a single directory (and subdirectories); there is no support for managing two separate directories in in repo.
So you'll have to somehow keep everything in one directory, probably as subdirectories HTMLTemplates and theme or similar.
You could use two git repos, but I'd strongly advise against this. A single repo should contain a whole "project", i.e. everything needed to build one piece of software (excluding things like external libraries). If you split your project across two repositories, you cannot usefully branch and merge (because you'd have to do it in both repos simultaneously), you cannot easily check out old versions etc..
To solve your problem, I see a few possible solutions:
Have some build / deployment script that copies everything to the right places. You probably alread have a script that invokes Middleman, and possibly tells Wordpress to refresh its cache, so you could add it there.
Set up a symbolic link for the wordpress directory. On UNIX-like systems this is easy and commonly done. On Windows, you can create "junction points", which I believe work similarly.
Configure Wordpress / Apache to read the directory directly from your git working copy. The path should be configurable.
I would prefer the first solution; this has the added advantage that it will decouple your development environment from the server configuration. This will make it easier if your setup later changes or your project needs to run in a different environment (development on a different machine, someone else also wants to work on your project, you want to deploy to a hosted server somewhere etc.).
Note: The problem is, I believe, that your are trying to use git as a deployment tool. While many people do this, git is not really suitable for this purpose. Deployment should usually be a separate step.

Require local file in PHPUnit

I am trying to test our source tree using PHPUnit with old, web based legacy code, trying to make as few changes as possible to begin. Once testing is in place, I can then change the library functions for better use, and better unit testing. However, I need the tests done to allow me to change it.
Question: We share the library code across many projects of our application, but they all share a common directory structure. When the website runs, local directories are available when we require files in the library.
Consider this:
APP1
APP2
Library
Library/COMM
Library/UTIL
...
When you launch the application, you point to either APP1 or APP2 for the different applications. They have the common code (messaging, DB access, etc...). in Library. The problem is, that the functions in the library need special parameters to work, as they are coded today. These libraries simply require('Config.php'); since it will be found in either APP1 or APP2 (they both have one with application specific settings) and the web server is using the APP1 or APP2 as the directory when the Library files were require()'d.
While this works, it fails when attempting to run the code in PHPUnit. My question is how to include the Config.php file without having to change the legacy code too much before the testing is in place.
I know this is the wrong format, but this is what I inherited.
I can not simply require('../../APP1/Config.php'); since both applications share this library.
Any suggestions are greatly appreciated.
Note: We are trying to test the library and all projects as we begin writing tests, so not sure if the include_path will solve it. I am contemplating different PHPUnit.xml.dist files for each application, but trying to avoid this right now due to corporate influence of testing all applications right away.
From phpunit.xml (<phpunit bootstrap="./bootstrap.php">...) and from the cli (--bootstrap ./bootstrap.php) you can specify a bootstrap file. In that file you could do this inclusion you are looking to do.
As word of advice, when stating to test a legacy code base don't start with Unit test. Your first goal should be "get some kind of automated tests in place". For most people, this will be system tests. That is testing the stack/site as a whole. A common tools for this is Selenium.
This is still not small task. What you are going to have to do it work out "how do I put my system in to a consistent state". The first thing you may need to do is automate importing and emptying test data in your database. Once you can do that you will be able to reliably run automated tests. You will need to get many other things to be consistent also, date and times being a good example.
My point is that, from experience, starting with Unit tests will not give you the value you need to prove that automated testings is worth the effort.
Good luck!

Looking for a good web application deployment strategy (ASP.NET MVC3)

I’m looking for a good deployment strategy for deploying a ASP.NET MVC3 application. What I imagine is that each deployment would be some kind of commit to a Source Management System in the sense that a deployment tool could automatically do the following:
1) Upon generating a deployment package (a commit) the tool would
remember the state of my Web.Config file, the state of a folder of
auto-generated scripts containing new database changed, the state of
a folder of batch files that contain new tasks to be run on the
server, the state of files specifying ISS settings changes, etc.
2) When I build a package the next time, the tool would know to only
package the new script files, web.config changes, new batch files,
new ISS settings since my last package
3) Apply the package unto my web application
I started looking into MS Deploy but it only seems to do number 3. I’ve been searching around for either an application that that does what I imagine or a strategy to combine some SMS and MS Deploy. I'm hoping that someone has already solved the problem I feel I have here. My last resort of course is to build the tool but again, that would be my last resort.
Are you using Team Foundation Server? If so, TFS comes with tools to automate builds (including labeling code, running unit tests, deploying, et cetera.) Take a look at http://msdn.microsoft.com/en-us/library/ms181710(v=vs.80).aspx
TFS is not exactly easy to configure and get going but it's free if you are already using TFS.
If you are not using TFS, look for continuous integration tools like NAnt or TeamCity.
Have you used Web Deploy and the "Publish" feature under Build in Visual Studio?
You can set options for things like leaving the previous files on the server.
Your web.config file, do you mean the main one or one that already exists elsewhere on the server? Your web.config file should copy from your project to the server, or are there settings that are different when running locally vs server? If so, look at using transforms to modify web.config.
This is only a partial answer to #1 for you, but we looked for a long time on a migration tool that we liked... We ultimately found Migrator.Net: http://code.google.com/p/migratordotnet/
Doing this, you can turn db migrations into a batch command

SCM for ASP.net

As part of my overall development practices review I'm looking at how best to streamline and automate our ASP.net web development practices.
At the moment, our process goes something like this:
Designer builds frontend as static HTML/CSS on a network share. This gets tweaked until signed off. (e.g. http://myserver/acmesite_design)
Once signed off, developer takes over and copies over frontend HTML/CSS to a new directory on the same server (e.g. http://myserver/acmesite_development)
Multiple developers work on local copy until project is complete.
Developer publishes code to an external publicly accessible server for a client to review/signoff.
Edits made locally based on feedback.
Republish to external server.
Signoff
Developer publishes to live public server
What goes wrong? Lots of things!
Version Control — this is obviously a must and is being introduced
Configuration errors — many many times, there are environment specific paths and variables (such as DB names, image upload directories, web server paths etc. etc.) which incorrectly get copied from local to staging to live etc. etc. with very embarrassing results.
I'm pretty confident I've got no.1 under control. What about configuration management? Does anyone have any advice as to how best to manage an applications structure within asp.net apps to minimize these kinds of problems?
I found that using SVN, NAnt and NUnit with Cruise Control.net solves a lot of the issues you describe. I think it works well for small groups and it's all free. Just need to learn how to use them.
CruiseControl.net helps you put together builds and continuous integration.
Use NAnt or MSBuild to do different environment builds (DEV, TEST, PROD, etc).
http://confluence.public.thoughtworks.org/display/CCNET/Welcome+to+CruiseControl.NET
You got the most important part right. Use version control. Subversion is a good choice.
I usually store configuration along with the site; i.e. when coding a PHP-based site I have a file named config.php-dist. If you want the site to work at all you'll have to copy + edit in all the required parameters (this avoids storing passwords in version control). The -dist file should have reasonable defaults.
Upload directories should be relative if possible; actually all directories should be relative. I'm not experienced in ASP.net, but if it's anything like PHP the current directory is always the directory of the file being requested. If you channel all requests through a single file (i.e. index.asp), then this can even be found programmatically. Or you could find it programmatically by using the equivalent of dirname(____FILE____) in your configuration file.
I also recommend installing IIS (or whatever webserver you are using) on all development workstations (including the designers). Makes life easier as noone can step on each others toes. What one has to do is simply add test hosts to the hosts file (\windows\system32\drivers\etc\hosts iirc) in addition to adding a site to the local IIS. This plays well with version control (checkout, add site to IIS and hosts-file, edit edit edit commit).
One thing that really helps is making sure you keep your paths relative where you can and centralise them where you can't, so when I've been working with ASP.Net I have tended to use web.config to store any configuration and path related data that can't be found programmatically. It is quite possible to find information like your current application path programmatically through the Request object - it's worth looking in some detail over what the environment makes available to you.
One way to make sure you don't end up on something that is dependent on the path name is having a continuous integration server executing your test suite against your application. Each time this happens you create a random filepath. As soon as someone introduces a dependency on the filepath it will fail.

Resources