How do I create a DatatSet with Data Type: ModelDirectory in Azure Machine Learning Studio? - azure-machine-learning-studio

I'm attempting to manually create a DataSet with Data Type: ModelDirectory in Azure Machine Learning Studio, in order to use it in an Inference Pipeline. I have taken an existing ModelDirectory DataSet and attempted to replicate it. Everything is identical, except that the replica has Data Type: AnyDirectory, and can not be hooked up to the input of a ScoreModel node in the designer. How can I (manually in the UI or, better yet, programmatically) create a DataSet with Data Type: ModelDirectory from the output files of a trained model?
Existing DataSet:
Existing DataSet outputs:
Manually Created Replica DataSet:
Manually Created Replica DataSet outputs:
As you can see, the outputs of both DataSets are identical. The only difference between the two DataSets, seems to be the 'Data Type' properties, although in the output view, you can see that both have 'type: ModelDirectory'.

Related

Using two R scripts in a PowerBI dashboard where script B depends on script A

I am trying to translate some code that we previously used in a software similar to PowerBI into some form that's compatible with PowerBI. One thing that I need to do for that is to generate a model fit to some data and use that to display some data on the fit (in some further visual elements).
From a sequential point of view, this is trivial. Generate an object, then work on that object and print some data. But from what I understand about PowerBI, this kind of interdependency between R scripts / visual elements (generate an object, then hand that object to other procedures to generate further output) is not intended and since I need to use several visual elements, and all of them depend on the output of the first, I have no idea how to work this out.
I need to use several visual elements, and all of them depend on the output of the first
Then the data needs to be created in Power Query and loaded into the data model. You can run R in Power Query to generate the data, and visualize it with regular Power BI Visuals and the R Visual.

How to manage more than one dataset - Machine Learning Azure

Is there any module that accepts more than one dataset for processing?
For instance "Split Data" , "Edit meta data" and "select columns in dataset" do not accept more than one dataset as input.
This is what I did :
There are several numeric and categorical variables in my model.I used "Convert to indicator variables " module to create dummy variables for my data. How do I include the indicator variables and numeric variables into one dataset so that I can split the data for my model ?
As of now, I'm doing data wrangling in Python and moving the datasets in Azure MLS for modeling. Ideally, I need to work on data wrangling in Azure MLS.
I expect to have one module that consolidates both the categorical binned variables and numeric variables in Azure MLS
Yup, there are several modules receiving multiple datasets - Add Columns, Apply SQL Transformation, Execute Python Script, to name a few.
Not sure why you need them for indicator values though - assuming you're talking about Train/Test Split, then I would just split the data after invoking the "Convert to indicator values" module.
I will add to the above answer. You can use Execute R script, Join data as well if the datasets have common keys.

R - Is it possible to run two (or more) consoles using the SAME environment simultaneously?

And if so, how?
I use RStudio. I know I can fork a project in order to perform calculations over two copies of the same environment (as described here). Although, it doesn't fit my needs because the environment I'm currently using is very big, and I don't have enough RAM for duplicating it.
Therefore, I am wondering if there is some way in which I can open two (or more) consoles using the same one (in particular, I would be particularly interested on not having to replicate the very big data frames).
Is there a way in which I can use RStudio this way, or is there any other IDE or tool which supports it?
Thank you for your help.
EDIT:
I will explain what I'm trying to do: I'm developing some machine learning models based on a large dataset.
I load the dataset into a data frame.
Then I perform different treatments over the data in order to transform them into ML-friendly data.
I perform these two steps in one R script, and I end up with an environment loaded with a heavy data frame, libraries and some other objects.
Then I'm using this dataset to feed several ML models: those models are of different classes, and within each class I'm trying several models with different parameters.
I have one R script for each class of models, and I would like to run and score each class parallel. Each model within each class will run sequentially.
The key here is: I know I can use different projects in order to do this, but that would suppose having to load several times the same environment, and for me that is problematic because it would mean having to load to RAM several times the same big data frame. Therefore I would like to know if there is a way to have several R scripts run in parallel while using the same environment.
Then I will use another script to rank all the models.

How to export an R Random Forest model for use in Excel VBA without API calls

Problem:
I have a Random Forest model trained in R. I need to deploy this model in a standalone Excel tool that will be used by 350 people across a sales network to perform real-time predictions based on data entered into the spreadsheet by users.
How can I do this?
Constraints:
It is not an option to require users to install R on their local machines.
It is not an option to have a server (physical or cloud) providing a scoring API.
What have I done so far?
1. PMML
I can export the model in PMML (XML structure). From research I can see there are libraries for loading and executing PMML inputs in Python and Java. However I haven't found anything implemented in VBA / VB.
2. Zementis
I looked into a solution called Zementis which offers an Excel add-in to deploy PMML models. However from my understanding this requires web-service calls to a cloud server (e.g. AWS) where the actual model execution happens. My IT security department will not allow this.
3. Others
The most common recommendation seems to be to call R to load the model and run the predict function. As noted above, this is not a viable option.
Detailed Context:
The Random Forest model is trained in R, with c. 30 variables. The model is used to recommend "personalised" prices for products as part of a sales process.
The model needs to be distributed to the sales network, with about 350 users. The business's preference is to integrate the model into an existing spreadsheet tool that sales teams currently use to calculate deal profitability.
This means that I need to be able to export the model in a way that it can be implemented in Excel VBA.
Given timescales, the implementation needs to be self-contained with no IT infrastructure or additional application installs. We are working with the organisation's IT team on a server based solution, however their deployment timescales are 12 months+ which means we need a tactical solution in the short-term.
Here's one approach to get the "rules" for the trees (example using the mtcars dataset)
install.packages("randomForest")
library(randomForest)
head(mtcars)
set.seed(1)
fit <- randomForest(mpg ~ ., data=mtcars, importance=TRUE, proximity=TRUE)
print(fit)
## Look at variable importance:
importance(fit)
# Print the rules for each tree in the forest
install.packages("rattle")
library(rattle)
printRandomForests(fit)
It is probably unrealistic to use the rules for 500 trees, but maybe you could implement 100 trees in your vba and then take an average of the results (for a continuous response) or predict the class with the most votes across the trees (for a categorical response).
Maybe you could recreate the model on a Worksheet.
As far as I know, Excel can import XML structures (on the Development Tools ribbon).
Edit: 1) save pmml structure in plaintext editor as .xml file.
2) Open the file in Excel 2013 (maybe other versions also do it)
3) Click through the error message and open the file anyway. Trees open as a table, a bit funny, but recognizable.
4) Create prediction calculation (generic fn in VBA) to operate on the tree.

Building predictive model from HDFS OR HIVE as source of training set and test set in production environment

Can someone help me if the steps I have written below is how it is done in real world predictive modelling .
Source is Oracle database.
Using Apache Sqoop to bring the data into HDFS.(I use --query to bring the features into hdfs)
Here, I am confused as to process the data further in HDFS or directly bring the data into Hive.
Access data from Hive or Hdfs.
Data manipulation and pruning using R
Sample the data into training and test data
Build the model using R and save it as PMML.
Evaluate the model using ROC curve or AUC
Deploy the model.
Predict the value using new datasets.
10.Visualize the new values using Tableau.
Please let me know whether it is a best practice to make Hive as the source of training data and test data or whether to directly make processing files from HDFS to build the model and store the results to Hive.
Which one adopts in real production environment.
The steps above seem quite strange on several levels. If you have your data in Oracle, you are likely using it or another RDBMS to do your modeling. There is no need to lose structure going to hdfs and then regain it in hive. There are several analytically focused columnar dbs that could do quite well. Also the PMML step seems unnecessary. Compliant PMML is only available for a few models and I have only seen it used in linear regressions. If your data is big enough that it needs hive, using R is possible but may not be the best choice. Using R with data that size (out of core) is fairly advanced R.
In short, there is likely a scenario where that set of steps is right, but it raises a lot of issues for me. Ask a lot of questions before proceeding.

Resources