Azure ML Studio Error 0035: Features for The vocabulary is empty - azure-machine-learning-studio

I am attempting to give classifications to various bodies of text using Azure ML Studio and I have my successful output all the way until I deploy and test a web service. Once I deploy my web service and attempt to test it I get the following error:
Error 0035: Features for The vocabulary is empty. Please check the Minimum n-gram document frequency. required but not provided., Error code: ModuleExecutionError, Http status code: 400
The vocabularies for the extract n-gram modules are not empty. The only aspect that changes from the working model to the Web service error is the web service input.
Training Model
Predictive Model

You need to create 2 N-GRAMS modules in your "Training experiment" as shown on the screenshot of the link below
Follow the steps from the link:
COPY your NGRAMS module,
SET one in "CREATE" (for training) and the other one in "READ" (for testing)
CONNECT the vocabulary output from the training NGRAMS to the input vocabulary of the testing NGRAMS
SAVE the output vocabulary from testing part to a "Convert to Dataset" block.
You can then use this to feed your deployments NGRAMS steps in "READ" mode.
See steps here =>
https://learn.microsoft.com/fr-fr/azure/machine-learning/algorithm-module-reference/extract-n-gram-features-from-text

The MS documentation is misleading here.
What Azure ML Studio is looking for is a second input for the "extract n-grams" pill in the predictive experiment.
The desired second input is a dataset. The one you want is produced by the "extract n-grams" pill in the training experiment. To get and use this dataset, go to your training experiment and add a "Convert to CSV" pill on the second output node of the "Extract N-Grams" pill. Then save this as a dataset.
Now you use it as a second input in your predictive model's "n-grams" pill. You should be good to go!

From the Azure GitHub: https://github.com/MicrosoftDocs/azure-docs/blob/master/articles/machine-learning/algorithm-module-reference/extract-n-gram-features-from-text.md
Score or publish a model that uses n-grams
Copy the Extract N-Gram Features from Text module from the training dataflow to the scoring dataflow.
Connect the Result Vocabulary output from the training dataflow to
Input Vocabulary on the scoring dataflow.
In the scoring workflow, modify the Extract N-Gram Features from
Text module and set the Vocabulary mode parameter to ReadOnly. Leave
all else the same.
To publish the pipeline, save Result Vocabulary as a dataset.
Connect the saved dataset to the Extract N-Gram Features from Text
module in your scoring graph.

Related

Extract sample of features used to build each tree in H2O

In GBM model, following parameters are used -
col_sample_rate
col_sample_rate_per_tree
col_sample_rate_change_per_level
I understand how the sampling works and how many variables get considered for splitting at each level for every tree. I am trying to understand how many times each feature gets considered for making a decision. Is there a way to easily extract all sample of features used for making a splitting decision from the model object?
Referring to the explanation provided by H2O, http://docs.h2o.ai/h2o/latest-stable/h2o-docs/data-science/algo-params/col_sample_rate.html, is there a way to know 60 randomly chosen features for each split?
Thank you for your help!
If you want to see which features were used at a given split in a give tree you can navigate the H2OTree object.
For R see documentation here and here
For Python see documentation here
You can also take a look at this Blog (if this link ever dies just do a google search for H2OTree class)
I don’t know if I would call this easy, but the MOJO tree visualizer spits out a graphviz dot data file which is turned into a visualization. This has the information you are interested in.
http://docs.h2o.ai/h2o/latest-stable/h2o-genmodel/javadoc/overview-summary.html#viewing-a-mojo

Core ML giving same prediction every time

So I used turicreate to create a handwritten-digit recognition model using famous MNIST Dataset with enough amount of data to train the model...
But When I'm deploying the model in ios project, it always predicts same output (i.e. 1) with probability of 1.0 whereas other probabilities are too less.
This usually happens when you don't preprocess the input images on iOS the same way as they were preprocessed by the model. In the case of Turi Create and Core ML, these tools should do this for you automatically (especially when you use Vision to drive Core ML).
But as you've not given any details about exactly how you're using the Core ML model, it's impossible to provide a solution to your question.

data mining with unstructured data how to implement?

I have unstructured data (screenshot of app) and semi-structured data(screen dumping file), i chose store it in hbase. my goal is find defect or issue on app (meaningfull data). Now, I'd like to apply data mining on these, so that is kind of text mining ? and how can i apply some data mining technical on this data ?
To begin with, you can use rule based approach where you define set of rules which detects the defect scenario.
Then you can prepare training data set which has many instances of defect, non-defect scenarios. In this step, for each screenshot or screen dump file you collect; you would manually tag it as defect or non-defect.
Then you can train classifier using this training data. Classifier would try to generalize training samples to predict the output label for the samples not seen in the past.
Since, your input is non-standard you might need some preprocessing to convert your input to standard form. For example, to process screenshots you might need some image processing, OCR, computer vision libraries.

How to export an R Random Forest model for use in Excel VBA without API calls

Problem:
I have a Random Forest model trained in R. I need to deploy this model in a standalone Excel tool that will be used by 350 people across a sales network to perform real-time predictions based on data entered into the spreadsheet by users.
How can I do this?
Constraints:
It is not an option to require users to install R on their local machines.
It is not an option to have a server (physical or cloud) providing a scoring API.
What have I done so far?
1. PMML
I can export the model in PMML (XML structure). From research I can see there are libraries for loading and executing PMML inputs in Python and Java. However I haven't found anything implemented in VBA / VB.
2. Zementis
I looked into a solution called Zementis which offers an Excel add-in to deploy PMML models. However from my understanding this requires web-service calls to a cloud server (e.g. AWS) where the actual model execution happens. My IT security department will not allow this.
3. Others
The most common recommendation seems to be to call R to load the model and run the predict function. As noted above, this is not a viable option.
Detailed Context:
The Random Forest model is trained in R, with c. 30 variables. The model is used to recommend "personalised" prices for products as part of a sales process.
The model needs to be distributed to the sales network, with about 350 users. The business's preference is to integrate the model into an existing spreadsheet tool that sales teams currently use to calculate deal profitability.
This means that I need to be able to export the model in a way that it can be implemented in Excel VBA.
Given timescales, the implementation needs to be self-contained with no IT infrastructure or additional application installs. We are working with the organisation's IT team on a server based solution, however their deployment timescales are 12 months+ which means we need a tactical solution in the short-term.
Here's one approach to get the "rules" for the trees (example using the mtcars dataset)
install.packages("randomForest")
library(randomForest)
head(mtcars)
set.seed(1)
fit <- randomForest(mpg ~ ., data=mtcars, importance=TRUE, proximity=TRUE)
print(fit)
## Look at variable importance:
importance(fit)
# Print the rules for each tree in the forest
install.packages("rattle")
library(rattle)
printRandomForests(fit)
It is probably unrealistic to use the rules for 500 trees, but maybe you could implement 100 trees in your vba and then take an average of the results (for a continuous response) or predict the class with the most votes across the trees (for a categorical response).
Maybe you could recreate the model on a Worksheet.
As far as I know, Excel can import XML structures (on the Development Tools ribbon).
Edit: 1) save pmml structure in plaintext editor as .xml file.
2) Open the file in Excel 2013 (maybe other versions also do it)
3) Click through the error message and open the file anyway. Trees open as a table, a bit funny, but recognizable.
4) Create prediction calculation (generic fn in VBA) to operate on the tree.

What type should the returned scores from an R scoring script?

I am attempting to develop an Azure ML experiment that uses R to perform predictions of a continuous response variable. The initial experiment is relatively simple, incorporating only a few experiment items, including "Create R Model", "Train Model" and "Score Model", along with some data input.
I have written a training script and a scoring script, both of which appear to execute without errors when I run the experiment within ML Studio. However, when I examine the scored dataset, the score values are all missing values. So I am concerned that my scoring script could be returning scores incorrectly. Can anyone advise what type I should be returning? Is it meant to be a single column data.frame, or something else?
It is also possible that my scores are not being properly calculated within the scoring script, although I have run the training and scoring scripts within R Studio, which shows the expected results. It would also be helpful if someone could suggest how to perform debugging of my scoring script in some way, so that I could determine whereabouts the code is failing to behave as expected.
Thanks, Paul
Try using this sample and compare with yours - https://gallery.cortanaintelligence.com/Experiment/Compare-Sample-5-in-R-vs-Azure-ML-1
My suggestion is do data preprocessing before you do the data input. Clear the missing values and outliers. Use relevant data preprocessing techniques to perform those operations.

Resources