How can I train a model to detect multiple values for the same key in one page? - microsoft-cognitive

I am using Forms Recognizer v2, specifically the Sample Labeling tool, to build and train a model that aims to parse information from packing lists. After labeling more than 5 documents, I train the model and then pass one of the documents to perform a prediction. However, the model can only predict one value for each key/tag, even though during the labeling process, I assigned multiple values to each tag. Is this a limitation in the current version of the tool, or am I missing something?
Best regards,
João Amaro

We can handle multiple value text lines if they are very close to each other (e.g., an address that runs multiple lines next to each other). However, if the multiple values are spread in different places, we suggest you use different keys for them.

Related

julia decisiontree model - get features list

I had created a DecisionTree model in Julia using some features that I had created through an algorithm.
model_rest2 = DecisionTreeClassifier(n_subfeatures=0)
#time DecisionTree.fit!(model_rest2, convert(Matrix, df3[:,[16:45;49:50;52:81;83:end]]), df3[:,:type_1])
I seem to have modified the feature building steps due to which while testing the model now it is not running as the number of features in the model is different from the number of features available in the test set. Is there a way to find the list of features being used by the model so that I can add the missing features?
Since I didn't get any answers, I changed my modus Operandi and entered all the column names instead of the numeric column range. This ensured I dont get into this issue again.

How can I keep track of which weights correspond to which layers and inputs in Keras/Tensorflow?

What is a good method of keeping track of what weights go where? When I do get_weights(model), I get a list of arrays, and I have to guess which ones correspond to which part of my model by their dimensionality. How can I figure out which layers they're emanating from and connecting to, so that I can manipulate them programmatically?
I'm working in R, but answers in python can probably be translated.
I would suggest using tensorflow debugger in combination with naming your layers with names that make sense. After that you can execute a certain amount of iterations and check the values of the weights for a layer by using its name. Also you can check gradients for these layers.

Suggested Neural Network for small, highly varying dataset?

I am currently working with a small dataset of training values, no more than 20, and am getting large MSE. The input data vectors themselves consist of 16 parameters, many of which are binary variables. Across all the training values, a majority of the 16 parameters stay the same (but not all). The remaining input variables, across all the exemplars, vary a lot amongst one another. This is to say, two exemplars might appear to be the same except for two parameters in which they differ, one parameter being a binary variable, and another being a continuous variable, where the difference could be greater than a single standard deviation (for that variable's set of values).
My single output variable (as of now) can either be a continuous variable, OR depending on the true difficulty of reducing the error in my situation, I can make this a classification problem instead, with 12 different forms for classification.
I have long been researching different neural networks than my current implementation of a feed-forward MLP, as I have read into Stochastic NNs, Ladder NNs, and many forms of recurrent NNs. I am stuck with which one I should investigate, as I do not have time to try every NN available.
While my description may be vague, could anyone make a suggestion as to which network I should investigate to minimize my cost function (as of now, MSE) the most?
If my current setup must be rendered implacable because of the sheer difficulty involved with predicting correct output for such a small set of highly variant training values, which network would best work, should my dataset be expanded to the order of thousands of exemplars (at the cost of having a significantly more redundant, seemingly homogenous set of input values)?
Any help is most certainly appreciated.
20 samples is very small especially if you have 16 input variables. It will be hard to determine which one of those inputs is responsible for your output value. If you keep your network simple (fewer layers) you may be able to use as many samples as you would need for traditional regression.

Text classification with R and SVM. Matrix features

I am playing a bit with text classification and SVM.
My understanding is that typically the way to pick up the features for the training matrix is essentially to use a "bag of words" where we essentially end up with a matrix with as many columns as different words are in our document and the values of such columns is the number of occurrences per word per document (of course each document is represented by a single row).
So that all works fine, I can train my algorithm and so on, but sometimes i get an error like
Error during wrapup: test data does not match model !
By digging it a bit, I found the answer in this question Error in predict.svm: test data does not match model which essentially says that if your model has features A, B and C, then your new data to be classified should contain columns A, B and C. Of course with text this is a bit tricky, my new documents to classify might contain words that have never been seen by the classifier with the training set.
More specifically I am using the RTextTools library whith uses SparseM and tm libraries internally, the object used to train the svm is of type "matrix.csr".
Regardless of the specifics of the library my question is, is there any technique in document classification to ensure that the fact that training documents and new documents have different words will not prevent new data from being classified?
UPDATE The solution suggested by #lejlot is very simple to achieve in RTextTools by simply making use of the originalMatrix optional parameter when using the create_matrix function. Essentially, originalMatrix should be the SAME matrix that one creates when one uses the create_matrix function for TRAINING the data. So after you have trained your data and have your models, keep also the original document matrix, when using new examples, make sure of using such object when creating the new matrix for your prediction set.
Regardless of the specifics of the library my question is, is there any technique in document classification to ensure that the fact that training documents and new documents have different words will not prevent new data from being classified?
Yes, and it is very trivial one. Before applying any training or classification you create a preprocessing object, which is supposed to map text to your vector representation. In particular - it stores whole vocabulary used for training. Later on you reuse the same preprocessing object on test documents, and you simply ignore words from outside of vocabulary stored before (OOV words, as they are often refered in the literature).
Obviously there are plenty other more "heuristic" approaches, where instead of discarding you try to map them to existing words (although it is less theoreticalyy justified). Rather - you should create intermediate representation, which will be your new "preprocessing" object which can handle OOV words (through some levenstein distance mapping etc.).

Can DOE driver results feed Metamodel component?

I am interested in exploring surrogate based optimization. I am not yet writing opendao code, just trying to figure out to what extent OpenMDAO will support this work.
I see that it has a DOE driver to generate training data (http://openmdao.readthedocs.org/en/1.5.0/usr-guide/tutorials/doe-drivers.html), I see that it has several surrogate models that can be added to a meta model (http://openmdao.readthedocs.org/en/1.5.0/usr-guide/examples/krig_sin.html). Yet, I haven't found an example where the results of the DOE are passed as training data to the Meta-model.
In many of the examples/tutorials/forum-posts it seems that the training data is created directly on or within the meta model. So it is not clear how these things work together.
Could the developers explain how training data is passed from a DOE to a meta model? Thanks!
In openmdao 1.x, this kind of process isn't directly supported (yet) via a DOE, but it is definitely possible. There are two paths that you can take, which offer different benefits depending on your eventual goal.
I will separate the different scenarios based on a single high level classification:
1) You want to do gradient based optimization around the whole DOE/Metamodel combination. This would be the case if, for example, you wanted to use CFD to predict drag at a few key points, then use a meta-model to generate a drag polar for mission analysis. A great example of this kind of modeling can be found in this paper on simultaneous aircraft-mission design optimization..
2) You don't want to do gradient based optimization around the whole model. You might want to do gradient free optimization (like a Genetic algorithm). You might want to do gradient based optimization just around the surrogate itself, with fixed training data. Or you might not want to do optimization at all...
If you're use case falls under scenario 1 (or will eventually fall under this use case in the future), then you want to use a multi-point approach. You create one instance of your model for each training case, then you can mux the results into an array you pass into meta-model. This is necessary so that derivatives can
be propagated through the full model. The multi-point approach will work well, and is very parallelizable. Depending on the structure of the model you will use for generating the training data itself, you might also consider a slightly different multi-point approach with a distributed component or a series of distributed components chained together. If your model will support it, the distributed component approach is the most efficient model structure to use in this case.
If you're use case falls into scenario 2, you can still employ the multi-point approach if you like. It will work out of the box. However, you could also consider using a regular DOE to generate the training data. In order to do this, you'll need to use a nested-problem approach, where you put the DOE training data generation in a sub-problem. This will also work, though it will take a bit of extra coding on your part to get the array of results out of the DOE because thats not currently implemented.
If you wanted to use the DOE to generate the data, then pass it downstream to a surrogate that would get optimized on, you could use a pair of problem instances. This would not necessarily require that you make nested problems at all. Instead you just build a run-script that has one problem instance that uses a DOE, when its done you collect the data into an array. Then you could manually assign that to the training inputs of a meta-model in a second problem instance. Something like the following pseudo-code:
prob1 = Problem()
prob1.driver = DOE()
#set up the DOE variables and model ...
prob1.run()
training_data = prob1.driver.results
prob2 = Problem()
prob2.driver = Optimizer()
#set up the meta-model and optimization problem
prob2['meta_model.train:x'] = training_data
prob2.run()

Resources