What is Threshold in the Evaluate Model Module? - azure-machine-learning-studio

Notice in the image below, if I increase the value of "Threshold," the accuracy of the model seems to increase (with diminishing returns after about .62).
What does this mean and can I somehow update this value such that my model will retain this setting?
For example, I am using a boosted decision tree, but I don't see any such value for "threshold."
Ref. https://learn.microsoft.com/en-us/previous-versions/azure/machine-learning/studio-module-reference/evaluate-model?redirectedfrom=MSDN

The term Threshold defines the line of separation between the variables before implementation of the evaluation metrics. We need to split the dataset into two different parts with different ratios.
For example, we have 9 rows in our dataset and we need to split it for training and testing purposes. Let us consider first two rows are for testing purposes and remaining are for training purposes. The Threshold is the Hyperplane the seperation line between the categories. When we need to split the data after training into categories, we need to differentiate between them with some threshold value. Based on the number of training and testing variables the threshold value will be automatically assigned by scikit learn.
It is true that if we increase the threshold based on the number of training variables and testing variables, the accuracy will increase. That will show impact on precision and recall.
Check out the blog on the same, regarding the importance of threshold in decision trees.
Blog Contribution: https://www.geeksforgeeks.org/

Related

Cross-Validation Across models in h2o in R

I am planning to run glm, lasso and randomForest across different sets of predictors to see which model combination is the best. I am going to be doing v-fold cross validation. To compare the ML algorithms consistently, the same fold has to be fed into each of the ML algorithms. Correct me if I am wrong here.
How can we achieve that in h2o package in R? Should I set
fold_assignment = Modulo within each algo function such as h2o.glm(), h2o.randomForest() etc.
Hence, would the training set be split the same way across the ML algos?
If I use fold_assignment = Modulo and what if I have to stratify my outcome? The stratification option is with fold_assignment parameter as well? I am not sure I can specify Modulo and and Stratified both at the same time.
Alternatively, if I set the same seed in each of the model, would they have the same folds as input?
I have the above questions after reading Chapter 4 from [Practical Machine Learning with H2O by Darren Cook] (https://www.oreilly.com/library/view/practical-machine-learning/9781491964590/ch04.html)
Further, for generalizability in site level data in a scenario as in the quotation below:
For example, if you have observations (e.g., user transactions) from K cities and you want to build models on users from only K-1 cities and validate them on the remaining city (if you want to study the generalization to new cities, for example), you will need to specify the parameter “fold_column” to be the city column. Otherwise, you will have rows (users) from all K cities randomly blended into the K folds, and all K cross-validation models will see all K cities, making the validation less useful (or totally wrong, depending on the distribution of the data). (source)
In that case, since we are cross folding by a column, it would be consistent across all the different models, right?
Make sure you split the dataset the same for all ML algos (same seed). Having the same seed for each model won't necessarily have the same cross validation assignments. To ensure they are apples-to-apples comparisons, create a fold column (.kfold_column() or .stratified_kfold_column()) and specify it during training so they all use the same fold assignment.

Query related to Misclassification rate in Decision Trees

I am working on Decision Tree model .The dataset is related to cars.I have 80% data in training set and 20% test set. The summary of the model ( based on training data) shows misclassification rate around 0.02605 where as when I run the model on training set came as 0.0289 , the difference between them is around 0.003. Is the difference acceptable , what is causing this difference? I am new to R/statistics.Please share your feedback.
Acceptable misclassification rate is more art than science. If your data is generated from a single population then there is without a doubt to be some unavoidable overlap between the groups, which will make linear classification error-prone. This doesn't mean its a problem. For instance, if you are classifying credit card charges as possibly fraudulent or not, and your recourse isn't too harsh in the case when you classify an observation to the former, then you it may be advantageous to be on the safer side and end up with more false-positives rather than a low misclassification rate. You could 1. visualize your data to identify overlap, or 2. compute N*.03 to discern the number of misclassified cases; if you have an understanding of what you are classifying, you could assess the seriousness of misclassification that way.

R H20 - Cross-validation with stratified sampling and non i.i.d. rows

I'm using H2O to analyse a dataset but I'm not sure how to correctly perform cross-validation on my dataset. I have an unbalanced dataset, so I would like to performed stratified cross-validation ( were the output variable is used to balance the groups on each partition).
However, on top of that, I also have an issue that many of my rows are repeats (a way of implementing weights without actually having weights). Independently of the source of this problem, I have seen before that, in some cases, you can do cross-validation were some rows have to be kept together. This seams to be the usage of fold_column. However, it is not possible to do both at the same time?
If there is no H2O solution, how can I compute the fold a priori and use it on H2O?
Based on H2O-3 docs this can't be done:
Note that all three options are only suitable for datasets that are i.i.d. If the dataset requires custom grouping to perform meaningful cross-validation, then a fold_column should be created and provided instead.
One quick idea is using weights_column instead of duplicating rows. Then both balance_classes and weights_column are available together as parameters in
GBM, DRF, Deep Learning, GLM, Naïve-Bayes, and AutoML.
Otherwise, I suggest following workflow performed in R or H2O on your data to achieve both fold assignment and consistency of duplicates between folds:
take original dataset (no repeats in data yet)
divide it into 2 sets based on the outcome field (the one that is unbalanced): one for positive and one for negative (if it's multinomial then have as many sets as there are outcomes)
divide each set into N folds by assigning new foldId column in both sets independently: this accomplishes stratified folds
combine (rbind) both sets back together
apply row duplication process that implements weights (which will preserve your fold assignments automatically now).

Running regression tree on large dataset in R

I am working with a dataset of roughly 1.5 million observations. I am finding that running a regression tree (I am using the mob()* function from the party package) on more than a small subset of my data is taking extremely long (I can't run on a subset of more than 50k obs).
I can think of two main problems that are slowing down the calculation
The splits are being calculated at each step using the whole dataset. I would be happy with results that chose the variable to split on at each node based on a random subset of the data, as long as it continues to replenish the size of the sample at each subnode in the tree.
The operation is not being parallelized. It seems to me that as soon as the tree has made it's first split, it ought to be able to use two processors, so that by the time there are 16 splits each of the processors in my machine would be in use. In practice it seems like only one is getting used.
Does anyone have suggestions on either alternative tree implementations that work better for large datasets or for things I could change to make the calculation go faster**?
* I am using mob(), since I want to fit a linear regression at the bottom of each node, to split up the data based on their response to the treatment variable.
** One thing that seems to be slowing down the calculation a lot is that I have a factor variable with 16 types. Calculating which subset of the variable to split on seems to take much longer than other splits (since there are so many different ways to group them). This variable is one that we believe to be important, so I am reluctant to drop it altogether. Is there a recommended way to group the types into a smaller number of values before putting it into the tree model?
My response comes from a class I took that used these slides (see slide 20).
The statement there is that there is no easy way to deal with categorical predictors with a large number of categories. Also, I know that decision trees and random forests will automatically prefer to split on categorical predictors with a large number of categories.
A few recommended solutions:
Bin your categorical predictor into fewer bins (that are still meaningful to you).
Order the predictor according to means (slide 20). This is my Prof's recommendation. But what it would lead me to is using an ordered factor in R
Finally, you need to be careful about the influence of this categorical predictor. For example, one thing I know that you can do with the randomForest package is to set the randomForest parameter mtry to a lower number. This controls the number of variables that the algorithm looks through for each split. When it's set lower you'll have fewer instances of your categorical predictor appear vs. the rest of the variables. This will speed up estimation times, and allow the advantage of decorrelation from the randomForest method ensure you don't overfit your categorical variable.
Finally, I'd recommend looking at the MARS or PRIM methods. My professor has some slides on that here. I know that PRIM is known for being low in computational requirement.

Random Forest optimization with tuning and cross-validation

I'm working with a large data set, so hope to remove extraneous variables and tune for an optimal m variables per branch. In R, there are two methods, rfcv and tuneRF, that help with these two tasks. I'm attempting to combine them to optimize parameters.
rfcv works roughly as follows:
create random forest and extract each variable's importance;
while (nvar > 1) {
remove the k (or k%) least important variables;
run random forest with remaining variables, reporting cverror and predictions
}
Presently, I've recoded rfcv to work as follows:
create random forest and extract each variable's importance;
while (nvar > 1) {
remove the k (or k%) least important variables;
tune for the best m for reduced variable set;
run random forest with remaining variables, reporting cverror and predictions;
}
This, of course, increases the run time by an order of magnitude. My question is how necessary this is (it's been hard to get an idea using toy datasets), and whether any other way could be expected to work roughly as well in far less time.
As always, the answer is it depends on the data. On one hand, if there aren't any irrelevant features, then you can just totally skip feature elimination. The tree building process in the random forest implementation already tries to select predictive features, which gives you some protection against irrelevant ones.
Leo Breiman gave a talk where he introduced 1000 irrelevant features into some medical prediction task that had only a handful of real features from the input domain. When he eliminated 90% of the features using a single filter on variable importance, the next iteration of random forest didn't pick any irrelevant features as predictors in its trees.

Resources