Statistical comparison of machine learning algorithm - math

I am working in machine learning. I am stuck in one of the thing.
I want to compare 4 machine learning techniques among 10 datasets. After performing experiment i got Area Under Curve value. After this i have applied Analysis of variance test which shows there is a significant difference between 4 machine learning techniques.
Now my problem is that which test will conclude that particular algorithm perform well compared to other algorithm and i want only one winner among the machine learning techniques.

A classifier's quality can be measured by the F-Score which measures the test's accuracy. Comparing these respective scores will give you a simple measure.
However, if you want to measure whether the difference between the classifiers' accuracies is significant, you can try the Bayesian Test or, if classifiers are trained once, McNemar's test.
There are other possibilities and the papers On Comparing Classifiers: Pitfalls to Avoid and a
Recommended Approach and Approximate Statistical Tests for Comparing
Supervised Classification Learning Algorithms are probably worth reading.

If you are gathering performance metrics (ROC,accuracy,sensitivity,specificity...) from identicially resampled data sets then you can perform statistical tests using paired comparisons. Most statistical software impliment Tukeys Range test (ANOVA). https://en.wikipedia.org/wiki/Tukey%27s_range_test. A formal treatment of this material is here: http://epub.ub.uni-muenchen.de/4134/1/tr030.pdf. This is the test I like to use for the purpose you discuss, although there are others and people have varying opinions.
You will still have to choose how you will sample based on your data (k-fold), repeated (k-fold), bootstrap, leave one out, repeated training test splits. Bootstrap methods tend to give you the tightest confidence intervals after leave one out; but leave one out might not be an option if your data is huge.
That being said you may also need to consider the problem domain. False positives may be an issue in classification. You may need to consider other metrics to choose the best performer for the domain. AUC might not always be the best model for a specific domain. For instance a credit card company may not want to deny a transaction to customers, we need a very low false positive on fraud classification.
You may also want to consider implementation. If a logistic regression performs near as well it may be a better choice over a more complicated implementation of a random forest. Are there legal implications to model use (Fair Credit Reporting Act...)?
A common sense approach is to begin with something like RF or Gradient boosted trees to get an empirical sense of a performance ceiling. Then build simpler models and use the simpler model that performs reasonabley well compared to the ceiling.
Or you could combine all your models using something like LASSO... or some other model.

Related

Random forest to determine most important predictors explaining variation [duplicate]

I am trying to use the random forests package for classification in R.
The Variable Importance Measures listed are:
mean raw importance score of variable x for class 0
mean raw importance score of variable x for class 1
MeanDecreaseAccuracy
MeanDecreaseGini
Now I know what these "mean" as in I know their definitions. What I want to know is how to use them.
What I really want to know is what these values mean in only the context of how accurate they are, what is a good value, what is a bad value, what are the maximums and minimums, etc.
If a variable has a high MeanDecreaseAccuracy or MeanDecreaseGini does that mean it is important or unimportant? Also any information on raw scores could be useful too.
I want to know everything there is to know about these numbers that is relevant to the application of them.
An explanation that uses the words 'error', 'summation', or 'permutated' would be less helpful then a simpler explanation that didn't involve any discussion of how random forests works.
Like if I wanted someone to explain to me how to use a radio, I wouldn't expect the explanation to involve how a radio converts radio waves into sound.
An explanation that uses the words 'error', 'summation', or 'permutated'
would be less helpful then a simpler explanation that didn't involve any
discussion of how random forests works.
Like if I wanted someone to explain to me how to use a radio, I wouldn't
expect the explanation to involve how a radio converts radio waves into sound.
How would you explain what the numbers in WKRP 100.5 FM "mean" without going into the pesky technical details of wave frequencies? Frankly parameters and related performance issues with Random Forests are difficult to get your head around even if you understand some technical terms.
Here's my shot at some answers:
-mean raw importance score of variable x for class 0
-mean raw importance score of variable x for class 1
Simplifying from the Random Forest web page, raw importance score measures how much more helpful than random a particular predictor variable is in successfully classifying data.
-MeanDecreaseAccuracy
I think this is only in the R module, and I believe it measures how much inclusion of this predictor in the model reduces classification error.
-MeanDecreaseGini
Gini is defined as "inequity" when used in describing a society's distribution of income, or a measure of "node impurity" in tree-based classification. A low Gini (i.e. higher descrease in Gini) means that a particular predictor variable plays a greater role in partitioning the data into the defined classes. It's a hard one to describe without talking about the fact that data in classification trees are split at individual nodes based on values of predictors. I'm not so clear on how this translates into better performance.
For your immediate concern: higher values mean the variables are more important. This should be true for all the measures you mention.
Random forests give you pretty complex models, so it can be tricky to interpret the importance measures. If you want to easily understand what your variables are doing, don't use RFs. Use linear models or a (non-ensemble) decision tree instead.
You said:
An explanation that uses the words
'error', 'summation', or 'permutated'
would be less helpful then a simpler
explanation that didn't involve any
discussion of how random forests
works.
It's going to be awfully tough to explain much more than the above unless you dig in and learn what about random forests. I assume you're complaining about either the manual, or the section from Breiman's manual:
http://www.stat.berkeley.edu/~breiman/RandomForests/cc_home.htm#varimp
To figure out how important a variable is, they fill it with random junk ("permute" it), then see how much predictive accuracy decreases. MeanDecreaseAccuracy and MeanDecreaseGini work this way. I'm not sure what the raw importance scores are.
Interpretability is kinda tough with Random Forests. While RF is an extremely robust classifier it makes its predictions democratically. By this I mean you build hundreds or thousands of trees by taking a random subset of your variables and a random subset of your data and build a tree. Then make a prediction for all the non-selected data and save the prediction. Its robust because it deals well with the vagaries of your data set, (ie it smooths over randomly high/low values, fortuitous plots/samples, measuring the same thing 4 different ways, etc). However if you have some highly correlated variables, both may seem important as they are not both always included in each model.
One potential approach with random forests may be to help whittle down your predictors then switch to regular CART or try the PARTY package for inference based tree models. However then you must be wary about data mining issues, and making inferences about parameters.

Machine learning project: split training/test sets before or after exploratory data analysis?

Is it best to split your data into training and test sets before doing any exploratory data analysis, or do all exploration based solely on training data?
I'm working on my first full machine learning project (a recommendation system for a course capstone project) and am looking for clarification on order of operations. My rough outline is to import and clean, do exploratory analysis, train my model, and then evaluate on a test set.
I am doing exploratory data analysis now - nothing special initially, just starting with variable distributions and whatnot. But I am not sure: should I split my data into training and test sets before or after exploratory analysis?
I don't want to potentially contaminate algorithm training by inspecting the test set. However, I also don't want to miss visual trends that might reflect real signal that my poor human eye might not see after filtering, and thus potentially miss investigating an important and relevant direction while designing my algorithm.
I checked other threads, like this, but the ones I found seem to ask more about things like regularization or actual manipulation of the original data. The answers I found were mixed but prioritized splitting first. However, I don't plan to do any actual manipulation of the data before splitting it (beyond inspecting distributions and potentially doing some factor conversions).
What do you do in your own work and why?
Thanks for helping a new programmer!
To answer this question, we should remind ourselves of why, in machine learning, we split data into training, validation and testing sets (see also this question).
Training sets are used for model development. We often carefully explore this data to get ideas for feature engineering and the general structure of the machine learning model. We then train the model using the training data set.
Usually, our goal is to generate models that will perform well not only on the training data, but also on previously unseen data. Therefore, we want to avoid models that capture the peculiarities of the data we have available now rather than the general structure of the data we will see in the future ("overfitting"). To do so, we assess the quality of the models we're training by evaluating their performance on a different set of data, the validation data, and choose the model that performs best on the validation data.
Having trained our final model, we often want to have an unbiased estimate of its performance. Since we have already used the validation data in the process of model development (we chose the model that performed best on the validation data), we cannot be sure that our model will perform equally well on unseen data. So, to assess model quality, we test performance unsing a new batch of data, the testing data.
This discussion gives the answer your question: We should not use the testing (or validation) data set for exploratory data analysis. Because if we did, we would run the risk of overfitting the model to the peculiarities of the data we have, for example by engineering features that work well for the testing data. At the same time, we would lose the ability of getting an unbiased estimate of our model's performance.
I would take the problem the other way round; is it bad to use the test set ?
The objective of modeling is to end up with a model with low variance (and small bias): that's why the test set is keeping a bunch of data aside to assess how your model behaves with new data (i.e. its variance). If you use the test set during modeling you are left with nothing to do that, and you are overfitting your data.
The objective of EDA is to understand the data you're working with; the distributions of features, their relationships, their dynamics, etc ... If you leave your test set in the data, is there a risk of "overfitting" your understanding of data ? If that was the case, you would observe on say 70% of your data some properties that are not valid for the 30% remaining (test set) ... knowing that the split is random, this is impossible, or you have been extremely unlucky.
From my understanding in Machine Learning Pipeline is exploratory data analysis should be done before splitting the data into train and test.
Here are my reasons:
The data may not be cleaned in the beginning. It might have missing values, mismatch datatypes and outliers.
Need to understand every features with the target variable in the dataset. This will help to understand the importance of every features with respect to the business problem and will help to derive the additional features as well.
The data visualization will also help to get the insights information from the dataset.
Once the above operations done, then we can split the dataset into train and test. Because the features must be similar in both train and test.

Keras plot after model is finished explanaition

It is not really a programming question, I apologise for that. I have trained my network and generated these graphs.
http://imgur.com/a/zOglL
http://imgur.com/a/90JKl
I am struggling to find an answer to what do Accuracy vs Val_Accuracy and Loss vs Val_loss really represent. I do understand that if val_loss starts to jump high it means that there is over fitting going on and network just starts to memorise the data rather then learning. Could anyone explain in a bit more detail what they all mean?
During a neural network training - you usually provide two sets of data - a train one and validation one. Your training algorithm is taking data from a train set - and using a calculus and backpropagation it's trying to decrease a cost function which somehow represents how good your representation is (smaller the better). Aside of this - a cost is computed for a validation set which is not seen by a training algorithm - so one may check if model doesn't overfit on a train data provided (this happens if train loss is substancially smaller than validation loss). Despite loss different metrics might be computed - one of them might be e.g. accuracy. Sometimes they give you better insights into how your model works because loss might be hard to understand. Metrics give you better understanding on if your model works good.

How do you conduct a power analysis for logistic regression in R?

I'm familiar with G*Power as a tool for power analyses, but have yet to find a resource on the internet describing how to compute a power analysis for for logistic regression in R. The pwr package doesn't list logistic regression as an option.
You will very likely need to "roll your own".
Specify your hypothesized relationship between predictors and outcome.
Specify what values of your predictors you are likely to observe in your study. Will they be correlated?
Specify the effect size you would like to detect, e.g., odds ratios corresponding to two specific settings of your predictors.
Specify a power level, e.g., beta=0.80.
For different sample sizes n:
Simulate predictors as specified
Simulate outcomes
Run your analysis
Record whether you detect a statistically significant effect
Do these steps many times, on the order of 1000 or more times. Count how often you did detect an effect. If you detected an effect more than (e.g.) 80% of the time, you are overpowered - reduce n and start over. If you detected an effect less than 80%, you are underpowered - increase n and start over. Rinse & repeat until you have a good n.
And then think some more about whether all your assumptions really make sense. Vary them a bit. Is the resulting value of n sensitive to your assumptions?
Yes, this will be quite a bit of work. But it will be worth it. On the one hand, it will keep you from running an over- or underpowered study. On the other hand, as I wrote, this will force you to think deeply about your assumptions, and this is the path to enlightenment. (Which is a painful path to travel. Sorry.)
If you don't get any better answers specifically helping you to do this in R, you may want to look to CrossValidated for more help. Good luck!
This question and answers on Crossvalidated discuss power for logistic regression and include R code as well as additional discussion and links for more information.

R Random Forests Variable Importance

I am trying to use the random forests package for classification in R.
The Variable Importance Measures listed are:
mean raw importance score of variable x for class 0
mean raw importance score of variable x for class 1
MeanDecreaseAccuracy
MeanDecreaseGini
Now I know what these "mean" as in I know their definitions. What I want to know is how to use them.
What I really want to know is what these values mean in only the context of how accurate they are, what is a good value, what is a bad value, what are the maximums and minimums, etc.
If a variable has a high MeanDecreaseAccuracy or MeanDecreaseGini does that mean it is important or unimportant? Also any information on raw scores could be useful too.
I want to know everything there is to know about these numbers that is relevant to the application of them.
An explanation that uses the words 'error', 'summation', or 'permutated' would be less helpful then a simpler explanation that didn't involve any discussion of how random forests works.
Like if I wanted someone to explain to me how to use a radio, I wouldn't expect the explanation to involve how a radio converts radio waves into sound.
An explanation that uses the words 'error', 'summation', or 'permutated'
would be less helpful then a simpler explanation that didn't involve any
discussion of how random forests works.
Like if I wanted someone to explain to me how to use a radio, I wouldn't
expect the explanation to involve how a radio converts radio waves into sound.
How would you explain what the numbers in WKRP 100.5 FM "mean" without going into the pesky technical details of wave frequencies? Frankly parameters and related performance issues with Random Forests are difficult to get your head around even if you understand some technical terms.
Here's my shot at some answers:
-mean raw importance score of variable x for class 0
-mean raw importance score of variable x for class 1
Simplifying from the Random Forest web page, raw importance score measures how much more helpful than random a particular predictor variable is in successfully classifying data.
-MeanDecreaseAccuracy
I think this is only in the R module, and I believe it measures how much inclusion of this predictor in the model reduces classification error.
-MeanDecreaseGini
Gini is defined as "inequity" when used in describing a society's distribution of income, or a measure of "node impurity" in tree-based classification. A low Gini (i.e. higher descrease in Gini) means that a particular predictor variable plays a greater role in partitioning the data into the defined classes. It's a hard one to describe without talking about the fact that data in classification trees are split at individual nodes based on values of predictors. I'm not so clear on how this translates into better performance.
For your immediate concern: higher values mean the variables are more important. This should be true for all the measures you mention.
Random forests give you pretty complex models, so it can be tricky to interpret the importance measures. If you want to easily understand what your variables are doing, don't use RFs. Use linear models or a (non-ensemble) decision tree instead.
You said:
An explanation that uses the words
'error', 'summation', or 'permutated'
would be less helpful then a simpler
explanation that didn't involve any
discussion of how random forests
works.
It's going to be awfully tough to explain much more than the above unless you dig in and learn what about random forests. I assume you're complaining about either the manual, or the section from Breiman's manual:
http://www.stat.berkeley.edu/~breiman/RandomForests/cc_home.htm#varimp
To figure out how important a variable is, they fill it with random junk ("permute" it), then see how much predictive accuracy decreases. MeanDecreaseAccuracy and MeanDecreaseGini work this way. I'm not sure what the raw importance scores are.
Interpretability is kinda tough with Random Forests. While RF is an extremely robust classifier it makes its predictions democratically. By this I mean you build hundreds or thousands of trees by taking a random subset of your variables and a random subset of your data and build a tree. Then make a prediction for all the non-selected data and save the prediction. Its robust because it deals well with the vagaries of your data set, (ie it smooths over randomly high/low values, fortuitous plots/samples, measuring the same thing 4 different ways, etc). However if you have some highly correlated variables, both may seem important as they are not both always included in each model.
One potential approach with random forests may be to help whittle down your predictors then switch to regular CART or try the PARTY package for inference based tree models. However then you must be wary about data mining issues, and making inferences about parameters.

Resources