I am working on Decision Tree model .The dataset is related to cars.I have 80% data in training set and 20% test set. The summary of the model ( based on training data) shows misclassification rate around 0.02605 where as when I run the model on training set came as 0.0289 , the difference between them is around 0.003. Is the difference acceptable , what is causing this difference? I am new to R/statistics.Please share your feedback.
Acceptable misclassification rate is more art than science. If your data is generated from a single population then there is without a doubt to be some unavoidable overlap between the groups, which will make linear classification error-prone. This doesn't mean its a problem. For instance, if you are classifying credit card charges as possibly fraudulent or not, and your recourse isn't too harsh in the case when you classify an observation to the former, then you it may be advantageous to be on the safer side and end up with more false-positives rather than a low misclassification rate. You could 1. visualize your data to identify overlap, or 2. compute N*.03 to discern the number of misclassified cases; if you have an understanding of what you are classifying, you could assess the seriousness of misclassification that way.
Related
I'm working on a churn model, the main idea it's to predict if a customer whether be a churner or no within 30 days.
I've been struggling with my dataset, I have 100k rows and my target variable is unbalanced, 95% no churn and 5% churn.
I'm trying with GLM and RF, if I train both models with raw data, I don't get any churn prediction, so, It doesn't work for me. I have tried balancing, taking all churners and same amount of no churners (50% churn, 50% no churn), training with that and then testing with my data and I get a lot of churn predictions when they not. I tried oversampling, undersampling, ROSE, SMOTE, and it seems that nothing's working for me.
With luck both models predict a maximum 20% of all my churners, then my gain and lift are not that good. I think I've tried everything, but I don't get more than 20% prediction of what I need.
I have customer behavior variables, personal information and more. I also made an exploratory analysis, calculating percentage of churn per age, per sex, per behavior and I saw that every group have the same churn percentage, so, I'm thinking that maybe I lack of more variables that separates groups in a better form (this last idea it's just personal).
Thank you everyone, greetings.
Currently i am working on a project whose objective is to find the customer who has more probability to purchase your project.Its a classification model (0 & 1 ).
I have created model with RF and XGB both & calculated gain score ( Data is imbalanced ).Not my more than 80 % customers covering in top 3 decile for training data but when i run the model on validation dataset, it fall back to 56-59 % in both model.
Say i have 20 customers & for better accuracy , i have clustered them, Now model is giving perfect result on cluster 1 customers but perform poor on cluster 2 customers.
Any suggestion to tune the same.
Firstly, if there is a high accuracy difference between your training and validation set your model may suffer from bias. You may need to use a more complex model for this training.
Secondly, because of the imbalance of your dataset, you maybe want to resample the training set. You can use under-sampling or over-sampling techniques(SMOTE).
Thirdly, you may need to use the right evaluation metrics like precision, recall, F1.
Finally, in train/val/test split you need to be careful about the distribution of your dataset. So you can use the stratified keyword to handle this problem.
I am using randomForest in R, I have a training model with R^2 of 0.94 , however , the prediction capacity for testing data is quite low. I would like to know if I can still use this training model only for determining which variable is more important/effective for output prediction.
Thanks
Based on what little information you provide, the question is hard to answer (think about providing more detail and background). Low prediction quality can result from wrong algorithm tuning, or it can be inherent in the data, i.e. your predictors themselves are not very strongly related to the outcome. In the first case, the prediction could be better with different parameters, e.g. more or less trees, different values for mtry, etc. If this is the case, then your importance measures are just as biased as your prediction (and should be used with caution). If the predictors themselves are weak, that means that your low quality prediction is as good as it gets. In this case, I would say the importance measures can be used, but they only tell you which of your overall weak predictors are more or less weak.
This question seems weird, let me explain it by example.
We train a particular classification model to determine if an image contains a person or not.
After the model is trained, we use an new image for predicting.
The predicting result show that there is 94% probability that the image contains a person.
Thus, could I say, the confidence level is 94%, for that the image may contains a person?
Your third item is not properly interpreted. The model returns a normalized score of 0.94 for the category "person". Although this score correlates relatively well with our cognitive notions of "probability" and "confidence", do not confuse it with either of those. It's a convenient metric with some overall useful properties, but it is not an accurate prediction to two digits of accuracy.
Granted, there may well be models for which the model's prediction is an accurate figure. For instance, the RealOdds models you'll find on 538 are built and tested to that standard. However, that is a directed effort of more than a decade; your everyday deep learning model is not held to the same standard ... unless you work to tune it to that, making the accuracy of that number a part of your training (incorporate it into the error function).
You can run a simple (although voluminous) experiment: collect all of the predictions and bin them; say, a range of 0.1 for each of 10 bins. Now, if this "prediction" is, indeed, a probability, then your 0.6-0.7 bin should correctly identify a person 65% of the time. Check that against ground truth: did that bin get 65% correct and 35% wrong? Is the discrepancy within expected ranges: do this for each of the 10 categories and run your favorite applicable statistical measures on it.
I expect that this will convince you that the inference score is neither a prediction nor a confidence score. However, I'm also hoping it will give you some ideas for future work.
I am trying to use the random forests package for classification in R.
The Variable Importance Measures listed are:
mean raw importance score of variable x for class 0
mean raw importance score of variable x for class 1
MeanDecreaseAccuracy
MeanDecreaseGini
Now I know what these "mean" as in I know their definitions. What I want to know is how to use them.
What I really want to know is what these values mean in only the context of how accurate they are, what is a good value, what is a bad value, what are the maximums and minimums, etc.
If a variable has a high MeanDecreaseAccuracy or MeanDecreaseGini does that mean it is important or unimportant? Also any information on raw scores could be useful too.
I want to know everything there is to know about these numbers that is relevant to the application of them.
An explanation that uses the words 'error', 'summation', or 'permutated' would be less helpful then a simpler explanation that didn't involve any discussion of how random forests works.
Like if I wanted someone to explain to me how to use a radio, I wouldn't expect the explanation to involve how a radio converts radio waves into sound.
An explanation that uses the words 'error', 'summation', or 'permutated'
would be less helpful then a simpler explanation that didn't involve any
discussion of how random forests works.
Like if I wanted someone to explain to me how to use a radio, I wouldn't
expect the explanation to involve how a radio converts radio waves into sound.
How would you explain what the numbers in WKRP 100.5 FM "mean" without going into the pesky technical details of wave frequencies? Frankly parameters and related performance issues with Random Forests are difficult to get your head around even if you understand some technical terms.
Here's my shot at some answers:
-mean raw importance score of variable x for class 0
-mean raw importance score of variable x for class 1
Simplifying from the Random Forest web page, raw importance score measures how much more helpful than random a particular predictor variable is in successfully classifying data.
-MeanDecreaseAccuracy
I think this is only in the R module, and I believe it measures how much inclusion of this predictor in the model reduces classification error.
-MeanDecreaseGini
Gini is defined as "inequity" when used in describing a society's distribution of income, or a measure of "node impurity" in tree-based classification. A low Gini (i.e. higher descrease in Gini) means that a particular predictor variable plays a greater role in partitioning the data into the defined classes. It's a hard one to describe without talking about the fact that data in classification trees are split at individual nodes based on values of predictors. I'm not so clear on how this translates into better performance.
For your immediate concern: higher values mean the variables are more important. This should be true for all the measures you mention.
Random forests give you pretty complex models, so it can be tricky to interpret the importance measures. If you want to easily understand what your variables are doing, don't use RFs. Use linear models or a (non-ensemble) decision tree instead.
You said:
An explanation that uses the words
'error', 'summation', or 'permutated'
would be less helpful then a simpler
explanation that didn't involve any
discussion of how random forests
works.
It's going to be awfully tough to explain much more than the above unless you dig in and learn what about random forests. I assume you're complaining about either the manual, or the section from Breiman's manual:
http://www.stat.berkeley.edu/~breiman/RandomForests/cc_home.htm#varimp
To figure out how important a variable is, they fill it with random junk ("permute" it), then see how much predictive accuracy decreases. MeanDecreaseAccuracy and MeanDecreaseGini work this way. I'm not sure what the raw importance scores are.
Interpretability is kinda tough with Random Forests. While RF is an extremely robust classifier it makes its predictions democratically. By this I mean you build hundreds or thousands of trees by taking a random subset of your variables and a random subset of your data and build a tree. Then make a prediction for all the non-selected data and save the prediction. Its robust because it deals well with the vagaries of your data set, (ie it smooths over randomly high/low values, fortuitous plots/samples, measuring the same thing 4 different ways, etc). However if you have some highly correlated variables, both may seem important as they are not both always included in each model.
One potential approach with random forests may be to help whittle down your predictors then switch to regular CART or try the PARTY package for inference based tree models. However then you must be wary about data mining issues, and making inferences about parameters.