It's normal to test model on independent data after cross validation - r

I want to perform a random forest model, so I split my data into 70% for the train and 30% for the test. I applied a cross validation procedure on my train data (70%) and obtained a precision for the cross validation. After that, I test my model on the test data (30%), then I have another clarification.
So, I want to know if this is a good approach to test the robustness of my model, and what is the interpretation of these two precision.
Thanks in advance.

You do not need to perform Cross-Validation when building a RF model, as RF calculates its own CV score knows as OOB score. In fact, the results that you get from the model (the confusion matrix at model_name$confusion) is based on the OOB scores.
You can use the OOB scores (and the various metrics derived from them, such as Precision, Recall, etc.) to select a model from a list of models (for ex. models with different parameters / arguments) and then use the test data to check if the selected model generalises well.

Related

Why XGB model is not giving output on validation dataset but work better for training

Currently i am working on a project whose objective is to find the customer who has more probability to purchase your project.Its a classification model (0 & 1 ).
I have created model with RF and XGB both & calculated gain score ( Data is imbalanced ).Not my more than 80 % customers covering in top 3 decile for training data but when i run the model on validation dataset, it fall back to 56-59 % in both model.
Say i have 20 customers & for better accuracy , i have clustered them, Now model is giving perfect result on cluster 1 customers but perform poor on cluster 2 customers.
Any suggestion to tune the same.
Firstly, if there is a high accuracy difference between your training and validation set your model may suffer from bias. You may need to use a more complex model for this training.
Secondly, because of the imbalance of your dataset, you maybe want to resample the training set. You can use under-sampling or over-sampling techniques(SMOTE).
Thirdly, you may need to use the right evaluation metrics like precision, recall, F1.
Finally, in train/val/test split you need to be careful about the distribution of your dataset. So you can use the stratified keyword to handle this problem.

Machine Learning Keras accuracy model vs accuracy new data prediction

I did a deep learning model using keras. Model accuracy has 99% score.
$`loss`
[1] 0.03411416
$acc
[1] 0.9952607
When I do a prediction classes on my new data file using the model I have only 87% of classes well classified. My question is, why there is a difference between model accuracy and model prediction score?
Your 99% is on the Training Set, this is an indicator of own is performing your algorithm while training, you should never look at it as a reference.
You should always look at your Test Set, this is the real value that matters.
Fore more, your accuracies should always look like this (at least the style):
e.g. The training set accuracy always growing and the testing set following the same trend but below the training curve.
You will always never have the exact two same sets (training & testing/validating) so this is normal to have a difference.
The objective of the training set is to generalize your data and learn from them.
The objective of the testing set is to see if you generalized well.
If you're too far from your training set, either there a lot of difference between the two sets (mostly distribution, data types etc..), or if they are similar then your model overfits (which means your model is too close to your training data and if there is a little difference in your testing data, this will lead to wrong predictions).
The reason the model overfits is often that your model is too complicated and you must simplify it (e.g. reduce number of layers, reduce number of neurons.. etc)

Feature selection and prediction accuracy in regression Forest in R

I am attempting to solve a regression problem where the input feature set is of size ~54.
Using OLS linear regression with a single predictor 'X1', I am not able to explain the variation in Y - hence I am trying to find additional important features using Regression forest (i.e., Random forest regression). The selected 'X1' is later found to be the most important feature.
My dataset has ~14500 entries. I have separated it into training and test sets in the ratio 9:1.
I have the following questions:
when trying to find the important features, should I run the regression forest on the entire dataset, or only on the training data?
Once the important features are found, should the model be re-built using the top few features to see whether feature selection speeds up the computation at a small cost to predictive power?
For now, I have built the model using the training set and all the features, and I am using it for prediction on the test set. I am calculating the MSE and R-squared from the training set. I am getting high MSE and low R2 on the training data, and reverse on the test data (shown below). Is this unusual?
forest <- randomForest(fmla, dTraining, ntree=501, importance=T)
mean((dTraining$y - predict(forest, data=dTraining))^2)
0.9371891
rSquared(dTraining$y, dTraining$y - predict(forest, data=dTraining))
0.7431078
mean((dTest$y - predict(forest, newdata=dTest))^2)
0.009771256
rSquared(dTest$y, dTest$y - predict(forest, newdata=dTest))
0.9950448
Please suggest.
Any suggestion if R-squared and MSE are good metrics for this problem, or if I need to look at some other metrics to evaluate if the model is good?
You should also try Cross Validated here
when trying to find the important features, should I run the regression forest on the entire dataset, or only on the training data?
Only on the training data. You want to prevent overfitting, which is why you do a train-test split in the first place.
Once the important features are found, should the model be re-built using the top few features to see whether feature selection speeds up the computation at a small cost to predictive power?
Yes, but the purpose of feature selection is not necessarily to speed up computation. With infinite features, it is possible to fit any pattern of data (i.e., overfitting). With feature selection, you're hoping to prevent overfitting by using only a few 'robust' features.
For now, I have built the model using the training set and all the features, and I am using it for prediction on the test set. I am calculating the MSE and R-squared from the training set. I am getting high MSE and low R2 on the training data, and reverse on the test data (shown below). Is this unusual?
Yes, it's unusual. You want low MSE and high R2 values for both your training and test data. (I would double check your calculations.) If you're getting high MSE and low R2 with your training data, it means your training was poor, which is very surprising. Also, I haven't used rSquared but maybe you want rSquared(dTest$y, predict(forest, newdata=dTest))?

How to consider different costs for different types of errors in SVM using R

Let Y be a binary variable.
If we use logistic regression for modeling, then we can use cv.glm for cross validation and there we can specify the cost function in the cost argument. By specifying the cost function, we can assign different unit costs to different types of errors:predicted Yes|reference is No or predicted No|reference is Yes.
I am wondering if I could achieve the same in SVM. In other words, is there a way for me to specify a cost(loss) function instead of using built-in loss function?
Besides the Answer by Yueguoguo, there is also three more solutions, the standard Wrapper approach, hyperplane tuning and the one in e1017.
The Wrapper approach (available out of the box for example in weka) is applicable to almost all classifiers. The idea is to over- or undersample the data in accordance with the misclassification costs. The learned model if trained to optimise accuracy is optimal under the costs.
The second idea is frequently used in textminining. The classification is svm's are derived from distance to the hyperplane. For linear separable problems this distance is {1,-1} for the support vectors. The classification of a new example is then basically, whether the distance is positive or negative. However, one can also shift this distance and not make the decision and 0 but move it for example towards 0.8. That way the classifications are shifted in one or the other direction, while the general shape of the data is not altered.
Finally, some machine learning toolkits have a build in parameter for class specific costs like class.weights in the e1017 implementation. the name is due to the fact that the term cost is pre-occupied.
The loss function for SVM hyperplane parameters is automatically tuned thanks to the beautiful theoretical foundation of the algorithm. SVM applies cross-validation for tuning hyperparameters. Say, an RBF kernel is used, cross validation is to select the optimal combination of C (cost) and gamma (kernel parameter) for the best performance, measured by certain metrics (e.g., mean squared error). In e1071, the performance can be obtained by using tune method, where the range of hyperparameters as well as attribute of cross-validation (i.e., 5-, 10- or more fold cross validation) can be specified.
To obtain comparative cross-validation results by using Area-Under-Curve type of error measurement, one can train different models with different hyperparameter configurations and then validate the model against sets of pre-labelled data.
Hope the answer helps.

Applying k-fold Cross Validation model using caret package

Let me start by saying that I have read many posts on Cross Validation and it seems there is much confusion out there. My understanding of that it is simply this:
Perform k-fold Cross Validation i.e. 10 folds to understand the average error across the 10 folds.
If acceptable then train the model on the complete data set.
I am attempting to build a decision tree using rpart in R and taking advantage of the caret package. Below is the code I am using.
# load libraries
library(caret)
library(rpart)
# define training control
train_control<- trainControl(method="cv", number=10)
# train the model
model<- train(resp~., data=mydat, trControl=train_control, method="rpart")
# make predictions
predictions<- predict(model,mydat)
# append predictions
mydat<- cbind(mydat,predictions)
# summarize results
confusionMatrix<- confusionMatrix(mydat$predictions,mydat$resp)
I have one question regarding the caret train application. I have read A Short Introduction to the caret Package train section which states during the resampling process the "optimal parameter set" is determined.
In my example have I coded it up correctly? Do I need to define the rpart parameters within my code or is my code sufficient?
when you perform k-fold cross validation you are already making a prediction for each sample, just over 10 different models (presuming k = 10).
There is no need make a prediction on the complete data, as you already have their predictions from the k different models.
What you can do is the following:
train_control<- trainControl(method="cv", number=10, savePredictions = TRUE)
Then
model<- train(resp~., data=mydat, trControl=train_control, method="rpart")
if you want to see the observed and predictions in a nice format you simply type:
model$pred
Also for the second part of your question, caret should handle all the parameter stuff. You can manually try tune parameters if you desire.
An important thing to be noted here is not confuse model selection and model error estimation.
You can use cross-validation to estimate the model hyper-parameters (regularization parameter for example).
Usually that is done with 10-fold cross validation, because it is good choice for the bias-variance trade-off (2-fold could cause models with high bias, leave one out cv can cause models with high variance/over-fitting).
After that, if you don't have an independent test set you could estimate an empirical distribution of some performance metric using cross validation: once you found out the best hyper-parameters you could use them in order to estimate de cv error.
Note that in this step the hyperparameters are fixed but maybe the model parameters are different accross the cross validation models.
In the first page of the short introduction document for caret package, it is mentioned that the optimal model is chosen across the parameters.
As a starting point, one must understand that cross-validation is a procedure for selecting best modeling approach rather than the model itself CV - Final model selection. Caret provides grid search option using tuneGrid where you can provide a list of parameter values to test. The final model will have the optimized parameter after training is done.

Resources