Feature selection and prediction accuracy in regression Forest in R - r

I am attempting to solve a regression problem where the input feature set is of size ~54.
Using OLS linear regression with a single predictor 'X1', I am not able to explain the variation in Y - hence I am trying to find additional important features using Regression forest (i.e., Random forest regression). The selected 'X1' is later found to be the most important feature.
My dataset has ~14500 entries. I have separated it into training and test sets in the ratio 9:1.
I have the following questions:
when trying to find the important features, should I run the regression forest on the entire dataset, or only on the training data?
Once the important features are found, should the model be re-built using the top few features to see whether feature selection speeds up the computation at a small cost to predictive power?
For now, I have built the model using the training set and all the features, and I am using it for prediction on the test set. I am calculating the MSE and R-squared from the training set. I am getting high MSE and low R2 on the training data, and reverse on the test data (shown below). Is this unusual?
forest <- randomForest(fmla, dTraining, ntree=501, importance=T)
mean((dTraining$y - predict(forest, data=dTraining))^2)
0.9371891
rSquared(dTraining$y, dTraining$y - predict(forest, data=dTraining))
0.7431078
mean((dTest$y - predict(forest, newdata=dTest))^2)
0.009771256
rSquared(dTest$y, dTest$y - predict(forest, newdata=dTest))
0.9950448
Please suggest.
Any suggestion if R-squared and MSE are good metrics for this problem, or if I need to look at some other metrics to evaluate if the model is good?

You should also try Cross Validated here
when trying to find the important features, should I run the regression forest on the entire dataset, or only on the training data?
Only on the training data. You want to prevent overfitting, which is why you do a train-test split in the first place.
Once the important features are found, should the model be re-built using the top few features to see whether feature selection speeds up the computation at a small cost to predictive power?
Yes, but the purpose of feature selection is not necessarily to speed up computation. With infinite features, it is possible to fit any pattern of data (i.e., overfitting). With feature selection, you're hoping to prevent overfitting by using only a few 'robust' features.
For now, I have built the model using the training set and all the features, and I am using it for prediction on the test set. I am calculating the MSE and R-squared from the training set. I am getting high MSE and low R2 on the training data, and reverse on the test data (shown below). Is this unusual?
Yes, it's unusual. You want low MSE and high R2 values for both your training and test data. (I would double check your calculations.) If you're getting high MSE and low R2 with your training data, it means your training was poor, which is very surprising. Also, I haven't used rSquared but maybe you want rSquared(dTest$y, predict(forest, newdata=dTest))?

Related

How to do cross-validation in R using neuralnet?

I'm trying to build a predictive model, using the neuralnet package. First I'm spliting my dataset in training (80%) and test (20%). But ANN is such a powerful technique that my model easily overfits the training set and performs poorly on the external test set.
Predicted vs True Value - Training is the right one and test set is the left one
Is there a way to do a cross-validation on the training set so that my model doesn't overfit the set? How may I do this with my own built in function?
Plus, are there any other approaches when dealing with deep learning? I've heard you can tweak the weights of the model in order to improve its quality on external data.
Thanks in advance!

Machine Learning Keras accuracy model vs accuracy new data prediction

I did a deep learning model using keras. Model accuracy has 99% score.
$`loss`
[1] 0.03411416
$acc
[1] 0.9952607
When I do a prediction classes on my new data file using the model I have only 87% of classes well classified. My question is, why there is a difference between model accuracy and model prediction score?
Your 99% is on the Training Set, this is an indicator of own is performing your algorithm while training, you should never look at it as a reference.
You should always look at your Test Set, this is the real value that matters.
Fore more, your accuracies should always look like this (at least the style):
e.g. The training set accuracy always growing and the testing set following the same trend but below the training curve.
You will always never have the exact two same sets (training & testing/validating) so this is normal to have a difference.
The objective of the training set is to generalize your data and learn from them.
The objective of the testing set is to see if you generalized well.
If you're too far from your training set, either there a lot of difference between the two sets (mostly distribution, data types etc..), or if they are similar then your model overfits (which means your model is too close to your training data and if there is a little difference in your testing data, this will lead to wrong predictions).
The reason the model overfits is often that your model is too complicated and you must simplify it (e.g. reduce number of layers, reduce number of neurons.. etc)

My understanding of : How does CV.GLMNET work to choose optimal lambda?

I wish to confirm my understanding of CV procedure in the glmnet package to explain it to a reviewer of my paper. I will be grateful if someone can add information to clarify the answer further.
Specifically, I had a binary classification problem with 29 input variables and 106 rows. Instead of splitting into training/test data (and further decreasing training data) I went with lasso choosing lambda through cross-validation as a means to minimise overfitting. After training the model with cv.glmnet I tested its classification accuracy on the same dataset (bootstrapped x 10000 for error intervals). I acknowledge that overfitting cannot be eliminated in this setting, but lasso with its penalizing term chosen by cross-validation is going to lessen its effect.
My explanation to the reviewer (who is a doctor like me) of how cv.glmnet does this is :
In each step of 10 fold cross-validation, data were divided randomly
into two groups containing 9/10th data for training and 1/10th for
internal validation (i.e., measuring binomial deviance/error of model
developed with that lambda). Lambda vs. deviance was plotted. When the
process was repeated 9 more times, 95% confidence intervals of lambda
vs. deviance were derived. The final lambda value to go into the model
was the one that gave the best compromise between high lambda and low
deviance. High lambda is the factor that minimises overfitting because
the regression model is not allowed to improve by assigning large
coefficients to the variables. The model is then trained on the entire
dataset using least squares approximation that minimises model error
penalized by lambda term. Because the lambda term is chosen through
cross-validation (and not from the entire dataset), the choice of
lambda is somewhat independent of the data.
I suspect my explanation can be improved much or the flaws in the methodology pointed out by the experts reading this.
Thanks in advance.
A bit late I guess, but here goes.
By default glmnet chooses the lambda.1se. It is the largest λ at which the MSE is within one standard error of the minimal MSE. Along the lines of overfitting, this usually reduces overfitting by selecting a simpler model (less non zero terms) but whose error is still close to the model with the least error. You can also check out this post. Not very sure if you mean this with "The final lambda value to go into the model was the one that gave the best compromise between high lambda and low deviance."
The main issue with your approach is calculating its accuracy on the same training data. This does not tell you how good the model will perform on unseen data, and bootstrapping does not address the error in the accuracy. For an estimate of the error, you should actually use the error from the cross validation. If your model does not work on 90% of the data, I don't see how using all of the training data works.

class importance for random forest in r

I'm using randomForest pkg in R to predict the binary class based upon 11 numerical predictors. Out of the two classes, Hit or Miss, the class Hit is of more importance, i.e. I would like to know about how many times correctly predicted Hit.
Is there a way to give the Hit a higher importance in training the random forest? Currently the trained random forest predicts merely 7% of the Hit cases correctly and definitely would like an improvement.
Higher importance? I don't know how to tell any algorithm "I'm not kidding this time: I want this analysis to be accurate."
You're always fighting the variance versus bias battle. If you improve the training accuracy too much, you run the risk of overfitting.
You can adjust random forest by varying the size of the random sample of predictors. If you have m predictors, the recommendation for random forest is p = m^1/2 for the number of splits in the tree. You can also vary the number of trees. Plot the test classification error versus # trees for different values of p to see how you do.
You can also try another algorithm, like gbm (Generalized Boosted Regression Models) or support vector machines
How does your data look when you plot it? Any obvious groups jumping out at you when you look at them in scatterplots?
Regardless of algorithm, I'd advise that you do n-fold validation of your model.

Massive datasets with the randomForest package

I have about 300,000 rows of data and 10 features in my model and I want to fit a random forest from the randomForest package in R.
To maximise the amount of trees I can get in the forest in a fixed window of time without ruining generalisation what are sensible ranges that I should set the parameters to?
Usually you can get away with just mtryas explained here and the default is often best:
https://stats.stackexchange.com/questions/50210/caret-and-randomforest-number-of-trees
But there is a function tuneRF with randomForest that will help you find optimal ntree or mtry as explained here:
setting values for ntree and mtry for random forest regression model
The time it takes you will have to test yourself - it's going to be the products of foldstuningntrees.
The only speculative point I would add is that with 300,000 rows of data you might reduce the runtime without loss of predictive accuracy by bootstrapping small samples of the data???

Resources