Machine Learning Keras accuracy model vs accuracy new data prediction - r

I did a deep learning model using keras. Model accuracy has 99% score.
$`loss`
[1] 0.03411416
$acc
[1] 0.9952607
When I do a prediction classes on my new data file using the model I have only 87% of classes well classified. My question is, why there is a difference between model accuracy and model prediction score?

Your 99% is on the Training Set, this is an indicator of own is performing your algorithm while training, you should never look at it as a reference.
You should always look at your Test Set, this is the real value that matters.
Fore more, your accuracies should always look like this (at least the style):
e.g. The training set accuracy always growing and the testing set following the same trend but below the training curve.
You will always never have the exact two same sets (training & testing/validating) so this is normal to have a difference.
The objective of the training set is to generalize your data and learn from them.
The objective of the testing set is to see if you generalized well.
If you're too far from your training set, either there a lot of difference between the two sets (mostly distribution, data types etc..), or if they are similar then your model overfits (which means your model is too close to your training data and if there is a little difference in your testing data, this will lead to wrong predictions).
The reason the model overfits is often that your model is too complicated and you must simplify it (e.g. reduce number of layers, reduce number of neurons.. etc)

Related

Problems with testset and trainset being too similar

In the dataset I have been given, the independent variables are given in intervals of 50 as shown below.
Hence, when I perform a train-test split on the dataset for linear regression, I obtain very similar train sets and test sets as shown:
I believe that this has an effect that is equivalent to training on the test set, which gives misleading results. Is it right to say, the test set may obtain high predictive accuracy, but my model will not be able to detect overfitting since overfitting is often detected through a large difference between train set and test set?
Additionally, what kind of limitations will having similar train and test sets pose in my modelling?

Can training set be used to determine variable importance using randomForest in R although the prediction of testing set is quite low?

I am using randomForest in R, I have a training model with R^2 of 0.94 , however , the prediction capacity for testing data is quite low. I would like to know if I can still use this training model only for determining which variable is more important/effective for output prediction.
Thanks
Based on what little information you provide, the question is hard to answer (think about providing more detail and background). Low prediction quality can result from wrong algorithm tuning, or it can be inherent in the data, i.e. your predictors themselves are not very strongly related to the outcome. In the first case, the prediction could be better with different parameters, e.g. more or less trees, different values for mtry, etc. If this is the case, then your importance measures are just as biased as your prediction (and should be used with caution). If the predictors themselves are weak, that means that your low quality prediction is as good as it gets. In this case, I would say the importance measures can be used, but they only tell you which of your overall weak predictors are more or less weak.

Feature selection and prediction accuracy in regression Forest in R

I am attempting to solve a regression problem where the input feature set is of size ~54.
Using OLS linear regression with a single predictor 'X1', I am not able to explain the variation in Y - hence I am trying to find additional important features using Regression forest (i.e., Random forest regression). The selected 'X1' is later found to be the most important feature.
My dataset has ~14500 entries. I have separated it into training and test sets in the ratio 9:1.
I have the following questions:
when trying to find the important features, should I run the regression forest on the entire dataset, or only on the training data?
Once the important features are found, should the model be re-built using the top few features to see whether feature selection speeds up the computation at a small cost to predictive power?
For now, I have built the model using the training set and all the features, and I am using it for prediction on the test set. I am calculating the MSE and R-squared from the training set. I am getting high MSE and low R2 on the training data, and reverse on the test data (shown below). Is this unusual?
forest <- randomForest(fmla, dTraining, ntree=501, importance=T)
mean((dTraining$y - predict(forest, data=dTraining))^2)
0.9371891
rSquared(dTraining$y, dTraining$y - predict(forest, data=dTraining))
0.7431078
mean((dTest$y - predict(forest, newdata=dTest))^2)
0.009771256
rSquared(dTest$y, dTest$y - predict(forest, newdata=dTest))
0.9950448
Please suggest.
Any suggestion if R-squared and MSE are good metrics for this problem, or if I need to look at some other metrics to evaluate if the model is good?
You should also try Cross Validated here
when trying to find the important features, should I run the regression forest on the entire dataset, or only on the training data?
Only on the training data. You want to prevent overfitting, which is why you do a train-test split in the first place.
Once the important features are found, should the model be re-built using the top few features to see whether feature selection speeds up the computation at a small cost to predictive power?
Yes, but the purpose of feature selection is not necessarily to speed up computation. With infinite features, it is possible to fit any pattern of data (i.e., overfitting). With feature selection, you're hoping to prevent overfitting by using only a few 'robust' features.
For now, I have built the model using the training set and all the features, and I am using it for prediction on the test set. I am calculating the MSE and R-squared from the training set. I am getting high MSE and low R2 on the training data, and reverse on the test data (shown below). Is this unusual?
Yes, it's unusual. You want low MSE and high R2 values for both your training and test data. (I would double check your calculations.) If you're getting high MSE and low R2 with your training data, it means your training was poor, which is very surprising. Also, I haven't used rSquared but maybe you want rSquared(dTest$y, predict(forest, newdata=dTest))?

Do i exclude data used in a training set to run predict () model?

I am very new to machine learning. I have a question about running predict on data used for training set.
Here are details: I took a portion of my initial dataset and split that portion into 80% (train) and 20% (test). I trained the model on 80% of training set
model <- train(name ~ ., data = train.df, method = ...)
and then run the model on 20% test data:
predict(model, newdata = test.df, type = "prob")
Now I want to predict using my trained model on initial dataset which also includes the training portion. Do I need to exclude that portion that was used for the training?
When you report accuracy to a third person about how good your machine learning model works, you always report the accuracy you get on the data set that was not used in training (and validation).
You can report your accuracy numbers for the over all data set but always include the remark that this data set also includes the data partition that was used for training the machine learning algorithm.
This care is taken to make sure your algorithm has not overfitted on your training set: https://en.wikipedia.org/wiki/Overfitting
Julie, I saw your comment below your original post. I would suggest you edit the original post and include your data split to be more complete in your question. It would also help to know what method of regression/classification you're using.
I'm assuming you're trying to assess the accuracy of your model with the 90% of data you left out. Depending on the number of samples you used in your training set you may or may not have the accuracy you'd like. Accuracy will also depend on your approach to the method of regression/classification you used.
To answer your question directly: you don't need to exclude anything from your dataset - the model doesn't change when you call predict().
All you're doing when you call predict is filling in the x-variables in your model with whatever data you supply. Your model was fitted to your training set, so if you supply training set data again it will still create predictions. Note though, for proving accuracy your results will be skewed if you include the set of data that you fit the model to since that's what it learned from to create predictions in the first place - kind of like watching a game, and then watching the same game again and being asked to make predictions about it.

class importance for random forest in r

I'm using randomForest pkg in R to predict the binary class based upon 11 numerical predictors. Out of the two classes, Hit or Miss, the class Hit is of more importance, i.e. I would like to know about how many times correctly predicted Hit.
Is there a way to give the Hit a higher importance in training the random forest? Currently the trained random forest predicts merely 7% of the Hit cases correctly and definitely would like an improvement.
Higher importance? I don't know how to tell any algorithm "I'm not kidding this time: I want this analysis to be accurate."
You're always fighting the variance versus bias battle. If you improve the training accuracy too much, you run the risk of overfitting.
You can adjust random forest by varying the size of the random sample of predictors. If you have m predictors, the recommendation for random forest is p = m^1/2 for the number of splits in the tree. You can also vary the number of trees. Plot the test classification error versus # trees for different values of p to see how you do.
You can also try another algorithm, like gbm (Generalized Boosted Regression Models) or support vector machines
How does your data look when you plot it? Any obvious groups jumping out at you when you look at them in scatterplots?
Regardless of algorithm, I'd advise that you do n-fold validation of your model.

Resources