Do i exclude data used in a training set to run predict () model? - r

I am very new to machine learning. I have a question about running predict on data used for training set.
Here are details: I took a portion of my initial dataset and split that portion into 80% (train) and 20% (test). I trained the model on 80% of training set
model <- train(name ~ ., data = train.df, method = ...)
and then run the model on 20% test data:
predict(model, newdata = test.df, type = "prob")
Now I want to predict using my trained model on initial dataset which also includes the training portion. Do I need to exclude that portion that was used for the training?

When you report accuracy to a third person about how good your machine learning model works, you always report the accuracy you get on the data set that was not used in training (and validation).
You can report your accuracy numbers for the over all data set but always include the remark that this data set also includes the data partition that was used for training the machine learning algorithm.
This care is taken to make sure your algorithm has not overfitted on your training set: https://en.wikipedia.org/wiki/Overfitting

Julie, I saw your comment below your original post. I would suggest you edit the original post and include your data split to be more complete in your question. It would also help to know what method of regression/classification you're using.
I'm assuming you're trying to assess the accuracy of your model with the 90% of data you left out. Depending on the number of samples you used in your training set you may or may not have the accuracy you'd like. Accuracy will also depend on your approach to the method of regression/classification you used.
To answer your question directly: you don't need to exclude anything from your dataset - the model doesn't change when you call predict().
All you're doing when you call predict is filling in the x-variables in your model with whatever data you supply. Your model was fitted to your training set, so if you supply training set data again it will still create predictions. Note though, for proving accuracy your results will be skewed if you include the set of data that you fit the model to since that's what it learned from to create predictions in the first place - kind of like watching a game, and then watching the same game again and being asked to make predictions about it.

Related

Changing coefficients in logistic regression

I will try to explain my problem as best as i can. I am trying to externally validate a prediction model, made by one of my colleagues. For the external validation, I have collected data from a set of new patients.
I want to test the accuracy of the prediction model on this new dataset. Online i have found a way to do so, using the coef.orig function to extract de coefficients of the original prediction model (see the picture i added). Here comes the problem, it is impossible for me to repeat the steps my colleague did to obtain the original prediction model. He used multiple imputation and bootstrapping for the development and internal validation, making it very complex to repeat his steps. What I do have, is the computed intercept and coefficients from the original model. Step 1 from the picture i added, could therefor be skipped.
My question is, how can I add these coefficients into the regression model, without the use of the 'coef()' function?
Steps to externally validate the prediction model:
The coefficients I need to use:
I thought that the offset function would possibly be of use, however this does not allow me to set the intercept and all the coefficients for the variables at the same time

how to select the best dataset for training a model

I want to create a best training sample from a given set of data points by way of running all possible combinations of train and test through a model and select based on the best R2.
I do not want to run the model with all possible combinations rather I want to select like a stratified set each time and run the model. Is there a way to do this in R.
sample dataset
df1 <- data.frame(
cbind(sno=1:30
,x1=c(14.3,14.8,14.8,15,15.1,15.1,15.4,15.4,16.1,14.3,14.8,14.8,15.2,15.1,15.1,15.4,15.4,16.1,14.2,14.8,14.7,15.1,15,15,15.3,15.3,15.9,15.1,15,15.3)
,y1=c(79.2,78.7,79,78.2,78.7,79.1,78.4,78.7,78.1,79.2,78.7,79,78.2,78.6,79.2,78.4,78.7,78.1,79.1,78.5,78.9,78,78.5,79,78.2,78.5,78,79.2,78.7,78.7)
,z1=c(219.8,221.6,232.5,213.1,231,247.6,230.2,240.9,245.5,122.8,124.2,131.5,119.1,130.5,141.1,130.8,137.7,140.8,25.4,30.5,30.5,23.8,29.6,34.6,29.5,33.3,35.2,105,170.7,117.3)
))
This defeats the purpose of training. Ideally, you have one or more training datasets and an untouched testing data set you'll ultimately test against once your model is fit. Cherry-picking a training dataset, using R-squared or any other metric for that matter, will introduce bias. Worse still, if your model parameters are wildly different depending on which training set you use, your model likely isn't very good and results against your testing dataset are likely to be spurious.

Machine Learning Keras accuracy model vs accuracy new data prediction

I did a deep learning model using keras. Model accuracy has 99% score.
$`loss`
[1] 0.03411416
$acc
[1] 0.9952607
When I do a prediction classes on my new data file using the model I have only 87% of classes well classified. My question is, why there is a difference between model accuracy and model prediction score?
Your 99% is on the Training Set, this is an indicator of own is performing your algorithm while training, you should never look at it as a reference.
You should always look at your Test Set, this is the real value that matters.
Fore more, your accuracies should always look like this (at least the style):
e.g. The training set accuracy always growing and the testing set following the same trend but below the training curve.
You will always never have the exact two same sets (training & testing/validating) so this is normal to have a difference.
The objective of the training set is to generalize your data and learn from them.
The objective of the testing set is to see if you generalized well.
If you're too far from your training set, either there a lot of difference between the two sets (mostly distribution, data types etc..), or if they are similar then your model overfits (which means your model is too close to your training data and if there is a little difference in your testing data, this will lead to wrong predictions).
The reason the model overfits is often that your model is too complicated and you must simplify it (e.g. reduce number of layers, reduce number of neurons.. etc)

rpart: Is training data required

I have a problem to understand some basics, so I'm stuck with a regression tree.
I use a classification tree by rpart to check the influence of environmental parameters on a tree growth factor I measured.
Long story short:
What is the purpose of splitting data into training and test data and (when) do I need it? My searches showed examples in which they either don't do it or do it, but I can't find the backstory. Is it just to verify the pruning?
Thank you ahead!
You need to split into training and test data before training the model. The training data helps the model learn, while the test data helps validate the model.
The split is done before running the model, and the model must be retrained when there is some fine tuning or change.
As you might know, the general process for postpruning is the following:
1) Split data into training & test (validation) sets
2) Build decision tree from training set
3) For every non-leaf node N, prune the subtree rooted by N and
replace with the majority class. Then test accuracy with a
validation set. This validation set could be the one defined before
or not.
This all means that you are probably on the right track and that yes, the whole dataset has probably been used to test the accuracy of the pruning.

Strugling to understand complete predictive model process in R

I'm very new to all this and I have a bit of a mental block on the logic of the process. I am trying to predict customer churn using a database of current and already churned customers. So far I have
1) Taken complete customer database of current customers and already churned customers along with customer service variables etc to use to predict on.
2) Split the data set randomly 70/30 into train and test
3) Using R, I have trained a random forest model to predict make predictions and then compared to the actual status using a confusion matrix.
4) I have ran that model using the test data to check accuracy for identifying the churners
I'm now a bit confused. What I want to do now is take all of our current customers and predict which ones will churn. Have I done this all wrong as alot of the current customers I need to predict if will churn have already been seen by the model as they appear in the training set?
Was I somehow supposed to use a training and test set that will not be part of the dataset I need to make predictions on?
Many thanks for any help.
As far as I have understood your question, I feel you want to know if you've done the right thing by using overlapping examples in your training and test set. You first need to understand that you need to keep your training set separate from your test set. Since your model parameters have been computed based on your training set, for similar examples in the test set, the model will give you the correct prediction, so your accuracy will definitely be positively impacted for those common training and test set examples but that is not the correct thing to do. Your test set should always contain previously unseen examples in order to properly evaluate the performance of your algorithm.
If your current customers (on which you want to test your model) are already there in the training set, you would want to leave them out in the testing process. I'd suggest you perform a check between the training set customers and the current set of customers based on some unique identifier (if present) such as the Customer ID and leave common customers out of your fresh batch of unseen test examples.
It looks to me that you have the standard training-test-validation set problem. If I understood correctly, you want to test the performance of your model (Random Forest) to all the data you have.
Standard classroom way to do this is indeed what you already did: Split the dataset for example 70% training and 30% test/validation set, train the model with training set and test with test set.
Better way to test (and predict for all of the data) is to use Cross-Validation to perform the analysis (https://en.wikipedia.org/wiki/Cross-validation_(statistics)). One example for cross-validation is 10-fold cross-validation: You split your data to 10 equal size blocks, loop over all the blocks and for every iteration use the remaining 9 blocks to train your model and the test the model on the specific block.
What you end up with cross-validation is a more comprehensive knowledge of the performance of your model, as well as the results for all of the customers in your database. Cross-validation mitigates the errors in analysis due to random selection of the test set.
Hope this helps!

Resources