how to select the best dataset for training a model

how to select the best dataset for training a model - r

I want to create a best training sample from a given set of data points by way of running all possible combinations of train and test through a model and select based on the best R2.
I do not want to run the model with all possible combinations rather I want to select like a stratified set each time and run the model. Is there a way to do this in R.
sample dataset
df1 <- data.frame(
cbind(sno=1:30
,x1=c(14.3,14.8,14.8,15,15.1,15.1,15.4,15.4,16.1,14.3,14.8,14.8,15.2,15.1,15.1,15.4,15.4,16.1,14.2,14.8,14.7,15.1,15,15,15.3,15.3,15.9,15.1,15,15.3)
,y1=c(79.2,78.7,79,78.2,78.7,79.1,78.4,78.7,78.1,79.2,78.7,79,78.2,78.6,79.2,78.4,78.7,78.1,79.1,78.5,78.9,78,78.5,79,78.2,78.5,78,79.2,78.7,78.7)
,z1=c(219.8,221.6,232.5,213.1,231,247.6,230.2,240.9,245.5,122.8,124.2,131.5,119.1,130.5,141.1,130.8,137.7,140.8,25.4,30.5,30.5,23.8,29.6,34.6,29.5,33.3,35.2,105,170.7,117.3)
))

This defeats the purpose of training. Ideally, you have one or more training datasets and an untouched testing data set you'll ultimately test against once your model is fit. Cherry-picking a training dataset, using R-squared or any other metric for that matter, will introduce bias. Worse still, if your model parameters are wildly different depending on which training set you use, your model likely isn't very good and results against your testing dataset are likely to be spurious.

Related

Missing value handling with imputation in a nested resampling procedure such that there is no information bleed from train to test

I am looking through the documentation for the nested resampling procedure in the mlr3tuning package and I do not see any way that the package can handle NA values such that any information bleed between the training and hold-out sets is avoided, which would result in overly optimistic performance stats. I would ideally like a way to split my data in a nested resampling procedure such that:
full_data = N
train = N - holdout
test = holdout
Then I could perform an imputation on the train and test datasets separately and then run the model on train, predict on test and then select new holdouts and train from the full dataset, run the imputation on them separately and train, predict, repeat for the number of outer_loops.
Is there a way of doing this? Am I missing something obvious?

mlr3 handles all of this for you if you use pipelines (see the relevant part of the mlr3 book). If you make imputation part of such a pipeline, it makes sure to train/test appropriately, just like for the model itself.
Briefly as an explanation, just like with the machine learning model you don't want to do any adjustments based on the test set; in particular you shouldn't impute based on test data. This will cause similar problems as doing this with a model, i.e. biased evaluation results that may not be representative of the true generalization error.

R package (`caret`?) for nested time series cross validation

I'm using caret's timeslice method to do step-ahead cross validation on time series data. I was surprised to find that:
the 'best' hyperparameters chosen by caret are those with the best average performance across all train/test splits, and similarly the reported performance is the average across all train/test splits based on these hyperparam values, and
caret trains a final model using all of the data available - which makes sense when fixedWindow = TRUE but perhaps not otherwise.
My data are non-stationary, so I'd like hyperparameter tuning, performance reporting and final model training to be based on recent data so that:
optimal hyperparameter values can change as underlying relationships change
reported performance can account for the fact that the best hyperparam values may change across splits, and
recent changes in underlying relationships are best captured when training the final model.
I'm wondering if the best approach for my non-stationary data would follow an approach something like:
split each training fold into a training subset and validation subset - using the validation subset to pick hyperparam values within each training fold.
train on the entire training fold using the hyperparam values selected in (1)
report performance based on whatever hyperparameter values were selected in (1), even though these may change from fold to fold
The final model is trained, and hyperparameter values selected, based on steps (1) and (2) using the most recent data only.
Having typed this up I've realised that I'm just describing nested CV for time series. My questions:
is this best practice for training time series models when data are non-stationary?
can caret, or another R package, do this?

How to do cross-validation in R using neuralnet?

I'm trying to build a predictive model, using the neuralnet package. First I'm spliting my dataset in training (80%) and test (20%). But ANN is such a powerful technique that my model easily overfits the training set and performs poorly on the external test set.
Predicted vs True Value - Training is the right one and test set is the left one
Is there a way to do a cross-validation on the training set so that my model doesn't overfit the set? How may I do this with my own built in function?
Plus, are there any other approaches when dealing with deep learning? I've heard you can tweak the weights of the model in order to improve its quality on external data.
Thanks in advance!

Do i exclude data used in a training set to run predict () model?

I am very new to machine learning. I have a question about running predict on data used for training set.
Here are details: I took a portion of my initial dataset and split that portion into 80% (train) and 20% (test). I trained the model on 80% of training set
model <- train(name ~ ., data = train.df, method = ...)
and then run the model on 20% test data:
predict(model, newdata = test.df, type = "prob")
Now I want to predict using my trained model on initial dataset which also includes the training portion. Do I need to exclude that portion that was used for the training?

When you report accuracy to a third person about how good your machine learning model works, you always report the accuracy you get on the data set that was not used in training (and validation).
You can report your accuracy numbers for the over all data set but always include the remark that this data set also includes the data partition that was used for training the machine learning algorithm.
This care is taken to make sure your algorithm has not overfitted on your training set: https://en.wikipedia.org/wiki/Overfitting

Julie, I saw your comment below your original post. I would suggest you edit the original post and include your data split to be more complete in your question. It would also help to know what method of regression/classification you're using.
I'm assuming you're trying to assess the accuracy of your model with the 90% of data you left out. Depending on the number of samples you used in your training set you may or may not have the accuracy you'd like. Accuracy will also depend on your approach to the method of regression/classification you used.
To answer your question directly: you don't need to exclude anything from your dataset - the model doesn't change when you call predict().
All you're doing when you call predict is filling in the x-variables in your model with whatever data you supply. Your model was fitted to your training set, so if you supply training set data again it will still create predictions. Note though, for proving accuracy your results will be skewed if you include the set of data that you fit the model to since that's what it learned from to create predictions in the first place - kind of like watching a game, and then watching the same game again and being asked to make predictions about it.

Strugling to understand complete predictive model process in R

I'm very new to all this and I have a bit of a mental block on the logic of the process. I am trying to predict customer churn using a database of current and already churned customers. So far I have
1) Taken complete customer database of current customers and already churned customers along with customer service variables etc to use to predict on.
2) Split the data set randomly 70/30 into train and test
3) Using R, I have trained a random forest model to predict make predictions and then compared to the actual status using a confusion matrix.
4) I have ran that model using the test data to check accuracy for identifying the churners
I'm now a bit confused. What I want to do now is take all of our current customers and predict which ones will churn. Have I done this all wrong as alot of the current customers I need to predict if will churn have already been seen by the model as they appear in the training set?
Was I somehow supposed to use a training and test set that will not be part of the dataset I need to make predictions on?
Many thanks for any help.

As far as I have understood your question, I feel you want to know if you've done the right thing by using overlapping examples in your training and test set. You first need to understand that you need to keep your training set separate from your test set. Since your model parameters have been computed based on your training set, for similar examples in the test set, the model will give you the correct prediction, so your accuracy will definitely be positively impacted for those common training and test set examples but that is not the correct thing to do. Your test set should always contain previously unseen examples in order to properly evaluate the performance of your algorithm.
If your current customers (on which you want to test your model) are already there in the training set, you would want to leave them out in the testing process. I'd suggest you perform a check between the training set customers and the current set of customers based on some unique identifier (if present) such as the Customer ID and leave common customers out of your fresh batch of unseen test examples.

It looks to me that you have the standard training-test-validation set problem. If I understood correctly, you want to test the performance of your model (Random Forest) to all the data you have.
Standard classroom way to do this is indeed what you already did: Split the dataset for example 70% training and 30% test/validation set, train the model with training set and test with test set.
Better way to test (and predict for all of the data) is to use Cross-Validation to perform the analysis (https://en.wikipedia.org/wiki/Cross-validation_(statistics)). One example for cross-validation is 10-fold cross-validation: You split your data to 10 equal size blocks, loop over all the blocks and for every iteration use the remaining 9 blocks to train your model and the test the model on the specific block.
What you end up with cross-validation is a more comprehensive knowledge of the performance of your model, as well as the results for all of the customers in your database. Cross-validation mitigates the errors in analysis due to random selection of the test set.
Hope this helps!

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

how to select the best dataset for training a model - r

Related

Missing value handling with imputation in a nested resampling procedure such that there is no information bleed from train to test

R package (`caret`?) for nested time series cross validation

How to do cross-validation in R using neuralnet?

Do i exclude data used in a training set to run predict () model?

Strugling to understand complete predictive model process in R

Categories

Resources