Custom performance summary functions in caret that use original data values - r

I am using caret to train a classification model on a dataset, from which I take the outputted class probabilities and feed them into another set of calculations. Specifically, I sum the class probabilities along one dimension of the original data and use that to calculate my eventual summary statistic describing the model quality of fit.
I currently have to train the model to maximize some other metric (I'm using Kappa right now), but what I really want to do is to write a summaryFunction to pass to trainControl that will encapsulate the entire calculation from start-to-finish.
The problem is that this summaryFunction would require the original data points, since I have to aggregate the class probabilities along a dimension of the original data in order to calculate the summary statistic. The summaryFunction prototype doesn't seem to provide the data itself in any way that I can see.
Is there a simple solution here? I suppose I could just make the original data frame global and use its values in my summaryFunction, provided that the class probabilities and predictions passed to the summaryFunction are in the same number and row order as the original data set?
Thank you very much!!

Related

R package (`caret`?) for nested time series cross validation

I'm using caret's timeslice method to do step-ahead cross validation on time series data. I was surprised to find that:
the 'best' hyperparameters chosen by caret are those with the best average performance across all train/test splits, and similarly the reported performance is the average across all train/test splits based on these hyperparam values, and
caret trains a final model using all of the data available - which makes sense when fixedWindow = TRUE but perhaps not otherwise.
My data are non-stationary, so I'd like hyperparameter tuning, performance reporting and final model training to be based on recent data so that:
optimal hyperparameter values can change as underlying relationships change
reported performance can account for the fact that the best hyperparam values may change across splits, and
recent changes in underlying relationships are best captured when training the final model.
I'm wondering if the best approach for my non-stationary data would follow an approach something like:
split each training fold into a training subset and validation subset - using the validation subset to pick hyperparam values within each training fold.
train on the entire training fold using the hyperparam values selected in (1)
report performance based on whatever hyperparameter values were selected in (1), even though these may change from fold to fold
The final model is trained, and hyperparameter values selected, based on steps (1) and (2) using the most recent data only.
Having typed this up I've realised that I'm just describing nested CV for time series. My questions:
is this best practice for training time series models when data are non-stationary?
can caret, or another R package, do this?

R H20 - Cross-validation with stratified sampling and non i.i.d. rows

I'm using H2O to analyse a dataset but I'm not sure how to correctly perform cross-validation on my dataset. I have an unbalanced dataset, so I would like to performed stratified cross-validation ( were the output variable is used to balance the groups on each partition).
However, on top of that, I also have an issue that many of my rows are repeats (a way of implementing weights without actually having weights). Independently of the source of this problem, I have seen before that, in some cases, you can do cross-validation were some rows have to be kept together. This seams to be the usage of fold_column. However, it is not possible to do both at the same time?
If there is no H2O solution, how can I compute the fold a priori and use it on H2O?
Based on H2O-3 docs this can't be done:
Note that all three options are only suitable for datasets that are i.i.d. If the dataset requires custom grouping to perform meaningful cross-validation, then a fold_column should be created and provided instead.
One quick idea is using weights_column instead of duplicating rows. Then both balance_classes and weights_column are available together as parameters in
GBM, DRF, Deep Learning, GLM, Naïve-Bayes, and AutoML.
Otherwise, I suggest following workflow performed in R or H2O on your data to achieve both fold assignment and consistency of duplicates between folds:
take original dataset (no repeats in data yet)
divide it into 2 sets based on the outcome field (the one that is unbalanced): one for positive and one for negative (if it's multinomial then have as many sets as there are outcomes)
divide each set into N folds by assigning new foldId column in both sets independently: this accomplishes stratified folds
combine (rbind) both sets back together
apply row duplication process that implements weights (which will preserve your fold assignments automatically now).

It's normal to test model on independent data after cross validation

I want to perform a random forest model, so I split my data into 70% for the train and 30% for the test. I applied a cross validation procedure on my train data (70%) and obtained a precision for the cross validation. After that, I test my model on the test data (30%), then I have another clarification.
So, I want to know if this is a good approach to test the robustness of my model, and what is the interpretation of these two precision.
Thanks in advance.
You do not need to perform Cross-Validation when building a RF model, as RF calculates its own CV score knows as OOB score. In fact, the results that you get from the model (the confusion matrix at model_name$confusion) is based on the OOB scores.
You can use the OOB scores (and the various metrics derived from them, such as Precision, Recall, etc.) to select a model from a list of models (for ex. models with different parameters / arguments) and then use the test data to check if the selected model generalises well.

R caret use custom probability threshold during cross validation

I'm training a SVM using the cross validation method in caret on data with a high class imbalance. I've made a custom summaryFunction that maximises the F-measure and using classProbs=TRUE I've found an optimal class probability threshold that further maximises the F-measure.
Now I'd like to use this threshold during cross validation but I can't figure out how. The only thing I can think of is re-labeling the prediction column in the summaryFunction, before calculating the F-measure, using my custom threshold and the probabilities from classProbs=TRUE, but I feel like this would be less effective than actually changing the threshold.
Is there a way to actually change the threshold and if not would this re-labeling be effective?

Do i exclude data used in a training set to run predict () model?

I am very new to machine learning. I have a question about running predict on data used for training set.
Here are details: I took a portion of my initial dataset and split that portion into 80% (train) and 20% (test). I trained the model on 80% of training set
model <- train(name ~ ., data = train.df, method = ...)
and then run the model on 20% test data:
predict(model, newdata = test.df, type = "prob")
Now I want to predict using my trained model on initial dataset which also includes the training portion. Do I need to exclude that portion that was used for the training?
When you report accuracy to a third person about how good your machine learning model works, you always report the accuracy you get on the data set that was not used in training (and validation).
You can report your accuracy numbers for the over all data set but always include the remark that this data set also includes the data partition that was used for training the machine learning algorithm.
This care is taken to make sure your algorithm has not overfitted on your training set: https://en.wikipedia.org/wiki/Overfitting
Julie, I saw your comment below your original post. I would suggest you edit the original post and include your data split to be more complete in your question. It would also help to know what method of regression/classification you're using.
I'm assuming you're trying to assess the accuracy of your model with the 90% of data you left out. Depending on the number of samples you used in your training set you may or may not have the accuracy you'd like. Accuracy will also depend on your approach to the method of regression/classification you used.
To answer your question directly: you don't need to exclude anything from your dataset - the model doesn't change when you call predict().
All you're doing when you call predict is filling in the x-variables in your model with whatever data you supply. Your model was fitted to your training set, so if you supply training set data again it will still create predictions. Note though, for proving accuracy your results will be skewed if you include the set of data that you fit the model to since that's what it learned from to create predictions in the first place - kind of like watching a game, and then watching the same game again and being asked to make predictions about it.

Resources