R package (`caret`?) for nested time series cross validation - r

I'm using caret's timeslice method to do step-ahead cross validation on time series data. I was surprised to find that:
the 'best' hyperparameters chosen by caret are those with the best average performance across all train/test splits, and similarly the reported performance is the average across all train/test splits based on these hyperparam values, and
caret trains a final model using all of the data available - which makes sense when fixedWindow = TRUE but perhaps not otherwise.
My data are non-stationary, so I'd like hyperparameter tuning, performance reporting and final model training to be based on recent data so that:
optimal hyperparameter values can change as underlying relationships change
reported performance can account for the fact that the best hyperparam values may change across splits, and
recent changes in underlying relationships are best captured when training the final model.
I'm wondering if the best approach for my non-stationary data would follow an approach something like:
split each training fold into a training subset and validation subset - using the validation subset to pick hyperparam values within each training fold.
train on the entire training fold using the hyperparam values selected in (1)
report performance based on whatever hyperparameter values were selected in (1), even though these may change from fold to fold
The final model is trained, and hyperparameter values selected, based on steps (1) and (2) using the most recent data only.
Having typed this up I've realised that I'm just describing nested CV for time series. My questions:
is this best practice for training time series models when data are non-stationary?
can caret, or another R package, do this?

Related

Missing value handling with imputation in a nested resampling procedure such that there is no information bleed from train to test

I am looking through the documentation for the nested resampling procedure in the mlr3tuning package and I do not see any way that the package can handle NA values such that any information bleed between the training and hold-out sets is avoided, which would result in overly optimistic performance stats. I would ideally like a way to split my data in a nested resampling procedure such that:
full_data = N
train = N - holdout
test = holdout
Then I could perform an imputation on the train and test datasets separately and then run the model on train, predict on test and then select new holdouts and train from the full dataset, run the imputation on them separately and train, predict, repeat for the number of outer_loops.
Is there a way of doing this? Am I missing something obvious?
mlr3 handles all of this for you if you use pipelines (see the relevant part of the mlr3 book). If you make imputation part of such a pipeline, it makes sure to train/test appropriately, just like for the model itself.
Briefly as an explanation, just like with the machine learning model you don't want to do any adjustments based on the test set; in particular you shouldn't impute based on test data. This will cause similar problems as doing this with a model, i.e. biased evaluation results that may not be representative of the true generalization error.

Cross-Validation Across models in h2o in R

I am planning to run glm, lasso and randomForest across different sets of predictors to see which model combination is the best. I am going to be doing v-fold cross validation. To compare the ML algorithms consistently, the same fold has to be fed into each of the ML algorithms. Correct me if I am wrong here.
How can we achieve that in h2o package in R? Should I set
fold_assignment = Modulo within each algo function such as h2o.glm(), h2o.randomForest() etc.
Hence, would the training set be split the same way across the ML algos?
If I use fold_assignment = Modulo and what if I have to stratify my outcome? The stratification option is with fold_assignment parameter as well? I am not sure I can specify Modulo and and Stratified both at the same time.
Alternatively, if I set the same seed in each of the model, would they have the same folds as input?
I have the above questions after reading Chapter 4 from [Practical Machine Learning with H2O by Darren Cook] (https://www.oreilly.com/library/view/practical-machine-learning/9781491964590/ch04.html)
Further, for generalizability in site level data in a scenario as in the quotation below:
For example, if you have observations (e.g., user transactions) from K cities and you want to build models on users from only K-1 cities and validate them on the remaining city (if you want to study the generalization to new cities, for example), you will need to specify the parameter “fold_column” to be the city column. Otherwise, you will have rows (users) from all K cities randomly blended into the K folds, and all K cross-validation models will see all K cities, making the validation less useful (or totally wrong, depending on the distribution of the data). (source)
In that case, since we are cross folding by a column, it would be consistent across all the different models, right?
Make sure you split the dataset the same for all ML algos (same seed). Having the same seed for each model won't necessarily have the same cross validation assignments. To ensure they are apples-to-apples comparisons, create a fold column (.kfold_column() or .stratified_kfold_column()) and specify it during training so they all use the same fold assignment.

how to select the best dataset for training a model

I want to create a best training sample from a given set of data points by way of running all possible combinations of train and test through a model and select based on the best R2.
I do not want to run the model with all possible combinations rather I want to select like a stratified set each time and run the model. Is there a way to do this in R.
sample dataset
df1 <- data.frame(
cbind(sno=1:30
,x1=c(14.3,14.8,14.8,15,15.1,15.1,15.4,15.4,16.1,14.3,14.8,14.8,15.2,15.1,15.1,15.4,15.4,16.1,14.2,14.8,14.7,15.1,15,15,15.3,15.3,15.9,15.1,15,15.3)
,y1=c(79.2,78.7,79,78.2,78.7,79.1,78.4,78.7,78.1,79.2,78.7,79,78.2,78.6,79.2,78.4,78.7,78.1,79.1,78.5,78.9,78,78.5,79,78.2,78.5,78,79.2,78.7,78.7)
,z1=c(219.8,221.6,232.5,213.1,231,247.6,230.2,240.9,245.5,122.8,124.2,131.5,119.1,130.5,141.1,130.8,137.7,140.8,25.4,30.5,30.5,23.8,29.6,34.6,29.5,33.3,35.2,105,170.7,117.3)
))
This defeats the purpose of training. Ideally, you have one or more training datasets and an untouched testing data set you'll ultimately test against once your model is fit. Cherry-picking a training dataset, using R-squared or any other metric for that matter, will introduce bias. Worse still, if your model parameters are wildly different depending on which training set you use, your model likely isn't very good and results against your testing dataset are likely to be spurious.

R H20 - Cross-validation with stratified sampling and non i.i.d. rows

I'm using H2O to analyse a dataset but I'm not sure how to correctly perform cross-validation on my dataset. I have an unbalanced dataset, so I would like to performed stratified cross-validation ( were the output variable is used to balance the groups on each partition).
However, on top of that, I also have an issue that many of my rows are repeats (a way of implementing weights without actually having weights). Independently of the source of this problem, I have seen before that, in some cases, you can do cross-validation were some rows have to be kept together. This seams to be the usage of fold_column. However, it is not possible to do both at the same time?
If there is no H2O solution, how can I compute the fold a priori and use it on H2O?
Based on H2O-3 docs this can't be done:
Note that all three options are only suitable for datasets that are i.i.d. If the dataset requires custom grouping to perform meaningful cross-validation, then a fold_column should be created and provided instead.
One quick idea is using weights_column instead of duplicating rows. Then both balance_classes and weights_column are available together as parameters in
GBM, DRF, Deep Learning, GLM, Naïve-Bayes, and AutoML.
Otherwise, I suggest following workflow performed in R or H2O on your data to achieve both fold assignment and consistency of duplicates between folds:
take original dataset (no repeats in data yet)
divide it into 2 sets based on the outcome field (the one that is unbalanced): one for positive and one for negative (if it's multinomial then have as many sets as there are outcomes)
divide each set into N folds by assigning new foldId column in both sets independently: this accomplishes stratified folds
combine (rbind) both sets back together
apply row duplication process that implements weights (which will preserve your fold assignments automatically now).

Build GBM classification model with customer post-stratification weights

I am attempting to produce a classification model based on the work of qualitative survey data. About 10K of our customers were researched and as a result a segmentation model was built and subsequently each customer categorised into 1 of 8 customer segments. The challenge is to now classify the TOTAL customer base into those segments. As only certain customers responded the researcher used overall demographics to apply post-stratification weights (or frequency weights).
My task is to now use our customer data as explanatory variables on this 10K in order to build a classification model for the whole base.
In order to handle the customer weights I simply duplicated each customer record by each respective frequency weight and the data set exploded to about 72K. I then split this data into train and test and used the R caret package to train a GBM and using the final chosen model classified my hold-out test set.
I was getting 82% accuracy and thought the results were too good to be true. After thinking about it I think the issue is that the model is inadvertently seeing records in train that are exactly the same in test (some records might be exactly duplicated up to 10 times).
I know that the GLM model function allows you to use the weight parameter to refer to a vector of weights but my question is how to utilise other machine learning algorithms, such as GBM or Random Forests, in R?
Thanks
You can use case weights with gbm and train. In general, the list of models in caret that can use case weights is here.

Resources