I am looking through the documentation for the nested resampling procedure in the mlr3tuning package and I do not see any way that the package can handle NA values such that any information bleed between the training and hold-out sets is avoided, which would result in overly optimistic performance stats. I would ideally like a way to split my data in a nested resampling procedure such that:
full_data = N
train = N - holdout
test = holdout
Then I could perform an imputation on the train and test datasets separately and then run the model on train, predict on test and then select new holdouts and train from the full dataset, run the imputation on them separately and train, predict, repeat for the number of outer_loops.
Is there a way of doing this? Am I missing something obvious?
mlr3 handles all of this for you if you use pipelines (see the relevant part of the mlr3 book). If you make imputation part of such a pipeline, it makes sure to train/test appropriately, just like for the model itself.
Briefly as an explanation, just like with the machine learning model you don't want to do any adjustments based on the test set; in particular you shouldn't impute based on test data. This will cause similar problems as doing this with a model, i.e. biased evaluation results that may not be representative of the true generalization error.
I am planning to run glm, lasso and randomForest across different sets of predictors to see which model combination is the best. I am going to be doing v-fold cross validation. To compare the ML algorithms consistently, the same fold has to be fed into each of the ML algorithms. Correct me if I am wrong here.
How can we achieve that in h2o package in R? Should I set
fold_assignment = Modulo within each algo function such as h2o.glm(), h2o.randomForest() etc.
Hence, would the training set be split the same way across the ML algos?
If I use fold_assignment = Modulo and what if I have to stratify my outcome? The stratification option is with fold_assignment parameter as well? I am not sure I can specify Modulo and and Stratified both at the same time.
Alternatively, if I set the same seed in each of the model, would they have the same folds as input?
I have the above questions after reading Chapter 4 from [Practical Machine Learning with H2O by Darren Cook] (https://www.oreilly.com/library/view/practical-machine-learning/9781491964590/ch04.html)
Further, for generalizability in site level data in a scenario as in the quotation below:
For example, if you have observations (e.g., user transactions) from K cities and you want to build models on users from only K-1 cities and validate them on the remaining city (if you want to study the generalization to new cities, for example), you will need to specify the parameter “fold_column” to be the city column. Otherwise, you will have rows (users) from all K cities randomly blended into the K folds, and all K cross-validation models will see all K cities, making the validation less useful (or totally wrong, depending on the distribution of the data). (source)
In that case, since we are cross folding by a column, it would be consistent across all the different models, right?
Make sure you split the dataset the same for all ML algos (same seed). Having the same seed for each model won't necessarily have the same cross validation assignments. To ensure they are apples-to-apples comparisons, create a fold column (.kfold_column() or .stratified_kfold_column()) and specify it during training so they all use the same fold assignment.
I want to create a best training sample from a given set of data points by way of running all possible combinations of train and test through a model and select based on the best R2.
I do not want to run the model with all possible combinations rather I want to select like a stratified set each time and run the model. Is there a way to do this in R.
sample dataset
df1 <- data.frame(
cbind(sno=1:30
,x1=c(14.3,14.8,14.8,15,15.1,15.1,15.4,15.4,16.1,14.3,14.8,14.8,15.2,15.1,15.1,15.4,15.4,16.1,14.2,14.8,14.7,15.1,15,15,15.3,15.3,15.9,15.1,15,15.3)
,y1=c(79.2,78.7,79,78.2,78.7,79.1,78.4,78.7,78.1,79.2,78.7,79,78.2,78.6,79.2,78.4,78.7,78.1,79.1,78.5,78.9,78,78.5,79,78.2,78.5,78,79.2,78.7,78.7)
,z1=c(219.8,221.6,232.5,213.1,231,247.6,230.2,240.9,245.5,122.8,124.2,131.5,119.1,130.5,141.1,130.8,137.7,140.8,25.4,30.5,30.5,23.8,29.6,34.6,29.5,33.3,35.2,105,170.7,117.3)
))
This defeats the purpose of training. Ideally, you have one or more training datasets and an untouched testing data set you'll ultimately test against once your model is fit. Cherry-picking a training dataset, using R-squared or any other metric for that matter, will introduce bias. Worse still, if your model parameters are wildly different depending on which training set you use, your model likely isn't very good and results against your testing dataset are likely to be spurious.
I'm using H2O to analyse a dataset but I'm not sure how to correctly perform cross-validation on my dataset. I have an unbalanced dataset, so I would like to performed stratified cross-validation ( were the output variable is used to balance the groups on each partition).
However, on top of that, I also have an issue that many of my rows are repeats (a way of implementing weights without actually having weights). Independently of the source of this problem, I have seen before that, in some cases, you can do cross-validation were some rows have to be kept together. This seams to be the usage of fold_column. However, it is not possible to do both at the same time?
If there is no H2O solution, how can I compute the fold a priori and use it on H2O?
Based on H2O-3 docs this can't be done:
Note that all three options are only suitable for datasets that are i.i.d. If the dataset requires custom grouping to perform meaningful cross-validation, then a fold_column should be created and provided instead.
One quick idea is using weights_column instead of duplicating rows. Then both balance_classes and weights_column are available together as parameters in
GBM, DRF, Deep Learning, GLM, Naïve-Bayes, and AutoML.
Otherwise, I suggest following workflow performed in R or H2O on your data to achieve both fold assignment and consistency of duplicates between folds:
take original dataset (no repeats in data yet)
divide it into 2 sets based on the outcome field (the one that is unbalanced): one for positive and one for negative (if it's multinomial then have as many sets as there are outcomes)
divide each set into N folds by assigning new foldId column in both sets independently: this accomplishes stratified folds
combine (rbind) both sets back together
apply row duplication process that implements weights (which will preserve your fold assignments automatically now).
I am attempting to produce a classification model based on the work of qualitative survey data. About 10K of our customers were researched and as a result a segmentation model was built and subsequently each customer categorised into 1 of 8 customer segments. The challenge is to now classify the TOTAL customer base into those segments. As only certain customers responded the researcher used overall demographics to apply post-stratification weights (or frequency weights).
My task is to now use our customer data as explanatory variables on this 10K in order to build a classification model for the whole base.
In order to handle the customer weights I simply duplicated each customer record by each respective frequency weight and the data set exploded to about 72K. I then split this data into train and test and used the R caret package to train a GBM and using the final chosen model classified my hold-out test set.
I was getting 82% accuracy and thought the results were too good to be true. After thinking about it I think the issue is that the model is inadvertently seeing records in train that are exactly the same in test (some records might be exactly duplicated up to 10 times).
I know that the GLM model function allows you to use the weight parameter to refer to a vector of weights but my question is how to utilise other machine learning algorithms, such as GBM or Random Forests, in R?
Thanks
You can use case weights with gbm and train. In general, the list of models in caret that can use case weights is here.