I am planning to run glm, lasso and randomForest across different sets of predictors to see which model combination is the best. I am going to be doing v-fold cross validation. To compare the ML algorithms consistently, the same fold has to be fed into each of the ML algorithms. Correct me if I am wrong here.
How can we achieve that in h2o package in R? Should I set
fold_assignment = Modulo within each algo function such as h2o.glm(), h2o.randomForest() etc.
Hence, would the training set be split the same way across the ML algos?
If I use fold_assignment = Modulo and what if I have to stratify my outcome? The stratification option is with fold_assignment parameter as well? I am not sure I can specify Modulo and and Stratified both at the same time.
Alternatively, if I set the same seed in each of the model, would they have the same folds as input?
I have the above questions after reading Chapter 4 from [Practical Machine Learning with H2O by Darren Cook] (https://www.oreilly.com/library/view/practical-machine-learning/9781491964590/ch04.html)
Further, for generalizability in site level data in a scenario as in the quotation below:
For example, if you have observations (e.g., user transactions) from K cities and you want to build models on users from only K-1 cities and validate them on the remaining city (if you want to study the generalization to new cities, for example), you will need to specify the parameter “fold_column” to be the city column. Otherwise, you will have rows (users) from all K cities randomly blended into the K folds, and all K cross-validation models will see all K cities, making the validation less useful (or totally wrong, depending on the distribution of the data). (source)
In that case, since we are cross folding by a column, it would be consistent across all the different models, right?
Make sure you split the dataset the same for all ML algos (same seed). Having the same seed for each model won't necessarily have the same cross validation assignments. To ensure they are apples-to-apples comparisons, create a fold column (.kfold_column() or .stratified_kfold_column()) and specify it during training so they all use the same fold assignment.
Related
I'm using caret's timeslice method to do step-ahead cross validation on time series data. I was surprised to find that:
the 'best' hyperparameters chosen by caret are those with the best average performance across all train/test splits, and similarly the reported performance is the average across all train/test splits based on these hyperparam values, and
caret trains a final model using all of the data available - which makes sense when fixedWindow = TRUE but perhaps not otherwise.
My data are non-stationary, so I'd like hyperparameter tuning, performance reporting and final model training to be based on recent data so that:
optimal hyperparameter values can change as underlying relationships change
reported performance can account for the fact that the best hyperparam values may change across splits, and
recent changes in underlying relationships are best captured when training the final model.
I'm wondering if the best approach for my non-stationary data would follow an approach something like:
split each training fold into a training subset and validation subset - using the validation subset to pick hyperparam values within each training fold.
train on the entire training fold using the hyperparam values selected in (1)
report performance based on whatever hyperparameter values were selected in (1), even though these may change from fold to fold
The final model is trained, and hyperparameter values selected, based on steps (1) and (2) using the most recent data only.
Having typed this up I've realised that I'm just describing nested CV for time series. My questions:
is this best practice for training time series models when data are non-stationary?
can caret, or another R package, do this?
I'm using H2O to analyse a dataset but I'm not sure how to correctly perform cross-validation on my dataset. I have an unbalanced dataset, so I would like to performed stratified cross-validation ( were the output variable is used to balance the groups on each partition).
However, on top of that, I also have an issue that many of my rows are repeats (a way of implementing weights without actually having weights). Independently of the source of this problem, I have seen before that, in some cases, you can do cross-validation were some rows have to be kept together. This seams to be the usage of fold_column. However, it is not possible to do both at the same time?
If there is no H2O solution, how can I compute the fold a priori and use it on H2O?
Based on H2O-3 docs this can't be done:
Note that all three options are only suitable for datasets that are i.i.d. If the dataset requires custom grouping to perform meaningful cross-validation, then a fold_column should be created and provided instead.
One quick idea is using weights_column instead of duplicating rows. Then both balance_classes and weights_column are available together as parameters in
GBM, DRF, Deep Learning, GLM, Naïve-Bayes, and AutoML.
Otherwise, I suggest following workflow performed in R or H2O on your data to achieve both fold assignment and consistency of duplicates between folds:
take original dataset (no repeats in data yet)
divide it into 2 sets based on the outcome field (the one that is unbalanced): one for positive and one for negative (if it's multinomial then have as many sets as there are outcomes)
divide each set into N folds by assigning new foldId column in both sets independently: this accomplishes stratified folds
combine (rbind) both sets back together
apply row duplication process that implements weights (which will preserve your fold assignments automatically now).
I have clustered mixed dataset contains numerical and categorical features (heart dataset from UCI) using two clustering methods k-prototype and PAM
My question is: how to validate the results of clustering?
I have found different methods in R such as Rand Index, SSE, Purity, clValid, pvclust all of them works with numeric data.
Is there any method can be used in the case of mixed data
Yeah u can compare the clustering result with, CV index. For more u can read this
Cv index
CV formula contains of CU (Category Utility) for categorical attributes, and varians for numeric attributes
You can still use the Adjusted Rand Index. This index only compares two partitions. It does not matter if the partition is build from categorical or continuous features
How many observations (n) and dimensions (d) are you particularly studying?
Probably you are in the n>>d case, but more recently d>>n is a hot topic.
Variable selection is something that needs to be done before-hand. Check for feature correlation, this can affect the number of clusters that you detect. If the features are correlated and they happen to be linear, you can use the gradient instead of the two variables.
There is no absolute answer to your question. Many methods exist because of this. Clustering is explorative by nature. The better you know your data the better you can design tests.
Need to define what you want to test: stability of the partition, or, the stability of the clustering recipe. There are different ways to deal with each of these problems. For the first one, resampling is a key, and, for the second one, the use of comparison indexes to measure how many observations were left out of certain partition is often used.
Recommended reading:
[1]Meila, M. (2016). Criteria for Comparing Clusterings. Handbook of Cluster Analysis. C. Hennig, M. Meila, F. Murtagh and R. Rocci: 619-635.
[2]Leisch, F. (2016). Resampling Methods for Exploring Cluster Stability. Handbook of Cluster Analysis. C. Hennig, M. Meila, F. Murtagh and R. Rocci: 637-652.
I am attempting to produce a classification model based on the work of qualitative survey data. About 10K of our customers were researched and as a result a segmentation model was built and subsequently each customer categorised into 1 of 8 customer segments. The challenge is to now classify the TOTAL customer base into those segments. As only certain customers responded the researcher used overall demographics to apply post-stratification weights (or frequency weights).
My task is to now use our customer data as explanatory variables on this 10K in order to build a classification model for the whole base.
In order to handle the customer weights I simply duplicated each customer record by each respective frequency weight and the data set exploded to about 72K. I then split this data into train and test and used the R caret package to train a GBM and using the final chosen model classified my hold-out test set.
I was getting 82% accuracy and thought the results were too good to be true. After thinking about it I think the issue is that the model is inadvertently seeing records in train that are exactly the same in test (some records might be exactly duplicated up to 10 times).
I know that the GLM model function allows you to use the weight parameter to refer to a vector of weights but my question is how to utilise other machine learning algorithms, such as GBM or Random Forests, in R?
Thanks
You can use case weights with gbm and train. In general, the list of models in caret that can use case weights is here.
I'm working with a large data set, so hope to remove extraneous variables and tune for an optimal m variables per branch. In R, there are two methods, rfcv and tuneRF, that help with these two tasks. I'm attempting to combine them to optimize parameters.
rfcv works roughly as follows:
create random forest and extract each variable's importance;
while (nvar > 1) {
remove the k (or k%) least important variables;
run random forest with remaining variables, reporting cverror and predictions
}
Presently, I've recoded rfcv to work as follows:
create random forest and extract each variable's importance;
while (nvar > 1) {
remove the k (or k%) least important variables;
tune for the best m for reduced variable set;
run random forest with remaining variables, reporting cverror and predictions;
}
This, of course, increases the run time by an order of magnitude. My question is how necessary this is (it's been hard to get an idea using toy datasets), and whether any other way could be expected to work roughly as well in far less time.
As always, the answer is it depends on the data. On one hand, if there aren't any irrelevant features, then you can just totally skip feature elimination. The tree building process in the random forest implementation already tries to select predictive features, which gives you some protection against irrelevant ones.
Leo Breiman gave a talk where he introduced 1000 irrelevant features into some medical prediction task that had only a handful of real features from the input domain. When he eliminated 90% of the features using a single filter on variable importance, the next iteration of random forest didn't pick any irrelevant features as predictors in its trees.