Cross-Validation in R using vtreat Package - r

Currently learning about cross validation through a course on DataCamp. They start the process by creating an n-fold cross validation plan. This is done with the kWayCrossValidation() function from the vtreat package. They call it as follows:
splitPlan <- kWayCrossValidation(nRows, nSplits, dframe, y)
Then, they suggest running a for loop as follows:
dframe$pred.cv <- 0
# k is the number of folds
# splitPlan is the cross validation plan
for(i in 1:k) {
# Get the ith split
split <- splitPlan[[i]]
# Build a model on the training data
# from this split
# (lm, in this case)
model <- lm(fmla, data = dframe[split$train,])
# make predictions on the
# application data from this split
dframe$pred.cv[split$app] <- predict(model, newdata = dframe[split$app,])
}
This results in a new column in the datafram with the predictions, per the last line of the above chunk of code.
My doubt is thus whether the predicted values on the data frame will be in fact averages of the 3 folds or if they will just be those of the 3rd run of the for loop?
Am I missing a detail here, or is this exactly what this code is doing, which would then defeat the purpose of the 3-fold cross validation or any-fold cross validation for that matter, as it will simply output the results of the last iteration? Shouldn't we be looking to output the average of all the folds, as laid out in the splitPlan?
Thank you.

I see there is confusion about the scope of K-fold cross-validation. The idea is not to average predictions over different folds, rather to average some measure of the prediction error, so to estimate test errors.
First of all, as you are new on SO, notice that you should always provide some data to work with. As in this case your question is not data-contingent, I just simulated some. Still, it is a good practice helping us helping you.
Check the following code, which slightly modifies what you have provided in the post:
library(vtreat)
# Simulating data.
set.seed(1986)
X = matrix(rnorm(2000, 0, 1), nrow = 1000, ncol = 2)
epsilon = matrix(rnorm(1000, 0, 0.01), nrow = 1000)
y = X[, 1] + X[, 2] + epsilon
dta = data.frame(X, y, pred.cv = NA)
# Folds.
nRows = dim(dta)[1]
nSplits = 3
splitPlan = kWayCrossValidation(nRows, nSplits)
# Fitting model on all folds but i-th.
for(i in 1:nSplits)
{
# Get the i-th split.
split = splitPlan[[i]]
# Build a model on the training data from this split.
model = lm(y ~ ., data = dta[split$train, -4])
# Make predictions on the application data from this split.
dta$pred.cv[split$app] = predict(model, newdata = dta[split$app, -4])
}
# Now compute an estimate of the test error using pred.cv.
mean((dta$y - dta$pred.cv)^2)
What the for loop does, is to fit a linear model on all folds but the i-th (i.e., on dta[split$train, -4]), and then it uses the fitted function to make predictions on the i-th fold (i.e., dta[split$app, -4]). At least, I am assuming that split$train and split$app serve such roles, as the documentation is really lacking (which usually is a bad sign). Notice I am revoming the 4-th column (dta$pred.cv) as it just pre-allocates memory in order to store all the predictions (it is not a feature!).
At each iteration, we are not filling the whole dta$pred.cv, but only a subset of that (corresponding to the rows of the i-th fold, stored each time in split$app). Thus, at the end that column just stores predictions from the K iteration.
The real rationale for cross-validation jumps in here. Let me introduce the concepts of training, validation, and test set. In data analysis, the ideal is to have such a huge data set so that we can divide it in three subsamples. The first one could then be used to train the algorithms (fitting models), the second to validate the models (tuning the models), the third to choose the best model in terms on some perfomance measure (usually mean-squared-error for regression, or MSE).
However, we often do not have all these data points (especially if you are an economist). Thus, we seek an estimator for the test MSE, so that the need for splitting data disappears. This is what K-fold cross-validation does: at once, each fold is treated as the test set, and the union of all the others as the training set. Then, we make predictions as in your code (in the loop), and save them. What you miss is the last line in the code I provided: the average of the MSE across folds. That provides us with as estimate of the test MSE, where we choose the model yielding the lowest value.
That being said, I never heard before of the vtreat package. If you are into data analysis, I suggest to have a look at the tidiyverse and the caret packages. As far as I know (and I see here on SO), they are widely used and super-well documented. May be worth learning them.

Related

From cv.glmnet get confusion matrix

Explanation of the Problem
I am comparing a few models, and my dataset is so small that I would much rather use cross validation than splitting out a validation set. One of my models is made using glm "GLM", another by cv.glmnet "GLMNET". In pseudocode, what I'd like to be able to to is the following:
initialize empty 2x2 matrices GLM_CONFUSION and GLMNET_CONFUSION
# Cross validation loop
For each data point VAL in my dataset X:
Let TRAIN be the rest of X (not including VAL)
Train GLM on TRAIN, use it to predict VAL
Depending on if it were a true positive, false positive, etc...
add 1 to the correct entry in GLM_CONFUSION
Train GLMNET on TRAIN, use it to predict VAL
Depending on if it were a true positive, false positive, etc...
add 1 to the correct entry in GLMNET_CONFUSION
This is not hard to do, the problem lies in cv.glmnet already using cross validation
to deduce the best value of the penalty lambda. It would be convenient if I could have cv.glmnet automatically build up the confusion matrix of the best model, i.e. my code should look like:
initialize empty 2x2 matrices GLM_CONFUSION and GLMNET_CONFUSION
Train GLMNET on X using cv.glmnet
Set GLMNET_CONFUSION to be the confusion matrix of lambda.1se (or lambda.min)
# Cross validation loop
For each data point VAL in my dataset X:
Let TRAIN be the rest of X (not including VAL)
Train GLM on TRAIN, use it to predict VAL
Depending on if it were a true positive, false positive, etc...
add 1 to the correct entry in GLM_CONFUSION
Not only would it be convenient, it is somewhat of a necessity - there are two alternatives:
Use cv.glmnet to find a new lambda.1se on TRAIN at every iteration of the cross validation loop. (i.e. a nested cross-validation)
Use cv.glmnet to find lambda.1se on X, and then 'fix' that value and treat it like a normal model to train during the cross validation loop. (two parallel cross-validations)
The second one is philosophically incorrect as it means GLMNET would have information on what it is trying to predict in the cross validation loop. The first would take a large chunk of time - I could in theory do it, but it might take half an hour and I feel as if there should be a better way.
What I've Looked At So Far
I've looked at the documentation of cv.glmnet - it does not seem like you can do what I am asking, but I am very new to R and data science in general so it is perfectly possible that I have missed something.
I have also looked on this website and seen some posts that at first glance appeared to be relevant, but in fact are asking for something different - for example, this post: tidy predictions and confusion matrix with glmnet
The above post appears similar to what I want, but it is not quite what I am looking for - it appears they are using predict.cv.glmnet to make new predictions, and then creating the confusion matrix of that - whereas I want the confusion matrix of the predictions made during the cross validation step.
I'm hoping that someone is able to either
Explain if and how it is possible to create the confusion matrix as described
Show that there is a third alternative separate to the two I proposed
"Hand-implement cv.glmnet" is not a viable alternative :P
Conclusively state that what I want is not possible and that I need to do one of the two alternatives I mentioned.
Any one of those would be a perfectly fine answer to this question (although I'm hoping for option 1!)
Apologies if there is something simple I have missed!
Thanks to #missuse's advice, I was able to get a solution that worked for me! It corresponds to option 2 in my post, with this alternative being to use the caret package.
In essence we need to attach a custom summary function to caret's model trainer. I mostly bumbled about for a couple hours until I got it to work - there may be better ways to do this, and I encourage others to post alternate answers if they know of any! My code is at the bottom (it's been slightly modified to make it not specific to the task I was working on)
Hopefully if anyone has a similar problem then this will help. Another resource that I found useful in solving this was the following post: https://stats.stackexchange.com/questions/299653/caret-glmnet-vs-cv-glmnet, as in it you can see very clearly how to convert a call to cv.glmnet into a call to caret's train version of glmnet.
library(caret)
# Confusion Matrix of model outputs
CM <- function(model) {
# Need to find index of best tune found by
# cross validation
idx <- 1
for (i in 1:nrow(model$results)) {
check <- model$results[i,]
foundBest <- TRUE
for (col in colnames(model$bestTune)) {
if (check[,col] != model$bestTune[,col]) {
foundBest <- FALSE
break
}
}
if (foundBest) {
idx <- i
break
}
}
# They are averaged w.r.t. the number of folds (ctrl$number)
# hence the multiplication
c(
model$results[idx,]$true_pos,
model$results[idx,]$false_pos,
model$results[idx,]$false_neg,
model$results[idx,]$true_neg
) * model$control$number
}
# Summary function from the training to give confusion metric
SummaryFunc <- function (data, lev = NULL, model = NULL) {
# This puts our output in the right format
out <- postResample(data$pred, data$obs)
# Get the confusion matrix
cm <- confusionMatrix(
factor(data$pred, levels=c(0, 1)),
factor(data$obs, levels=c(0, 1))
)$table
# Add those details to the output
oldnames <- names(out)
out <- c(out, cm[1, 1], cm[2, 1], cm[1, 2], cm[2, 2])
names(out) <- c(oldnames, "true_pos", "false_pos", "false_neg", "true_neg")
out
}
# 10-fold cross validation, as in cv.glmnet implementation
ctrl <- trainControl(
method="cv",
number=10,
summaryFunction=SummaryFunc,
)
# Example of standard glm
our.glm <- train(
your_formula,
data=your_data,
method="glm",
family=gaussian(link="identity"),
trControl=ctrl,
metric="RMSE"
)
# Example of what used to be cv.glmnet
our.glmnet <- train(
your_feature_matrix,
your_label_matrix,
method="glmnet",
family=gaussian(link="identity"),
trControl=ctrl,
metric="RMSE",
tuneGrid = expand.grid(
alpha = 1,
lambda = seq(0.001, 0.1, by=0.001)
)
)
CM(our.glm)
CM(our.glmnet)

number of trees in h2o.gbm

in traditional gbm, we can use
predict.gbm(model, newsdata=..., n.tree=...)
So that I can compare result with different number of trees for the test data.
In h2o.gbm, although it has n.tree to set, it seems it doesn't have any effect on the result. It's all the same as the default model:
h2o.test.pred <- as.vector(h2o.predict(h2o.gbm.model, newdata=test.frame, n.tree=100))
R2(h2o.test.pred, test.mat$y)
[1] -0.00714109
h2o.test.pred <- as.vector(h2o.predict(h2o.gbm.model, newdata=test.frame, n.tree=10))
> R2(h2o.test.pred, test.mat$y)
[1] -0.00714109
Does anybod have similar problem? How to solve it? h2o.gbm is much faster than gbm, so if it can get detailed result of each tree that would be great.
I don't think H2O supports what you are describing.
BUT, if what you are after is to get the performance against the number of trees used, that can be done at model building time.
library(h2o)
h2o.init()
iris <- as.h2o(iris)
parts <- h2o.splitFrame(iris,c(0.8,0.1))
train <- parts[[1]]
valid <- parts[[2]]
test <- parts[[3]]
m <- h2o.gbm(1:4, 5, train,
validation_frame = valid,
ntrees = 100, #Max desired
score_tree_interval = 1)
h2o.scoreHistory(m)
plot(m)
The score history will show the evaluation after adding each new tree. plot(m) will show a chart of this. Looks like 20 is plenty for iris!
BTW, if your real purpose was to find out the optimum number of trees to use, then switch early stopping on, and it will do that automatically for you. (Just make sure you are using both validation and test data frames.)
As of 3.20.0.6 H2O does support this. The method you are looking for is
staged_predict_proba. For classification models it produces predicted class probabilities after each iteration (tree), for every observation in your testing frame. For regression models (i.e. when response is numerical), although not really documented, it produces the actual prediction for every observation in your testing frame.
From these predictions it is also easy to compute various performance metrics (AUC, r2 etc), assuming that's what you're after.
Python API:
staged_predict_proba = model.staged_predict_proba(test)
R API:
staged_predict_proba <- h2o.staged_predict_proba(model, prostate.test)

Successive training in neuralnet

I have a huge trainData and I want to withdraw random subsets out of it (let's say 1000 times) and use them to train the nural network object successively. Is it possible to do by using neuralnet R package. What I am thinking about is something like:
library(neuralnet)
for (i=1:1000){
classA <- 2000
classB <- 2000
dataB <- trainData[sample(which(trainData$class == "B"), classB, replace=TRUE),] #withdraw 2000 samples from class B
dataU <- trainData[sample(which(trainData$class == "A"), classA, replace=TRUE),] #withdraw 2000 samples from class A
subset <- rbind(dataB, dataU) #bind them to make a subset
and then feed this subset of actual trainData to train the neuralnet object again and again like:
nn <- neuralnet(formula, data=subset, hidden=c(3,5), linear.output = F, stepmax = 2147483647) #use that subset for training the neural network
}
My question is will this neualnet object named nn will be trained in every iteration of loop and when loop will finish will I get a fully trained neural network object? Secondly, what will be the effect of non-convergence in the cases when the neuralnet would be unable to converge for a particular subset? Will it affect the predictions result?
The shortest answer - No
More nuanced answer - Sort of ...
Why? - Because the neuralnet::neuralnet function is not designed to return the weights if the threshold is not reached within stepmax. However, if the threshold is reached, the resulting object will contain the final weights. These weights could then be fed to the neuralnet function as the startweights argument allowing for successive learning. Your call would look like the following:
# nn.prior = previously run neuralnet object
nn <- neuralnet(formula, data=subset, hidden=c(3,5), linear.output = F, stepmax = 2147483647, startweights = nn.prior$weights)
However, I initially answer 'No' because choosing a threshold to get a suitable amount of information out of a subset while also making sure it 'converges' before stepmax would likely be a guessing game and not very objective.
You have essentially four options I can think of:
Find another package that allows for this explicitly
Get the neuralnet source code and modify it to return the weights even when 'convergence' isn't achieved (i.e. reaching threshold).
Take a suitably sized random subset and just build your model on that and test its' performance. (This is actually quite common practice AFAIK).
Take all your subsets, build a model on each and look into combining them as an 'ensemble' model.
I would recommend to use k-fold validation to train many nets using library(e1071) and tune function.

R- Random Forest - Importance / varImPlot

I have an issue with Random Forest with the Importance / varImPlot function, I hope someone could help me with?
I tried to code versions but I am confused about the (different) results:
1.)
rffit = randomForest(price~.,data=train,mtry=x,ntree=500)
rfvalpred = predict(rffit,newdata=test)
varImpPlot(rffit)
importance(rffit)
Shows the plot and the data of “importance”, however only “IncNodePurity”. And the data is different the plot and the data, I tried with "Scale" but did not work.
2.)
rf.analyzed_data = randomForest(price~.,data=train,mtry=x,ntree=500,importance=TRUE)
yhat.rf = predict(rf.analyzed_data,newdata=test)
varImpPlot(rf.analyzed_data)
importance(rf.analyzed_data)
In that case it does not produce any plot anymore and the importance data is showing “%IncMSE” and “IncNodePurity” data but the “IncNodePurity” data is different to first code?
Questions:
1.) Any idea why data is different for “IncNodePurity”?
2.) Any idea why no “%IncMSE” is shown in the first version?
3.) Why no plot is shown in the second version?
Many thanks!!
Ed
1) IncNodePurity is derived from the loss function, and you get that measure for free just by training the model. On the downside it is a more unstable estimate as results may vary from each model run. It is also more biased as it favors variables with many levels. I guess your found the differences are due to randomness.
2) VI, %IncMSE takes a little extra time to compute and is therefore optional. Roughly all values in data set needs to be shuffled and every OOB sample needs to be predicted once for every tree times for every variable. As the package randomForest is designed, you have to compute VI during training. importance must be set to TRUE. varImpPlot cannot plot it as it has not been computed.
3) Not sure. In this code example I see both plots at least.
library(randomForest)
#data
X = data.frame(replicate(6,rnorm(1000)))
y = with(X, X1^2 + sin(X2*pi) + X3*X4)
train = data.frame(y=y,X=X)
#training
rf1=randomForest(y~.,data=train,importance=F)
rf2=randomForest(y~.,data=train, importance=T)
#plotting importnace
varImpPlot(rf1) #plot only with IncNodePurity
varImpPlot(rf2) #bi-plot also with %IncMSE

Custom parameter tuning for KNN in caret

I have a k nearest neighbors implementation that let me compute in a single pass predictions for multiple values of k and for multiple subset of training and test data (e.g. all the folds in the K-fold cross validation, AKA resampling metrics). My implementation can also leverage multiple cores.
I would like to interface my method to be used with the caret package. I can easily build custom method for the train function. But this will result in multiple calls to the model fit (one for each parameter and fold combinations).
As far as I know, I can't indicate tuning strategies when using trainControl. The code source of train mention something about "seq" model fitting :
## There are two types of methods to build the models: "basic" means that each tuning parameter
## combination requires it's own model fit and "seq" where a single model fit can be used to
## get predictions for multiple tuning parameters.
But I can't see any way to actually use that with custom models.
Any clue on how to approach this ?
More generally, suppose that you have a model class where you can estimate prediction errors across multiple parameters using a single model fit (e.g. ala Linear Regression LOOCV Trick but for multiple parameter values too), how would you interface it in caret?
Here's some example code to set up a (empty) custom model in caret:
# Custom caret
library(caret)
learning_data = data.frame(y=sample(c("one","two","three"),200,replace=T))
learning_data = cbind(learning_data,matrix(runif(3*200),ncol=3))
testRatio=0.75
inTrain <- createDataPartition(learning_data$y, p = testRatio, list = FALSE)
trainExpr <- learning_data[inTrain,]
testExpr <- learning_data[-inTrain,]
trainClass <- trainExpr$y
testClass <- testExpr$y
trainExpr$y<-NULL
testExpr$y<-NULL
cv_opts = trainControl(method="cv", number=4,verboseIter=T)
my_knn <- function(data,weight,parameter,levels,last,...){
print("training")
# print(dim(data))
# str(parameter)
# list(fit=rdist(data$,data))
list(fit=NA)
}
my_knn_pred <- function(object,newdata){
print("testing")
# str(object)
# print(dim(newdata))
return("one")
}
sortFunc <- function(x) x[order(x$k),]
# Values of K to test
knn_opts = data.frame(.k=c(seq(7,11, 2))) #odd to avoid ties
custom_tr = trainControl(method="cv", number=4,verboseIter=T, custom=list(parameters=knn_opts,model=my_knn,prediction=my_knn_pred,probability=NULL,sort=sortFunc))
# This will result in 12 calls, 6 to my_knn, 6 to my_knn_pred, one per combination of fold and parameter value
custom_knn_performances <- train(x = trainExpr, y = trainClass,method = "custom",trControl=custom_tr,tuneGrid=knn_opts)
I would like to control the training procedure so as to generate predictions for all folds and parameter values in a single call.
The current custom model fit parts of train don't allow for sequential parameters.
The next release will. All of the specific model code will no longer be hard-coded and will be modularized (including the sequential parameters).
The work is about 80% done and I hope to have it out before the end of the year. I want to do a lot of testing on this version.
Drop me an email if you would like to kick it around before it is released (no warranty though).
Max

Resources