how to resample and compare the resutls when I just want to predict the last row of the data using surv. functions in mlr package, R? - r

I just start trying the R package mlr, I am wondering if I can customize training set and test set. For example, all the data of a time sequence are the training set except for the last,and the last one is the test set.
Here is my example:
library(mlr)
library(survival)
data(lung)
myData2 <- lung %>%
select(time,status,age)
myData2$status = (myData2$status == 2)
myTrain <- c(1:(nrow(myData2)-1))
myTest <- nrow(myData2)
Lung data is from survival package. I just use three dimensions: time, status and age. Now, let's suppose they do not mean the patients' ages and how long they can survive. Let's say this is a ink purchase history of one customer.
age=74 means this customer bought 74 bottles of ink on that day and time=306 means the customer run out the ink after 306 days. So, I want to build up a survival model using all the data except for the last row. Then, when I have the data of the last row, which is age=58 implying the customer bought 58 bottles of ink on that day, I can make a prediction on time. A number close to 177 will be a good estimation. So, my training set and test set are fixed, which does not need to be resampled.
In addition, I need to change the hyperparameters for a comparison. Here is my code:
surv.task <- makeSurvTask(data=myData2,target=c('time','status'))
surv.lrn <- makeLearner("surv.cforest")
ps <- makeParamSet(
makeDiscreteParam('mincriterion',values=c(1.281552,2,3)),
makeDiscreteParam('ntree',values=c(100,200,300))
)
ctrl <- makeTuneControlGrid()
rdesc <- makeResampleDesc('Holdout',split=1,predict='train')
lrn = makeTuneWrapper(surv.lrn,control=ctrl,resampling=rdesc,par.set=ps,
measures = list(setAggregation(cindex,train.mean)))
mod <- train(learner=lrn,task=surv.task,subset=myTrain)
surv.pred <- predict(mod,task=surv.task,subset=myTest)
surv.pred
You can see that I use split=1 in makeResampleDesc because I have fixed training set which does not need to be resampled. measures in makeTuneWrapper is currently not meaningful to me as I need to customize my own measures. Because of fixed data split, I can not use the functions like resample or tuneParams to get an evaluation on test data when using different hyperparameters.
So, my question is: when the training set and test set are fixed, can mlr provide a comprehensive compare for every hyperparameter? If so, how to do it?
Incidentally, looks like there is function makeFixedHoldoutInstance which might can do this, just do not know how to use it. For example, I use makeFixedHoldoutInstance in this way and I have got such error information:
> f <- makeFixedHoldoutInstance(train.inds=myTrain,test.inds=myTest,size=length(myTrain)+1)
> lrn = makeTuneWrapper(surv.lrn,control=ctrl,resampling=f,par.set=ps)
> resample(learner=lrn,task=surv.task,resampling=f)
[Resample] holdout iter 1: [Tune] Started tuning learner surv.cforest for parameter set:
Type len Def Constr Req Tunable Trafo
mincriterion discrete - - 1.281552,2,3 - TRUE -
ntree discrete - - 100,200,300 - TRUE -
With control class: TuneControlGrid
Imputation value: -0
[Tune-x] 1: mincriterion=1.281552; ntree=100
Error in resample.fun(learner2, task, resampling, measures = measures, :
Size of data set: 227 and resampling instance: 228 differ!

With makeFixedHoldoutInstance you get the resampling you asked for.
But you can not use the same fixed resampling indices for the tuning inside the tuning wrapper and the resampling.
This is because first resample will split the data according to the fixed holdout instance f. Then the tuning inside the tuning wrapper will also need a resampling method to calculate the performance for a given configuration. As the tuning only sees the data after the split done by resample it can not apply the same fixed resampling.
From reading your question I guess you don't want to use the tuneWrapper but you want to directly tune your learner. So you should use simply tuneParams:
tr = tuneParams(learner = surv.lrn, task = surv.task, resampling = cv2, par.set = ps, control = ctrl)
Note: This does not work on the given example because the cindex needs at least one uncensored observation and even then it does not make sense because the cindex is only meaningful for a bigger test set.

Related

Cross-Validation in R using vtreat Package

Currently learning about cross validation through a course on DataCamp. They start the process by creating an n-fold cross validation plan. This is done with the kWayCrossValidation() function from the vtreat package. They call it as follows:
splitPlan <- kWayCrossValidation(nRows, nSplits, dframe, y)
Then, they suggest running a for loop as follows:
dframe$pred.cv <- 0
# k is the number of folds
# splitPlan is the cross validation plan
for(i in 1:k) {
# Get the ith split
split <- splitPlan[[i]]
# Build a model on the training data
# from this split
# (lm, in this case)
model <- lm(fmla, data = dframe[split$train,])
# make predictions on the
# application data from this split
dframe$pred.cv[split$app] <- predict(model, newdata = dframe[split$app,])
}
This results in a new column in the datafram with the predictions, per the last line of the above chunk of code.
My doubt is thus whether the predicted values on the data frame will be in fact averages of the 3 folds or if they will just be those of the 3rd run of the for loop?
Am I missing a detail here, or is this exactly what this code is doing, which would then defeat the purpose of the 3-fold cross validation or any-fold cross validation for that matter, as it will simply output the results of the last iteration? Shouldn't we be looking to output the average of all the folds, as laid out in the splitPlan?
Thank you.
I see there is confusion about the scope of K-fold cross-validation. The idea is not to average predictions over different folds, rather to average some measure of the prediction error, so to estimate test errors.
First of all, as you are new on SO, notice that you should always provide some data to work with. As in this case your question is not data-contingent, I just simulated some. Still, it is a good practice helping us helping you.
Check the following code, which slightly modifies what you have provided in the post:
library(vtreat)
# Simulating data.
set.seed(1986)
X = matrix(rnorm(2000, 0, 1), nrow = 1000, ncol = 2)
epsilon = matrix(rnorm(1000, 0, 0.01), nrow = 1000)
y = X[, 1] + X[, 2] + epsilon
dta = data.frame(X, y, pred.cv = NA)
# Folds.
nRows = dim(dta)[1]
nSplits = 3
splitPlan = kWayCrossValidation(nRows, nSplits)
# Fitting model on all folds but i-th.
for(i in 1:nSplits)
{
# Get the i-th split.
split = splitPlan[[i]]
# Build a model on the training data from this split.
model = lm(y ~ ., data = dta[split$train, -4])
# Make predictions on the application data from this split.
dta$pred.cv[split$app] = predict(model, newdata = dta[split$app, -4])
}
# Now compute an estimate of the test error using pred.cv.
mean((dta$y - dta$pred.cv)^2)
What the for loop does, is to fit a linear model on all folds but the i-th (i.e., on dta[split$train, -4]), and then it uses the fitted function to make predictions on the i-th fold (i.e., dta[split$app, -4]). At least, I am assuming that split$train and split$app serve such roles, as the documentation is really lacking (which usually is a bad sign). Notice I am revoming the 4-th column (dta$pred.cv) as it just pre-allocates memory in order to store all the predictions (it is not a feature!).
At each iteration, we are not filling the whole dta$pred.cv, but only a subset of that (corresponding to the rows of the i-th fold, stored each time in split$app). Thus, at the end that column just stores predictions from the K iteration.
The real rationale for cross-validation jumps in here. Let me introduce the concepts of training, validation, and test set. In data analysis, the ideal is to have such a huge data set so that we can divide it in three subsamples. The first one could then be used to train the algorithms (fitting models), the second to validate the models (tuning the models), the third to choose the best model in terms on some perfomance measure (usually mean-squared-error for regression, or MSE).
However, we often do not have all these data points (especially if you are an economist). Thus, we seek an estimator for the test MSE, so that the need for splitting data disappears. This is what K-fold cross-validation does: at once, each fold is treated as the test set, and the union of all the others as the training set. Then, we make predictions as in your code (in the loop), and save them. What you miss is the last line in the code I provided: the average of the MSE across folds. That provides us with as estimate of the test MSE, where we choose the model yielding the lowest value.
That being said, I never heard before of the vtreat package. If you are into data analysis, I suggest to have a look at the tidiyverse and the caret packages. As far as I know (and I see here on SO), they are widely used and super-well documented. May be worth learning them.

Meaning of alpha and beta parameters in function makeFeatSelControlSequential (MLR library in R)

For deterministic forward or backward search, I'm used to give thresholds for p-values linked to coefficients linked to individual features. In the documention of makeFeatSelControlSequential in R/MLR https://www.rdocumentation.org/packages/mlr/versions/2.13/topics/FeatSelControl, alpha and beta parameters are described as follow:
alpha
(numeric(1)): Parameter of the sequential feature selection. Minimal required value of improvement difference for a forward / adding step. Default is 0.01.
beta
(numeric(1)): Parameter of the sequential feature selection. Minimal required value of improvement difference for a backward / removing step. Negative values imply that you allow a slight decrease for the removal of a feature. Default is -0.001.
It is however not clear what does "improvement difference" mean here. In the example below, I gave 0 as treshold for a backward selection (beta parameter). If this parameter relates to a threshold on p-value, I would expect to get the model without feature but it is not the case as I get an AUC of 0.9886302 instead of 0.5.
# 1. Find a synthetic dataset for supervised learning (two classes)
###################################################################
library(mlbench)
data(BreastCancer)
# generate 1000 rows, 21 quantitative candidate predictors and 1 target variable
p<-mlbench.waveform(1000)
# convert list into dataframe
dataset<-as.data.frame(p)
# drop thrid class to get 2 classes
dataset2 = subset(dataset, classes != 3)
dataset2 <- droplevels(dataset2 )
# 2. Perform cross validation with embedded feature selection using logistic regression
##########################################################################################
library(BBmisc)
library(mlr)
set.seed(123, "L'Ecuyer")
set.seed(21)
# Choice of data
mCT <- makeClassifTask(data =dataset2, target = "classes")
# Choice of algorithm
mL <- makeLearner("classif.logreg", predict.type = "prob")
# Choice of cross-validations for folds
outer = makeResampleDesc("CV", iters = 10,stratify = TRUE)
# Choice of feature selection method
ctrl = makeFeatSelControlSequential(method = "sbs", maxit = NA,beta = 0)
# Choice of sampling between training and test within the fold
inner = makeResampleDesc("Holdout",stratify = TRUE)
lrn = makeFeatSelWrapper(mL, resampling = inner, control = ctrl)
r = resample(lrn, mCT, outer, extract = getFeatSelResult,measures = list(mlr::auc,mlr::acc,mlr::brier),models=TRUE)
The parameters control what difference in performance (for whatever performance measure you choose) is acceptable to proceed with a step along a forward or backward search. mlr doesn't compute any p-values, and no p-values are used in this process.
As the parameters only control what happens in a step, they also don't directly control the final outcome. What happens under the hood is that, e.g. for forward search, mlr computes the performances of all feature sets that expand the current one with a single feature and chooses the best one as long as it provides at least the improvement specified in alpha or beta. This procedure repeats until either all features (forward search) or no features (backward search) are present or if no minimum improvement as specified by the parameters can be achieved.

number of trees in h2o.gbm

in traditional gbm, we can use
predict.gbm(model, newsdata=..., n.tree=...)
So that I can compare result with different number of trees for the test data.
In h2o.gbm, although it has n.tree to set, it seems it doesn't have any effect on the result. It's all the same as the default model:
h2o.test.pred <- as.vector(h2o.predict(h2o.gbm.model, newdata=test.frame, n.tree=100))
R2(h2o.test.pred, test.mat$y)
[1] -0.00714109
h2o.test.pred <- as.vector(h2o.predict(h2o.gbm.model, newdata=test.frame, n.tree=10))
> R2(h2o.test.pred, test.mat$y)
[1] -0.00714109
Does anybod have similar problem? How to solve it? h2o.gbm is much faster than gbm, so if it can get detailed result of each tree that would be great.
I don't think H2O supports what you are describing.
BUT, if what you are after is to get the performance against the number of trees used, that can be done at model building time.
library(h2o)
h2o.init()
iris <- as.h2o(iris)
parts <- h2o.splitFrame(iris,c(0.8,0.1))
train <- parts[[1]]
valid <- parts[[2]]
test <- parts[[3]]
m <- h2o.gbm(1:4, 5, train,
validation_frame = valid,
ntrees = 100, #Max desired
score_tree_interval = 1)
h2o.scoreHistory(m)
plot(m)
The score history will show the evaluation after adding each new tree. plot(m) will show a chart of this. Looks like 20 is plenty for iris!
BTW, if your real purpose was to find out the optimum number of trees to use, then switch early stopping on, and it will do that automatically for you. (Just make sure you are using both validation and test data frames.)
As of 3.20.0.6 H2O does support this. The method you are looking for is
staged_predict_proba. For classification models it produces predicted class probabilities after each iteration (tree), for every observation in your testing frame. For regression models (i.e. when response is numerical), although not really documented, it produces the actual prediction for every observation in your testing frame.
From these predictions it is also easy to compute various performance metrics (AUC, r2 etc), assuming that's what you're after.
Python API:
staged_predict_proba = model.staged_predict_proba(test)
R API:
staged_predict_proba <- h2o.staged_predict_proba(model, prostate.test)

Successive training in neuralnet

I have a huge trainData and I want to withdraw random subsets out of it (let's say 1000 times) and use them to train the nural network object successively. Is it possible to do by using neuralnet R package. What I am thinking about is something like:
library(neuralnet)
for (i=1:1000){
classA <- 2000
classB <- 2000
dataB <- trainData[sample(which(trainData$class == "B"), classB, replace=TRUE),] #withdraw 2000 samples from class B
dataU <- trainData[sample(which(trainData$class == "A"), classA, replace=TRUE),] #withdraw 2000 samples from class A
subset <- rbind(dataB, dataU) #bind them to make a subset
and then feed this subset of actual trainData to train the neuralnet object again and again like:
nn <- neuralnet(formula, data=subset, hidden=c(3,5), linear.output = F, stepmax = 2147483647) #use that subset for training the neural network
}
My question is will this neualnet object named nn will be trained in every iteration of loop and when loop will finish will I get a fully trained neural network object? Secondly, what will be the effect of non-convergence in the cases when the neuralnet would be unable to converge for a particular subset? Will it affect the predictions result?
The shortest answer - No
More nuanced answer - Sort of ...
Why? - Because the neuralnet::neuralnet function is not designed to return the weights if the threshold is not reached within stepmax. However, if the threshold is reached, the resulting object will contain the final weights. These weights could then be fed to the neuralnet function as the startweights argument allowing for successive learning. Your call would look like the following:
# nn.prior = previously run neuralnet object
nn <- neuralnet(formula, data=subset, hidden=c(3,5), linear.output = F, stepmax = 2147483647, startweights = nn.prior$weights)
However, I initially answer 'No' because choosing a threshold to get a suitable amount of information out of a subset while also making sure it 'converges' before stepmax would likely be a guessing game and not very objective.
You have essentially four options I can think of:
Find another package that allows for this explicitly
Get the neuralnet source code and modify it to return the weights even when 'convergence' isn't achieved (i.e. reaching threshold).
Take a suitably sized random subset and just build your model on that and test its' performance. (This is actually quite common practice AFAIK).
Take all your subsets, build a model on each and look into combining them as an 'ensemble' model.
I would recommend to use k-fold validation to train many nets using library(e1071) and tune function.

Custom parameter tuning for KNN in caret

I have a k nearest neighbors implementation that let me compute in a single pass predictions for multiple values of k and for multiple subset of training and test data (e.g. all the folds in the K-fold cross validation, AKA resampling metrics). My implementation can also leverage multiple cores.
I would like to interface my method to be used with the caret package. I can easily build custom method for the train function. But this will result in multiple calls to the model fit (one for each parameter and fold combinations).
As far as I know, I can't indicate tuning strategies when using trainControl. The code source of train mention something about "seq" model fitting :
## There are two types of methods to build the models: "basic" means that each tuning parameter
## combination requires it's own model fit and "seq" where a single model fit can be used to
## get predictions for multiple tuning parameters.
But I can't see any way to actually use that with custom models.
Any clue on how to approach this ?
More generally, suppose that you have a model class where you can estimate prediction errors across multiple parameters using a single model fit (e.g. ala Linear Regression LOOCV Trick but for multiple parameter values too), how would you interface it in caret?
Here's some example code to set up a (empty) custom model in caret:
# Custom caret
library(caret)
learning_data = data.frame(y=sample(c("one","two","three"),200,replace=T))
learning_data = cbind(learning_data,matrix(runif(3*200),ncol=3))
testRatio=0.75
inTrain <- createDataPartition(learning_data$y, p = testRatio, list = FALSE)
trainExpr <- learning_data[inTrain,]
testExpr <- learning_data[-inTrain,]
trainClass <- trainExpr$y
testClass <- testExpr$y
trainExpr$y<-NULL
testExpr$y<-NULL
cv_opts = trainControl(method="cv", number=4,verboseIter=T)
my_knn <- function(data,weight,parameter,levels,last,...){
print("training")
# print(dim(data))
# str(parameter)
# list(fit=rdist(data$,data))
list(fit=NA)
}
my_knn_pred <- function(object,newdata){
print("testing")
# str(object)
# print(dim(newdata))
return("one")
}
sortFunc <- function(x) x[order(x$k),]
# Values of K to test
knn_opts = data.frame(.k=c(seq(7,11, 2))) #odd to avoid ties
custom_tr = trainControl(method="cv", number=4,verboseIter=T, custom=list(parameters=knn_opts,model=my_knn,prediction=my_knn_pred,probability=NULL,sort=sortFunc))
# This will result in 12 calls, 6 to my_knn, 6 to my_knn_pred, one per combination of fold and parameter value
custom_knn_performances <- train(x = trainExpr, y = trainClass,method = "custom",trControl=custom_tr,tuneGrid=knn_opts)
I would like to control the training procedure so as to generate predictions for all folds and parameter values in a single call.
The current custom model fit parts of train don't allow for sequential parameters.
The next release will. All of the specific model code will no longer be hard-coded and will be modularized (including the sequential parameters).
The work is about 80% done and I hope to have it out before the end of the year. I want to do a lot of testing on this version.
Drop me an email if you would like to kick it around before it is released (no warranty though).
Max

Resources